Latest working kernel version: 2.6.22.x Earliest failing kernel version: 2.6.23.x (haven't yet tested the rc's) Distribution: Debian Sid Hardware Environment: Intel SC450NX system (Quad P3 Xeon) Software Environment: Debian Sid Problem Description: With all kernels I have tried later than 2.6.22.x my SC450NX system hangs a long time at boot with messages like: scsi 0:0:6:0: ABORT operation timed-out. scsi 0:0:6:0: DEVICE RESET operation started scsi 1:0:6:0: ABORT operation timed-out. scsi 1:0:6:0: DEVICE RESET operation started scsi 0:0:6:0: DEVICE RESET operation timed-out. scsi 0:0:6:0: BUS RESET operation started scsi 1:0:6:0: DEVICE RESET operation timed-out. scsi 1:0:6:0: BUS RESET operation started scsi 0:0:6:0: BUS RESET operation timed-out. scsi 0:0:6:0: HOST RESET operation started sym0: SCSI BUS has been reset. scsi 1:0:6:0: BUS RESET operation timed-out. scsi 1:0:6:0: HOST RESET operation started sym1: SCSI BUS has been reset. scsi 0:0:6:0: HOST RESET operation timed-out. scsi 0:0:6:0: Device offlined - not ready after error recovery scsi 1:0:6:0: HOST RESET operation timed-out. scsi 1:0:6:0: Device offlined - not ready after error recovery scsi 0:0:8:0: ABORT operation started scsi 1:0:8:0: ABORT operation started on all scsi ids (in use or not) on two scsi channels. This is alarming enough, but when the system finally starts up, I also find that one of my nic's have stopped working. A quick check at /proc/interrupts shows that there are lot's of differences from the "usual" interrupts. I'll attach /proc/interrupts from both a working and non working kernel, and post dmesg from both. The problem is that with the 2.6.25-rc3 kernel, I can't get the beginning of the dmesg.. Is there a trick other than dmesg -s 65536? Thanks :) Steps to reproduce:
Created attachment 15033 [details] interrupts from working kernel (2.6.22.9)
Created attachment 15034 [details] dmesg from working kernel (2.6.22.9)
Created attachment 15035 [details] interrupts from broken kernel (2.6.25-rc3)
Created attachment 15036 [details] Partial dmesg from broken kernel (2.6.25-rc3)
Will you please attach the output of acpidump and lspci -vvxxx? (Please use the working kernel). Thanks.
Created attachment 15045 [details] acpidump
Created attachment 15046 [details] lspci -vvxxx
HI, Stian It seems that the interrupt for 02.03.0/02.03.1(SCIS controller) is incorrect. So OS reports the following error message and system can't work normally. > scsi 0:0:0:0: ABORT operation started > scsi 1:0:0:0: ABORT operation started > scsi 0:0:0:0: ABORT operation timed-out. > scsi 0:0:0:0: DEVICE RESET operation started > scsi 1:0:0:0: ABORT operation timed-out. Will you please attach the full dmesg for 2.6.25-rc3? ( Please add the boot option of "initcall_debug"). Thanks.
Will do later tonight. But as I wrote earlier, I don't know how to get the full dmesg. IIRC you could earlier set the size of the dmesg buffer in kernel config, but I'm not able to find this config option anymore... Please note that the network card sharing the same (wrong) interrupt with the scsi controller also does not work. Thanks. -Stian
Please increase the CONFIG_LOG_BUF_SHIFT in kernel configuration. (For example: 19/18).Maybe we can get the full dmesg output. Thanks.
Created attachment 15093 [details] new dmesg from broken kernel (2.6.25-rc3). With debug_initcall
Created attachment 15116 [details] try the patch Will you please try the attached patch and see whether the problem can be fixed? Thanks.
Works fine, thanks :)
Why isn't this patch in 2.6.25-rc6?
*** Bug 10396 has been marked as a duplicate of this bug. ***
Re: irq numbers The reason that the IRQ numbers in 2.6.25 look different from 2.6.22 is because "irq compression" is now disabled on i386 for IRQ numbers below 64. The IOAPIC on a 450NX has 64 entries, so what you see now are always the real pin numbers == GSI numbers == IRQ numbers, identitiy mapped. eg. the mylex was on IRQ16, now on IRQ17, but has been on GSI 17 all along: < ACPI: PCI Interrupt 0000:01:08.0[A] -> GSI 17 (level, low) -> IRQ 16 --- > ACPI: PCI Interrupt 0000:01:08.0[A] -> GSI 17 (level, low) -> IRQ 17 and the symbios on 58 and the uhci_hcd:usb1 on 54 now show that way: < ACPI: PCI Interrupt 0000:00:08.0[A] -> GSI 58 (level, low) -> IRQ 21 < ACPI: PCI Interrupt 0000:00:0c.2[D] -> GSI 54 (level, low) -> IRQ 22 --- > ACPI: PCI Interrupt 0000:00:08.0[A] -> GSI 58 (level, low) -> IRQ 58 > ACPI: PCI Interrupt 0000:00:0c.2[D] -> GSI 54 (level, low) -> IRQ 54 But that just obscures the real failure, which is this: > ACPI: Bus 0000:02 not present in PCI namespace < ACPI: PCI Interrupt 0000:02:03.0[A] -> GSI 57 (level, low) -> IRQ 19 < ACPI: PCI Interrupt 0000:02:03.1[B] -> GSI 56 (level, low) -> IRQ 20 --- > ACPI: PCI Interrupt 0000:02:03.0[A]: no GSI - using IRQ 11 > ACPI: PCI Interrupt 0000:02:03.1[B]: no GSI - using IRQ 11 and your 2nd ethernet is on bus 2 and thus died also: ACPI: Unable to derive IRQ for device 0000:02:01.0 ACPI: PCI Interrupt 0000:02:01.0[A]: no GSI - using IRQ 11 3c59x: Donald Becker and others. 0000:02:01.0: 3Com PCI 3c980C Python-T at f881c000.
assigning to jbarnes for him to push the appropriate PCI fix upstream.
ykzhao's patch already seems to be upstream, as of April 15, so I'm closing this out.
shipped in 2.6.25 - closed. commit b87e81e5c6e64ae0eae3b4f61bf07bfeec856184 Author: yakui.zhao@intel.com <yakui.zhao@intel.com> Date: Tue Apr 15 14:34:49 2008 -0700 acpi: unneccessary to scan the PCI bus already scanned http://bugzilla.kernel.org/show_bug.cgi?id=10124