Bug 10124

Summary: no GSI - using IRQ 11 - Intel SC450NX - 2.6.23 regression
Product: Drivers Reporter: Stian Jordet (stian_web)
Component: PCIAssignee: Jesse Barnes (jbarnes)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, linux
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25-rc3 Tree: Mainline
Regression: Yes
Attachments: interrupts from working kernel (2.6.22.9)
dmesg from working kernel (2.6.22.9)
interrupts from broken kernel (2.6.25-rc3)
Partial dmesg from broken kernel (2.6.25-rc3)
acpidump
lspci -vvxxx
new dmesg from broken kernel (2.6.25-rc3). With debug_initcall
try the patch

Description Stian Jordet 2008-02-27 14:19:16 UTC
Latest working kernel version: 2.6.22.x
Earliest failing kernel version: 2.6.23.x (haven't yet tested the rc's)
Distribution: Debian Sid
Hardware Environment: Intel SC450NX system (Quad P3 Xeon)
Software Environment: Debian Sid
Problem Description:
With all kernels I have tried later than 2.6.22.x my SC450NX system hangs a long time at boot with messages like:

scsi 0:0:6:0: ABORT operation timed-out.
scsi 0:0:6:0: DEVICE RESET operation started
scsi 1:0:6:0: ABORT operation timed-out.
scsi 1:0:6:0: DEVICE RESET operation started
scsi 0:0:6:0: DEVICE RESET operation timed-out.
scsi 0:0:6:0: BUS RESET operation started
scsi 1:0:6:0: DEVICE RESET operation timed-out.
scsi 1:0:6:0: BUS RESET operation started
scsi 0:0:6:0: BUS RESET operation timed-out.
scsi 0:0:6:0: HOST RESET operation started
sym0: SCSI BUS has been reset.
scsi 1:0:6:0: BUS RESET operation timed-out.
scsi 1:0:6:0: HOST RESET operation started
sym1: SCSI BUS has been reset.
scsi 0:0:6:0: HOST RESET operation timed-out.
scsi 0:0:6:0: Device offlined - not ready after error recovery
scsi 1:0:6:0: HOST RESET operation timed-out.
scsi 1:0:6:0: Device offlined - not ready after error recovery
scsi 0:0:8:0: ABORT operation started
scsi 1:0:8:0: ABORT operation started

on all scsi ids (in use or not) on two scsi channels. This is alarming enough, but when the system finally starts up, I also find that one of my nic's have stopped working.

A quick check at /proc/interrupts shows that there are lot's of differences from the "usual" interrupts. I'll attach /proc/interrupts from both a working and non working kernel, and post dmesg from both. The problem is that with the 2.6.25-rc3 kernel, I can't get the beginning of the dmesg.. Is there a trick other than dmesg -s 65536?

Thanks :)

Steps to reproduce:
Comment 1 Stian Jordet 2008-02-27 14:21:10 UTC
Created attachment 15033 [details]
interrupts from working kernel (2.6.22.9)
Comment 2 Stian Jordet 2008-02-27 14:21:41 UTC
Created attachment 15034 [details]
dmesg from working kernel (2.6.22.9)
Comment 3 Stian Jordet 2008-02-27 14:22:07 UTC
Created attachment 15035 [details]
interrupts from broken kernel (2.6.25-rc3)
Comment 4 Stian Jordet 2008-02-27 14:22:33 UTC
Created attachment 15036 [details]
Partial dmesg from broken kernel (2.6.25-rc3)
Comment 5 ykzhao 2008-02-27 17:46:22 UTC
Will you please attach the output of acpidump and lspci -vvxxx? 
(Please use the working kernel).
Thanks.
Comment 6 Stian Jordet 2008-02-27 23:08:32 UTC
Created attachment 15045 [details]
acpidump
Comment 7 Stian Jordet 2008-02-27 23:08:55 UTC
Created attachment 15046 [details]
lspci -vvxxx
Comment 8 ykzhao 2008-02-29 00:55:57 UTC
HI, Stian
    It seems that the interrupt for 02.03.0/02.03.1(SCIS controller) is incorrect. So OS reports the following error message and system can't work normally.
  > scsi 0:0:0:0: ABORT operation started
  > scsi 1:0:0:0: ABORT operation started
  > scsi 0:0:0:0: ABORT operation timed-out.
  > scsi 0:0:0:0: DEVICE RESET operation started
 > scsi 1:0:0:0: ABORT operation timed-out. 
   
    Will you please attach the full dmesg for 2.6.25-rc3? ( Please add the boot option of "initcall_debug").
    Thanks.
Comment 9 Stian Jordet 2008-02-29 00:58:31 UTC
Will do later tonight. But as I wrote earlier, I don't know how to get the full dmesg. IIRC you could earlier set the size of the dmesg buffer in kernel config, but I'm not able to find this config option anymore...

Please note that the network card sharing the same (wrong) interrupt with the scsi controller also does not work.

Thanks.

-Stian
Comment 10 ykzhao 2008-02-29 05:59:51 UTC
Please increase the CONFIG_LOG_BUF_SHIFT in kernel configuration. (For example: 19/18).Maybe we can get the full dmesg output.
Thanks.
Comment 11 Stian Jordet 2008-02-29 08:54:57 UTC
Created attachment 15093 [details]
new dmesg from broken kernel (2.6.25-rc3). With debug_initcall
Comment 12 ykzhao 2008-03-02 22:21:25 UTC
Created attachment 15116 [details]
try the patch

Will you please try the attached patch and see whether the problem can be fixed?
Thanks.
Comment 13 Stian Jordet 2008-03-03 04:51:28 UTC
Works fine, thanks :)
Comment 14 Stian Jordet 2008-03-18 01:36:01 UTC
Why isn't this patch in 2.6.25-rc6?
Comment 15 TJ 2008-04-11 08:08:52 UTC
*** Bug 10396 has been marked as a duplicate of this bug. ***
Comment 16 Len Brown 2008-06-24 18:31:49 UTC
Re: irq numbers

The reason that the IRQ numbers in 2.6.25 look different from
2.6.22 is because "irq compression" is now disabled on i386
for IRQ numbers below 64.  The IOAPIC on a 450NX has 64
entries, so what you see now are always the real pin numbers ==
GSI numbers == IRQ numbers, identitiy mapped.

eg. the mylex was on IRQ16, now on IRQ17,
but has been on GSI 17 all along:

< ACPI: PCI Interrupt 0000:01:08.0[A] -> GSI 17 (level, low) -> IRQ 16
---
> ACPI: PCI Interrupt 0000:01:08.0[A] -> GSI 17 (level, low) -> IRQ 17

and the symbios on 58 and the uhci_hcd:usb1 on 54 now show that way:

< ACPI: PCI Interrupt 0000:00:08.0[A] -> GSI 58 (level, low) -> IRQ 21
< ACPI: PCI Interrupt 0000:00:0c.2[D] -> GSI 54 (level, low) -> IRQ 22
---
> ACPI: PCI Interrupt 0000:00:08.0[A] -> GSI 58 (level, low) -> IRQ 58
> ACPI: PCI Interrupt 0000:00:0c.2[D] -> GSI 54 (level, low) -> IRQ 54

But that just obscures the real failure, which is this:

> ACPI: Bus 0000:02 not present in PCI namespace

< ACPI: PCI Interrupt 0000:02:03.0[A] -> GSI 57 (level, low) -> IRQ 19
< ACPI: PCI Interrupt 0000:02:03.1[B] -> GSI 56 (level, low) -> IRQ 20
---
> ACPI: PCI Interrupt 0000:02:03.0[A]: no GSI - using IRQ 11
> ACPI: PCI Interrupt 0000:02:03.1[B]: no GSI - using IRQ 11

and your 2nd ethernet is on bus 2 and thus died also:

ACPI: Unable to derive IRQ for device 0000:02:01.0
ACPI: PCI Interrupt 0000:02:01.0[A]: no GSI - using IRQ 11
3c59x: Donald Becker and others.
0000:02:01.0: 3Com PCI 3c980C Python-T at f881c000.
Comment 17 Len Brown 2008-06-24 18:44:09 UTC
assigning to jbarnes for him to push the appropriate PCI fix upstream.
Comment 18 Jesse Barnes 2008-06-25 12:17:53 UTC
ykzhao's patch already seems to be upstream, as of April 15, so I'm closing this out.
Comment 19 Len Brown 2008-06-25 14:34:34 UTC
shipped in 2.6.25 - closed.

commit b87e81e5c6e64ae0eae3b4f61bf07bfeec856184
Author: yakui.zhao@intel.com <yakui.zhao@intel.com>
Date:   Tue Apr 15 14:34:49 2008 -0700

    acpi: unneccessary to scan the PCI bus already scanned
    
    http://bugzilla.kernel.org/show_bug.cgi?id=10124