Created attachment 72854 [details] kernel log of failed 3.3.1 bootup I have an M2N-LR motherboard with a pair of 3Ware 9550SX RAID controllers plugged into PCI-X slots. With kernel 3.0.x or older, everything works fine. However, with kernel 3.1.x or newer (tested with 3.1, 3.2, and 3.3.1) the controller IRQ assignments are all wonky and the cards don't come up properly as the SCSI bus probe fails. *** kernel 3.0.6 (good) 3ware 9000 Storage Controller device driver for Linux v2.26.02.014. ACPI: PCI Interrupt Link [LNEC] enabled at IRQ 18 3w-9xxx 0000:03:00.0: PCI INT A -> Link[LNEC] -> GSI 18 (level, low) -> IRQ 18 scsi2 : 3ware 9000 Storage Controller 3w-9xxx: scsi2: Found a 3ware 9000 Storage Controller at 0xefdff000, IRQ: 18. 3w-9xxx: scsi2: Firmware FE9X 3.08.00.029, BIOS BE9X 3.10.00.003, Ports: 8. 3w-9xxx 0000:03:04.0: PCI INT A -> Link[LNEC] -> GSI 18 (level, low) -> IRQ 18 scsi 2:0:0:0: Direct-Access AMCC 9550SXU-8L DISK 3.08 PQ: 0 ANSI: 5 scsi 2:0:1:0: Direct-Access AMCC 9550SXU-8L DISK 3.08 PQ: 0 ANSI: 5 scsi7 : 3ware 9000 Storage Controller 3w-9xxx: scsi7: Found a 3ware 9000 Storage Controller at 0xefdfe000, IRQ: 18. 3w-9xxx: scsi7: Firmware FE9X 3.08.00.029, BIOS BE9X 3.10.00.003, Ports: 4. scsi 7:0:0:0: Direct-Access AMCC 9550SX-4LP DISK 3.08 PQ: 0 ANSI: 5 *** kernel 3.3.1 (Same results with 3.1.x and 3.2.x) 3ware 9000 Storage Controller device driver for Linux v2.26.02.014. 3w-9xxx 0000:03:00.0: PCI IRQ 0 -> rerouted to legacy IRQ 16 ACPI: Invalid index 16 3w-9xxx 0000:03:00.0: PCI INT A: no GSI - using ISA IRQ 14 scsi4 : 3ware 9000 Storage Controller 3w-9xxx: scsi4: Found a 3ware 9000 Storage Controller at 0xefdff000, IRQ: 14. 3w-9xxx: scsi4: Firmware FE9X 3.08.00.029, BIOS BE9X 3.10.00.003, Ports: 8. 3w-9xxx 0000:03:04.0: PCI IRQ 0 -> rerouted to legacy IRQ 16 ACPI: Invalid index 16 3w-9xxx 0000:03:04.0: PCI INT A: no GSI - using ISA IRQ 14 scsi8 : 3ware 9000 Storage Controller 3w-9xxx: scsi8: Found a 3ware 9000 Storage Controller at 0xefdfe000, IRQ: 14. 3w-9xxx: scsi8: Firmware FE9X 3.08.00.029, BIOS BE9X 3.10.00.003, Ports: 4. scsi: waiting for bus probes to complete ... scsi 4:0:0:0: WARNING: (0x06:0x002C): Command (0x12) timed out, resetting card. scsi 8:0:0:0: WARNING: (0x06:0x002C): Command (0x12) timed out, resetting card. scsi 4:0:0:0: WARNING: (0x06:0x002C): Command (0x0) timed out, resetting card. scsi 8:0:0:0: WARNING: (0x06:0x002C): Command (0x0) timed out, resetting card. scsi 4:0:0:0: Device offlined - not ready after error recovery scsi 8:0:0:0: Device offlined - not ready after error recovery [repeat above six lines fifteen more times, once for each LUN] Strictly speaking these are Fedora 15 kernels (specifically versions "2.6.40.6" and "2.6.43.1") but I don't think that has any bearing on this problem. I've reported this to RedHat (RHBZ #808880) but this is likely an upstream kernel bug so I'm reporting it here too. I'll be attaching kernel logs of successful 3.0.6 and unsuccessful 3.3.1 boots, as well as 'dmidecode' and 'lspci -v' output.
Created attachment 72855 [details] kernel log of successful 3.0.6 bootup.
Created attachment 72856 [details] output of dmidecode
Created attachment 72857 [details] output of 'lspci -v'
working: ACPI: PCI Interrupt Link [LNEC] (IRQs 16 17 18 19) *14 ... ACPI: PCI Interrupt Link [LNEC] enabled at IRQ 18 3w-9xxx 0000:03:00.0: PCI INT A -> Link[LNEC] -> GSI 18 (level, low) -> IRQ 18 failing: ACPI: PCI Interrupt Link [LNEC] (IRQs 16 17 18 19) *14 ... <LNEC is never accessed again> 3w-9xxx 0000:03:00.0: PCI IRQ 0 -> rerouted to legacy IRQ 16 ACPI: Invalid index 16 3w-9xxx 0000:03:00.0: PCI INT A: no GSI - using ISA IRQ 14 So for some reason on the new kernel, when PCI enumeraged 3w-9xxx, it did not find the PCI interrupt link device for that device! My 1st guess is that something in PCI broke, just b/c we've not touched the code in this area in ACPI... Please attach the output from acpidump from the working system.
Please try with and without CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS Please report if reverting this patch between 3.0 and 3.1 has an effect: d7f6169a0d32002657886fee561c641acddb9a75 ACPI: fix CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS
Created attachment 72871 [details] output of 'acpidump' on working 3.0 kernel When generating this output, acpidump print this on stderr: Wrong checksum for generic table!
The failing kernels are all built with CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y It's going to take a couple of days schedule a maintenance window and get the patched/reconfigured kernel staged for testing, but I'll get the test results ASAP.
Created attachment 72920 [details] boot log of patched 3.3.2 kernel (success) This is the boot log of the Fedora 15 3.3.2 kernel (aka 2.6.43.2-2.fc15.x86_64) with the following commit reverted: d7f6169a0d32002657886fee561c641acddb9a75 ACPI: fix CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS Also, CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y In short, when that patch was reverted, the kernel booted up successfully, with correct IRQ routing.
*** Bug 43150 has been marked as a duplicate of this bug. ***
Could you please add the output of "lspci -nn". Thanks!
Created attachment 73077 [details] output of 'lspci -nn' Here's the dump, taken from the system running the patched 3.3.2 kernel.
The problem is likely related to the 6702PXH PCI Express-to-PCI Bridge and your system being unable to use MSIs. You should be able to workaround the problem by adding "pci=noioapicquirk" to the kernel command line for now. I'll try to come up with some debug patch to gather more information.
Created attachment 73159 [details] debug-prt.patch Hi Solomon, could you test this patch on top of a 3.3 kernel? Please provide a full dmesg. Thanks!
I am currently building a new kernel with the debug patch, and I'll have results in a few hours.
Created attachment 73318 [details] dmesg with debug-prt.patch applied I had to boot with pci=noioapicquirk, but the debugging output is attached.
Just as an FYI, I no longer have the hardware in question. The PCI-X controller cards have been replaced with a single PCIe controller.
so I'll close this bug as it can not be reproduced any more. please feel free to reopen it if anyone can reproduce the problem.
Due to an odd series of events, I once again have a 3Ware 9550SXU PCI-X card plugged into an M2N-LR motherboard. The problem I reported four years ago is still present with Fedora's 4.8.14 kernel -- Just as described in comment #4. Adding 'pci=noioapicquirk' to the kernel cmdline is still a successful workaround for the problem. As I am now able to reproduce the problem with current kernels, I'm re-opening this bug...
Is that possible for you to bisect the culprit out? Since the bug is marked as regression. The phenomenon is determined, so this should be a bisectable bug. Thanks Lv
According to what I tried earlier, reverting this commit allowed things to work: (see comment #8) d7f6169a0d32002657886fee561c641acddb9a75 ACPI: fix CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS That said, these symptoms started when Fedora enabled CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS in their kernels, which coincided with the 3.0->3.1 transition. IIUC, disabling this option and using the pci=noioapicquirk cmdline option essentially accomplish the same thing. Consequently, I suspect that this IRQ rerouting feature never worked on this particular system -- or at least the heuristics to determine what is "broken" are incorrect, as the evidence suggests that the boot IRQs aren't actually "broken" because the system appears to work fine with the assignments that the BIOS handed out at boot time. Also, comment #12 probably explains the underlying cause.
(In reply to Solomon Peachy from comment #20) > According to what I tried earlier, reverting this commit allowed things to > work: (see comment #8) Ah, thanks. > > d7f6169a0d32002657886fee561c641acddb9a75 > ACPI: fix CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS > > That said, these symptoms started when Fedora enabled > CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS in their kernels, which coincided > with the 3.0->3.1 transition. IIUC, disabling this option and using the > pci=noioapicquirk cmdline option essentially accomplish the same thing. > > Consequently, I suspect that this IRQ rerouting feature never worked on this > particular system -- or at least the heuristics to determine what is > "broken" are incorrect, as the evidence suggests that the boot IRQs aren't > actually "broken" because the system appears to work fine with the > assignments that the BIOS handed out at boot time. > > Also, comment #12 probably explains the underlying cause. OK. Let's first ping Stefan Assmann back... I'll leave this opening. Thanks Lv
Hi Solomon, since you have a system to test this again let's continue the search. I didn't follow up on this so far as there was no way to test any changes. Looking at the debug output you provided [ 1.906802] 3w-9xxx 0000:03:04.0: PRT pin=1 link=ffff8802140621b8 index=0 index is 0 and that link carries a pointer to an acpi_handle. IIRC we would have to get the interrupt information via the _CRS method. Back in the day when the quirk was introduced all systems we saw got their IRQ via index, so this case was missed. However since you report that things work fine with noioapicquirk and my ACPI knowledge is not so sophisticated I'd lean to extend the code in a way that detects that index is 0 and avoid any rerouting. The 6700PXH is a relic and I don't have any of those systems available anymore so I feel this would be the safest way to proceed. Lv, Len, would that be ok with you?
Created attachment 249641 [details] bootirq-zero-index.patch Please give this patch a try.
ping...
I'm on an overseas trip right now; I won't have physical access to the system for another week.
After a second last-minute trip overseas, and a tree creating three new skylights (it's been a long month...) I finally have physical access to the system this weekend. I applied the patch on top of the Fedora 4.9.6-200 kernel. Compiling it now.
Good news! With the patch in #23, the system boots up fine without needing 'pci=noioapicquirk'. I'll attach the kernel log.
Created attachment 254161 [details] dmesg of successful boot with a patched 4.9.6 kernel (without pci=noioapicquirk)
@sassmann@redhat.com is the patch in comment #23 for upstream?
Yes, if nobody objects I'll submit it upstream.
please mark this bug as resolved one you've sent the patch out.
s/one/once
Already sent. http://marc.info/?l=linux-acpi&m=148672725418693&w=2
Created attachment 255271 [details] DMI quirk patch Solomon, upstream has asked to solve this by a different approach, namely excluding the M2N-LR by DMI quirk. Attached is a patch that hopefully triggers on your system. You should see "disable boot interrupt reroute" in your dmesg with the patch applied. Please verify that it works before I submit the new patch. Thanks!
So rather than a more generic solution, the "fix" is to add a workaround for my specific system... I'll compile a patched kernel and stage it for a test, but given that the machine is currently deployed in a very remote location [1], I'm rather reluctant to try this until I have physical access -- probably around the end of March. [1] literally a cabin in the woods
I can finally confirm that the second patch (DMI quirk) also successfully resolves the problem. ... [ 0.498225] pci 0000:01:05.0: Video device with shadowed ROM at [mem 0x000c00 00-0x000dffff] [ 0.498463] ASUSTek Computer INC. M2N-LR detected: disable boot interrupt reroute [ 0.498699] PCI: CLS 64 bytes, default 64 ...
(In reply to Solomon Peachy from comment #36) > So rather than a more generic solution, the "fix" is to add a workaround for > my specific system... If you can explain the "entry->index == 0" solution, I'm more than happy to take that. My questions about it start here: https://lkml.kernel.org/r/20170308223410.GA12086@bhelgaas-glaptop.roam.corp.google.com