Bug 43074

Summary: Bisected: CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS broken unless 'pci=noioapicquirk' - ASUS M2N-LR
Product: ACPI Reporter: Solomon Peachy (pizza)
Component: Config-InterruptsAssignee: Stefan Assmann (sassmann)
Status: RESOLVED CODE_FIX    
Severity: high CC: acpi-bugzilla, bjorn, lenb, lv.zheng, rui.zhang, sassmann, trenn
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.8.14 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: kernel log of failed 3.3.1 bootup
kernel log of successful 3.0.6 bootup.
output of dmidecode
output of 'lspci -v'
output of 'acpidump' on working 3.0 kernel
boot log of patched 3.3.2 kernel (success)
output of 'lspci -nn'
debug-prt.patch
dmesg with debug-prt.patch applied
bootirq-zero-index.patch
dmesg of successful boot with a patched 4.9.6 kernel (without pci=noioapicquirk)
DMI quirk patch

Description Solomon Peachy 2012-04-08 22:48:48 UTC
Created attachment 72854 [details]
kernel log of failed 3.3.1 bootup

I have an M2N-LR motherboard with a pair of 3Ware 9550SX RAID controllers plugged into PCI-X slots.  With kernel 3.0.x or older, everything works fine.  However, with kernel 3.1.x or newer (tested with 3.1, 3.2, and 3.3.1) the controller IRQ assignments are all wonky and the cards don't come up properly as the SCSI bus probe fails.

*** kernel 3.0.6 (good)
3ware 9000 Storage Controller device driver for Linux v2.26.02.014.
ACPI: PCI Interrupt Link [LNEC] enabled at IRQ 18
3w-9xxx 0000:03:00.0: PCI INT A -> Link[LNEC] -> GSI 18 (level, low) -> IRQ 18
scsi2 : 3ware 9000 Storage Controller
3w-9xxx: scsi2: Found a 3ware 9000 Storage Controller at 0xefdff000, IRQ: 18.
3w-9xxx: scsi2: Firmware FE9X 3.08.00.029, BIOS BE9X 3.10.00.003, Ports: 8.
3w-9xxx 0000:03:04.0: PCI INT A -> Link[LNEC] -> GSI 18 (level, low) -> IRQ 18
scsi 2:0:0:0: Direct-Access     AMCC     9550SXU-8L DISK  3.08 PQ: 0 ANSI: 5
scsi 2:0:1:0: Direct-Access     AMCC     9550SXU-8L DISK  3.08 PQ: 0 ANSI: 5
scsi7 : 3ware 9000 Storage Controller
3w-9xxx: scsi7: Found a 3ware 9000 Storage Controller at 0xefdfe000, IRQ: 18.
3w-9xxx: scsi7: Firmware FE9X 3.08.00.029, BIOS BE9X 3.10.00.003, Ports: 4.
scsi 7:0:0:0: Direct-Access     AMCC     9550SX-4LP DISK  3.08 PQ: 0 ANSI: 5

*** kernel 3.3.1 (Same results with 3.1.x and 3.2.x)

3ware 9000 Storage Controller device driver for Linux v2.26.02.014.
3w-9xxx 0000:03:00.0: PCI IRQ 0 -> rerouted to legacy IRQ 16
ACPI: Invalid index 16
3w-9xxx 0000:03:00.0: PCI INT A: no GSI - using ISA IRQ 14
scsi4 : 3ware 9000 Storage Controller
3w-9xxx: scsi4: Found a 3ware 9000 Storage Controller at 0xefdff000, IRQ: 14.
3w-9xxx: scsi4: Firmware FE9X 3.08.00.029, BIOS BE9X 3.10.00.003, Ports: 8.
3w-9xxx 0000:03:04.0: PCI IRQ 0 -> rerouted to legacy IRQ 16
ACPI: Invalid index 16
3w-9xxx 0000:03:04.0: PCI INT A: no GSI - using ISA IRQ 14
scsi8 : 3ware 9000 Storage Controller
3w-9xxx: scsi8: Found a 3ware 9000 Storage Controller at 0xefdfe000, IRQ: 14.
3w-9xxx: scsi8: Firmware FE9X 3.08.00.029, BIOS BE9X 3.10.00.003, Ports: 4.
scsi: waiting for bus probes to complete ...
scsi 4:0:0:0: WARNING: (0x06:0x002C): Command (0x12) timed out, resetting card.
scsi 8:0:0:0: WARNING: (0x06:0x002C): Command (0x12) timed out, resetting card.
scsi 4:0:0:0: WARNING: (0x06:0x002C): Command (0x0) timed out, resetting card.
scsi 8:0:0:0: WARNING: (0x06:0x002C): Command (0x0) timed out, resetting card.
scsi 4:0:0:0: Device offlined - not ready after error recovery
scsi 8:0:0:0: Device offlined - not ready after error recovery
[repeat above six lines fifteen more times, once for each LUN]

Strictly speaking these are Fedora 15 kernels (specifically versions "2.6.40.6" and "2.6.43.1") but I don't think that has any bearing on this problem.  I've reported this to RedHat (RHBZ #808880) but this is likely an upstream kernel bug so I'm reporting it here too.

I'll be attaching kernel logs of successful 3.0.6 and unsuccessful 3.3.1 boots, as well as 'dmidecode' and 'lspci -v' output.
Comment 1 Solomon Peachy 2012-04-08 22:49:21 UTC
Created attachment 72855 [details]
kernel log of successful 3.0.6 bootup.
Comment 2 Solomon Peachy 2012-04-08 22:49:45 UTC
Created attachment 72856 [details]
output of dmidecode
Comment 3 Solomon Peachy 2012-04-08 22:50:10 UTC
Created attachment 72857 [details]
output of 'lspci -v'
Comment 4 Len Brown 2012-04-10 02:49:12 UTC
working:
ACPI: PCI Interrupt Link [LNEC] (IRQs 16 17 18 19) *14
...
ACPI: PCI Interrupt Link [LNEC] enabled at IRQ 18
3w-9xxx 0000:03:00.0: PCI INT A -> Link[LNEC] -> GSI 18 (level, low) -> IRQ 18


failing:

ACPI: PCI Interrupt Link [LNEC] (IRQs 16 17 18 19) *14
...
<LNEC is never accessed again>
3w-9xxx 0000:03:00.0: PCI IRQ 0 -> rerouted to legacy IRQ 16
ACPI: Invalid index 16
3w-9xxx 0000:03:00.0: PCI INT A: no GSI - using ISA IRQ 14


So for some reason on the new kernel, when PCI
enumeraged 3w-9xxx, it did not find the PCI interrupt
link device for that device!

My 1st guess is that something in PCI broke, just b/c
we've not touched the code in this area in ACPI...

Please attach the output from acpidump from the working system.
Comment 5 Len Brown 2012-04-10 02:55:17 UTC
Please try with and without CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS

Please report if reverting this patch between 3.0 and 3.1 has an effect:

d7f6169a0d32002657886fee561c641acddb9a75
ACPI: fix CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS
Comment 6 Solomon Peachy 2012-04-10 10:35:17 UTC
Created attachment 72871 [details]
output of 'acpidump' on working 3.0 kernel

When generating this output, acpidump print this on stderr:

  Wrong checksum for generic table!
Comment 7 Solomon Peachy 2012-04-10 10:42:24 UTC
The failing kernels are all built with CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y

It's going to take a couple of days schedule a maintenance window and get the patched/reconfigured kernel staged for testing, but I'll get the test results ASAP.
Comment 8 Solomon Peachy 2012-04-14 20:35:52 UTC
Created attachment 72920 [details]
boot log of patched 3.3.2 kernel (success)

This is the boot log of the Fedora 15 3.3.2 kernel (aka 2.6.43.2-2.fc15.x86_64) with the following commit reverted:

d7f6169a0d32002657886fee561c641acddb9a75
ACPI: fix CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS

Also, CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y

In short, when that patch was reverted, the kernel booted up successfully, with correct IRQ routing.
Comment 9 Len Brown 2012-04-24 02:47:14 UTC
*** Bug 43150 has been marked as a duplicate of this bug. ***
Comment 10 Stefan Assmann 2012-04-25 11:06:08 UTC
Could you please add the output of "lspci -nn". Thanks!
Comment 11 Solomon Peachy 2012-04-25 11:16:16 UTC
Created attachment 73077 [details]
output of 'lspci -nn'

Here's the dump, taken from the system running the patched 3.3.2 kernel.
Comment 12 Stefan Assmann 2012-04-27 09:18:07 UTC
The problem is likely related to the 6702PXH PCI Express-to-PCI Bridge and your system being unable to use MSIs. You should be able to workaround the problem by adding "pci=noioapicquirk" to the kernel command line for now. I'll try to come up with some debug patch to gather more information.
Comment 13 Stefan Assmann 2012-05-02 11:04:16 UTC
Created attachment 73159 [details]
debug-prt.patch

Hi Solomon,
could you test this patch on top of a 3.3 kernel? Please provide a full dmesg.
Thanks!
Comment 14 Solomon Peachy 2012-05-06 11:43:04 UTC
I am currently building a new kernel with the debug patch, and I'll have results in a few hours.
Comment 15 Solomon Peachy 2012-05-16 21:57:16 UTC
Created attachment 73318 [details]
dmesg with debug-prt.patch applied

I had to boot with pci=noioapicquirk, but the debugging output is attached.
Comment 16 Solomon Peachy 2012-10-18 20:12:38 UTC
Just as an FYI, I no longer have the hardware in question.  The PCI-X controller cards have been replaced with a single PCIe controller.
Comment 17 Zhang Rui 2012-11-28 13:19:58 UTC
so I'll close this bug as it can not be reproduced any more.
please feel free to reopen it if anyone can reproduce the problem.
Comment 18 Solomon Peachy 2016-12-20 12:48:49 UTC
Due to an odd series of events, I once again have a 3Ware 9550SXU PCI-X card plugged into an M2N-LR motherboard.

The problem I reported four years ago is still present with Fedora's 4.8.14 kernel -- Just as described in comment #4.

Adding 'pci=noioapicquirk' to the kernel cmdline is still a successful workaround for the problem.

As I am now able to reproduce the problem with current kernels, I'm re-opening this bug...
Comment 19 Lv Zheng 2016-12-21 06:12:02 UTC
Is that possible for you to bisect the culprit out?
Since the bug is marked as regression.
The phenomenon is determined, so this should be a bisectable bug.

Thanks
Lv
Comment 20 Solomon Peachy 2016-12-21 12:22:57 UTC
According to what I tried earlier, reverting this commit allowed things to work:  (see comment #8)

d7f6169a0d32002657886fee561c641acddb9a75
ACPI: fix CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS

That said, these symptoms started when Fedora enabled CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS in their kernels, which coincided with the 3.0->3.1 transition.  IIUC, disabling this option and using the pci=noioapicquirk cmdline option essentially accomplish the same thing.  

Consequently, I suspect that this IRQ rerouting feature never worked on this particular system -- or at least the heuristics to determine what is "broken" are incorrect, as the evidence suggests that the boot IRQs aren't actually "broken" because the system appears to work fine with the assignments that the BIOS handed out at boot time.

Also, comment #12 probably explains the underlying cause.
Comment 21 Lv Zheng 2016-12-22 02:26:11 UTC
(In reply to Solomon Peachy from comment #20)
> According to what I tried earlier, reverting this commit allowed things to
> work:  (see comment #8)

Ah, thanks.

> 
> d7f6169a0d32002657886fee561c641acddb9a75
> ACPI: fix CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS
> 
> That said, these symptoms started when Fedora enabled
> CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS in their kernels, which coincided
> with the 3.0->3.1 transition.  IIUC, disabling this option and using the
> pci=noioapicquirk cmdline option essentially accomplish the same thing.  
> 
> Consequently, I suspect that this IRQ rerouting feature never worked on this
> particular system -- or at least the heuristics to determine what is
> "broken" are incorrect, as the evidence suggests that the boot IRQs aren't
> actually "broken" because the system appears to work fine with the
> assignments that the BIOS handed out at boot time.
> 
> Also, comment #12 probably explains the underlying cause.

OK.
Let's first ping Stefan Assmann back...
I'll leave this opening.

Thanks
Lv
Comment 22 Stefan Assmann 2017-01-02 12:18:51 UTC
Hi Solomon,
since you have a system to test this again let's continue the search. I didn't follow up on this so far as there was no way to test any changes.

Looking at the debug output you provided
[    1.906802] 3w-9xxx 0000:03:04.0: PRT pin=1 link=ffff8802140621b8 index=0
index is 0 and that link carries a pointer to an acpi_handle. IIRC we would have to get the interrupt information via the _CRS method. Back in the day when the quirk was introduced all systems we saw got their IRQ via index, so this case was missed.
However since you report that things work fine with noioapicquirk and my ACPI knowledge is not so sophisticated I'd lean to extend the code in a way that detects that index is 0 and avoid any rerouting. The 6700PXH is a relic and I don't have any of those systems available anymore so I feel this would be the safest way to proceed.

Lv, Len, would that be ok with you?
Comment 23 Stefan Assmann 2017-01-02 13:52:31 UTC
Created attachment 249641 [details]
bootirq-zero-index.patch

Please give this patch a try.
Comment 24 Zhang Rui 2017-01-11 02:42:03 UTC
ping...
Comment 25 Solomon Peachy 2017-01-11 09:07:37 UTC
I'm on an overseas trip right now; I won't have physical access to the system for another week.
Comment 26 Zhang Rui 2017-01-23 06:38:42 UTC
ping...
Comment 27 Solomon Peachy 2017-02-04 21:06:23 UTC
After a second last-minute trip overseas, and a tree creating three new skylights (it's been a long month...) I finally have physical access to the system this weekend.

I applied the patch on top of the Fedora 4.9.6-200 kernel.  Compiling it now.
Comment 28 Solomon Peachy 2017-02-05 01:11:03 UTC
Good news!  With the patch in #23, the system boots up fine without needing 'pci=noioapicquirk'.

I'll attach the kernel log.
Comment 29 Solomon Peachy 2017-02-05 01:12:42 UTC
Created attachment 254161 [details]
dmesg of successful boot with a patched 4.9.6 kernel (without pci=noioapicquirk)
Comment 30 Zhang Rui 2017-02-06 01:55:32 UTC
@sassmann@redhat.com

is the patch in comment #23 for upstream?
Comment 31 Stefan Assmann 2017-02-06 07:19:18 UTC
Yes, if nobody objects I'll submit it upstream.
Comment 32 Zhang Rui 2017-02-13 03:17:07 UTC
please mark this bug as resolved one you've sent the patch out.
Comment 33 Zhang Rui 2017-02-13 03:17:26 UTC
s/one/once
Comment 34 Stefan Assmann 2017-02-13 11:02:06 UTC
Already sent.
http://marc.info/?l=linux-acpi&m=148672725418693&w=2
Comment 35 Stefan Assmann 2017-03-15 15:12:09 UTC
Created attachment 255271 [details]
DMI quirk patch

Solomon,
upstream has asked to solve this by a different approach, namely excluding the M2N-LR by DMI quirk. Attached is a patch that hopefully triggers on your system. You should see "disable boot interrupt reroute" in your dmesg
 with the patch applied.
Please verify that it works before I submit the new patch.
Thanks!
Comment 36 Solomon Peachy 2017-03-15 17:36:43 UTC
So rather than a more generic solution, the "fix" is to add a workaround for my specific system...

I'll compile a patched kernel and stage it for a test, but given that the machine is currently deployed in a very remote location [1], I'm rather reluctant to try this until I have physical access -- probably around the end of March.

[1] literally a cabin in the woods
Comment 37 Solomon Peachy 2017-04-15 12:26:04 UTC
I can finally confirm that the second patch (DMI quirk) also successfully resolves the problem.

...
[    0.498225] pci 0000:01:05.0: Video device with shadowed ROM at [mem 0x000c00
00-0x000dffff]
[    0.498463] ASUSTek Computer INC. M2N-LR detected: disable boot interrupt reroute
[    0.498699] PCI: CLS 64 bytes, default 64
...
Comment 38 Bjorn Helgaas 2017-04-25 20:33:02 UTC
(In reply to Solomon Peachy from comment #36)
> So rather than a more generic solution, the "fix" is to add a workaround for
> my specific system...

If you can explain the "entry->index == 0" solution, I'm more than happy to take that.  My questions about it start here: https://lkml.kernel.org/r/20170308223410.GA12086@bhelgaas-glaptop.roam.corp.google.com