Bug 84281

Summary: [BISECTED] LSI PCI FC Adapter not enumerated
Product: Drivers Reporter: Bjorn Helgaas (bjorn)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: RESOLVED CODE_FIX    
Severity: normal CC: dirk
Priority: P1    
Hardware: All   
OS: Linux   
URL: http://lkml.kernel.org/r/ghiol53r9u.fsf@lena.gouders.net
Kernel Version: 3.14 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: lspci (showing all devices)
dmesg (working)
dmesg (failing)
Output of dmidecode
AIDA64 report (txt)
AIDA64 report (html)
dmesg with Win2k/XP BIOS setting
dmesg with ACPI SRAT Table = disable BIOS setting
dmesg with Secured Setup Configurations = Yes BIOS setting

Description Bjorn Helgaas 2014-09-11 17:24:44 UTC
Dirk Gouders reported (see http://lkml.kernel.org/r/ghiol53r9u.fsf@lena.gouders.net):

On a Tyan VX50 (B4985) I ran into problems when updating the kernel: the
PCI FC Adapter is no longer recognized.

Actually, this machine caused more problems, because I was not able to
get it run with fedora, debian or ubuntu but I am somewhat sure that has
to do with grub2 and now I am (again) running Gentoo on it, booting with
legacy grub.  So, I hope that has nothing to do with the PCI FC Adapter
not being recognized.

I bisected this PCI problem to commit 1820ffdccb9b4398 (PCI: Make sure
bus number resources stay within their parents bounds).

With this commit the "bridge configuration invalid..." message is
triggered; I'll attach the dmesg output.
Comment 1 Bjorn Helgaas 2014-09-11 17:46:10 UTC
Created attachment 149811 [details]
lspci (showing all devices)

Extracted from Dirk's email: 
http://lkml.kernel.org/r/ghwq9k3m6z.fsf@lena.gouders.net
Comment 2 Bjorn Helgaas 2014-09-11 18:06:44 UTC
Created attachment 149891 [details]
dmesg (working)

Dmesg log from 3.14.17, which works correctly, with PCI debug messages turned on.

From http://lkml.kernel.org/r/ghioku5l29.fsf@lena.gouders.net (timestamps removed).
Comment 3 Bjorn Helgaas 2014-09-11 18:08:39 UTC
Created attachment 149901 [details]
dmesg (failing)

Dmesg log from 1820ffdccb9b4398 (commit identified by bisection as the first bad commit), with PCI debug messages turned on.

From http://lkml.kernel.org/r/ghioku5l29.fsf@lena.gouders.net (timestamps removed).
Comment 4 Dirk Gouders 2014-09-14 11:33:44 UTC
Created attachment 150211 [details]
Output of dmidecode
Comment 5 Bjorn Helgaas 2014-09-19 17:23:21 UTC
Analysis (I wrote this up based on diagnosis by Andreas Noever):

Dirk tested a Tyan VX50 (B4985) with this device that worked like this
prior to 1820ffdccb9b:

    bus: [bus 00-7f] on node 0 link 1
    ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-07])
    pci 0000:00:0e.0: PCI bridge to [bus 0a] 
    pci_bus 0000:0a: busn_res: can not insert [bus 0a] under [bus 00-07] (conflicts with (null) [bus 00-07])
    pci 0000:0a:00.0: [1000:0646] type 00 class 0x0c0400 (FC adapter)

Note that the root bridge [bus 00-07] aperture is wrong; this is a BIOS
defect in the PCI0 _CRS method.  But prior to 1820ffdccb9b, we didn't
enforce that aperture, and the FC adapter worked fine at 0a:00.0.

After 1820ffdccb9b, we notice that 00:0e.0's aperture is not contained
in the root bridge's aperture, so we reconfigure it so it *is*
contained:

    pci 0000:00:0e.0: bridge configuration invalid ([bus 0a-0a]), reconfiguring
    pci 0000:00:0e.0: PCI bridge to [bus 06-07]

This effectively moves the FC device from 0a:00.0 to 07:00.0, which
should be legal.  But when we enumerate bus 06, the FC device doesn't
respond, so we don't find anything.  This is probably a defect in the
FC device.
Comment 6 Bjorn Helgaas 2014-09-19 17:43:49 UTC
[Oops, the FC device moves from 0a:00.0 to *06:00.0*]
Comment 7 Dirk Gouders 2014-09-20 19:23:40 UTC
Created attachment 151071 [details]
AIDA64 report (txt)
Comment 8 Dirk Gouders 2014-09-20 19:24:35 UTC
Created attachment 151081 [details]
AIDA64 report (html)
Comment 9 Dirk Gouders 2014-09-20 19:35:44 UTC
Created attachment 151091 [details]
dmesg with Win2k/XP BIOS setting

Output of dmesg when started with BIOS setting:

Advanced->Installed O/S->Win2k/XP
Comment 10 Dirk Gouders 2014-09-20 19:57:51 UTC
More tests with the VX50:

* The BIOS has a choice entry "Advanced -> Installed O/S" with
  the following possible choices:

    * Other
    * Win95
    * Win98
    * WinMe
    * Win2k/XP
    * Linux

  Our setting is "Linux" but today I tested all others to verify if this setting
  is relevant to this issue.  None of these choices helped and comparing 
  timestamp-cleaned files, besides sorting, CPU and process numbers, I don't see
  differences, so I upload just one of those files and will provide others if
  wanted.

* I tested two other toggled settings:

  1) Advanced -> Hammer Configurations -> ACPI SRAT Table = disable
  2) Advanced -> Secured Setup Configurations = Yes

  Both also did not help but caused different dmesg output that I also attach.

* I created an AIDA64 report on Win2008 (Win98 failed to start after installation).
  I could not manage the FC adapter with the management software and could not
  identify it in the hardware list or AIDA report but will leave serious evaluation
  to more competent persons.
Comment 11 Dirk Gouders 2014-09-20 19:59:14 UTC
Created attachment 151101 [details]
dmesg with ACPI SRAT Table = disable BIOS setting
Comment 12 Dirk Gouders 2014-09-20 20:00:20 UTC
Created attachment 151111 [details]
dmesg with Secured Setup Configurations = Yes BIOS setting
Comment 13 Bjorn Helgaas 2014-09-22 14:26:27 UTC
Win98 doesn't start, and Windows Server 2008 starts but the FC adapter doesn't work at all.

The initial configuration from BIOS is the same as for Linux:

  ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-07])
  pci 0000:00:0e.0: PCI bridge to [bus 0a]

The AIDA64 report in comment #7 shows that Windows reconfigured the 00:0e.0 bridge to fit within the host bridge's [bus 00-07] range:

  B00 D0E F00:  nVIDIA nForce Pro 2200 (CK8-04 Pro) - PCI Express Root Port
    Offset 010:  00 00 00 00  00 00 00 00  00 07 07 00  31 31 00 20

The secondary (at 0x19) and subordinate (at 0x1a) bus numbers are both 07.

There is no device on bus 07.  The [1000:0646] LSI FC adapter is not visible at all.

This is essentially the same behavior that prompted this bug report.  The only real difference is that Linux reprogrammed the bridge to [bus 06-07], while Windows set it to [bus 07].
Comment 14 Bjorn Helgaas 2014-12-09 21:25:23 UTC
This should be fixed by 12d8706963f0 (Revert "PCI: Make sure bus number resources stay within their parents bounds"), which appeared in v3.17-rc7.

In 12d8706963f0, I mentioned other possible fixes:

1) Add a quirk to fix the _CRS information based on what amd_bus.c read from the hardware
2) Reset the FC device after we change its bus number

I'm not sure 1) is a good idea because it makes the hardware configuration out of sync with the platform's idea of it.  But this would be a corner case quirk and maybe good enough for this one platform.

For 2), it seems a little too aggressive to always reset devices when bus numbers change, but we could conceivably have a quirk to say "this device needs to be reset on bus number changes."  That would let us handle this broken LSI device even on other platforms.