Hello, I found that patch "pci: Rework ASPM disable code" from commit 3c076351c4027a56d5005a39a0b518a4ba393ce2 causes LSI/Intel RAID adapters failure on Intel Servers. Tested with Intel S2600IP mainboard and Intel S2400SC mainboard. Dmesg output on patched kernels: [ 2444.630689] megasas: 0x1000:0x005b:0x8086:0x3510: bus 133:slot 0:func 0 [ 2444.630897] megasas: Waiting for FW to come to ready state [ 2444.630900] megasas: FW in FAULT state!! After reverting this patch everything works well: [ 30.052181] megasas: 0x1000:0x005b:0x8086:0x3510: bus 133:slot 0:func 0 [ 30.052392] megasas: FW now in Ready state [ 30.052436] megaraid_sas 0000:85:00.0: irq 132 for MSI/MSI-X [ 30.052445] megaraid_sas 0000:85:00.0: irq 133 for MSI/MSI-X [ 30.052454] megaraid_sas 0000:85:00.0: irq 134 for MSI/MSI-X [ 30.052462] megaraid_sas 0000:85:00.0: irq 135 for MSI/MSI-X [ 30.052470] megaraid_sas 0000:85:00.0: irq 136 for MSI/MSI-X [ 30.052479] megaraid_sas 0000:85:00.0: irq 137 for MSI/MSI-X [ 30.052487] megaraid_sas 0000:85:00.0: irq 138 for MSI/MSI-X [ 30.052495] megaraid_sas 0000:85:00.0: irq 139 for MSI/MSI-X [ 30.052504] megaraid_sas 0000:85:00.0: irq 140 for MSI/MSI-X [ 30.052512] megaraid_sas 0000:85:00.0: irq 141 for MSI/MSI-X [ 30.052521] megaraid_sas 0000:85:00.0: irq 142 for MSI/MSI-X [ 30.052534] megaraid_sas 0000:85:00.0: irq 143 for MSI/MSI-X [ 30.052543] megaraid_sas 0000:85:00.0: irq 144 for MSI/MSI-X [ 30.052551] megaraid_sas 0000:85:00.0: irq 145 for MSI/MSI-X [ 30.052559] megaraid_sas 0000:85:00.0: irq 146 for MSI/MSI-X [ 30.052568] megaraid_sas 0000:85:00.0: irq 147 for MSI/MSI-X [ 30.078270] megasas:IOC Init cmd success [ 30.108295] megasas: INIT adapter done This issue was also reported here: https://bugs.launchpad.net/ubuntu/+source/debian-installer/+bug/1091465 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1091263 http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg17919.html
Arkadiusz, can you please attach the complete dmesg logs (with and without the revert) and the complete "lspci -vvxxx" output? Thanks!
This looks similar to the ASPM issues in bug #64541 (iwlwifi, resolved by a driver change), bug #59311 (sdhci), and bug #73241 (sdhci). The sdhci issues are still open, but I suspect they are also driver problems. I think this megasas issue is also a driver problem, but I can't tell without more information (requested in comment #1). I'm reassigning this to SCSI drivers on that assumption, but if we don't get any more information, I guess we should just close this.
@Bjorn: I have a supermicro system with the same problem. I haven't reverted that patch yet to test, but plan to later this weekend. pcie_aspm=off as a boot param has NO effect on the problem. Here's the lspci you wanted: # lspci -d 1000: -vvxxx 01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05) Subsystem: Super Micro Computer Inc LSI MegaRAID ROMB Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 16 Region 0: I/O ports at 8000 [disabled] [size=256] Region 1: Memory at dfe60000 (64-bit, non-prefetchable) [size=16K] Region 3: Memory at dfe00000 (64-bit, non-prefetchable) [size=256K] Expansion ROM at dfe40000 [disabled] [size=128K] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+ Capabilities: [d0] Vital Product Data pcilib: sysfs_read_vpd: read failed: Connection timed out Not readable Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [c0] MSI-X: Enable- Count=16 Masked- Vector table: BAR=1 offset=00002000 PBA: BAR=1 offset=00003000 00: 00 10 5b 00 02 00 10 00 05 00 04 01 10 00 00 00 10: 01 80 00 00 04 00 e6 df 00 00 00 00 04 00 e0 df 20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 90 06 30: 00 00 e4 df 50 00 00 00 00 00 00 00 0b 01 00 00 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 00 01 00 00 10 d0 02 00 25 80 00 10 70: 20 28 00 00 83 04 40 00 40 00 83 10 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00 90: 00 00 00 00 0e 00 00 00 03 00 3e 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 11 00 0f 00 01 20 00 00 01 30 00 00 00 00 00 00 d0: 03 a8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Arkadiusz, Robin, is this still a problem? Does booting with "pci=conf1" make a difference? That was a workaround for the Ubuntu issue (mentioned in the original description). I still need two complete dmesg logs: one from a boot showing the problem, and one from a working boot. It's best if these are from a recent upstream kernel, e.g., v4.0.
Created attachment 175171 [details] dmesg (working, v3.16-rc4) Sorry, I forgot that Robin did extensive testing of this and attached logs here: http://www.spinics.net/lists/linux-scsi/msg76204.html . I'm going to attach them to this bugzilla as well so they don't get lost. This v3.16-rc4 dmesg log is a working boot.
Created attachment 175181 [details] dmesg (failing, v3.16-rc4) Same v3.16-rc4 kernel, but it fails on this boot.
Created attachment 175191 [details] lspci (working)
Created attachment 175201 [details] lspci (failing)
Summary of Robin's testing (from his email): Kernels: K.1: Ubuntu's 3.16-rc4 K.2: 3.2-rc4 3c076351c402 - aspm merged K.3: 3.2-rc4 69166fbf02c7 - aspm merge parent Notes: 3.2* compiled with GCC4.6, 3.16-rc4 with GCC4.8 BIOS: Boot -> FastBoot: B1.1 Off B1.2 On (CMOS reset default) BIOS: Advanced -> PCIe/PCI/PnP Configuration -> ASPM Support B2.1 Force L0s B2.2 BIOS (CMOS reset default) B2.3 Disabled Reduced Kernaugh Map of results: Kernels,B1,B2: Result *, B1.1, * PASS *, B1.2, B2.1 VARIABLE (9 runs: 5 fail, 4 pass, no kernel consistency) K.1, B1.2, B2.2 FAIL K.1, B1.2, B2.3 FAIL K.2, B1.2, B2.2 FAIL K.2, B1.2, B2.3 FAIL K.3, B1.2, B2.2 PASS K.3, B1.2, B2.3 PASS
My understanding of the Karnaugh map is that: - Fast Boot disabled: all kernels always passed - Fast Boot enabled, ASPM set to Force L0s enabled: variable; no consistency of results - Fast Boot enabled, ASPM set to BIOS or Disabled: pre-3c076351c402 always passed, post-3c076351c402 always failed Here are some diffs between the working and failing v3.16-rc4 boots: --- dmesg.working 2015-04-28 11:23:19.900776670 -0500 +++ dmesg.broken 2015-04-28 11:23:14.632848652 -0500 megasas: 06.803.01.00-rc1 Mon. Mar. 10 17:00:00 PDT 2014 megasas: 0x1000:0x005b:0x15d9:0x0690: bus 1:slot 0:func 0 -megasas: FW now in Ready state +megaraid_sas 0000:01:00.0: enabling device (0000 -> 0002) +megasas: Waiting for FW to come to ready state +megasas: FW in FAULT state!! +megaraid_sas 0000:01:00.0: megasas: FW restarted successfully from megasas_init_fw! +megasas: Waiting for FW to come to ready state +megasas: FW in FAULT state!! My theory is that when Fast Boot is enabled, the BIOS does not run the megasas option ROM. In that case, Linux receives the device uninitialized (hence the new "enabling device" message). I suspect megaraid_sas depends on something done by the option ROM, possibly something related to ASPM.
Created attachment 175271 [details] debug patch for 69166fbf02c7 This patch applies on 69166fbf02c7. Please boot it with "pci=earlydump" and attach the resulting dmesg log here.
Created attachment 175281 [details] debug patch for v4.1-rc1 This patch applies on v4.1-rc1. Please boot it with "pci=earlydump" and attach the resulting dmesg log here.
I collected other similar reports on the web. Here's a summary of what I found. Chris reported an issue [1] on an unspecified system with megaraid_sas and a MegaRAID SAS 2208 adapter on Debian Wheezy (kernel based on v3.2). He later reported [11] that neither "acpi=off" nor "pci=conf1" helped. Ron reported [2] that v3.0.0 worked, v3.2 through v3.7.1 did not work, and "pci=conf1" was a workaround on his Intel S2600CP system. Gunnar reported [3] that "acpi=off" was a workaround for Ubuntu 12.04. Arkadiusz reported [4] a similar problem on Intel S2500IP and S2400SC systems and bisected it to 3c076351c402 ("PCI: Rework ASPM disable code"), which appeared in v3.3-rc1. Robin reported [5] a similar problem on a Supermicro X9DRH-7TF system and "pcie_aspm=off" didn't help (but I'm not confident that "pcie=aspm=off" is equivalent to reverting the commit Arkadiusz identified). And furthermore [10], "pci=conf1" and "disable_msi=1" didn't help either. The failure happens only after 3c076351c402 ("PCI: Rework ASPM disable code"), and turning off the BIOS Fast Boot feature is a workaround [12]. Joro reported [6] a similar problem on an Intel S2600CP4 system with Ubuntu 12.04, 12.10, and 13.04, but that CentOS 6.3 worked fine. Ron [7] suggested "pcie=conf1" as a workaround. Michał reported [8] that on an Intel S2600IP4 system, v3.2.4 worked, but v3.2.5 had the same problem. v3.2.5 added 3c076351c402 ("PCI: Rework ASPM disable code"). Matthias confirmed [9] the same problem as Michał on an Intel S1200BTLR system with v3.2.24 and v3.5.0, both of which contain 3c076351c402. In all cases the failing kernel includes 3c076351c402 (I couldn't verify this for Chris' report on Wheezy). When reported, the working kernel (3.0.0, CentOS6.3, v3.2.4, and bisected result) does not include 3c076351c402. Ron suggested "pci=conf1" as a workaround on an Intel S2600CP system, but others have tried it without success. Robin found that turning off BIOS "Fast Boot" was a workaround on a Supermicro system, but nobody else has tried this. [1] http://debian.2.n7.nabble.com/Wheezy-Driver-for-Intel-RMS25CB080-RAID-Controller-tp2783386.html [2] http://debian.2.n7.nabble.com/Wheezy-Driver-for-Intel-RMS25CB080-RAID-Controller-td2783386.html#a2836241 [3] https://lists.debian.org/debian-user/2012/10/msg01332.html [4] https://bugzilla.kernel.org/show_bug.cgi?id=63661 [5] https://bugzilla.kernel.org/show_bug.cgi?id=63661#c3 [6] https://bugs.launchpad.net/ubuntu/+source/debian-installer/+bug/1091465 [7] https://bugs.launchpad.net/ubuntu/+source/debian-installer/+bug/1091465/comments/6 [8] https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg17919.html [9] https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg17925.html [10] http://permalink.gmane.org/gmane.linux.scsi/92439 [11] https://lists.debian.org/debian-user/2013/01/msg00056.html [12] http://www.spinics.net/lists/linux-scsi/msg76204.html
Bjorn, thanks for the excellent summary. And thanks a lot to Arkadiusz for finding the commit that introduced the problem. I can only confirm all of this. We had customers reporting a similar issue on Intel S2600 series systems, running SLES 11 SP3 and SLES 12 which both contain commit 3c076351c402 (SLES 11 SP3 got in from stable kernel update 3.0.20.) The same customers are running SLES 11 SP2 with no problem (also including kernel update 3.0.20 but with 3c076351c402 reverted as it caused PS/2 keyboard and touchpad misdetection on other systems - a problem apparently solved meanwhile.) I'm going to suggest to the customers to look for a Fast Boot option in the BIOS and disable it if it exists.
Created attachment 182051 [details] Candidate fix This candidate fix I received from Kashyap Desai (Avago) appears to solve the problem.
Let me adjust my previous comment: the candidate fix I received fixed the problem on SLES 12 (based on kernel 3.12) but not on SLES 11 SP3 (based on kernel 3.0.) So it should be sufficient for upstream, but for older kernels it seems that some other commits must be backported.
Bjorn, are you still interested by a boot log with the debug patch from comment #11?
> Bjorn, are you still interested by a boot log with the debug patch from > comment #11? I don't think so. It sounds like the driver change, i.e., comment 15, solves the problem, so I guess there's nothing for me to do.
Actually I'm not completely sure if it solves the problem. With the patch from comment #15 backported to an older kernel, the controller starts but with a reset and a significant delay (40 seconds.) There was no reset and no delay with older kernels, so I'm afraid the patch is really only a workaround and not a proper fix. I'll do more tests and report.
I'm closing this because it seems that we're stalled. If this is still an issue, please reopen and maybe attach dmesg logs from a current kernel, e.g., v4.8.
For those wondering about a lack of response from me here, I don't work for the company with the problematic hardware anymore, and it was put into full production use after the workaround was found to be sufficient.