Bug 63661 - [BISECTED]Several Intel/LSI adapters doesn't work on Intel Servers when using kernels with "pci: Rework ASPM disable code" patch applied
Summary: [BISECTED]Several Intel/LSI adapters doesn't work on Intel Servers when using...
Status: RESOLVED WILL_NOT_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: other_modules
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-10-25 08:27 UTC by Arkadiusz Bubała
Modified: 2016-10-28 21:51 UTC (History)
6 users (show)

See Also:
Kernel Version: 3.0.20+; 3.2.5+; 3.3+
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg (working, v3.16-rc4) (71.37 KB, text/plain)
2015-04-28 17:01 UTC, Bjorn Helgaas
Details
dmesg (failing, v3.16-rc4) (69.65 KB, text/plain)
2015-04-28 17:03 UTC, Bjorn Helgaas
Details
lspci (working) (228.45 KB, text/plain)
2015-04-28 17:06 UTC, Bjorn Helgaas
Details
lspci (failing) (228.36 KB, text/plain)
2015-04-28 17:06 UTC, Bjorn Helgaas
Details
debug patch for 69166fbf02c7 (6.49 KB, patch)
2015-04-29 17:27 UTC, Bjorn Helgaas
Details | Diff
debug patch for v4.1-rc1 (7.88 KB, patch)
2015-04-29 17:28 UTC, Bjorn Helgaas
Details | Diff
Candidate fix (6.57 KB, patch)
2015-07-07 07:23 UTC, Jean Delvare
Details | Diff

Description Arkadiusz Bubała 2013-10-25 08:27:29 UTC
Hello,

I found that patch "pci: Rework ASPM disable code" from commit 3c076351c4027a56d5005a39a0b518a4ba393ce2 causes LSI/Intel RAID adapters failure on Intel Servers. Tested with Intel S2600IP mainboard and Intel S2400SC mainboard.

Dmesg output on patched kernels:
[ 2444.630689] megasas: 0x1000:0x005b:0x8086:0x3510: bus 133:slot 0:func 0
[ 2444.630897] megasas: Waiting for FW to come to ready state
[ 2444.630900] megasas: FW in FAULT state!! 

After reverting this patch everything works well:
[   30.052181] megasas: 0x1000:0x005b:0x8086:0x3510: bus 133:slot 0:func 0
[   30.052392] megasas: FW now in Ready state
[   30.052436] megaraid_sas 0000:85:00.0: irq 132 for MSI/MSI-X
[   30.052445] megaraid_sas 0000:85:00.0: irq 133 for MSI/MSI-X
[   30.052454] megaraid_sas 0000:85:00.0: irq 134 for MSI/MSI-X
[   30.052462] megaraid_sas 0000:85:00.0: irq 135 for MSI/MSI-X
[   30.052470] megaraid_sas 0000:85:00.0: irq 136 for MSI/MSI-X
[   30.052479] megaraid_sas 0000:85:00.0: irq 137 for MSI/MSI-X
[   30.052487] megaraid_sas 0000:85:00.0: irq 138 for MSI/MSI-X
[   30.052495] megaraid_sas 0000:85:00.0: irq 139 for MSI/MSI-X
[   30.052504] megaraid_sas 0000:85:00.0: irq 140 for MSI/MSI-X
[   30.052512] megaraid_sas 0000:85:00.0: irq 141 for MSI/MSI-X
[   30.052521] megaraid_sas 0000:85:00.0: irq 142 for MSI/MSI-X
[   30.052534] megaraid_sas 0000:85:00.0: irq 143 for MSI/MSI-X
[   30.052543] megaraid_sas 0000:85:00.0: irq 144 for MSI/MSI-X
[   30.052551] megaraid_sas 0000:85:00.0: irq 145 for MSI/MSI-X
[   30.052559] megaraid_sas 0000:85:00.0: irq 146 for MSI/MSI-X
[   30.052568] megaraid_sas 0000:85:00.0: irq 147 for MSI/MSI-X
[   30.078270] megasas:IOC Init cmd success
[   30.108295] megasas: INIT adapter done


This issue was also reported here:
https://bugs.launchpad.net/ubuntu/+source/debian-installer/+bug/1091465
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1091263
http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg17919.html
Comment 1 Bjorn Helgaas 2014-01-10 16:54:01 UTC
Arkadiusz, can you please attach the complete dmesg logs (with and without the revert) and the complete "lspci -vvxxx" output?  Thanks!
Comment 2 Bjorn Helgaas 2014-06-04 23:52:35 UTC
This looks similar to the ASPM issues in bug #64541 (iwlwifi, resolved by a driver change), bug #59311 (sdhci), and bug #73241 (sdhci).  The sdhci issues are still open, but I suspect they are also driver problems.

I think this megasas issue is also a driver problem, but I can't tell without more information (requested in comment #1).

I'm reassigning this to SCSI drivers on that assumption, but if we don't get any more information, I guess we should just close this.
Comment 3 Robin H. Johnson 2014-07-12 12:00:14 UTC
@Bjorn:
I have a supermicro system with the same problem. I haven't reverted that patch yet to test, but plan to later this weekend.

pcie_aspm=off as a boot param has NO effect on the problem.

Here's the lspci you wanted:
# lspci -d 1000: -vvxxx
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05)
	Subsystem: Super Micro Computer Inc LSI MegaRAID ROMB
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 16
	Region 0: I/O ports at 8000 [disabled] [size=256]
	Region 1: Memory at dfe60000 (64-bit, non-prefetchable) [size=16K]
	Region 3: Memory at dfe00000 (64-bit, non-prefetchable) [size=256K]
	Expansion ROM at dfe40000 [disabled] [size=128K]
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+
	Capabilities: [d0] Vital Product Data
pcilib: sysfs_read_vpd: read failed: Connection timed out
		Not readable
	Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [c0] MSI-X: Enable- Count=16 Masked-
		Vector table: BAR=1 offset=00002000
		PBA: BAR=1 offset=00003000
00: 00 10 5b 00 02 00 10 00 05 00 04 01 10 00 00 00
10: 01 80 00 00 04 00 e6 df 00 00 00 00 04 00 e0 df
20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 90 06
30: 00 00 e4 df 50 00 00 00 00 00 00 00 0b 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 01 00 00 10 d0 02 00 25 80 00 10
70: 20 28 00 00 83 04 40 00 40 00 83 10 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00
90: 00 00 00 00 0e 00 00 00 03 00 3e 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 11 00 0f 00 01 20 00 00 01 30 00 00 00 00 00 00
d0: 03 a8 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Comment 4 Bjorn Helgaas 2015-04-28 15:28:03 UTC
Arkadiusz, Robin, is this still a problem?

Does booting with "pci=conf1" make a difference?  That was a workaround for the Ubuntu issue (mentioned in the original description).

I still need two complete dmesg logs: one from a boot showing the problem, and one from a working boot.  It's best if these are from a recent upstream kernel, e.g., v4.0.
Comment 5 Bjorn Helgaas 2015-04-28 17:01:29 UTC
Created attachment 175171 [details]
dmesg (working, v3.16-rc4)

Sorry, I forgot that Robin did extensive testing of this and attached logs here: http://www.spinics.net/lists/linux-scsi/msg76204.html .  I'm going to attach them to this bugzilla as well so they don't get lost.

This v3.16-rc4 dmesg log is a working boot.
Comment 6 Bjorn Helgaas 2015-04-28 17:03:42 UTC
Created attachment 175181 [details]
dmesg (failing, v3.16-rc4)

Same v3.16-rc4 kernel, but it fails on this boot.
Comment 7 Bjorn Helgaas 2015-04-28 17:06:18 UTC
Created attachment 175191 [details]
lspci (working)
Comment 8 Bjorn Helgaas 2015-04-28 17:06:39 UTC
Created attachment 175201 [details]
lspci (failing)
Comment 9 Bjorn Helgaas 2015-04-29 17:14:31 UTC
Summary of Robin's testing (from his email):

Kernels:
K.1: Ubuntu's 3.16-rc4
K.2: 3.2-rc4 3c076351c402 - aspm merged
K.3: 3.2-rc4 69166fbf02c7 - aspm merge parent
Notes: 3.2* compiled with GCC4.6, 3.16-rc4 with GCC4.8

BIOS: Boot -> FastBoot:
B1.1 Off
B1.2 On (CMOS reset default)

BIOS: Advanced -> PCIe/PCI/PnP Configuration -> ASPM Support
B2.1 Force L0s
B2.2 BIOS (CMOS reset default)
B2.3 Disabled

Reduced Kernaugh Map of results:
Kernels,B1,B2:   Result
  *, B1.1,    *  PASS
  *, B1.2, B2.1  VARIABLE (9 runs: 5 fail, 4 pass, no kernel consistency)
K.1, B1.2, B2.2  FAIL
K.1, B1.2, B2.3  FAIL
K.2, B1.2, B2.2  FAIL
K.2, B1.2, B2.3  FAIL
K.3, B1.2, B2.2  PASS
K.3, B1.2, B2.3  PASS
Comment 10 Bjorn Helgaas 2015-04-29 17:25:31 UTC
My understanding of the Karnaugh map is that:

  - Fast Boot disabled: all kernels always passed

  - Fast Boot enabled, ASPM set to Force L0s enabled: variable; no
    consistency of results

  - Fast Boot enabled, ASPM set to BIOS or Disabled: pre-3c076351c402
    always passed, post-3c076351c402 always failed

Here are some diffs between the working and failing v3.16-rc4 boots:

--- dmesg.working       2015-04-28 11:23:19.900776670 -0500
+++ dmesg.broken        2015-04-28 11:23:14.632848652 -0500
 megasas: 06.803.01.00-rc1 Mon. Mar. 10 17:00:00 PDT 2014
 megasas: 0x1000:0x005b:0x15d9:0x0690: bus 1:slot 0:func 0 
-megasas: FW now in Ready state
+megaraid_sas 0000:01:00.0: enabling device (0000 -> 0002)
+megasas: Waiting for FW to come to ready state
+megasas: FW in FAULT state!!
+megaraid_sas 0000:01:00.0: megasas: FW restarted successfully from megasas_init_fw!
+megasas: Waiting for FW to come to ready state
+megasas: FW in FAULT state!!

My theory is that when Fast Boot is enabled, the BIOS does not run the megasas option ROM.  In that case, Linux receives the device uninitialized (hence the new "enabling device" message).  I suspect megaraid_sas depends on something done by the option ROM, possibly something related to ASPM.
Comment 11 Bjorn Helgaas 2015-04-29 17:27:10 UTC
Created attachment 175271 [details]
debug patch for 69166fbf02c7

This patch applies on 69166fbf02c7.  Please boot it with "pci=earlydump" and attach the resulting dmesg log here.
Comment 12 Bjorn Helgaas 2015-04-29 17:28:10 UTC
Created attachment 175281 [details]
debug patch for v4.1-rc1

This patch applies on v4.1-rc1.  Please boot it with "pci=earlydump" and attach the resulting dmesg log here.
Comment 13 Bjorn Helgaas 2015-04-29 21:40:09 UTC
I collected other similar reports on the web.  Here's a summary of what I found.

Chris reported an issue [1] on an unspecified system with megaraid_sas and a MegaRAID SAS 2208 adapter on Debian Wheezy (kernel based on v3.2).  He later reported [11] that neither "acpi=off" nor "pci=conf1" helped.

Ron reported [2] that v3.0.0 worked, v3.2 through v3.7.1 did not work, and
"pci=conf1" was a workaround on his Intel S2600CP system.

Gunnar reported [3] that "acpi=off" was a workaround for Ubuntu 12.04.

Arkadiusz reported [4] a similar problem on Intel S2500IP and S2400SC systems
and bisected it to 3c076351c402 ("PCI: Rework ASPM disable code"), which 
appeared in v3.3-rc1.

Robin reported [5] a similar problem on a Supermicro X9DRH-7TF system and
"pcie_aspm=off" didn't help (but I'm not confident that "pcie=aspm=off" is
equivalent to reverting the commit Arkadiusz identified).  And furthermore
[10], "pci=conf1" and "disable_msi=1" didn't help either.  The failure
happens only after 3c076351c402 ("PCI: Rework ASPM disable code"), and
turning off the BIOS Fast Boot feature is a workaround [12].

Joro reported [6] a similar problem on an Intel S2600CP4 system with Ubuntu
12.04, 12.10, and 13.04, but that CentOS 6.3 worked fine.

Ron [7] suggested "pcie=conf1" as a workaround.

Michał reported [8] that on an Intel S2600IP4 system, v3.2.4 worked, but v3.2.5 had the same problem.  v3.2.5 added 3c076351c402 ("PCI: Rework ASPM disable code").

Matthias confirmed [9] the same problem as Michał on an Intel S1200BTLR system with v3.2.24 and v3.5.0, both of which contain 3c076351c402.

In all cases the failing kernel includes 3c076351c402 (I couldn't verify this for Chris' report on Wheezy).  When reported, the working kernel (3.0.0, CentOS6.3, v3.2.4, and bisected result) does not include 3c076351c402.

Ron suggested "pci=conf1" as a workaround on an Intel S2600CP system, but others have tried it without success.

Robin found that turning off BIOS "Fast Boot" was a workaround on a Supermicro system, but nobody else has tried this.

[1] http://debian.2.n7.nabble.com/Wheezy-Driver-for-Intel-RMS25CB080-RAID-Controller-tp2783386.html
[2] http://debian.2.n7.nabble.com/Wheezy-Driver-for-Intel-RMS25CB080-RAID-Controller-td2783386.html#a2836241
[3] https://lists.debian.org/debian-user/2012/10/msg01332.html
[4] https://bugzilla.kernel.org/show_bug.cgi?id=63661
[5] https://bugzilla.kernel.org/show_bug.cgi?id=63661#c3
[6] https://bugs.launchpad.net/ubuntu/+source/debian-installer/+bug/1091465
[7] https://bugs.launchpad.net/ubuntu/+source/debian-installer/+bug/1091465/comments/6
[8] https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg17919.html
[9] https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg17925.html
[10] http://permalink.gmane.org/gmane.linux.scsi/92439
[11] https://lists.debian.org/debian-user/2013/01/msg00056.html
[12] http://www.spinics.net/lists/linux-scsi/msg76204.html
Comment 14 Jean Delvare 2015-05-28 08:14:18 UTC
Bjorn, thanks for the excellent summary. And thanks a lot to Arkadiusz for finding the commit that introduced the problem.

I can only confirm all of this. We had customers reporting a similar issue on Intel S2600 series systems, running SLES 11 SP3 and SLES 12 which both contain commit 3c076351c402 (SLES 11 SP3 got in from stable kernel update 3.0.20.) The same customers are running SLES 11 SP2 with no problem (also including kernel update 3.0.20 but with 3c076351c402 reverted as it caused PS/2 keyboard and touchpad misdetection on other systems - a problem apparently solved meanwhile.)

I'm going to suggest to the customers to look for a Fast Boot option in the BIOS and disable it if it exists.
Comment 15 Jean Delvare 2015-07-07 07:23:52 UTC
Created attachment 182051 [details]
Candidate fix

This candidate fix I received from Kashyap Desai (Avago) appears to solve the problem.
Comment 16 Jean Delvare 2015-08-03 18:11:54 UTC
Let me adjust my previous comment: the candidate fix I received fixed the problem on SLES 12 (based on kernel 3.12) but not on SLES 11 SP3 (based on kernel 3.0.) So it should be sufficient for upstream, but for older kernels it seems that some other commits must be backported.
Comment 17 Jean Delvare 2015-08-04 12:52:36 UTC
Bjorn, are you still interested by a boot log with the debug patch from comment #11?
Comment 18 Bjorn Helgaas 2015-08-04 15:44:02 UTC
> Bjorn, are you still interested by a boot log with the debug patch from
> comment #11?

I don't think so.  It sounds like the driver change, i.e., comment 15, solves the problem, so I guess there's nothing for me to do.
Comment 19 Jean Delvare 2015-08-26 11:05:45 UTC
Actually I'm not completely sure if it solves the problem. With the patch from comment #15 backported to an older kernel, the controller starts but with a reset and a significant delay (40 seconds.) There was no reset and no delay with older kernels, so I'm afraid the patch is really only a workaround and not a proper fix.

I'll do more tests and report.
Comment 20 Bjorn Helgaas 2016-10-28 20:55:41 UTC
I'm closing this because it seems that we're stalled.  If this is still an issue, please reopen and maybe attach dmesg logs from a current kernel, e.g., v4.8.
Comment 21 Robin H. Johnson 2016-10-28 21:51:53 UTC
For those wondering about a lack of response from me here, I don't work for the company with the problematic hardware anymore, and it was put into full production use after the workaround was found to be sufficient.

Note You need to log in before you can comment on or make changes to this bug.