Most recent kernel where this bug did not occur: 2.6.17 Distribution: Ubuntu 6.06 Hardware Environment: IBM x366, 4x 3.16GHz Xeon, 8G RAM, aic94xx SAS controller, 6x 36G SAS disks, Calgary IOMMU Software Environment: netbooted Ubuntu mutant. Problem Description: aic94xx driver NMIs the system if calgary iommu is enabled. Steps to reproduce: 1. Clone linux-2.6 and add in scsi-misc, scsi-rc-fixes and aic94xx from git.kernel.org. 2. Build a netbootable "allmodconfig" kernel with attached config file. 3. Calgary IOMMU code activates 4. Subsequent loading of the aic94xx driver causes the service processor to log a PCI SERR with a split completion timeout, followed by an NMI, followed by a reboot. Note that disabling the Calgary IOMMU for the bus that the SAS controller is connected to (either via kconfig or iommu=none or calgary=disable=1) will make the problem go away.
Created attachment 9069 [details] .config file that i use This is the .config file that I use. Everything should be modular except for a few NIC drivers and enough code to mount nfs root.
Created attachment 9070 [details] lspci -vvv output from the affected machine Here's the lspci output. Note that I've hit this bug on a few other machines/models (x260) in our lab.
Begin forwarded message: Date: Thu, 21 Sep 2006 18:10:56 -0700 From: bugme-daemon@bugzilla.kernel.org To: bugme-new@lists.osdl.org Subject: [Bugme-new] [Bug 7180] New: aic94xx driver locks up on IBM x366 with Calgary IOMMU enabled http://bugzilla.kernel.org/show_bug.cgi?id=7180 Summary: aic94xx driver locks up on IBM x366 with Calgary IOMMU enabled Kernel Version: 2.6.18 Status: NEW Severity: normal Owner: djwong@us.ibm.com Submitter: djwong@us.ibm.com CC: alexisb@us.ibm.com Most recent kernel where this bug did not occur: 2.6.17 Distribution: Ubuntu 6.06 Hardware Environment: IBM x366, 4x 3.16GHz Xeon, 8G RAM, aic94xx SAS controller, 6x 36G SAS disks, Calgary IOMMU Software Environment: netbooted Ubuntu mutant. Problem Description: aic94xx driver NMIs the system if calgary iommu is enabled. Steps to reproduce: 1. Clone linux-2.6 and add in scsi-misc, scsi-rc-fixes and aic94xx from git.kernel.org. 2. Build a netbootable "allmodconfig" kernel with attached config file. 3. Calgary IOMMU code activates 4. Subsequent loading of the aic94xx driver causes the service processor to log a PCI SERR with a split completion timeout, followed by an NMI, followed by a reboot. Note that disabling the Calgary IOMMU for the bus that the SAS controller is connected to (either via kconfig or iommu=none or calgary=disable=1) will make the problem go away. ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
Reply-To: muli@il.ibm.com On Thu, Sep 21, 2006 at 06:23:50PM -0700, Andrew Morton wrote: > > > Begin forwarded message: > > Date: Thu, 21 Sep 2006 18:10:56 -0700 > From: bugme-daemon@bugzilla.kernel.org > To: bugme-new@lists.osdl.org > Subject: [Bugme-new] [Bug 7180] New: aic94xx driver locks up on IBM > x366 with Calgary IOMMU enabled > [snip] > > http://bugzilla.kernel.org/show_bug.cgi?id=7180 Thanks Andrew, we're looking into it. Cheers, Muli
Also, the machine doesn't always reboot after the NMI; sometimes the machine just locks up hard and has to be power cycled.
Reply-To: James.Bottomley@SteelEye.com On Fri, 2006-09-22 at 09:05 +0300, Muli Ben-Yehuda wrote: > Thanks Andrew, we're looking into it. Just to add to the complexity, this seems to be working for me on a maia system with aic94xx boot drives James ----- PCI-DMA: Calgary IOMMU detected. TCE table spec is 6. [...] PCI-DMA: Using Calgary IOMMU Calgary: enabling translation on PHB 0 Calgary: errant DMAs will now be prevented on this bus. Calgary: enabling translation on PHB 1 Calgary: errant DMAs will now be prevented on this bus. [...] aic94xx: Adaptec aic94xx SAS/SATA driver version 1.0.2 loaded GSI 16 sharing vector 0xA9 and IRQ 16 ACPI: PCI Interrupt 0000:01:02.0[A] -> GSI 25 (level, low) -> IRQ 16 aic94xx: found Adaptec AIC-9410W SAS/SATA Host Adapter, device 0000:01:02.0 scsi0 : aic94xx aic94xx: BIOS present (1,1), 1548 aic94xx: ue num:1, ue size:88 aic94xx: 1Found FLASH(8) manuf:1, dev_id:0xda, sec_prot:0 aic94xx: manuf sect SAS_ADDR 5005076a0115a840 aic94xx: manuf sect PCBA SN aic94xx: ms: num_phy_desc: 8 aic94xx: ms: phy0: ENEBLEABLE aic94xx: ms: phy1: ENEBLEABLE aic94xx: ms: phy2: ENEBLEABLE aic94xx: ms: phy3: ENEBLEABLE aic94xx: ms: phy4: ENEBLEABLE aic94xx: ms: phy5: ENEBLEABLE aic94xx: ms: phy6: ENEBLEABLE aic94xx: ms: phy7: ENEBLEABLE aic94xx: ms: max_phys:0x8, num_phys:0x8 aic94xx: ms: enabled_phys:0xff aic94xx: ctrla: phy0: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy1: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy2: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy3: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy4: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy5: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy6: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy7: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: max_scbs:512, max_ddbs:128 aic94xx: setting phy0 addr to 5005076a0115a840 aic94xx: setting phy1 addr to 5005076a0115a840 aic94xx: setting phy2 addr to 5005076a0115a840 aic94xx: setting phy3 addr to 5005076a0115a840 aic94xx: setting phy4 addr to 5005076a0115a840 aic94xx: setting phy5 addr to 5005076a0115a840 aic94xx: setting phy6 addr to 5005076a0115a840 aic94xx: setting phy7 addr to 5005076a0115a840 aic94xx: num_edbs:21 aic94xx: num_escbs:3 aic94xx: using sequencer V17/10c6 aic94xx: downloading CSEQ... aic94xx: dma-ing 8192 bytes aic94xx: verified 8192 bytes, passed aic94xx: downloading LSEQs... aic94xx: dma-ing 14336 bytes aic94xx: LSEQ0 verified 14336 bytes, passed aic94xx: LSEQ1 verified 14336 bytes, passed aic94xx: LSEQ2 verified 14336 bytes, passed aic94xx: LSEQ3 verified 14336 bytes, passed aic94xx: LSEQ4 verified 14336 bytes, passed aic94xx: LSEQ5 verified 14336 bytes, passed aic94xx: LSEQ6 verified 14336 bytes, passed aic94xx: LSEQ7 verified 14336 bytes, passed aic94xx: max_scbs:446 aic94xx: first_scb_site_no:0x20 aic94xx: last_scb_site_no:0x1fe aic94xx: First SCB dma_handle: 0x9000 aic94xx: device 0000:01:02.0: SAS addr 5005076a0115a840, PCBA SN , 8 phys, 8 enabled phys, flash present, BIOS build 1548 aic94xx: posting 3 escbs aic94xx: escbs posted aic94xx: posting 8 control phy scbs aic94xx: enabled phys aic94xx: control_phy_tasklet_complete: phy4, lrate:0x9, proto:0xe aic94xx: escb_tasklet_complete: phy4: BYTES_DMAED aic94xx: SAS proto IDENTIFY: aic94xx: 00: 20 00 02 02 aic94xx: 04: 00 00 00 00 aic94xx: 08: 00 00 00 00 aic94xx: 0c: 50 05 07 6a aic94xx: 10: 00 00 2c 50 aic94xx: 14: 03 00 00 00 aic94xx: 18: 00 00 00 00 aic94xx: control_phy_tasklet_complete: phy0, lrate:0x9, proto:0xe aic94xx: escb_tasklet_complete: phy0: BYTES_DMAED aic94xx: SAS proto IDENTIFY: aic94xx: 00: 20 00 02 02 aic94xx: 04: 00 00 00 00 aic94xx: 08: 00 00 00 00 aic94xx: 0c: 50 05 07 6a aic94xx: 10: 00 00 06 20 aic94xx: 14: 03 00 00 00 aic94xx: 18: 00 00 00 00 aic94xx: control_phy_tasklet_complete: phy1, lrate:0x9, proto:0xe aic94xx: escb_tasklet_complete: phy1: BYTES_DMAED aic94xx: SAS proto IDENTIFY: aic94xx: 00: 20 00 02 02 aic94xx: 04: 00 00 00 00 aic94xx: 08: 00 00 00 00 aic94xx: 0c: 50 05 07 6a aic94xx: 10: 00 00 06 20 aic94xx: 14: 04 00 00 00 aic94xx: 18: 00 00 00 00 aic94xx: control_phy_tasklet_complete: phy6, lrate:0x9, proto:0xe aic94xx: escb_tasklet_complete: phy6: BYTES_DMAED aic94xx: SAS proto IDENTIFY: aic94xx: 00: 20 00 02 02 aic94xx: 04: 00 00 00 00 aic94xx: 08: 00 00 00 00 aic94xx: 0c: 50 05 07 6a aic94xx: 10: 00 00 2c 50 aic94xx: 14: 07 00 00 00 aic94xx: 18: 00 00 00 00 aic94xx: control_phy_tasklet_complete: phy2, lrate:0x9, proto:0xe aic94xx: escb_tasklet_complete: phy2: BYTES_DMAED aic94xx: SAS proto IDENTIFY: aic94xx: 00: 20 00 02 02 aic94xx: 04: 00 00 00 00 aic94xx: 08: 00 00 00 00 aic94xx: 0c: 50 05 07 6a aic94xx: 10: 00 00 06 20 aic94xx: 14: 07 00 00 00 aic94xx: 18: 00 00 00 00 aic94xx: control_phy_tasklet_complete: phy7, lrate:0x9, proto:0xe aic94xx: escb_tasklet_complete: phy7: BYTES_DMAED aic94xx: SAS proto IDENTIFY: aic94xx: 00: 20 00 02 02 aic94xx: 04: 00 00 00 00 aic94xx: 08: 00 00 00 00 aic94xx: 0c: 50 05 07 6a aic94xx: 10: 00 00 2c 50 aic94xx: 14: 08 00 00 00 aic94xx: 18: 00 00 00 00 aic94xx: control_phy_tasklet_complete: phy3, lrate:0x9, proto:0xe aic94xx: escb_tasklet_complete: phy3: BYTES_DMAED aic94xx: SAS proto IDENTIFY: aic94xx: 00: 20 00 02 02 aic94xx: 04: 00 00 00 00 aic94xx: 08: 00 00 00 00 aic94xx: 0c: 50 05 07 6a aic94xx: 10: 00 00 06 20 aic94xx: 14: 08 00 00 00 aic94xx: 18: 00 00 00 00 aic94xx: control_phy_tasklet_complete: phy5, lrate:0x9, proto:0xe aic94xx: escb_tasklet_complete: phy5: BYTES_DMAED aic94xx: SAS proto IDENTIFY: aic94xx: 00: 20 00 02 02 aic94xx: 04: 00 00 00 00 aic94xx: 08: 00 00 00 00 aic94xx: 0c: 50 05 07 6a aic94xx: 10: 00 00 2c 50 aic94xx: 14: 04 00 00 00 aic94xx: 18: 00 00 00 00 sas: phy4 added to port0, phy_mask:0x10 sas: phy0 added to port1, phy_mask:0x1 sas: phy1 matched wide port1 sas: phy1 added to port1, phy_mask:0x3 sas: phy6 matched wide port0 sas: phy6 added to port0, phy_mask:0x50 sas: phy2 matched wide port1 sas: phy2 added to port1, phy_mask:0x7 sas: phy7 matched wide port0 sas: phy7 added to port0, phy_mask:0xd0 sas: phy3 matched wide port1 sas: phy3 added to port1, phy_mask:0xf sas: phy5 matched wide port0 sas: phy5 added to port0, phy_mask:0xf0 sas: DOING DISCOVERY on port 0, pid:869 sas: ex 5005076a00002c50 phy00:D attached: 50010b9000001369 sas: ex 5005076a00002c50 phy01:T attached: 0000000000000000 sas: ex 5005076a00002c50 phy02:T attached: 0000000000000000 sas: ex 5005076a00002c50 phy03:S attached: 5005076a0115a840 sas: ex 5005076a00002c50 phy04:S attached: 5005076a0115a840 sas: ex 5005076a00002c50 phy05:T attached: 0000000000000000 sas: ex 5005076a00002c50 phy06:T attached: 0000000000000000 sas: ex 5005076a00002c50 phy07:S attached: 5005076a0115a840 sas: ex 5005076a00002c50 phy08:S attached: 5005076a0115a840 sas: ex 5005076a00002c50 phy09:T attached: 0000000000000000 sas: ex 5005076a00002c50 phy10:T attached: 0000000000000000 sas: ex 5005076a00002c50 phy11:T attached: 0000000000000000 sas: ex 5005076a00002c50 phy12:D attached: 5005076a00002c5d scsi 0:0:0:0: Direct access IBM-ESXS BBA036C3ESTT0Z N BH06 PQ: 0 ANSI: 5 SCSI device sda: 71096640 512-byte hdwr sectors (36401 MB) sda: Write Protect is off sda: Mode Sense: d3 00 10 08 SCSI device sda: drive cache: write through w/ FUA SCSI device sda: 71096640 512-byte hdwr sectors (36401 MB) sda: Write Protect is off sda: Mode Sense: d3 00 10 08 SCSI device sda: drive cache: write through w/ FUA sda: sda1 sda2 sda3 sd 0:0:0:0: Attached scsi disk sda scsi 0:0:1:0: Enclosure IBM-ESXS VSC7160 1.01 PQ: 0 ANSI: 3 sas: device 5005076a00002c5d, LUN 0 doesn't support TCQ sas: DONE DISCOVERY on port 0, pid:869, result:0 sas: DOING DISCOVERY on port 1, pid:869 sas: ex 5005076a00000620 phy00:T attached: 0000000000000000 sas: ex 5005076a00000620 phy01:T attached: 0000000000000000 sas: ex 5005076a00000620 phy02:T attached: 0000000000000000 sas: ex 5005076a00000620 phy03:S attached: 5005076a0115a840 sas: ex 5005076a00000620 phy04:S attached: 5005076a0115a840 sas: ex 5005076a00000620 phy05:T attached: 0000000000000000 sas: ex 5005076a00000620 phy06:T attached: 0000000000000000 sas: ex 5005076a00000620 phy07:S attached: 5005076a0115a840 sas: ex 5005076a00000620 phy08:S attached: 5005076a0115a840 sas: ex 5005076a00000620 phy09:T attached: 0000000000000000 sas: ex 5005076a00000620 phy10:T attached: 0000000000000000 sas: ex 5005076a00000620 phy11:T attached: 0000000000000000 sas: ex 5005076a00000620 phy12:D attached: 5005076a0000062d scsi 0:0:2:0: Enclosure IBM-ESXS VSC7160 1.01 PQ: 0 ANSI: 3 sas: device 5005076a0000062d, LUN 0 doesn't support TCQ sas: DONE DISCOVERY on port 1, pid:869, result:0
> Just to add to the complexity... Hey, I can add some too! This condition doesn't always happen the first time the driver loads. Sometimes the machine survives a loads and unload cycle anywhere between 1 and a half dozen times before the SERR/NMI show up.
Created attachment 9281 [details] Patch to disable the Calgary's split completion timers This is a heavy-handed patch that disables all of the Calgary PCIX bridge's split completion timers. Really, this sledgehammer should be cut down to do this only on PHB 1 of a x366/x260/x460. Or perhaps simply increasing the timeout would suffice. I'll go experiment with that, but in the meantime here's a crappy patch in case anybody else wants to try it out.
Created attachment 9282 [details] Patch to disable the Calgary's split completion timers Whoops, bad patch. Let's try again.
Patch looks reasonable, I'll give it a spin on my machine. Does it help on your problematic machine?
Created attachment 9285 [details] Even more restricted version of above patches. Yes, these patches fix the SERR -> SC Timeout -> NMI problems on my x366 and x260. Attached is an even more restrictive version of the patch that (a) only acts upon the particular bus in question and (b) sets a higher timeout instead of turning it off altogether.
Created attachment 9292 [details] slightly reworked version Great to hear it! I think there's a slight mistake in your latest patch though, you're changing the timer for PHB2 rather than PHB1. Can you please give this slightly reworked patch a spin and confirm that it solves the problem? if it does we'll go ahead and apply it (without the debug printks).
Patch works on my x366. Will see about shaking out a x260/x460 for testing there too, though I'm fairly confident that a fix for one fixes all three.
Fixes the x260 also. ACK.
Excellent, I'll push the patch upstream. Nice work :-)
For 4-node x460 with the Adaptec Razor (9410SAS) being enabled in second, third and fourth node. and the ServerRAID card in the first node, would this patch still work? I presume (and I am probably incorrect) that the 2,3,4-node PHB for the aic94xx is not 1?
> For 4-node x460 with the Adaptec Razor (9410SAS) being enabled in second, > third > and fourth node. and the ServerRAID card in the first node, would this patch > still work? hmm... what does the machine have on PCI bus 1? basically it will only modify the timer on bus 1, so it won't "work" in that sense, but if you haven't seen any problems that this could in theory fix, everything should continue to work fine. > I presume (and I am probably incorrect) that the 2,3,4-node PHB for the > aic94xx is not 1? the bus number is not 1 but the PHB ID is 1 (there are 4 PHBs per Calgary, numbered 0-3, so multiple busses will have PHB ID 1, one for each Calgary).
Maybe we should look for PHBs with aic94xx adapters hanging off, and increase the SC timeout for any that we find? Or just increase all of them across the board?
Yeah, we could do that... but I'd rather understand first why it makes a difference. Do we have any contacts at adaptec that could maybe help on this?
Jack Hammer might be able to help us. Either that or possibly the documentation.
Patch is in 2.6.19-rc3, closing.