Bug 7180

Summary: aic94xx driver locks up on IBM x366 with Calgary IOMMU enabled
Product: SCSI Drivers Reporter: Darrick J. Wong (djwong)
Component: AIC94XXAssignee: Muli Ben-Yehuda (muli)
Status: CLOSED CODE_FIX    
Severity: normal CC: alexisb, jdmason, scsi_drivers-aic94xx
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.18 Subsystem:
Regression: --- Bisected commit-id:
Attachments: .config file that i use
lspci -vvv output from the affected machine
Patch to disable the Calgary's split completion timers
Patch to disable the Calgary's split completion timers
Even more restricted version of above patches.
slightly reworked version

Description Darrick J. Wong 2006-09-21 18:01:08 UTC
Most recent kernel where this bug did not occur: 2.6.17
Distribution: Ubuntu 6.06
Hardware Environment: IBM x366, 4x 3.16GHz Xeon, 8G RAM, aic94xx SAS controller,
6x 36G SAS disks, Calgary IOMMU
Software Environment: netbooted Ubuntu mutant.
Problem Description: aic94xx driver NMIs the system if calgary iommu is enabled.

Steps to reproduce:
1. Clone linux-2.6 and add in scsi-misc, scsi-rc-fixes and aic94xx from
git.kernel.org.
2. Build a netbootable "allmodconfig" kernel with attached config file.
3. Calgary IOMMU code activates
4. Subsequent loading of the aic94xx driver causes the service processor to log
a PCI SERR with a split completion timeout, followed by an NMI, followed by a
reboot.

Note that disabling the Calgary IOMMU for the bus that the SAS controller is
connected to (either via kconfig or iommu=none or calgary=disable=1) will make
the problem go away.
Comment 1 Darrick J. Wong 2006-09-21 18:03:29 UTC
Created attachment 9069 [details]
.config file that i use

This is the .config file that I use.  Everything should be modular except for a
few NIC drivers and enough code to mount nfs root.
Comment 2 Darrick J. Wong 2006-09-21 18:04:29 UTC
Created attachment 9070 [details]
lspci -vvv output from the affected machine

Here's the lspci output.  Note that I've hit this bug on a few other
machines/models (x260) in our lab.
Comment 3 Andrew Morton 2006-09-21 18:14:10 UTC

Begin forwarded message:

Date: Thu, 21 Sep 2006 18:10:56 -0700
From: bugme-daemon@bugzilla.kernel.org
To: bugme-new@lists.osdl.org
Subject: [Bugme-new] [Bug 7180] New: aic94xx driver locks up on IBM x366 with Calgary IOMMU enabled


http://bugzilla.kernel.org/show_bug.cgi?id=7180

           Summary: aic94xx driver locks up on IBM x366 with Calgary IOMMU
                    enabled
    Kernel Version: 2.6.18
            Status: NEW
          Severity: normal
             Owner: djwong@us.ibm.com
         Submitter: djwong@us.ibm.com
                CC: alexisb@us.ibm.com


Most recent kernel where this bug did not occur: 2.6.17
Distribution: Ubuntu 6.06
Hardware Environment: IBM x366, 4x 3.16GHz Xeon, 8G RAM, aic94xx SAS controller,
6x 36G SAS disks, Calgary IOMMU
Software Environment: netbooted Ubuntu mutant.
Problem Description: aic94xx driver NMIs the system if calgary iommu is enabled.

Steps to reproduce:
1. Clone linux-2.6 and add in scsi-misc, scsi-rc-fixes and aic94xx from
git.kernel.org.
2. Build a netbootable "allmodconfig" kernel with attached config file.
3. Calgary IOMMU code activates
4. Subsequent loading of the aic94xx driver causes the service processor to log
a PCI SERR with a split completion timeout, followed by an NMI, followed by a
reboot.

Note that disabling the Calgary IOMMU for the bus that the SAS controller is
connected to (either via kconfig or iommu=none or calgary=disable=1) will make
the problem go away.

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

Comment 4 Anonymous Emailer 2006-09-21 22:55:51 UTC
Reply-To: muli@il.ibm.com

On Thu, Sep 21, 2006 at 06:23:50PM -0700, Andrew Morton wrote:
> 
> 
> Begin forwarded message:
> 
> Date: Thu, 21 Sep 2006 18:10:56 -0700
> From: bugme-daemon@bugzilla.kernel.org
> To: bugme-new@lists.osdl.org
> Subject: [Bugme-new] [Bug 7180] New: aic94xx driver locks up on IBM
> x366 with Calgary IOMMU enabled
>
[snip]
>
> http://bugzilla.kernel.org/show_bug.cgi?id=7180

Thanks Andrew, we're looking into it.

Cheers,
Muli

Comment 5 Darrick J. Wong 2006-09-22 10:47:50 UTC
Also, the machine doesn't always reboot after the NMI; sometimes the machine
just locks up hard and has to be power cycled.
Comment 6 Anonymous Emailer 2006-09-22 15:17:12 UTC
Reply-To: James.Bottomley@SteelEye.com

On Fri, 2006-09-22 at 09:05 +0300, Muli Ben-Yehuda wrote:
> Thanks Andrew, we're looking into it.

Just to add to the complexity, this seems to be working for me on a maia
system with aic94xx boot drives

James

-----

PCI-DMA: Calgary IOMMU detected. TCE table spec is 6.
[...]
PCI-DMA: Using Calgary IOMMU
Calgary: enabling translation on PHB 0
Calgary: errant DMAs will now be prevented on this bus.
Calgary: enabling translation on PHB 1
Calgary: errant DMAs will now be prevented on this bus.
[...]
aic94xx: Adaptec aic94xx SAS/SATA driver version 1.0.2 loaded
GSI 16 sharing vector 0xA9 and IRQ 16
ACPI: PCI Interrupt 0000:01:02.0[A] -> GSI 25 (level, low) -> IRQ 16
aic94xx: found Adaptec AIC-9410W SAS/SATA Host Adapter, device 0000:01:02.0
scsi0 : aic94xx
aic94xx: BIOS present (1,1), 1548
aic94xx: ue num:1, ue size:88
aic94xx: 1Found FLASH(8) manuf:1, dev_id:0xda, sec_prot:0
aic94xx: manuf sect SAS_ADDR 5005076a0115a840
aic94xx: manuf sect PCBA SN 
aic94xx: ms: num_phy_desc: 8
aic94xx: ms: phy0: ENEBLEABLE
aic94xx: ms: phy1: ENEBLEABLE
aic94xx: ms: phy2: ENEBLEABLE
aic94xx: ms: phy3: ENEBLEABLE
aic94xx: ms: phy4: ENEBLEABLE
aic94xx: ms: phy5: ENEBLEABLE
aic94xx: ms: phy6: ENEBLEABLE
aic94xx: ms: phy7: ENEBLEABLE
aic94xx: ms: max_phys:0x8, num_phys:0x8
aic94xx: ms: enabled_phys:0xff
aic94xx: ctrla: phy0: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy1: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy2: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy3: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy4: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy5: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy6: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy7: sas_addr: 5005076a0115a840, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: max_scbs:512, max_ddbs:128
aic94xx: setting phy0 addr to 5005076a0115a840
aic94xx: setting phy1 addr to 5005076a0115a840
aic94xx: setting phy2 addr to 5005076a0115a840
aic94xx: setting phy3 addr to 5005076a0115a840
aic94xx: setting phy4 addr to 5005076a0115a840
aic94xx: setting phy5 addr to 5005076a0115a840
aic94xx: setting phy6 addr to 5005076a0115a840
aic94xx: setting phy7 addr to 5005076a0115a840
aic94xx: num_edbs:21
aic94xx: num_escbs:3
aic94xx: using sequencer V17/10c6
aic94xx: downloading CSEQ...
aic94xx: dma-ing 8192 bytes
aic94xx: verified 8192 bytes, passed
aic94xx: downloading LSEQs...
aic94xx: dma-ing 14336 bytes
aic94xx: LSEQ0 verified 14336 bytes, passed
aic94xx: LSEQ1 verified 14336 bytes, passed
aic94xx: LSEQ2 verified 14336 bytes, passed
aic94xx: LSEQ3 verified 14336 bytes, passed
aic94xx: LSEQ4 verified 14336 bytes, passed
aic94xx: LSEQ5 verified 14336 bytes, passed
aic94xx: LSEQ6 verified 14336 bytes, passed
aic94xx: LSEQ7 verified 14336 bytes, passed
aic94xx: max_scbs:446
aic94xx: first_scb_site_no:0x20
aic94xx: last_scb_site_no:0x1fe
aic94xx: First SCB dma_handle: 0x9000
aic94xx: device 0000:01:02.0: SAS addr 5005076a0115a840, PCBA SN , 8 phys, 8 enabled phys, flash present, BIOS build 1548
aic94xx: posting 3 escbs
aic94xx: escbs posted
aic94xx: posting 8 control phy scbs
aic94xx: enabled phys
aic94xx: control_phy_tasklet_complete: phy4, lrate:0x9, proto:0xe
aic94xx: escb_tasklet_complete: phy4: BYTES_DMAED
aic94xx: SAS proto IDENTIFY:
aic94xx: 00: 20 00 02 02
aic94xx: 04: 00 00 00 00
aic94xx: 08: 00 00 00 00
aic94xx: 0c: 50 05 07 6a
aic94xx: 10: 00 00 2c 50
aic94xx: 14: 03 00 00 00
aic94xx: 18: 00 00 00 00
aic94xx: control_phy_tasklet_complete: phy0, lrate:0x9, proto:0xe
aic94xx: escb_tasklet_complete: phy0: BYTES_DMAED
aic94xx: SAS proto IDENTIFY:
aic94xx: 00: 20 00 02 02
aic94xx: 04: 00 00 00 00
aic94xx: 08: 00 00 00 00
aic94xx: 0c: 50 05 07 6a
aic94xx: 10: 00 00 06 20
aic94xx: 14: 03 00 00 00
aic94xx: 18: 00 00 00 00
aic94xx: control_phy_tasklet_complete: phy1, lrate:0x9, proto:0xe
aic94xx: escb_tasklet_complete: phy1: BYTES_DMAED
aic94xx: SAS proto IDENTIFY:
aic94xx: 00: 20 00 02 02
aic94xx: 04: 00 00 00 00
aic94xx: 08: 00 00 00 00
aic94xx: 0c: 50 05 07 6a
aic94xx: 10: 00 00 06 20
aic94xx: 14: 04 00 00 00
aic94xx: 18: 00 00 00 00
aic94xx: control_phy_tasklet_complete: phy6, lrate:0x9, proto:0xe
aic94xx: escb_tasklet_complete: phy6: BYTES_DMAED
aic94xx: SAS proto IDENTIFY:
aic94xx: 00: 20 00 02 02
aic94xx: 04: 00 00 00 00
aic94xx: 08: 00 00 00 00
aic94xx: 0c: 50 05 07 6a
aic94xx: 10: 00 00 2c 50
aic94xx: 14: 07 00 00 00
aic94xx: 18: 00 00 00 00
aic94xx: control_phy_tasklet_complete: phy2, lrate:0x9, proto:0xe
aic94xx: escb_tasklet_complete: phy2: BYTES_DMAED
aic94xx: SAS proto IDENTIFY:
aic94xx: 00: 20 00 02 02
aic94xx: 04: 00 00 00 00
aic94xx: 08: 00 00 00 00
aic94xx: 0c: 50 05 07 6a
aic94xx: 10: 00 00 06 20
aic94xx: 14: 07 00 00 00
aic94xx: 18: 00 00 00 00
aic94xx: control_phy_tasklet_complete: phy7, lrate:0x9, proto:0xe
aic94xx: escb_tasklet_complete: phy7: BYTES_DMAED
aic94xx: SAS proto IDENTIFY:
aic94xx: 00: 20 00 02 02
aic94xx: 04: 00 00 00 00
aic94xx: 08: 00 00 00 00
aic94xx: 0c: 50 05 07 6a
aic94xx: 10: 00 00 2c 50
aic94xx: 14: 08 00 00 00
aic94xx: 18: 00 00 00 00
aic94xx: control_phy_tasklet_complete: phy3, lrate:0x9, proto:0xe
aic94xx: escb_tasklet_complete: phy3: BYTES_DMAED
aic94xx: SAS proto IDENTIFY:
aic94xx: 00: 20 00 02 02
aic94xx: 04: 00 00 00 00
aic94xx: 08: 00 00 00 00
aic94xx: 0c: 50 05 07 6a
aic94xx: 10: 00 00 06 20
aic94xx: 14: 08 00 00 00
aic94xx: 18: 00 00 00 00
aic94xx: control_phy_tasklet_complete: phy5, lrate:0x9, proto:0xe
aic94xx: escb_tasklet_complete: phy5: BYTES_DMAED
aic94xx: SAS proto IDENTIFY:
aic94xx: 00: 20 00 02 02
aic94xx: 04: 00 00 00 00
aic94xx: 08: 00 00 00 00
aic94xx: 0c: 50 05 07 6a
aic94xx: 10: 00 00 2c 50
aic94xx: 14: 04 00 00 00
aic94xx: 18: 00 00 00 00
sas: phy4 added to port0, phy_mask:0x10
sas: phy0 added to port1, phy_mask:0x1
sas: phy1 matched wide port1
sas: phy1 added to port1, phy_mask:0x3
sas: phy6 matched wide port0
sas: phy6 added to port0, phy_mask:0x50
sas: phy2 matched wide port1
sas: phy2 added to port1, phy_mask:0x7
sas: phy7 matched wide port0
sas: phy7 added to port0, phy_mask:0xd0
sas: phy3 matched wide port1
sas: phy3 added to port1, phy_mask:0xf
sas: phy5 matched wide port0
sas: phy5 added to port0, phy_mask:0xf0
sas: DOING DISCOVERY on port 0, pid:869
sas: ex 5005076a00002c50 phy00:D attached: 50010b9000001369
sas: ex 5005076a00002c50 phy01:T attached: 0000000000000000
sas: ex 5005076a00002c50 phy02:T attached: 0000000000000000
sas: ex 5005076a00002c50 phy03:S attached: 5005076a0115a840
sas: ex 5005076a00002c50 phy04:S attached: 5005076a0115a840
sas: ex 5005076a00002c50 phy05:T attached: 0000000000000000
sas: ex 5005076a00002c50 phy06:T attached: 0000000000000000
sas: ex 5005076a00002c50 phy07:S attached: 5005076a0115a840
sas: ex 5005076a00002c50 phy08:S attached: 5005076a0115a840
sas: ex 5005076a00002c50 phy09:T attached: 0000000000000000
sas: ex 5005076a00002c50 phy10:T attached: 0000000000000000
sas: ex 5005076a00002c50 phy11:T attached: 0000000000000000
sas: ex 5005076a00002c50 phy12:D attached: 5005076a00002c5d
scsi 0:0:0:0: Direct access     IBM-ESXS BBA036C3ESTT0Z N BH06 PQ: 0 ANSI: 5
SCSI device sda: 71096640 512-byte hdwr sectors (36401 MB)
sda: Write Protect is off
sda: Mode Sense: d3 00 10 08
SCSI device sda: drive cache: write through w/ FUA
SCSI device sda: 71096640 512-byte hdwr sectors (36401 MB)
sda: Write Protect is off
sda: Mode Sense: d3 00 10 08
SCSI device sda: drive cache: write through w/ FUA
 sda: sda1 sda2 sda3
sd 0:0:0:0: Attached scsi disk sda
scsi 0:0:1:0: Enclosure         IBM-ESXS VSC7160          1.01 PQ: 0 ANSI: 3
sas: device 5005076a00002c5d, LUN 0 doesn't support TCQ
sas: DONE DISCOVERY on port 0, pid:869, result:0
sas: DOING DISCOVERY on port 1, pid:869
sas: ex 5005076a00000620 phy00:T attached: 0000000000000000
sas: ex 5005076a00000620 phy01:T attached: 0000000000000000
sas: ex 5005076a00000620 phy02:T attached: 0000000000000000
sas: ex 5005076a00000620 phy03:S attached: 5005076a0115a840
sas: ex 5005076a00000620 phy04:S attached: 5005076a0115a840
sas: ex 5005076a00000620 phy05:T attached: 0000000000000000
sas: ex 5005076a00000620 phy06:T attached: 0000000000000000
sas: ex 5005076a00000620 phy07:S attached: 5005076a0115a840
sas: ex 5005076a00000620 phy08:S attached: 5005076a0115a840
sas: ex 5005076a00000620 phy09:T attached: 0000000000000000
sas: ex 5005076a00000620 phy10:T attached: 0000000000000000
sas: ex 5005076a00000620 phy11:T attached: 0000000000000000
sas: ex 5005076a00000620 phy12:D attached: 5005076a0000062d
scsi 0:0:2:0: Enclosure         IBM-ESXS VSC7160          1.01 PQ: 0 ANSI: 3
sas: device 5005076a0000062d, LUN 0 doesn't support TCQ
sas: DONE DISCOVERY on port 1, pid:869, result:0


Comment 7 Darrick J. Wong 2006-09-22 23:13:31 UTC
> Just to add to the complexity...

Hey, I can add some too!  This condition doesn't always happen the first time
the driver loads.  Sometimes the machine survives a loads and unload cycle
anywhere between 1 and a half dozen times before the SERR/NMI show up.
Comment 8 Darrick J. Wong 2006-10-16 15:58:10 UTC
Created attachment 9281 [details]
Patch to disable the Calgary's split completion timers

This is a heavy-handed patch that disables all of the Calgary PCIX bridge's
split completion timers.  Really, this sledgehammer should be cut down to do
this only on PHB 1 of a x366/x260/x460.  Or perhaps simply increasing the
timeout would suffice.	I'll go experiment with that, but in the meantime
here's a crappy patch in case anybody else wants to try it out.
Comment 9 Darrick J. Wong 2006-10-16 16:03:56 UTC
Created attachment 9282 [details]
Patch to disable the Calgary's split completion timers

Whoops, bad patch.  Let's try again.
Comment 10 Muli Ben-Yehuda 2006-10-17 04:42:23 UTC
Patch looks reasonable, I'll give it a spin on my machine. Does it help on your
problematic machine?
Comment 11 Darrick J. Wong 2006-10-17 10:11:50 UTC
Created attachment 9285 [details]
Even more restricted version of above patches.

Yes, these patches fix the SERR -> SC Timeout -> NMI problems on my x366 and
x260.  Attached is an even more restrictive version of the patch that (a) only
acts upon the particular bus in question and (b) sets a higher timeout instead
of turning it off altogether.
Comment 12 Muli Ben-Yehuda 2006-10-18 02:42:53 UTC
Created attachment 9292 [details]
slightly reworked version

Great to hear it! I think there's a slight mistake in your latest patch though,
you're changing the timer for PHB2 rather than PHB1. Can you please give this
slightly reworked patch a spin and confirm that it solves the problem? if it
does we'll go ahead and apply it (without the debug printks).
Comment 13 Darrick J. Wong 2006-10-18 11:32:40 UTC
Patch works on my x366.  Will see about shaking out a x260/x460 for testing
there too, though I'm fairly confident that a fix for one fixes all three.
Comment 14 Darrick J. Wong 2006-10-18 11:52:28 UTC
Fixes the x260 also.  ACK.
Comment 15 Muli Ben-Yehuda 2006-10-18 12:41:22 UTC
Excellent, I'll push the patch upstream. Nice work :-)
Comment 16 Konrad Rzeszutek 2006-10-19 07:03:45 UTC
For 4-node x460  with the Adaptec Razor (9410SAS) being enabled in second, third
and fourth node. and the ServerRAID card in the first node, would this patch
still work?

I presume (and I am probably  incorrect) that the 2,3,4-node PHB for the aic94xx
is not 1?
Comment 17 Muli Ben-Yehuda 2006-10-19 09:39:55 UTC
> For 4-node x460  with the Adaptec Razor (9410SAS) being enabled in second,
> third
> and fourth node. and the ServerRAID card in the first node, would this patch
> still work?

hmm... what does the machine have on PCI bus 1? basically it will only modify
the timer on bus 1, so it won't "work" in that sense, but if you haven't seen
any problems that this could in theory fix, everything should continue to work fine.

> I presume (and I am probably  incorrect) that the 2,3,4-node PHB for the 
> aic94xx is not 1?

the bus number is not 1 but the PHB ID is 1 (there are 4 PHBs per Calgary,
numbered 0-3, so multiple busses will have PHB ID 1, one for each Calgary).
Comment 18 Darrick J. Wong 2006-10-19 09:47:47 UTC
Maybe we should look for PHBs with aic94xx adapters hanging off, and increase
the SC timeout for any that we find?  Or just increase all of them across the board?
Comment 19 Muli Ben-Yehuda 2006-10-19 09:52:06 UTC
Yeah, we could do that... but I'd rather understand first why it makes a
difference. Do we have any contacts at adaptec that could maybe help on this?
Comment 20 Darrick J. Wong 2006-10-19 17:16:22 UTC
Jack Hammer might be able to help us.  Either that or possibly the documentation.
Comment 21 Muli Ben-Yehuda 2006-10-24 04:50:00 UTC
Patch is in 2.6.19-rc3, closing.