Bug 14124 - Boot failure with ICH6R in AHCI mode
Summary: Boot failure with ICH6R in AHCI mode
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-09-04 18:22 UTC by Thomas Jarosch
Modified: 2009-09-25 11:39 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.30.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Log files from serial console + kernel config (24.00 KB, application/x-compressed-tar)
2009-09-05 11:35 UTC, Thomas Jarosch
Details
dmidecode and lspci output (6.75 KB, application/x-compressed-tar)
2009-09-16 07:33 UTC, Thomas Jarosch
Details

Description Thomas Jarosch 2009-09-04 18:22:39 UTC
Hello,

I just upgraded from kernel 2.6.29.6 to kernel 2.6.30.5.
One HP Proliant DL320 G3 box with two sata drives fails to
detect the drives properly on boot.

Here's a write-down of the messages:
ata1: SATA max UDMA/133 abar m1024@xfbee0000 port 0xfbee0100 irq17
ata2: SATA max UDMA/133 abar m1024@xfbee0000 port 0xfbee0180 irq17
ata3: SATA max UDMA/133 abar m1024@xfbee0000 port 0xfbee0200 irq17
ata4: SATA max UDMA/133 abar m1024@xfbee0000 port 0xfbee0280 irq17
ata4: SATA link down (SStatus 0 SControl 300)
ata3: SATA link down (SStatus 0 SControl 300)
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00 qc timeout (cmd 0xec)
ata1.00 failed to IDENTIFY (I/O error, err_mask=0x4)
ata2.00 qc timeout (cmd 0xec)
ata2.00 failed to IDENTIFY (I/O error, err_mask=0x4)
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00 qc timeout (cmd 0xec)
ata1.00 failed to IDENTIFY (I/O error, err_mask=0x4)
ata1: limiting SATA link speed to 1.5 Gbps
ata2.00 qc timeout (cmd 0xec)
ata2.00 failed to IDENTIFY (I/O error, err_mask=0x4)
ata2: limiting SATA link speed to 1.5 Gbps
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
--hangs for some time--


I can also see that the ahci driver is actiaved for this chipset.
The same kernel image boots fine on 14 other boxes,
including some HP Proliant DL320 boxes in AHCI mode.

lspci -v:
0000:00:1f.2 RAID bus controller: Intel Corp.: Unknown device 2652 (rev 03)
        Subsystem: Hewlett-Packard Company: Unknown device 3206            
        Flags: bus master, 66Mhz, medium devsel, latency 0, IRQ 17         
        I/O ports at 1080                                                  
        I/O ports at 1088 [size=4]                                         
        I/O ports at 1090 [size=8]                                         
        I/O ports at 1098 [size=4]                                         
        I/O ports at 10a0 [size=16]                                        
        Memory at fbee0000 (32-bit, non-prefetchable) [size=1K]            
        Capabilities: [70] Power Management version 2     

I've now reverted to kernel 2.6.29.6, which boots fine.

Any idea what that could be?

Thanks,
Thomas
Comment 1 Tejun Heo 2009-09-05 00:33:18 UTC
Looks like an IRQ delivery problem.  Does irqpoll or pci=noacpi help?
Comment 2 Thomas Jarosch 2009-09-05 11:33:43 UTC
Yes, it really feels like an IRQ problem: Once the system hangs and I start to hit the return key, it will show one more line of output for every key pressed (=generated interrupt).

Unfortunately, irqpoll, pci=noacpi or pci=nomsi didn't help. I've created boot logs with a serial console for every variant. I'll also attach my kernel config and a working boot log from kernel 2.6.29. The "pci=noacpi" boot log contains a backtrace related to IRQs.

Maybe this helps a little: The box is packed with 6 network interfaces, so I guess it will have more IRQ sharing than the other boxes. It's a productive system, so I can only trash it on the weekend or in the evening.
Comment 3 Thomas Jarosch 2009-09-05 11:35:18 UTC
Created attachment 23014 [details]
Log files from serial console + kernel config
Comment 4 Thomas Jarosch 2009-09-07 13:53:08 UTC
I've silently replaced the productive box with a "spare" one and now can play around with it. What I've tested so far:

2.6.29 vanilla -> ok
2.6.30 vanilla -> fails
2.6.31-rc9 -> fails

I'll try to bisect 2.6.29 <-> 2.6.30.
Comment 5 Thomas Jarosch 2009-09-07 16:08:11 UTC
Here's my current bisect log:

[root@intradev linux-2.6]# git bisect log
git bisect start
# bad: [07a2039b8eb0af4ff464efd3dfd95de5c02648c6] Linux 2.6.30
git bisect bad 07a2039b8eb0af4ff464efd3dfd95de5c02648c6
# good: [8e0ee43bc2c3e19db56a4adaa9a9b04ce885cd84] Linux 2.6.29
git bisect good 8e0ee43bc2c3e19db56a4adaa9a9b04ce885cd84
# bad: [3c6fae67d026d57f64eb3da9c0d0e76983e39ae3] Merge branch 'hwmon-for-linus' of git://jdelvare.pck.nerim.net/jdelvare-2.6
git bisect bad 3c6fae67d026d57f64eb3da9c0d0e76983e39ae3
# bad: [a8416961d32d8bb757bcbb86b72042b66d044510] Merge branch 'irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
git bisect bad a8416961d32d8bb757bcbb86b72042b66d044510
# good: [fe85ff8299538c8488645e7d72539079dad5bae6] usbnet: convert dms9601 driver to net_device_ops
git bisect good fe85ff8299538c8488645e7d72539079dad5bae6
# good: [928a726b0e12184729900c076e13dbf1c511c96c] Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6
git bisect good 928a726b0e12184729900c076e13dbf1c511c96c
# bad: [d3f12d36f148f101c568bdbce795e41cd9ceadf3] Merge branch 'kvm-updates/2.6.30' of git://git.kernel.org/pub/scm/virt/kvm/kvm
git bisect bad d3f12d36f148f101c568bdbce795e41cd9ceadf3
# good: [61a091827e273650b39eb87c799a6d260913fa0b] Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6
git bisect good 61a091827e273650b39eb87c799a6d260913fa0b
# good: [71450f78853b82d55cda4e182c9db6e26b631485] KVM: Report IRQ injection status for MSI delivered interrupts
git bisect good 71450f78853b82d55cda4e182c9db6e26b631485
# bad: [39f15003c7b268e4199d5ddce60a6944a74a14b7] Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
git bisect bad 39f15003c7b268e4199d5ddce60a6944a74a14b7
# bad: [9223d01b2fdf638a73888ad73a1784fca3454c1e] pata-rb532-cf: platform_get_irq() fix ignored failure
git bisect bad 9223d01b2fdf638a73888ad73a1784fca3454c1e
# good: [6be976e79db3ba691b657476a8bf4a635e5586f9] pata-rb532-cf: drop custom freeze and thaw
git bisect good 6be976e79db3ba691b657476a8bf4a635e5586f9


I'm very close to the issue and have to abort for today :o)

Maybe it's commit a5bfc4714b3f01365aef89a92673f2ceb1ccf246
"ahci: drop intx manipulation on msi enable"?
Comment 6 Thomas Jarosch 2009-09-08 08:26:31 UTC
Reverting a5bfc4714b3f01365aef89a92673f2ceb1ccf246
"ahci: drop intx manipulation on msi enable"
solved the issue, I can now boot 2.6.30 / 2.6.31-rc9.

Would it hurt to revert this patch?
It was already talked about:
http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg31682.html
Comment 7 Thomas Jarosch 2009-09-15 12:04:47 UTC
Any short comment on this one?
Comment 8 Tejun Heo 2009-09-15 17:40:28 UTC
Can you please post the output of "lspci -nnvvvxxx" and "dmidecode"?
Comment 9 Thomas Jarosch 2009-09-16 07:33:22 UTC
Created attachment 23098 [details]
dmidecode and lspci output

The logs are from kernel 2.6.30.5 + the reverted code in ahci.c.
Comment 10 Tejun Heo 2009-09-16 14:13:26 UTC
Hmmm... for now I think reverting the offending commit is the right thing to do.  I'll post a patch to revert it.  Longer term, I think this is something the pci layer should take care of not individual drivers.  I'll see whether there's a good place I can hook into.

Thanks.
Comment 11 Thomas Jarosch 2009-09-16 14:43:57 UTC
> Thanks.

Thank -you- :)
Comment 12 Thomas Jarosch 2009-09-18 08:15:38 UTC
Patch has been applied upstream.
Comment 13 Alexander Huemer 2009-09-25 11:39:40 UTC
please see [1].
it seems like this issue also appears under other circumstances, my environment does not match the one reported earlier in this bug.
it seems commit [2] did not make it in linux-2.6.31.1.
i assume the first kernel releases not affected by the issue will be 2.6.30.9 (if that will be released) and 2.6.31.2, right ?

[1] http://thread.gmane.org/gmane.linux.kernel/894187
[2] 31b239ad1ba7225435e13f5afc47e48eb674c0cc

Note You need to log in before you can comment on or make changes to this bug.