Hello, I just upgraded from kernel 2.6.29.6 to kernel 2.6.30.5. One HP Proliant DL320 G3 box with two sata drives fails to detect the drives properly on boot. Here's a write-down of the messages: ata1: SATA max UDMA/133 abar m1024@xfbee0000 port 0xfbee0100 irq17 ata2: SATA max UDMA/133 abar m1024@xfbee0000 port 0xfbee0180 irq17 ata3: SATA max UDMA/133 abar m1024@xfbee0000 port 0xfbee0200 irq17 ata4: SATA max UDMA/133 abar m1024@xfbee0000 port 0xfbee0280 irq17 ata4: SATA link down (SStatus 0 SControl 300) ata3: SATA link down (SStatus 0 SControl 300) ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00 qc timeout (cmd 0xec) ata1.00 failed to IDENTIFY (I/O error, err_mask=0x4) ata2.00 qc timeout (cmd 0xec) ata2.00 failed to IDENTIFY (I/O error, err_mask=0x4) ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00 qc timeout (cmd 0xec) ata1.00 failed to IDENTIFY (I/O error, err_mask=0x4) ata1: limiting SATA link speed to 1.5 Gbps ata2.00 qc timeout (cmd 0xec) ata2.00 failed to IDENTIFY (I/O error, err_mask=0x4) ata2: limiting SATA link speed to 1.5 Gbps ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) --hangs for some time-- I can also see that the ahci driver is actiaved for this chipset. The same kernel image boots fine on 14 other boxes, including some HP Proliant DL320 boxes in AHCI mode. lspci -v: 0000:00:1f.2 RAID bus controller: Intel Corp.: Unknown device 2652 (rev 03) Subsystem: Hewlett-Packard Company: Unknown device 3206 Flags: bus master, 66Mhz, medium devsel, latency 0, IRQ 17 I/O ports at 1080 I/O ports at 1088 [size=4] I/O ports at 1090 [size=8] I/O ports at 1098 [size=4] I/O ports at 10a0 [size=16] Memory at fbee0000 (32-bit, non-prefetchable) [size=1K] Capabilities: [70] Power Management version 2 I've now reverted to kernel 2.6.29.6, which boots fine. Any idea what that could be? Thanks, Thomas
Looks like an IRQ delivery problem. Does irqpoll or pci=noacpi help?
Yes, it really feels like an IRQ problem: Once the system hangs and I start to hit the return key, it will show one more line of output for every key pressed (=generated interrupt). Unfortunately, irqpoll, pci=noacpi or pci=nomsi didn't help. I've created boot logs with a serial console for every variant. I'll also attach my kernel config and a working boot log from kernel 2.6.29. The "pci=noacpi" boot log contains a backtrace related to IRQs. Maybe this helps a little: The box is packed with 6 network interfaces, so I guess it will have more IRQ sharing than the other boxes. It's a productive system, so I can only trash it on the weekend or in the evening.
Created attachment 23014 [details] Log files from serial console + kernel config
I've silently replaced the productive box with a "spare" one and now can play around with it. What I've tested so far: 2.6.29 vanilla -> ok 2.6.30 vanilla -> fails 2.6.31-rc9 -> fails I'll try to bisect 2.6.29 <-> 2.6.30.
Here's my current bisect log: [root@intradev linux-2.6]# git bisect log git bisect start # bad: [07a2039b8eb0af4ff464efd3dfd95de5c02648c6] Linux 2.6.30 git bisect bad 07a2039b8eb0af4ff464efd3dfd95de5c02648c6 # good: [8e0ee43bc2c3e19db56a4adaa9a9b04ce885cd84] Linux 2.6.29 git bisect good 8e0ee43bc2c3e19db56a4adaa9a9b04ce885cd84 # bad: [3c6fae67d026d57f64eb3da9c0d0e76983e39ae3] Merge branch 'hwmon-for-linus' of git://jdelvare.pck.nerim.net/jdelvare-2.6 git bisect bad 3c6fae67d026d57f64eb3da9c0d0e76983e39ae3 # bad: [a8416961d32d8bb757bcbb86b72042b66d044510] Merge branch 'irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip git bisect bad a8416961d32d8bb757bcbb86b72042b66d044510 # good: [fe85ff8299538c8488645e7d72539079dad5bae6] usbnet: convert dms9601 driver to net_device_ops git bisect good fe85ff8299538c8488645e7d72539079dad5bae6 # good: [928a726b0e12184729900c076e13dbf1c511c96c] Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6 git bisect good 928a726b0e12184729900c076e13dbf1c511c96c # bad: [d3f12d36f148f101c568bdbce795e41cd9ceadf3] Merge branch 'kvm-updates/2.6.30' of git://git.kernel.org/pub/scm/virt/kvm/kvm git bisect bad d3f12d36f148f101c568bdbce795e41cd9ceadf3 # good: [61a091827e273650b39eb87c799a6d260913fa0b] Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6 git bisect good 61a091827e273650b39eb87c799a6d260913fa0b # good: [71450f78853b82d55cda4e182c9db6e26b631485] KVM: Report IRQ injection status for MSI delivered interrupts git bisect good 71450f78853b82d55cda4e182c9db6e26b631485 # bad: [39f15003c7b268e4199d5ddce60a6944a74a14b7] Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 git bisect bad 39f15003c7b268e4199d5ddce60a6944a74a14b7 # bad: [9223d01b2fdf638a73888ad73a1784fca3454c1e] pata-rb532-cf: platform_get_irq() fix ignored failure git bisect bad 9223d01b2fdf638a73888ad73a1784fca3454c1e # good: [6be976e79db3ba691b657476a8bf4a635e5586f9] pata-rb532-cf: drop custom freeze and thaw git bisect good 6be976e79db3ba691b657476a8bf4a635e5586f9 I'm very close to the issue and have to abort for today :o) Maybe it's commit a5bfc4714b3f01365aef89a92673f2ceb1ccf246 "ahci: drop intx manipulation on msi enable"?
Reverting a5bfc4714b3f01365aef89a92673f2ceb1ccf246 "ahci: drop intx manipulation on msi enable" solved the issue, I can now boot 2.6.30 / 2.6.31-rc9. Would it hurt to revert this patch? It was already talked about: http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg31682.html
Any short comment on this one?
Can you please post the output of "lspci -nnvvvxxx" and "dmidecode"?
Created attachment 23098 [details] dmidecode and lspci output The logs are from kernel 2.6.30.5 + the reverted code in ahci.c.
Hmmm... for now I think reverting the offending commit is the right thing to do. I'll post a patch to revert it. Longer term, I think this is something the pci layer should take care of not individual drivers. I'll see whether there's a good place I can hook into. Thanks.
> Thanks. Thank -you- :)
Patch has been applied upstream.
please see [1]. it seems like this issue also appears under other circumstances, my environment does not match the one reported earlier in this bug. it seems commit [2] did not make it in linux-2.6.31.1. i assume the first kernel releases not affected by the issue will be 2.6.30.9 (if that will be released) and 2.6.31.2, right ? [1] http://thread.gmane.org/gmane.linux.kernel/894187 [2] 31b239ad1ba7225435e13f5afc47e48eb674c0cc