Latest working kernel version: 2.6.27-rc5 Earliest failing kernel version: 2.6.27-rc6 Bug caused by commit 2fd673ecf0378ddeeeb87b3605e50212e0c0ddc6 Distribution: Ubuntu 8.10 Hardware Environment: x86_64 (Athlon X2 4400+) nVidia CK804 chipset 1 PATA disk ST3250823A 1 SATA disk ST3500630AS Problem Description: Disk IO to SATA is slow, and during IO the kernel logs messages like this: [ 86.416012] ata4: EH complete [ 89.884011] ata4: EH complete [ 91.440012] ata4: EH complete [ 95.632013] ata4: EH complete [ 120.320012] ata4: EH complete [ 146.912695] ata4: EH pending after 5 tries, giving up [ 146.912748] ata4: EH complete [ 148.472754] ata4: EH pending after 5 tries, giving up [ 148.472806] ata4: EH complete [ 150.033090] ata4: EH pending after 5 tries, giving up [ 150.033142] ata4: EH complete [ 151.592379] ata4: EH pending after 5 tries, giving up I first noticed this with Ubuntus 2.6.27-3-generic kernel, which is based on -rc6. I then verified the bug using HEAD on Linux linux-2.6 git. The problem seems to be caused by the latest commit for drivers/ata/sata_nv.c, reverting this solves the problem for me: commit 2fd673ecf0378ddeeeb87b3605e50212e0c0ddc6 Author: Tejun Heo <tj@kernel.org> Date: Fri Aug 29 16:13:12 2008 +0200 sata_nv: disable hardreset for generic
Created attachment 17926 [details] dmesg
Created attachment 17927 [details] lspci -vv
Created attachment 18001 [details] sata_nv-eh-debug.patch Please apply the attached patch and post the resulting log. Please don't forget to turn on KALLSYMS.
Created attachment 18011 [details] System log This is my system log on a patched 2.6.27-rc7 kernel.
Created attachment 18057 [details] dmesg.2.6.27-rc7_sata_nv-eh-debug Here is my log as well, looks similar to the one from Peter. Mine is also using 2.6.27-rc7 + patch.
Looks like nv_host_intr is triggering a freeze. Looking at the code the only reason it should be able to do that is if it detects a hotplug condition?
Created attachment 18078 [details] sata_nv-reinstate-nv_hardreset.patch Can you please test this patch? It seems I made a stupid mistake while trying to fix nv_generic hardreset regression. Thanks.
The patch corrects the problem here.
Handled-By : Tejun Heo <tj@kernel.org> Patch : http://bugzilla.kernel.org/attachment.cgi?id=18078&action=view
The patch corrects the problem here as well.
Patch posted. http://article.gmane.org/gmane.linux.ide/34750
Sorry for my slow responses, I have not had enough time the last days. The patch fixes the bug on my machine as well. I see that it also made it into -rc8, thanks a lot!
Fixed by: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4c1eb90a0908c0c60db2169dce08fb672e7582f1
I saw this problem for the first time when I upgraded to the most recent stable kernel, 2.6.28.7. The kern.log file is full of messages such as: Feb 27 09:58:38 dolphin kernel: ata2: EH pending after 5 tries, giving up Feb 27 09:58:38 dolphin kernel: ata2: EH complete Feb 27 09:58:39 dolphin kernel: ata2: EH pending after 5 tries, giving up If I reboot into kernel 2.6.28.6, built with exactly the same configuration, the messages disappear. The "hardinfo" program reports that I have the following hardware: Processor: AMD Athlon(tm) 64 Processor 3000+ stepping 15, 31, 0 (AMD Opteron/Athlon64/FX) PCI Devices: Host bridge nVidia Corporation nForce3 250Gb Host Bridge ISA bridge nVidia Corporation nForce3 250Gb LPC Bridge SMBus nVidia Corporation nForce 250Gb PCI System Management USB Controller nVidia Corporation CK8S USB Controller USB Controller nVidia Corporation CK8S USB Controller USB Controller nVidia Corporation nForce3 EHCI USB 2.0 Controller Audio Controller nVidia Corporation nForce3 250Gb AC'97 Audio Controller IDE interface nVidia Corporation CK8S Parallel ATA Controller IDE interface nVidia Corporation nForce3 Serial ATA Controller PCI bridge nVidia Corporation nForce3 250Gb AGP Host to PCI Bridge PCI bridge nVidia Corporation nForce3 250Gb PCI-to-PCI Bridge Host bridge Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration Host bridge Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map Host bridge Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller Host bridge Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control VGA controller ATI Technologies Inc RV280 [Radeon 9200 SE] Ethernet controller National Semiconductor Corporation DP83815 FireWire (IEEE 1394) VIA Technologies, Inc. VT6306 Fire II IEEE 1394 OHCI Link Layer Controller Ethernet controller Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller Storage: IDE Disks LITE-ON COMBO SOHC-5235K SCSI Disks ATA WDC WD800JD-00JN
Can you please post the output of "lspci -nn"? Also, is it possible for you to catch full failing boot log?
Created attachment 20394 [details] Output of "lspci -nn"
Created attachment 20395 [details] Contents of /var/log/dmesg for kernel 2.6.28.6
Created attachment 20396 [details] Contents of /var/log/dmesg for kernel 2.6.28.7
I hope the three attachments provide the information you need. I see that the line "ata2: SATA link down (SStatus 0 SControl 300)", present in the dmesg output when I boot with 2.6.28.6, is missing with 2.6.28.7. I am no Linux expert and only compile my own kernel because the stock Debian kernel did not support my network card when I first purchased this machine. It is likely that my kernel configuration is dumb. Please ask if you need to see it. Sorry for creating four messages instead of one; I'm still learning to use bugzilla. Thanks.
Creating separate entry for each file is fine. Bugzilla is pretty dumb that way. For 2.6.28.7, /var/log/dmesg looks truncated and missing some log entries. Can you please do "dmesg > dmesg.out" after boot and attach the result?
I believe that dmesg gave the same truncated entry (missing some stuff at the beginning) as appeared in /var/log/dmesg. I am away from home at present and won't be able to access the computer in question (which is currently off) again until Tuesday evening, UK time, but will try again then and report.
I have the same problem. I had it with Ubuntu Intrepid (testing) and it was fixed. It appeared again recently with Ubuntu Jaunty (testing), kernel 2.6.28.8 dan@shuttle:~$ lspci -nn 00:00.0 Host bridge [0600]: nVidia Corporation nForce3 250Gb Host Bridge [10de:00e1] (rev a1) 00:01.0 ISA bridge [0601]: nVidia Corporation nForce3 250Gb LPC Bridge [10de:00e0] (rev a2) 00:01.1 SMBus [0c05]: nVidia Corporation nForce 250Gb PCI System Management [10de:00e4] (rev a1) 00:02.0 USB Controller [0c03]: nVidia Corporation CK8S USB Controller [10de:00e7] (rev a1) 00:02.1 USB Controller [0c03]: nVidia Corporation CK8S USB Controller [10de:00e7] (rev a1) 00:02.2 USB Controller [0c03]: nVidia Corporation nForce3 EHCI USB 2.0 Controller [10de:00e8] (rev a2) 00:06.0 Multimedia audio controller [0401]: nVidia Corporation nForce3 250Gb AC'97 Audio Controller [10de:00ea] (rev a1) 00:08.0 IDE interface [0101]: nVidia Corporation CK8S Parallel ATA Controller (v2.5) [10de:00e5] (rev a2) 00:0a.0 IDE interface [0101]: nVidia Corporation nForce3 Serial ATA Controller [10de:00e3] (rev a2) 00:0b.0 PCI bridge [0604]: nVidia Corporation nForce3 250Gb AGP Host to PCI Bridge [10de:00e2] (rev a2) 00:0e.0 PCI bridge [0604]: nVidia Corporation nForce3 250Gb PCI-to-PCI Bridge [10de:00ed] (rev a2) 00:18.0 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration [1022:1100] 00:18.1 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map [1022:1101] 00:18.2 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller [1022:1102] 00:18.3 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control [1022:1103] 01:00.0 VGA compatible controller [0300]: nVidia Corporation NV44A [GeForce 6200] [10de:0221] (rev a1) 02:06.0 Ethernet controller [0200]: Atheros Communications Inc. AR2413 802.11bg NIC [168c:001a] (rev 01) 02:07.0 FireWire (IEEE 1394) [0c00]: VIA Technologies, Inc. VT6306 Fire II IEEE 1394 OHCI Link Layer Controller [1106:3044] (rev 80) 02:08.0 Ethernet controller [0200]: Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller [11ab:4320] (rev 13)
Dan, can you please post the output of "dmesg" after boot? Matthew, you can increase dmesg buf using kernel parameter log_buf_len=n where n is the capacity. Also, dmesg command itself has size limitation which can be adjusted using -s parameter. Thanks.
Created attachment 20410 [details] dmesg with EH errors, obtained after boot Here it is, dmesg on my computer exhibiting the problem, obtained immediately after boot
Strange. Does kernel parameter "scsi.scan=sync" make any difference? Please note that if scsi is built as module and loaded from initrd, you might need to do different things to pass the parameter to the scsi module depending on your distro.
booting with scsi_mod.scan=sync and the messages still appear. [ 0.000000] Kernel command line: root=UUID=69049c1f-f6f6-45a5-aaea-0737bb69ec0e ro quiet splash scsi_mod.scan=sync scsi is built-in, not as a module.
Thanks for verifying. I'm still quite confused how the recent change can cause such failure tho. Can you build your own kernel and try some patches? Thanks.
Tejun, yes, I can build my own kernel. I'll be back in 12 hours (have to go to work now) and I'll try it then. Thanks for working on this bug.
2.6.27.19-170.2.35.fc10.x86_64 has the same problem... is there some command that would stop it? i can't reboot so often (my box records TV programmings quite often)... :-) why did u offer a kernel update, that was known to have that bug? :-) -Arne
Okay, successfully reproduced the problem here. Will know when I know more. Arne, I didn't know about this problem before releasing the patch. I'm not that nuts (yet). :-) Sorry about the trouble.
Created attachment 20420 [details] libata-thaw-after-reset-skip.patch This patch should fix the problem. Sorry about all the breakages. sata_nv reset peculiarities combined with obscure libata EH bug made things very convoluted. Hopefully, this is the last fallout from the prefer-hardreset conversion. Thanks.
Resolving as CODE_FIX.
Thanks Tejun, this is great. I was just about to report that I have my kernel compiled and ready to start trying your patches but, surprise, the bug is already fixed.
It would be nice if you can verify the patch. Thanks.
Hi Tejun. Thanks very much for your work. I tried to test your patch this morning before work. After unpacking a fresh copy of 2.6.28 into /usr/src/linux-2.6.28.7, I did cd /usr/src/linux-2.6.28.7 patch -p1 < patch-2.6.28.7 patch -p1 < patch-libata-thaw-after-reset-skip at which point I saw the following error message patching file drivers/ata/libata-eh.c Hunk #1 FAILED at 2423. 1 out of 1 hunk FAILED -- saving rejects to file drivers/ata/libata-eh.c.rej Was your patch against 2.6.28.7?
Created attachment 20423 [details] libata-thaw-after-reset-skip.patch It was against 2.6.29-rc. I'm attaching one for 2.6.28.
Hi Tejun! Thanks for responding so quickly... What about fedora 10's current kernel (2.6.27.19-170.2.35.fc10.x86_64)? *sob* :-) I opened a bug report there: https://bugzilla.redhat.com/show_bug.cgi?id=488371 Is there a way to inform fedora's developers about bugs? I mean: The bug is known since 2009-03-01, but fedora's kernel update was yesterday... Bye Arne
Test completed: vanilla 2.6.28.7: EH errors appear in dmesg output patched 2.6.28.7: no more EH errors, your patch eliminated them Sorry for the delay, after you fixed it I completely removed the compiled kernel and the source files, so I had to compile everything from beginning. Since I used the default Ubuntu .config file, it compiled a lot of modules (I guess all of them), thus such a big delay. Thanks again, Dan.
Great. Thanks for verifying. Arne, eh.. I don't know much about fedora but I bet they will pick it up once the next -stable is released.
Your patch against 2.6.28.7 eliminated the EH errors from the dmesg output on my machine also. Thanks very much. Matthew