|Summary:||sata_nv EH problems|
|Product:||IO/Storage||Reporter:||Pär Lindfors (paran)|
|Component:||Serial ATA||Assignee:||Tejun Heo (tj)|
|Severity:||normal||CC:||danyer, hancockrwd, htejun, m.foulkes, marc.c.dionne, peter, rjw|
|Bug Depends on:|
Output of "lspci -nn"
Contents of /var/log/dmesg for kernel 188.8.131.52
Contents of /var/log/dmesg for kernel 184.108.40.206
dmesg with EH errors, obtained after boot
Description Pär Lindfors 2008-09-21 18:09:13 UTC
Latest working kernel version: 2.6.27-rc5 Earliest failing kernel version: 2.6.27-rc6 Bug caused by commit 2fd673ecf0378ddeeeb87b3605e50212e0c0ddc6 Distribution: Ubuntu 8.10 Hardware Environment: x86_64 (Athlon X2 4400+) nVidia CK804 chipset 1 PATA disk ST3250823A 1 SATA disk ST3500630AS Problem Description: Disk IO to SATA is slow, and during IO the kernel logs messages like this: [ 86.416012] ata4: EH complete [ 89.884011] ata4: EH complete [ 91.440012] ata4: EH complete [ 95.632013] ata4: EH complete [ 120.320012] ata4: EH complete [ 146.912695] ata4: EH pending after 5 tries, giving up [ 146.912748] ata4: EH complete [ 148.472754] ata4: EH pending after 5 tries, giving up [ 148.472806] ata4: EH complete [ 150.033090] ata4: EH pending after 5 tries, giving up [ 150.033142] ata4: EH complete [ 151.592379] ata4: EH pending after 5 tries, giving up I first noticed this with Ubuntus 2.6.27-3-generic kernel, which is based on -rc6. I then verified the bug using HEAD on Linux linux-2.6 git. The problem seems to be caused by the latest commit for drivers/ata/sata_nv.c, reverting this solves the problem for me: commit 2fd673ecf0378ddeeeb87b3605e50212e0c0ddc6 Author: Tejun Heo <firstname.lastname@example.org> Date: Fri Aug 29 16:13:12 2008 +0200 sata_nv: disable hardreset for generic
Comment 3 Tejun Heo 2008-09-24 02:40:54 UTC
Created attachment 18001 [details] sata_nv-eh-debug.patch Please apply the attached patch and post the resulting log. Please don't forget to turn on KALLSYMS.
Comment 4 Peter Kerwien 2008-09-24 12:09:46 UTC
Created attachment 18011 [details] System log This is my system log on a patched 2.6.27-rc7 kernel.
Comment 5 Pär Lindfors 2008-09-26 11:45:43 UTC
Created attachment 18057 [details] dmesg.2.6.27-rc7_sata_nv-eh-debug Here is my log as well, looks similar to the one from Peter. Mine is also using 2.6.27-rc7 + patch.
Comment 6 Robert Hancock 2008-09-27 11:47:52 UTC
Looks like nv_host_intr is triggering a freeze. Looking at the code the only reason it should be able to do that is if it detects a hotplug condition?
Comment 7 Tejun Heo 2008-09-27 14:21:25 UTC
Created attachment 18078 [details] sata_nv-reinstate-nv_hardreset.patch Can you please test this patch? It seems I made a stupid mistake while trying to fix nv_generic hardreset regression. Thanks.
Comment 8 Marc Dionne 2008-09-27 15:19:17 UTC
The patch corrects the problem here.
Comment 9 Rafael J. Wysocki 2008-09-27 16:04:46 UTC
Handled-By : Tejun Heo <email@example.com> Patch : http://bugzilla.kernel.org/attachment.cgi?id=18078&action=view
Comment 10 Peter Kerwien 2008-09-27 16:18:58 UTC
The patch corrects the problem here as well.
Comment 11 Tejun Heo 2008-09-27 16:26:12 UTC
Patch posted. http://article.gmane.org/gmane.linux.ide/34750
Comment 12 Pär Lindfors 2008-09-30 14:54:56 UTC
Sorry for my slow responses, I have not had enough time the last days. The patch fixes the bug on my machine as well. I see that it also made it into -rc8, thanks a lot!
Comment 13 Rafael J. Wysocki 2008-10-04 13:27:34 UTC
Comment 14 Matthew Foulkes 2009-02-27 02:33:56 UTC
I saw this problem for the first time when I upgraded to the most recent stable kernel, 220.127.116.11. The kern.log file is full of messages such as: Feb 27 09:58:38 dolphin kernel: ata2: EH pending after 5 tries, giving up Feb 27 09:58:38 dolphin kernel: ata2: EH complete Feb 27 09:58:39 dolphin kernel: ata2: EH pending after 5 tries, giving up If I reboot into kernel 18.104.22.168, built with exactly the same configuration, the messages disappear. The "hardinfo" program reports that I have the following hardware: Processor: AMD Athlon(tm) 64 Processor 3000+ stepping 15, 31, 0 (AMD Opteron/Athlon64/FX) PCI Devices: Host bridge nVidia Corporation nForce3 250Gb Host Bridge ISA bridge nVidia Corporation nForce3 250Gb LPC Bridge SMBus nVidia Corporation nForce 250Gb PCI System Management USB Controller nVidia Corporation CK8S USB Controller USB Controller nVidia Corporation CK8S USB Controller USB Controller nVidia Corporation nForce3 EHCI USB 2.0 Controller Audio Controller nVidia Corporation nForce3 250Gb AC'97 Audio Controller IDE interface nVidia Corporation CK8S Parallel ATA Controller IDE interface nVidia Corporation nForce3 Serial ATA Controller PCI bridge nVidia Corporation nForce3 250Gb AGP Host to PCI Bridge PCI bridge nVidia Corporation nForce3 250Gb PCI-to-PCI Bridge Host bridge Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration Host bridge Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map Host bridge Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller Host bridge Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control VGA controller ATI Technologies Inc RV280 [Radeon 9200 SE] Ethernet controller National Semiconductor Corporation DP83815 FireWire (IEEE 1394) VIA Technologies, Inc. VT6306 Fire II IEEE 1394 OHCI Link Layer Controller Ethernet controller Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller Storage: IDE Disks LITE-ON COMBO SOHC-5235K SCSI Disks ATA WDC WD800JD-00JN
Comment 15 Tejun Heo 2009-03-01 04:06:18 UTC
Can you please post the output of "lspci -nn"? Also, is it possible for you to catch full failing boot log?
Comment 16 Matthew Foulkes 2009-03-01 05:46:21 UTC
Created attachment 20394 [details] Output of "lspci -nn"
Comment 17 Matthew Foulkes 2009-03-01 05:47:12 UTC
Created attachment 20395 [details] Contents of /var/log/dmesg for kernel 22.214.171.124
Comment 18 Matthew Foulkes 2009-03-01 05:47:49 UTC
Created attachment 20396 [details] Contents of /var/log/dmesg for kernel 126.96.36.199
Comment 19 Matthew Foulkes 2009-03-01 05:53:43 UTC
I hope the three attachments provide the information you need. I see that the line "ata2: SATA link down (SStatus 0 SControl 300)", present in the dmesg output when I boot with 188.8.131.52, is missing with 184.108.40.206. I am no Linux expert and only compile my own kernel because the stock Debian kernel did not support my network card when I first purchased this machine. It is likely that my kernel configuration is dumb. Please ask if you need to see it. Sorry for creating four messages instead of one; I'm still learning to use bugzilla. Thanks.
Comment 20 Tejun Heo 2009-03-01 16:20:28 UTC
Creating separate entry for each file is fine. Bugzilla is pretty dumb that way. For 220.127.116.11, /var/log/dmesg looks truncated and missing some log entries. Can you please do "dmesg > dmesg.out" after boot and attach the result?
Comment 21 Matthew Foulkes 2009-03-02 03:47:31 UTC
I believe that dmesg gave the same truncated entry (missing some stuff at the beginning) as appeared in /var/log/dmesg. I am away from home at present and won't be able to access the computer in question (which is currently off) again until Tuesday evening, UK time, but will try again then and report.
Comment 22 Dan Andresan 2009-03-02 12:21:08 UTC
I have the same problem. I had it with Ubuntu Intrepid (testing) and it was fixed. It appeared again recently with Ubuntu Jaunty (testing), kernel 18.104.22.168 dan@shuttle:~$ lspci -nn 00:00.0 Host bridge : nVidia Corporation nForce3 250Gb Host Bridge [10de:00e1] (rev a1) 00:01.0 ISA bridge : nVidia Corporation nForce3 250Gb LPC Bridge [10de:00e0] (rev a2) 00:01.1 SMBus [0c05]: nVidia Corporation nForce 250Gb PCI System Management [10de:00e4] (rev a1) 00:02.0 USB Controller [0c03]: nVidia Corporation CK8S USB Controller [10de:00e7] (rev a1) 00:02.1 USB Controller [0c03]: nVidia Corporation CK8S USB Controller [10de:00e7] (rev a1) 00:02.2 USB Controller [0c03]: nVidia Corporation nForce3 EHCI USB 2.0 Controller [10de:00e8] (rev a2) 00:06.0 Multimedia audio controller : nVidia Corporation nForce3 250Gb AC'97 Audio Controller [10de:00ea] (rev a1) 00:08.0 IDE interface : nVidia Corporation CK8S Parallel ATA Controller (v2.5) [10de:00e5] (rev a2) 00:0a.0 IDE interface : nVidia Corporation nForce3 Serial ATA Controller [10de:00e3] (rev a2) 00:0b.0 PCI bridge : nVidia Corporation nForce3 250Gb AGP Host to PCI Bridge [10de:00e2] (rev a2) 00:0e.0 PCI bridge : nVidia Corporation nForce3 250Gb PCI-to-PCI Bridge [10de:00ed] (rev a2) 00:18.0 Host bridge : Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration [1022:1100] 00:18.1 Host bridge : Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map [1022:1101] 00:18.2 Host bridge : Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller [1022:1102] 00:18.3 Host bridge : Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control [1022:1103] 01:00.0 VGA compatible controller : nVidia Corporation NV44A [GeForce 6200] [10de:0221] (rev a1) 02:06.0 Ethernet controller : Atheros Communications Inc. AR2413 802.11bg NIC [168c:001a] (rev 01) 02:07.0 FireWire (IEEE 1394) [0c00]: VIA Technologies, Inc. VT6306 Fire II IEEE 1394 OHCI Link Layer Controller [1106:3044] (rev 80) 02:08.0 Ethernet controller : Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller [11ab:4320] (rev 13)
Comment 23 Tejun Heo 2009-03-02 16:39:39 UTC
Dan, can you please post the output of "dmesg" after boot? Matthew, you can increase dmesg buf using kernel parameter log_buf_len=n where n is the capacity. Also, dmesg command itself has size limitation which can be adjusted using -s parameter. Thanks.
Comment 24 Dan Andresan 2009-03-02 16:55:38 UTC
Created attachment 20410 [details] dmesg with EH errors, obtained after boot Here it is, dmesg on my computer exhibiting the problem, obtained immediately after boot
Comment 25 Tejun Heo 2009-03-02 18:03:01 UTC
Strange. Does kernel parameter "scsi.scan=sync" make any difference? Please note that if scsi is built as module and loaded from initrd, you might need to do different things to pass the parameter to the scsi module depending on your distro.
Comment 26 Dan Andresan 2009-03-03 00:02:46 UTC
booting with scsi_mod.scan=sync and the messages still appear. [ 0.000000] Kernel command line: root=UUID=69049c1f-f6f6-45a5-aaea-0737bb69ec0e ro quiet splash scsi_mod.scan=sync scsi is built-in, not as a module.
Comment 27 Tejun Heo 2009-03-03 00:50:28 UTC
Thanks for verifying. I'm still quite confused how the recent change can cause such failure tho. Can you build your own kernel and try some patches? Thanks.
Comment 28 Dan Andresan 2009-03-03 01:04:56 UTC
Tejun, yes, I can build my own kernel. I'll be back in 12 hours (have to go to work now) and I'll try it then. Thanks for working on this bug.
Comment 29 Arne Woerner 2009-03-03 14:13:20 UTC
22.214.171.124-170.2.35.fc10.x86_64 has the same problem... is there some command that would stop it? i can't reboot so often (my box records TV programmings quite often)... :-) why did u offer a kernel update, that was known to have that bug? :-) -Arne
Comment 30 Tejun Heo 2009-03-03 22:04:34 UTC
Okay, successfully reproduced the problem here. Will know when I know more. Arne, I didn't know about this problem before releasing the patch. I'm not that nuts (yet). :-) Sorry about the trouble.
Comment 31 Tejun Heo 2009-03-03 23:03:12 UTC
Created attachment 20420 [details] libata-thaw-after-reset-skip.patch This patch should fix the problem. Sorry about all the breakages. sata_nv reset peculiarities combined with obscure libata EH bug made things very convoluted. Hopefully, this is the last fallout from the prefer-hardreset conversion. Thanks.
Comment 32 Tejun Heo 2009-03-03 23:03:41 UTC
Resolving as CODE_FIX.
Comment 33 Dan Andresan 2009-03-03 23:56:52 UTC
Thanks Tejun, this is great. I was just about to report that I have my kernel compiled and ready to start trying your patches but, surprise, the bug is already fixed.
Comment 34 Tejun Heo 2009-03-04 00:16:50 UTC
It would be nice if you can verify the patch. Thanks.
Comment 35 Matthew Foulkes 2009-03-04 01:39:45 UTC
Hi Tejun. Thanks very much for your work. I tried to test your patch this morning before work. After unpacking a fresh copy of 2.6.28 into /usr/src/linux-126.96.36.199, I did cd /usr/src/linux-188.8.131.52 patch -p1 < patch-184.108.40.206 patch -p1 < patch-libata-thaw-after-reset-skip at which point I saw the following error message patching file drivers/ata/libata-eh.c Hunk #1 FAILED at 2423. 1 out of 1 hunk FAILED -- saving rejects to file drivers/ata/libata-eh.c.rej Was your patch against 220.127.116.11?
Comment 36 Tejun Heo 2009-03-04 01:45:45 UTC
Created attachment 20423 [details] libata-thaw-after-reset-skip.patch It was against 2.6.29-rc. I'm attaching one for 2.6.28.
Comment 37 Arne Woerner 2009-03-04 02:06:47 UTC
Hi Tejun! Thanks for responding so quickly... What about fedora 10's current kernel (18.104.22.168-170.2.35.fc10.x86_64)? *sob* :-) I opened a bug report there: https://bugzilla.redhat.com/show_bug.cgi?id=488371 Is there a way to inform fedora's developers about bugs? I mean: The bug is known since 2009-03-01, but fedora's kernel update was yesterday... Bye Arne
Comment 38 Dan Andresan 2009-03-04 02:51:28 UTC
Test completed: vanilla 22.214.171.124: EH errors appear in dmesg output patched 126.96.36.199: no more EH errors, your patch eliminated them Sorry for the delay, after you fixed it I completely removed the compiled kernel and the source files, so I had to compile everything from beginning. Since I used the default Ubuntu .config file, it compiled a lot of modules (I guess all of them), thus such a big delay. Thanks again, Dan.
Comment 39 Tejun Heo 2009-03-04 03:10:58 UTC
Great. Thanks for verifying. Arne, eh.. I don't know much about fedora but I bet they will pick it up once the next -stable is released.
Comment 40 Matthew Foulkes 2009-03-04 12:07:16 UTC
Your patch against 188.8.131.52 eliminated the EH errors from the dmesg output on my machine also. Thanks very much. Matthew