Latest working kernel version: 2.6.27-rc5
Earliest failing kernel version: 2.6.27-rc6
Bug caused by commit 2fd673ecf0378ddeeeb87b3605e50212e0c0ddc6
Distribution: Ubuntu 8.10
x86_64 (Athlon X2 4400+)
nVidia CK804 chipset
1 PATA disk ST3250823A
1 SATA disk ST3500630AS
Disk IO to SATA is slow, and during IO the kernel logs messages like this:
[ 86.416012] ata4: EH complete
[ 89.884011] ata4: EH complete
[ 91.440012] ata4: EH complete
[ 95.632013] ata4: EH complete
[ 120.320012] ata4: EH complete
[ 146.912695] ata4: EH pending after 5 tries, giving up
[ 146.912748] ata4: EH complete
[ 148.472754] ata4: EH pending after 5 tries, giving up
[ 148.472806] ata4: EH complete
[ 150.033090] ata4: EH pending after 5 tries, giving up
[ 150.033142] ata4: EH complete
[ 151.592379] ata4: EH pending after 5 tries, giving up
I first noticed this with Ubuntus 2.6.27-3-generic kernel, which is based on -rc6. I then verified the bug using HEAD on Linux linux-2.6 git.
The problem seems to be caused by the latest commit for drivers/ata/sata_nv.c, reverting this solves the problem for me:
Author: Tejun Heo <firstname.lastname@example.org>
Date: Fri Aug 29 16:13:12 2008 +0200
sata_nv: disable hardreset for generic
Created attachment 17926 [details]
Created attachment 17927 [details]
Created attachment 18001 [details]
Please apply the attached patch and post the resulting log. Please don't forget to turn on KALLSYMS.
Created attachment 18011 [details]
This is my system log on a patched 2.6.27-rc7 kernel.
Created attachment 18057 [details]
Here is my log as well, looks similar to the one from Peter. Mine is also using 2.6.27-rc7 + patch.
Looks like nv_host_intr is triggering a freeze. Looking at the code the only reason it should be able to do that is if it detects a hotplug condition?
Created attachment 18078 [details]
Can you please test this patch? It seems I made a stupid mistake while trying to fix nv_generic hardreset regression. Thanks.
The patch corrects the problem here.
Handled-By : Tejun Heo <email@example.com>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=18078&action=view
The patch corrects the problem here as well.
Sorry for my slow responses, I have not had enough time the last days. The patch fixes the bug on my machine as well.
I see that it also made it into -rc8, thanks a lot!
Fixed by: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4c1eb90a0908c0c60db2169dce08fb672e7582f1
I saw this problem for the first time when I upgraded to the most recent stable
kernel, 184.108.40.206. The kern.log file is full of messages such as:
Feb 27 09:58:38 dolphin kernel: ata2: EH pending after 5 tries, giving up
Feb 27 09:58:38 dolphin kernel: ata2: EH complete
Feb 27 09:58:39 dolphin kernel: ata2: EH pending after 5 tries, giving up
If I reboot into kernel 220.127.116.11, built with exactly the same configuration,
the messages disappear.
The "hardinfo" program reports that I have the following hardware:
AMD Athlon(tm) 64 Processor 3000+
stepping 15, 31, 0 (AMD Opteron/Athlon64/FX)
Host bridge nVidia Corporation nForce3 250Gb Host Bridge
ISA bridge nVidia Corporation nForce3 250Gb LPC Bridge
SMBus nVidia Corporation nForce 250Gb PCI System Management
USB Controller nVidia Corporation CK8S USB Controller
USB Controller nVidia Corporation CK8S USB Controller
USB Controller nVidia Corporation nForce3 EHCI USB 2.0 Controller
Audio Controller nVidia Corporation nForce3 250Gb AC'97 Audio Controller
IDE interface nVidia Corporation CK8S Parallel ATA Controller
IDE interface nVidia Corporation nForce3 Serial ATA Controller
PCI bridge nVidia Corporation nForce3 250Gb AGP Host to PCI Bridge
PCI bridge nVidia Corporation nForce3 250Gb PCI-to-PCI Bridge
Host bridge Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
Host bridge Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
Host bridge Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
Host bridge Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
VGA controller ATI Technologies Inc RV280 [Radeon 9200 SE]
Ethernet controller National Semiconductor Corporation DP83815
FireWire (IEEE 1394) VIA Technologies, Inc. VT6306 Fire II IEEE 1394 OHCI Link Layer Controller
Ethernet controller Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller
LITE-ON COMBO SOHC-5235K
ATA WDC WD800JD-00JN
Can you please post the output of "lspci -nn"? Also, is it possible for you to catch full failing boot log?
Created attachment 20394 [details]
Output of "lspci -nn"
Created attachment 20395 [details]
Contents of /var/log/dmesg for kernel 18.104.22.168
Created attachment 20396 [details]
Contents of /var/log/dmesg for kernel 22.214.171.124
I hope the three attachments provide the information you need. I see that the
line "ata2: SATA link down (SStatus 0 SControl 300)", present in
the dmesg output when I boot with 126.96.36.199, is missing with 188.8.131.52.
I am no Linux expert and only compile my own kernel because the stock Debian kernel did not support my network card when I first purchased this machine.
It is likely that my kernel configuration is dumb. Please ask if you need to see it.
Sorry for creating four messages instead of one; I'm still learning to use bugzilla.
Creating separate entry for each file is fine. Bugzilla is pretty dumb that way. For 184.108.40.206, /var/log/dmesg looks truncated and missing some log entries. Can you please do "dmesg > dmesg.out" after boot and attach the result?
I believe that dmesg gave the same truncated entry (missing some stuff at the
beginning) as appeared in /var/log/dmesg. I am away from home at present
and won't be able to access the computer in question (which is currently off) again until Tuesday evening, UK time, but will try again then and report.
I have the same problem. I had it with Ubuntu Intrepid (testing) and it was fixed. It appeared again recently with Ubuntu Jaunty (testing), kernel 220.127.116.11
dan@shuttle:~$ lspci -nn
00:00.0 Host bridge : nVidia Corporation nForce3 250Gb Host Bridge [10de:00e1] (rev a1)
00:01.0 ISA bridge : nVidia Corporation nForce3 250Gb LPC Bridge [10de:00e0] (rev a2)
00:01.1 SMBus [0c05]: nVidia Corporation nForce 250Gb PCI System Management [10de:00e4] (rev a1)
00:02.0 USB Controller [0c03]: nVidia Corporation CK8S USB Controller [10de:00e7] (rev a1)
00:02.1 USB Controller [0c03]: nVidia Corporation CK8S USB Controller [10de:00e7] (rev a1)
00:02.2 USB Controller [0c03]: nVidia Corporation nForce3 EHCI USB 2.0 Controller [10de:00e8] (rev a2)
00:06.0 Multimedia audio controller : nVidia Corporation nForce3 250Gb AC'97 Audio Controller [10de:00ea] (rev a1)
00:08.0 IDE interface : nVidia Corporation CK8S Parallel ATA Controller (v2.5) [10de:00e5] (rev a2)
00:0a.0 IDE interface : nVidia Corporation nForce3 Serial ATA Controller [10de:00e3] (rev a2)
00:0b.0 PCI bridge : nVidia Corporation nForce3 250Gb AGP Host to PCI Bridge [10de:00e2] (rev a2)
00:0e.0 PCI bridge : nVidia Corporation nForce3 250Gb PCI-to-PCI Bridge [10de:00ed] (rev a2)
00:18.0 Host bridge : Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration [1022:1100]
00:18.1 Host bridge : Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map [1022:1101]
00:18.2 Host bridge : Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller [1022:1102]
00:18.3 Host bridge : Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control [1022:1103]
01:00.0 VGA compatible controller : nVidia Corporation NV44A [GeForce 6200] [10de:0221] (rev a1)
02:06.0 Ethernet controller : Atheros Communications Inc. AR2413 802.11bg NIC [168c:001a] (rev 01)
02:07.0 FireWire (IEEE 1394) [0c00]: VIA Technologies, Inc. VT6306 Fire II IEEE 1394 OHCI Link Layer Controller [1106:3044] (rev 80)
02:08.0 Ethernet controller : Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller [11ab:4320] (rev 13)
Dan, can you please post the output of "dmesg" after boot? Matthew, you can increase dmesg buf using kernel parameter log_buf_len=n where n is the capacity. Also, dmesg command itself has size limitation which can be adjusted using -s parameter. Thanks.
Created attachment 20410 [details]
dmesg with EH errors, obtained after boot
Here it is, dmesg on my computer exhibiting the problem, obtained immediately after boot
Strange. Does kernel parameter "scsi.scan=sync" make any difference? Please note that if scsi is built as module and loaded from initrd, you might need to do different things to pass the parameter to the scsi module depending on your distro.
booting with scsi_mod.scan=sync and the messages still appear.
[ 0.000000] Kernel command line: root=UUID=69049c1f-f6f6-45a5-aaea-0737bb69ec0e ro quiet splash scsi_mod.scan=sync
scsi is built-in, not as a module.
Thanks for verifying. I'm still quite confused how the recent change can cause such failure tho. Can you build your own kernel and try some patches? Thanks.
Tejun, yes, I can build my own kernel. I'll be back in 12 hours (have to go to work now) and I'll try it then. Thanks for working on this bug.
18.104.22.168-170.2.35.fc10.x86_64 has the same problem...
is there some command that would stop it?
i can't reboot so often (my box records TV programmings quite often)... :-)
why did u offer a kernel update, that was known to have that bug? :-)
Okay, successfully reproduced the problem here. Will know when I know more.
Arne, I didn't know about this problem before releasing the patch. I'm not that nuts (yet). :-) Sorry about the trouble.
Created attachment 20420 [details]
This patch should fix the problem. Sorry about all the breakages. sata_nv reset peculiarities combined with obscure libata EH bug made things very convoluted. Hopefully, this is the last fallout from the prefer-hardreset conversion.
Resolving as CODE_FIX.
Thanks Tejun, this is great.
I was just about to report that I have my kernel compiled and ready to start trying your patches but, surprise, the bug is already fixed.
It would be nice if you can verify the patch. Thanks.
Hi Tejun. Thanks very much for your work.
I tried to test your patch this morning before work. After unpacking a
fresh copy of 2.6.28 into /usr/src/linux-22.214.171.124, I did
patch -p1 < patch-126.96.36.199
patch -p1 < patch-libata-thaw-after-reset-skip
at which point I saw the following error message
patching file drivers/ata/libata-eh.c
Hunk #1 FAILED at 2423.
1 out of 1 hunk FAILED -- saving rejects to file drivers/ata/libata-eh.c.rej
Was your patch against 188.8.131.52?
Created attachment 20423 [details]
It was against 2.6.29-rc. I'm attaching one for 2.6.28.
Thanks for responding so quickly...
What about fedora 10's current kernel (184.108.40.206-170.2.35.fc10.x86_64)?
I opened a bug report there:
Is there a way to inform fedora's developers about bugs?
I mean: The bug is known since 2009-03-01, but fedora's kernel update was yesterday...
vanilla 220.127.116.11: EH errors appear in dmesg output
patched 18.104.22.168: no more EH errors, your patch eliminated them
Sorry for the delay, after you fixed it I completely removed the compiled kernel and the source files, so I had to compile everything from beginning. Since I used the default Ubuntu .config file, it compiled a lot of modules (I guess all of them), thus such a big delay.
Great. Thanks for verifying.
Arne, eh.. I don't know much about fedora but I bet they will pick it up once the next -stable is released.
Your patch against 22.214.171.124 eliminated the EH errors from the dmesg output on my machine also.
Thanks very much.