Bug 11615

Summary: sata_nv EH problems
Product: IO/Storage Reporter: Pär Lindfors (paran)
Component: Serial ATAAssignee: Tejun Heo (tj)
Status: RESOLVED CODE_FIX    
Severity: normal CC: danyer, hancockrwd, htejun, marc.c.dionne, peter, rjw, wimacofo
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27-rc6 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 11167    
Attachments: dmesg
lspci -vv
sata_nv-eh-debug.patch
System log
dmesg.2.6.27-rc7_sata_nv-eh-debug
sata_nv-reinstate-nv_hardreset.patch
Output of "lspci -nn"
Contents of /var/log/dmesg for kernel 2.6.28.6
Contents of /var/log/dmesg for kernel 2.6.28.7
dmesg with EH errors, obtained after boot
libata-thaw-after-reset-skip.patch
libata-thaw-after-reset-skip.patch

Description Pär Lindfors 2008-09-21 18:09:13 UTC
Latest working kernel version: 2.6.27-rc5
Earliest failing kernel version: 2.6.27-rc6
Bug caused by commit 2fd673ecf0378ddeeeb87b3605e50212e0c0ddc6

Distribution: Ubuntu 8.10
Hardware Environment:
x86_64 (Athlon X2 4400+)
nVidia CK804 chipset
1 PATA disk ST3250823A
1 SATA disk ST3500630AS

Problem Description:
Disk IO to SATA is slow, and during IO the kernel logs messages like this:

[   86.416012] ata4: EH complete
[   89.884011] ata4: EH complete
[   91.440012] ata4: EH complete
[   95.632013] ata4: EH complete
[  120.320012] ata4: EH complete
[  146.912695] ata4: EH pending after 5 tries, giving up
[  146.912748] ata4: EH complete
[  148.472754] ata4: EH pending after 5 tries, giving up
[  148.472806] ata4: EH complete
[  150.033090] ata4: EH pending after 5 tries, giving up
[  150.033142] ata4: EH complete
[  151.592379] ata4: EH pending after 5 tries, giving up

I first noticed this with Ubuntus 2.6.27-3-generic kernel, which is based on -rc6. I then verified the bug using HEAD on Linux linux-2.6 git.

The problem seems to be caused by the latest commit for drivers/ata/sata_nv.c, reverting this solves the problem for me:

commit 2fd673ecf0378ddeeeb87b3605e50212e0c0ddc6
Author: Tejun Heo <tj@kernel.org>
Date:   Fri Aug 29 16:13:12 2008 +0200

    sata_nv: disable hardreset for generic
Comment 1 Pär Lindfors 2008-09-21 18:13:25 UTC
Created attachment 17926 [details]
dmesg
Comment 2 Pär Lindfors 2008-09-21 18:15:14 UTC
Created attachment 17927 [details]
lspci -vv
Comment 3 Tejun Heo 2008-09-24 02:40:54 UTC
Created attachment 18001 [details]
sata_nv-eh-debug.patch

Please apply the attached patch and post the resulting log.  Please don't forget to turn on KALLSYMS.
Comment 4 Peter Kerwien 2008-09-24 12:09:46 UTC
Created attachment 18011 [details]
System log

This is my system log on a patched 2.6.27-rc7 kernel.
Comment 5 Pär Lindfors 2008-09-26 11:45:43 UTC
Created attachment 18057 [details]
dmesg.2.6.27-rc7_sata_nv-eh-debug

Here is my log as well, looks similar to the one from Peter. Mine is also using 2.6.27-rc7 + patch.
Comment 6 Robert Hancock 2008-09-27 11:47:52 UTC
Looks like nv_host_intr is triggering a freeze. Looking at the code the only reason it should be able to do that is if it detects a hotplug condition?
Comment 7 Tejun Heo 2008-09-27 14:21:25 UTC
Created attachment 18078 [details]
sata_nv-reinstate-nv_hardreset.patch

Can you please test this patch?  It seems I made a stupid mistake while trying to fix nv_generic hardreset regression.  Thanks.
Comment 8 Marc Dionne 2008-09-27 15:19:17 UTC
The patch corrects the problem here.
Comment 9 Rafael J. Wysocki 2008-09-27 16:04:46 UTC
Handled-By : Tejun Heo <tj@kernel.org>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=18078&action=view
Comment 10 Peter Kerwien 2008-09-27 16:18:58 UTC
The patch corrects the problem here as well.
Comment 11 Tejun Heo 2008-09-27 16:26:12 UTC
Patch posted.

  http://article.gmane.org/gmane.linux.ide/34750
Comment 12 Pär Lindfors 2008-09-30 14:54:56 UTC
Sorry for my slow responses, I have not had enough time the last days. The patch fixes the bug on my machine as well.

I see that it also made it into -rc8, thanks a lot!
Comment 14 Matthew Foulkes 2009-02-27 02:33:56 UTC
I saw this problem for the first time when I upgraded to the most recent stable
kernel, 2.6.28.7. The kern.log file is full of messages such as:

Feb 27 09:58:38 dolphin kernel: ata2: EH pending after 5 tries, giving up
Feb 27 09:58:38 dolphin kernel: ata2: EH complete
Feb 27 09:58:39 dolphin kernel: ata2: EH pending after 5 tries, giving up

If I reboot into kernel 2.6.28.6, built with exactly the same configuration,
the messages disappear.

The "hardinfo" program reports that I have the following hardware:

Processor:
AMD Athlon(tm)  64 Processor 3000+
stepping        15, 31, 0 (AMD Opteron/Athlon64/FX)

PCI Devices:
Host bridge      nVidia Corporation nForce3 250Gb Host Bridge
ISA bridge       nVidia Corporation nForce3 250Gb LPC Bridge
SMBus            nVidia Corporation nForce 250Gb PCI System Management
USB Controller   nVidia Corporation CK8S USB Controller
USB Controller   nVidia Corporation CK8S USB Controller
USB Controller   nVidia Corporation nForce3 EHCI USB 2.0 Controller
Audio Controller nVidia Corporation nForce3 250Gb AC'97 Audio Controller
IDE interface    nVidia Corporation CK8S Parallel ATA Controller
IDE interface    nVidia Corporation nForce3 Serial ATA Controller
PCI bridge       nVidia Corporation nForce3 250Gb AGP Host to PCI Bridge
PCI bridge       nVidia Corporation nForce3 250Gb PCI-to-PCI Bridge
Host bridge      Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
Host bridge      Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
Host bridge      Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
Host bridge      Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
VGA controller   ATI Technologies Inc RV280 [Radeon 9200 SE]
Ethernet controller  National Semiconductor Corporation DP83815
FireWire (IEEE 1394) VIA Technologies, Inc. VT6306 Fire II IEEE 1394 OHCI Link Layer Controller
Ethernet controller  Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller

Storage:
IDE Disks
LITE-ON COMBO SOHC-5235K
SCSI Disks
ATA WDC WD800JD-00JN
Comment 15 Tejun Heo 2009-03-01 04:06:18 UTC
Can you please post the output of "lspci -nn"?  Also, is it possible for you to catch full failing boot log?
Comment 16 Matthew Foulkes 2009-03-01 05:46:21 UTC
Created attachment 20394 [details]
Output of "lspci -nn"
Comment 17 Matthew Foulkes 2009-03-01 05:47:12 UTC
Created attachment 20395 [details]
Contents of /var/log/dmesg for kernel 2.6.28.6
Comment 18 Matthew Foulkes 2009-03-01 05:47:49 UTC
Created attachment 20396 [details]
Contents of /var/log/dmesg for kernel 2.6.28.7
Comment 19 Matthew Foulkes 2009-03-01 05:53:43 UTC
I hope the three attachments provide the information you need. I see that the
line "ata2: SATA link down (SStatus 0 SControl 300)", present in
the dmesg output when I boot with 2.6.28.6, is missing with 2.6.28.7.

I am no Linux expert and only compile my own kernel because the stock Debian kernel did not support my network card when I first purchased this machine.
It is likely that my kernel configuration is dumb. Please ask if you need to see it.

Sorry for creating four messages instead of one; I'm still learning to use bugzilla.

Thanks.
Comment 20 Tejun Heo 2009-03-01 16:20:28 UTC
Creating separate entry for each file is fine.  Bugzilla is pretty dumb that way.  For 2.6.28.7, /var/log/dmesg looks truncated and missing some log entries.  Can you please do "dmesg > dmesg.out" after boot and attach the result?
Comment 21 Matthew Foulkes 2009-03-02 03:47:31 UTC
I believe that dmesg gave the same truncated entry (missing some stuff at the
beginning) as appeared in /var/log/dmesg. I am away from home at present 
and won't be able to access the computer in question (which is currently off) again until Tuesday evening, UK time, but will try again then and report.
Comment 22 Dan Andresan 2009-03-02 12:21:08 UTC
I have the same problem. I had it with Ubuntu Intrepid (testing) and it was fixed. It appeared again recently with Ubuntu Jaunty (testing), kernel 2.6.28.8

dan@shuttle:~$ lspci -nn
00:00.0 Host bridge [0600]: nVidia Corporation nForce3 250Gb Host Bridge [10de:00e1] (rev a1)
00:01.0 ISA bridge [0601]: nVidia Corporation nForce3 250Gb LPC Bridge [10de:00e0] (rev a2)
00:01.1 SMBus [0c05]: nVidia Corporation nForce 250Gb PCI System Management [10de:00e4] (rev a1)
00:02.0 USB Controller [0c03]: nVidia Corporation CK8S USB Controller [10de:00e7] (rev a1)
00:02.1 USB Controller [0c03]: nVidia Corporation CK8S USB Controller [10de:00e7] (rev a1)
00:02.2 USB Controller [0c03]: nVidia Corporation nForce3 EHCI USB 2.0 Controller [10de:00e8] (rev a2)
00:06.0 Multimedia audio controller [0401]: nVidia Corporation nForce3 250Gb AC'97 Audio Controller [10de:00ea] (rev a1)
00:08.0 IDE interface [0101]: nVidia Corporation CK8S Parallel ATA Controller (v2.5) [10de:00e5] (rev a2)
00:0a.0 IDE interface [0101]: nVidia Corporation nForce3 Serial ATA Controller [10de:00e3] (rev a2)
00:0b.0 PCI bridge [0604]: nVidia Corporation nForce3 250Gb AGP Host to PCI Bridge [10de:00e2] (rev a2)
00:0e.0 PCI bridge [0604]: nVidia Corporation nForce3 250Gb PCI-to-PCI Bridge [10de:00ed] (rev a2)
00:18.0 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration [1022:1100]
00:18.1 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map [1022:1101]
00:18.2 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller [1022:1102]
00:18.3 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control [1022:1103]
01:00.0 VGA compatible controller [0300]: nVidia Corporation NV44A [GeForce 6200] [10de:0221] (rev a1)
02:06.0 Ethernet controller [0200]: Atheros Communications Inc. AR2413 802.11bg NIC [168c:001a] (rev 01)
02:07.0 FireWire (IEEE 1394) [0c00]: VIA Technologies, Inc. VT6306 Fire II IEEE 1394 OHCI Link Layer Controller [1106:3044] (rev 80)
02:08.0 Ethernet controller [0200]: Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller [11ab:4320] (rev 13)
Comment 23 Tejun Heo 2009-03-02 16:39:39 UTC
Dan, can you please post the output of "dmesg" after boot?  Matthew, you can increase dmesg buf using kernel parameter log_buf_len=n where n is the capacity.  Also, dmesg command itself has size limitation which can be adjusted using -s parameter.  Thanks.
Comment 24 Dan Andresan 2009-03-02 16:55:38 UTC
Created attachment 20410 [details]
dmesg with EH errors, obtained after boot

Here it is, dmesg on my computer exhibiting the problem, obtained immediately after boot
Comment 25 Tejun Heo 2009-03-02 18:03:01 UTC
Strange.  Does kernel parameter "scsi.scan=sync" make any difference?  Please note that if scsi is built as module and loaded from initrd, you might need to do different things to pass the parameter to the scsi module depending on your distro.
Comment 26 Dan Andresan 2009-03-03 00:02:46 UTC
booting with scsi_mod.scan=sync and the messages still appear.

[    0.000000] Kernel command line: root=UUID=69049c1f-f6f6-45a5-aaea-0737bb69ec0e ro quiet splash scsi_mod.scan=sync

scsi is built-in, not as a module.
Comment 27 Tejun Heo 2009-03-03 00:50:28 UTC
Thanks for verifying.  I'm still quite confused how the recent change can cause such failure tho.  Can you build your own kernel and try some patches?  Thanks.
Comment 28 Dan Andresan 2009-03-03 01:04:56 UTC
Tejun, yes, I can build my own kernel. I'll be back in 12 hours (have to go to work now) and I'll try it then. Thanks for working on this bug.
Comment 29 Arne Woerner 2009-03-03 14:13:20 UTC
2.6.27.19-170.2.35.fc10.x86_64 has the same problem... 

is there some command that would stop it?
i can't reboot so often (my box records TV programmings quite often)... :-)

why did u offer a kernel update, that was known to have that bug? :-)

-Arne
Comment 30 Tejun Heo 2009-03-03 22:04:34 UTC
Okay, successfully reproduced the problem here.  Will know when I know more.

Arne, I didn't know about this problem before releasing the patch.  I'm not that nuts (yet). :-)  Sorry about the trouble.
Comment 31 Tejun Heo 2009-03-03 23:03:12 UTC
Created attachment 20420 [details]
libata-thaw-after-reset-skip.patch

This patch should fix the problem.  Sorry about all the breakages.  sata_nv reset peculiarities combined with obscure libata EH bug made things very convoluted.  Hopefully, this is the last fallout from the prefer-hardreset conversion.

Thanks.
Comment 32 Tejun Heo 2009-03-03 23:03:41 UTC
Resolving as CODE_FIX.
Comment 33 Dan Andresan 2009-03-03 23:56:52 UTC
Thanks Tejun, this is great.

I was just about to report that I have my kernel compiled and ready to start trying your patches but, surprise, the bug is already fixed.
 
Comment 34 Tejun Heo 2009-03-04 00:16:50 UTC
It would be nice if you can verify the patch.  Thanks.
Comment 35 Matthew Foulkes 2009-03-04 01:39:45 UTC
Hi Tejun. Thanks very much for your work.

I tried to test your patch this morning before work. After unpacking a 
fresh copy of 2.6.28 into /usr/src/linux-2.6.28.7, I did

  cd /usr/src/linux-2.6.28.7
  patch -p1 < patch-2.6.28.7
  patch -p1 < patch-libata-thaw-after-reset-skip

at which point I saw the following error message

  patching file drivers/ata/libata-eh.c
  Hunk #1 FAILED at 2423.
  1 out of 1 hunk FAILED -- saving rejects to file drivers/ata/libata-eh.c.rej

Was your patch against 2.6.28.7?
Comment 36 Tejun Heo 2009-03-04 01:45:45 UTC
Created attachment 20423 [details]
libata-thaw-after-reset-skip.patch

It was against 2.6.29-rc.  I'm attaching one for 2.6.28.
Comment 37 Arne Woerner 2009-03-04 02:06:47 UTC
Hi Tejun!

Thanks for responding so quickly...

What about fedora 10's current kernel (2.6.27.19-170.2.35.fc10.x86_64)?
*sob* :-)
I opened a bug report there:
https://bugzilla.redhat.com/show_bug.cgi?id=488371

Is there a way to inform fedora's developers about bugs?
I mean: The bug is known since 2009-03-01, but fedora's kernel update was yesterday...

Bye
Arne
Comment 38 Dan Andresan 2009-03-04 02:51:28 UTC
Test completed:
vanilla 2.6.28.7: EH errors appear in dmesg output
patched 2.6.28.7: no more EH errors, your patch eliminated them

Sorry for the delay, after you fixed it I completely removed the compiled kernel and the source files, so I had to compile everything from beginning. Since I used the default Ubuntu .config file, it compiled a lot of modules (I guess all of them), thus such a big delay.

Thanks again,
Dan.
Comment 39 Tejun Heo 2009-03-04 03:10:58 UTC
Great.  Thanks for verifying.

Arne, eh.. I don't know much about fedora but I bet they will pick it up once the next -stable is released.
Comment 40 Matthew Foulkes 2009-03-04 12:07:16 UTC
Your patch against 2.6.28.7 eliminated the EH errors from the dmesg output on my machine also.

Thanks very much.

Matthew