Bug 13492

Summary: pata_via: cd/dvd drive gets disabled by driver after timeout error
Product: IO/Storage Reporter: Jesse Gilles (jesse.gilles)
Component: Serial ATAAssignee: Tejun Heo (tj)
Status: RESOLVED WILL_FIX_LATER    
Severity: normal CC: mybigspam, tj
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.24 to 2.6.29.4 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg output
lspci -vv output
config-2.6.29.4

Description Jesse Gilles 2009-06-09 15:10:59 UTC
Created attachment 21828 [details]
dmesg output

After some time (a few hours, maybe?), my IDE DVD/RW drive stops functioning completely and appears to be disabled by the pata_via driver.  Before the error, the drive works perfectly fine and can read and write CDs.  Once the timeout error occurs, the drive doesn't function until rebooting.

This problem starting after upgrading from Ubuntu Gutsy (2.6.22 kernel) to Hardy (2.6.24 kernel).  I upgraded to Ubuntu Jaunty (2.6.28 kernel) and still saw the problem.  I compiled and ran a vanilla 2.6.29.4 kernel and see the same behavior as well.

Hardware:
Lite-On DVD-RW SHW-160P6S
MSI K9VGM-V mainboard
AMD Athlon X2 5200+
SATA hard drive

lspci -vv and dmesg output from 2.6.29.4 attached.

Ubuntu bug link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/231575
Comment 1 Jesse Gilles 2009-06-09 15:11:20 UTC
Created attachment 21829 [details]
lspci -vv output
Comment 2 Jesse Gilles 2009-06-09 15:17:14 UTC
Created attachment 21830 [details]
config-2.6.29.4
Comment 3 Tejun Heo 2009-06-10 00:34:03 UTC
Does 'hal-disable-polling /dev/sr0' make any difference?  Please note that it will disable media presence detection and thus won't mount media automatically.
Comment 4 Jesse Gilles 2009-06-10 18:42:33 UTC
I haven't let it run for very long yet (about 5 hours), but that does seem to have fixed things so far.  I ran 'hal-disable-polling --device /dev/sr0' and rebooted (to re-enable the device) and haven't had any timeout errors since.
Comment 5 Tejun Heo 2009-06-11 02:15:46 UTC
Hmmm... Linux is currently using slightly different (from windows) command sequence to probe for media presence.  It's perfectly valid but there seems to be a few drives which lock up after a while.  It's most likely a firmware bug.  In the long run, we'll probably have to move media presence probing into kernel and mimic what windows is doing. :-( For shorter term, maybe there's a firmware update available?

Thanks.
Comment 6 Jesse Gilles 2009-06-11 14:06:25 UTC
Everything is still working fine now after a full 24 hours with the media detection polling disabled.  Thanks for the info and advice.  I did find a firmware update for my Lite-On drive, but it requires Windows to flash...so I'd have to move it to another machine to flash it and then put it back.  I might try that later.

Was media detect polling done differently in older kernels?  Or is it completely implemented in user space?  I was just remembering that I had no problems when running the Ubuntu gutsy kernel based on 2.6.22.  Something must have changed since then. This all started happening when I upgraded from gutsy to hardy (2.6.24) and has happened with every kernel since.  When I was still running hardy, I downgraded to the gutsy kernel and the problem went away.

Thanks again -- I can easily live without media detection :)
Comment 7 Tejun Heo 2009-06-12 02:07:26 UTC
I don't know much about ubuntu versions but the difference could have come from the transition from ide to libata.  libata uses SCSI cdrom driver (sr) to drive cdroms which uses different open sequence from ide-cdrom.  Both are perfectly legit and time-tested but well it was something new to ide cdroms, I suppose.  Please try to upgrade firmware when you've got a chance.

Thanks.
Comment 8 Jesse Gilles 2009-06-12 15:08:42 UTC
Yeah, I think it was triggered by switching to libata in Ubuntu hardy (8.04).

I upgraded my dvd drive to the latest firmware yesterday (PS0A to PS0C), enabled polling, and still got the timeout errors.  I wasn't too optimistic about the firmware update since Lite-On's description of the update simply said "detect more media" with no other information, but it was certainly worth a try.

From the ubuntu bug comments, it looks people are having issues with numerous different IDE dvd drives (not just Lite-On).
Comment 9 Jesse Gilles 2009-06-16 16:27:04 UTC
Well, the timeout error has happened again even with media check polling disabled.  It just seems like it took longer.  I got the same errors in dmesg again after about 3.5 days of uptime.  This is running the stock ubuntu jaunty kernel (2.6.28 based).  It seems likely that 2.6.29.4 would behave similarly given that they have the same behavior with polling on.
Comment 10 Tejun Heo 2009-06-17 05:40:29 UTC
Hmmm... interesting.  Could it be that the device is just getting old and failing?  With polling disabled, nothing is really touching the drive and it's still checking out and doesn't respond to reset.  Can you please give a shot at running one of the old distros?

Thanks.
Comment 11 Jesse Gilles 2009-06-17 20:36:19 UTC
I thinks it's pretty unlikely that the drive is failing.  It's not very old and works just fine whenever it's not disabled by the driver.  Also, and more importantly, I have another identical workstation running Windows with an identical drive and because I could only upgrade the drive firmware from Windows, I flashed the drive that was already in the Windows machine and then swapped it with the one in linux box.  And the swapped drive is doing the same thing (with the latest firmware).

Unfortunately, I use both machines daily for development work so I can't easily run an old distro on either of them for testing.

What older distro would you suggest?  I was running Ubuntu gutsy for quite a while and never had the issue.  When I upgraded to hardy, the issue surfaced and when I reverted to the gutsy kernel (kernel change only, still hardy distro otherwise), the problem completely went away.

Or are you referring to building an older kernel release and running with my current distro (jaunty)?

Thanks again.
Comment 12 Tejun Heo 2009-06-18 01:34:33 UTC
I forgot some of the context, sorry about that.  I still don't know who's responsible for the behavior change.  It could be the kernel or some change in userland which ends up issuing different commands.  The timing out command being the simplest command w/o any data attached there isn't much the driver can screw up.  Can you please post dmesg output after the drive check out with hal polling disabled?  Let's see which command times out.

Thanks.
Comment 13 Jesse Gilles 2009-06-18 22:32:48 UTC
Here is the output after a few days of uptime, running Ubuntu jaunty with ubuntu kernel (2.6.28) and with hal polling disabled.

[295326.000033] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[295326.000040] ata3.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[295326.000042]          cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
[295326.000043]          res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
[295326.000045] ata3.00: status: { DRDY }
[295326.000065] ata3: soft resetting link
[295326.212248] ata3.00: configured for UDMA/66
[295331.212031] ata3.00: qc timeout (cmd 0xa0)
[295331.212038] ata3.00: TEST_UNIT_READY failed (err_mask=0x5)
[295331.212061] ata3: soft resetting link
[295331.424237] ata3.00: configured for UDMA/66
[295336.424024] ata3.00: qc timeout (cmd 0xa0)
[295336.424031] ata3.00: TEST_UNIT_READY failed (err_mask=0x5)
[295336.424035] ata3.00: limiting speed to UDMA/66:PIO3
[295336.424055] ata3: soft resetting link
[295336.636258] ata3.00: configured for UDMA/66
[295341.636029] ata3.00: qc timeout (cmd 0xa0)
[295341.636037] ata3.00: TEST_UNIT_READY failed (err_mask=0x5)
[295341.636039] ata3.00: disabled
[295341.636073] ata3: soft resetting link
[295341.792082] ata3: EH complete
Comment 14 Tejun Heo 2009-06-25 10:56:02 UTC
The failing command is still TUR.  If hal media polling is disabled, I wonder who's still issuing the command.  Can you please run "while true; do fuser -v /dev/sr0; done" and see whether you can catch anything?  As fuser only checks whether there's any program which has the file open at the time of execution, it will miss the polling program most of the time but if you keep the loop running for a while, it should eventually catch who's doing it.

Thanks.
Comment 15 Jesse Gilles 2009-06-25 13:55:14 UTC
Ok, I can run that for a while and see what I find.  My current suspicion is that it could be VirtualBox, which I run here and there for testing software.
Comment 16 Jesse Gilles 2009-08-25 21:00:01 UTC
I never found anything out from running fuser, but I also haven't had the problem for quite a while.  My drive has been working fine for weeks with the media polling disabled.  Is there any potential of a real fix getting made for this?
Comment 17 Tejun Heo 2009-08-27 06:02:46 UTC
Hmmm... In general, it looks like the drive isn't quite happy with TURs being issued frequently, which really is the device's fault. :-) But blame game aside, there are quite some number of devices which choke on repeated TURs issued by hal media presence polling.  The root cause is that windows isn't using TUR during media presence polling so vendors never verify against TURs being issued constantly.  I'm currently working on in-kernel media presence polling so that we can what windows does and issue only single command (GET_EVENT_STATUS) for media presence polling.  I expect it to be finished in a few weeks but it'll take some time to trickle down to distros.  So, yeap, long term solution is in progress and for the time being, disabling media presence polling seems like the workaround.

I think it would be better to resolve this one as WONTFIX for the time.  Jeff?

Thanks.
Comment 18 Tejun Heo 2009-08-27 06:03:40 UTC
Heh... Apparently I can steal it.  Resolving as WILL_FIX_LATER.

Thank you.
Comment 19 Jesse Gilles 2009-08-27 14:27:14 UTC
Thanks very much for the detailed info.
Comment 20 Jesse Gilles 2010-05-12 22:01:42 UTC
Has there been any progress on addressing this issue?

I recently upgraded to Ubuntu 10.04, which is running a kernel based on  2.6.32.11 and I have the exact same problem as always.  I also found another DVD burner laying around (also Lite-On though) and it has the same problems.  Model is DH20A4P.

Most recent output, which looks pretty much the same:

[ 2382.000028] ata3: lost interrupt (Status 0x51)
[ 2387.000036] ata3.00: qc timeout (cmd 0xa0)
[ 2387.000047] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 2387.000051] sr 2:0:0:0: CDB: Test Unit Ready: 00 00 00 00 00 00
[ 2387.000061] ata3.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[ 2387.000062]          res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
[ 2387.000064] ata3.00: status: { DRDY ERR }
[ 2387.000085] ata3: soft resetting link
[ 2387.212240] ata3.00: configured for UDMA/66
[ 2392.212028] ata3.00: qc timeout (cmd 0xa0)
[ 2392.212035] ata3.00: TEST_UNIT_READY failed (err_mask=0x5)
[ 2392.212056] ata3: soft resetting link
[ 2392.424242] ata3.00: configured for UDMA/66
[ 2397.424027] ata3.00: qc timeout (cmd 0xa0)
[ 2397.424034] ata3.00: TEST_UNIT_READY failed (err_mask=0x5)
[ 2397.424038] ata3.00: limiting speed to UDMA/66:PIO3
[ 2397.424059] ata3: soft resetting link
[ 2397.636262] ata3.00: configured for UDMA/66
[ 2402.636029] ata3.00: qc timeout (cmd 0xa0)
[ 2402.636036] ata3.00: TEST_UNIT_READY failed (err_mask=0x5)
[ 2402.636039] ata3.00: disabled
[ 2402.636072] ata3: soft resetting link
[ 2402.792090] ata3: EH complete
Comment 21 Tejun Heo 2010-05-13 07:33:42 UTC
I was working on it then found out that it required another feature in the kernel so I began working on that.  The latter one is nearing completion so I'm hoping to go back to polling thing soonish.  Eh well, it always takes longer than expected.  So, sorry about the long delay but I think it isn't too far now.

Thanks.
Comment 22 Jesse Gilles 2010-05-13 14:09:47 UTC
Thanks for the update!  Just out of curiosity, is this issue specific to the pata_via driver (or are there other drivers that issue the same command to poll media)?  I assume this has to be pretty limited to certain hardware/drivers or there would be a LOT of people complaining :)
Comment 23 Tejun Heo 2010-05-17 09:57:51 UTC
It's more dependent on the drive than the controller and although it isn't very widespread there still are quite a few reports, so yeah it's one of the well known issues.

Thanks.