Bug 8605 - ata2: failed to recover some devices
Summary: ata2: failed to recover some devices
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-06-10 04:35 UTC by Soner Tari
Modified: 2007-06-26 03:55 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.21.4
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
messages log file, ata debug enabled (853.72 KB, application/octet-stream)
2007-06-13 07:24 UTC, Soner Tari
Details
messages, lspci -vv output, and dmesg (tar bziped) (35.33 KB, application/x-bzip)
2007-06-14 04:52 UTC, Soner Tari
Details
Patch to limit ATAPI DMA to R/W only for Samsung SN-308B (558 bytes, patch)
2007-06-14 23:51 UTC, Albert Lee
Details | Diff
Debug patch for the trailing data (514 bytes, patch)
2007-06-18 18:14 UTC, Albert Lee
Details | Diff
Tejun's patch to limit ATAPI DMA to multiple of 16 bytes (4.21 KB, patch)
2007-06-25 02:43 UTC, Albert Lee
Details | Diff
Detailed debug patch for the trailing data (3.39 KB, patch)
2007-06-25 02:48 UTC, Albert Lee
Details | Diff
kern.log with trailing data (153.48 KB, application/x-bzip)
2007-06-25 06:51 UTC, Soner Tari
Details

Description Soner Tari 2007-06-10 04:35:30 UTC
Most recent kernel where this bug did *NOT* occur: Kernels with old libata
Distribution: Ubuntu
Hardware Environment:
Software Environment:
Problem Description:

If I run K3b to burn a CD, I get the following lines at the tail of dmesg:

[  451.780000] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
[  451.780000] ata2.00: cmd a0/01:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0xac
data 8 in
[  451.780000]          res 40/00:03:00:3c:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
[  451.780000] ata2: soft resetting port
[  452.082000] ata2.00: revalidation failed (errno=-2)
[  452.082000] ata2: failed to recover some devices, retrying in 5 secs
[  457.083000] ata2: soft resetting port
[  457.385000] ata2.00: revalidation failed (errno=-2)
[  457.385000] ata2: failed to recover some devices, retrying in 5 secs
[  462.386000] ata2: soft resetting port
[  462.688000] ata2.00: revalidation failed (errno=-2)
[  462.688000] ata2.00: disabled
[  463.189000] ata2: EH complete

And my CD drive becomes inaccessible even for reading, until I reboot the system.

I am having this problem for sometime now, please also see:
https://launchpad.net/ubuntu/+bug/99018 But since then I've downloaded the
latest stable vanilla kernel, hence this error report.

The device is SAMSUNG  CDRW/DVD SN-308B U021.

If I switch to old libata kernels (/dev/hdX), there is no problem at all.

Steps to reproduce:

Just start K3b or Refresh the devices on K3b config page.
Comment 1 Andrew Morton 2007-06-10 13:08:07 UTC
Guys, this is a regression.
Comment 2 Tejun Heo 2007-06-10 21:22:36 UTC
Andrew, this is not a regression in libata.  The report is a bit confusing but
the device has never worked on libata (it works with the ide driver).  Albert,
this seems to be continuation of the following.

  http://bugzilla.kernel.org/show_bug.cgi?id=8244#c71

Any ideas?
Comment 3 Soner Tari 2007-06-11 02:07:57 UTC
Sorry if I used the wrong terminology, the problem has been with libata all
along. By "old libata" I meant the ide drivers. I don't have any problem with
the ide drivers on this system (for the last 4 years).

After applying Albert's patch (as you can read on the link Tejun provides too),
I was able to burn CD iso's *on the command line* almost without any problems,
but K3b (or Gnome CD Writer) problems persisted. So I asked this issue to Albert
in a private e-mail, how it could be possible that his fix solved the issue on
the command line but not on K3b. He suggested that I download and try the latest
vanilla kernel, and report if there are problems. Unfortunately, with the latest
vanilla kernel, I can't write (or blank) CD's even on the command line.

I'm running my vanilla ide kernel now, but I have libata one also (not to
mention Ubuntu Feisty kernels). So I am ready to try your suggestions.
Comment 4 Albert Lee 2007-06-13 00:17:53 UTC
Hi Soner,

Could you please turn on libata debug trace by
 - edit include/linux/libata.h
   #define ATA_DEBUG
   #define ATA_VERBOSE_DEBUG
 - rebuild the kernel and reboot
 - /var/log/messages will be flooded with ton of libata debug trace
 - Try to reproduce the timeout with k3b
 - Capture and attach full dmesg with the timeout to this bugzilla

Hopefully we can find some clue about the timeout. Thanks.
Comment 5 Soner Tari 2007-06-13 07:24:10 UTC
Created attachment 11739 [details]
messages log file, ata debug enabled

I've started K3b after 17:05 in the logs.
Comment 6 Albert Lee 2007-06-13 22:14:54 UTC
Hi Soner,

It seems the log is not verbose enough. Could you also #define ATA_VERBOSE_DEBUG, rebuild the kernel and recapture. Thanks.
Comment 7 Albert Lee 2007-06-13 22:47:58 UTC
Hi Soner,

BTW, could you also attach the output of "dmesg" and "lspci -v -v", thanks.
Comment 8 Soner Tari 2007-06-14 04:52:56 UTC
Created attachment 11749 [details]
messages, lspci -vv output, and dmesg (tar bziped)

Hi Albert,

Actually, I had already turned on ATA_VERBOSE_DEBUG per your previous instructions. But since you asked again, I've make-cleaned first and rebuilt the kernel. However, I don't think the result is too different.

Here is the output of 'grep -H "ATA_.*DEBUG" *':
libata.h:#define ATA_DEBUG              /* debugging output */
libata.h:#define ATA_VERBOSE_DEBUG      /* yet more debugging output */

Do you think I am doing something wrong, or some kernel config problem?

Please let me know if you need anything else.
Thank you,
Comment 9 Albert Lee 2007-06-14 23:37:30 UTC
Summary of the problem:
 - 00:1f.1 IDE interface: Intel Corporation 82801CAM IDE U100 (rev 02)
 - SAMSUNG  CDRW/DVD SN-308B U021

During k3b burning, the command 0xac (GET_PERFORMANCE) times out:
[  451.780000] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
[  451.780000] ata2.00: cmd a0/01:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0xac
data 8 in
[  451.780000]          res 40/00:03:00:3c:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
Comment 10 Albert Lee 2007-06-14 23:51:06 UTC
Created attachment 11759 [details]
Patch to limit ATAPI DMA to R/W only for Samsung SN-308B

Hi Soner,

Could you please try
1. #undef ATA_DEBUG and #undef ATA_VERBOSE_DEBUG
2. Apply the attached patch
3. rebuild/reboot the kernel
4. Run k3b and see if the timeout problem still occurs

Thanks.
Comment 11 Soner Tari 2007-06-15 03:30:35 UTC
Hi Albert,

OK, I can burn CDs using K3b now, there's no problem there, but I see the following lines in messages log (with explanations):

K3b starts:
Jun 15 11:41:56 laptop kernel: [  139.836000] ata2.00: 16 bytes trailing data

K3b device list refresh:
Jun 15 11:42:20 laptop kernel: [  164.107000] ata2.00: 16 bytes trailing data

Blank CDRW inserted:
Jun 15 11:45:09 laptop kernel: [  333.091000] ata2.00: 16 bytes trailing data
Jun 15 11:45:09 laptop kernel: [  333.336000] cdrom: This disc doesn't have any tracks I recognize!
Jun 15 11:45:09 laptop kernel: [  333.339000] ata2.00: 24 bytes trailing data

After CD image burn starts:
Jun 15 11:47:37 laptop kernel: [  481.355000] ata2.00: 24 bytes trailing data

After CD inserted back (K3b pops it out after successful write):
Jun 15 11:48:55 laptop kernel: [  559.335000] ata2.00: 24 bytes trailing data
Jun 15 11:48:55 laptop kernel: [  559.347000] ata2.00: 16 bytes trailing data

After starting CDRW erase:
Jun 15 11:50:14 laptop kernel: [  638.947000] ata2.00: 24 bytes trailing data

There were no other log lines between those above.

Therefore, I turned ata verbose debug back on to see what the trailing data are (thinking that it could help you), but unfortunately neither messages log file nor dmesg has any information on that.

One more thing, trying to burn the same image with Gnome CD Writer fails with the following line in messages, and I would appreciate if anyone knows how to fix it (not that it's too important, because K3b seems ok now):
Jun 15 11:52:38 laptop kernel: [  782.619000] scsi: unknown opcode 0x01

In short, I tried to reproduce the timeout error using K3b (up to 10-15 trials), I don't see any timeout problems now. Of course I don't want to be misled by this first impression like the last time, but my initial report is that I can use K3b now with a vanilla libata kernel without apparent problems (with possibly minor issues above). I don't know if this patch is considered as the final fix for my problem, but I have switched to that kernel now.

So, thank you very much for your help. Please let me know if you need me to do anything else for the trailing data issues above (I don't know if they are important at all).
Comment 12 Soner Tari 2007-06-17 02:26:08 UTC
Hmm, I've got my first error. It seems like the previous timeout error logs, but now there are more lines. This happened when K3b was blanking a CDRW:

Jun 16 21:31:09 laptop kernel: [29960.236000] ata2.00: 16 bytes trailing data
Jun 16 21:31:09 laptop kernel: [29960.636000] ata2.00: 24 bytes trailing data
Jun 16 21:31:46 laptop kernel: [29997.210000] ata2.00: 24 bytes trailing data
Jun 16 21:33:02 laptop kernel: [30073.886000]          res 40/00:03:00:10:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
Jun 16 21:33:09 laptop kernel: [30080.927000] ata2: port is slow to respond, please be patient (Status 0xd0)
Jun 16 21:33:32 laptop kernel: [30103.928000] ata2: soft resetting port
Jun 16 21:34:03 laptop kernel: [30134.230000] ata2.00: qc timeout (cmd 0xa1)
Jun 16 21:34:03 laptop kernel: [30134.230000] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Jun 16 21:34:03 laptop kernel: [30134.230000] ata2: failed to recover some devices, retrying in 5 secs
Jun 16 21:34:15 laptop kernel: [30146.272000] ata2: port is slow to respond, please be patient (Status 0x80)
Jun 16 21:34:38 laptop kernel: [30169.273000] ata2: soft resetting port
Jun 16 21:35:08 laptop kernel: [30199.575000] ata2.00: qc timeout (cmd 0xa1)
Jun 16 21:35:08 laptop kernel: [30199.575000] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Jun 16 21:35:08 laptop kernel: [30199.575000] ata2.00: limiting speed to MWDMA2:PIO3
Jun 16 21:35:08 laptop kernel: [30199.575000] ata2: failed to recover some devices, retrying in 5 secs
Jun 16 21:35:20 laptop kernel: [30211.617000] ata2: port is slow to respond, please be patient (Status 0x80)
Jun 16 21:35:43 laptop kernel: [30234.618000] ata2: soft resetting port
Jun 16 21:36:13 laptop kernel: [30264.920000] ata2.00: qc timeout (cmd 0xa1)
Jun 16 21:36:13 laptop kernel: [30264.920000] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Jun 16 21:36:13 laptop kernel: [30264.920000] ata2.00: disabled
Jun 16 21:36:14 laptop kernel: [30265.421000] ata2: EH complete

As before, if this happens, I have to reboot the system (if I want to use the CD drive even for reading). Otherwise, I am burning CD's successfully.
Comment 13 Albert Lee 2007-06-18 18:14:51 UTC
Created attachment 11788 [details]
Debug patch for the trailing data

> laptop kernel: [  139.836000] ata2.00: 16 bytes trailing data

The trailing data looks bad. Usually when we saw the "trailing data" error, we might have some data that is not written. Please verify the data burned by comparing the md5sum of your burned cd by "dd if=/dev/scd0|md5um" with the md5sum of your iso image file.

Could you also please try the attached debug patch to see the direction of the data flow? Thanks.
Comment 14 Soner Tari 2007-06-19 03:04:03 UTC
Hi Albert, I've checked the MD5 sum as you suggested, and it is correct.

Also, after your debug patch, I see the following sample lines in messages:

Jun 19 12:23:05 laptop kernel: [  140.049000] ata2.00: 24 bytes trailing data read
Jun 19 12:23:38 laptop kernel: [  172.448000] ata2.00: 16 bytes trailing data read

So these are reads from the CD drive. I guess that explains successful writes, right?
Comment 15 Albert Lee 2007-06-20 02:14:02 UTC
> So these are reads from the CD drive. I guess that explains successful
> writes,
right?

Yes. It's good to see the written data being correct.

I wonder why the device ever returns more data than requested. Possible causes might be firmware bug or cabling/data integrity problem. It's worth a try if firmware is available for this drive.

Since the Samsung SN-308B drive behaves much better with k3b issued command 0xac (GET_PERFORMANCE) after the DMA_RW_ONLY patch, maybe that's what we can do at this moment...
Comment 16 Soner Tari 2007-06-20 04:48:38 UTC
I think the most recent firmware is from 2003, and installed on my system already.

But, I actually had ruled out a firmware bug or cabling problem, because I am running Linux on this laptop (not to mention WindowsXP and OpenBSD) for more than 4 years now. Also, there is no problem with the latest vanilla kernel with the old ide driver. I'm not sure if the new libata reveals a problem hidden for years, but I'd find it quite unlikely. (These facts make me still think that the problem is within the new libata somewhere, but you are the expert.)

Alright then, thank you for your efforts. I agree, that's enough for now. If it becomes a headache, I can always switch to old ide drivers anyway.

(If you know a way to capture the trailing data, probably I can try that as the last thing in this thread.)
Comment 17 Albert Lee 2007-06-20 23:32:03 UTC
Hi Soner,

Tejun has a new patch related to ATAPI DMA. Could you please drop my previous "DMA_RW_ONLY blacklist" patch and try Tejun's new patch with k3b:

https://bugzilla.novell.com/attachment.cgi?id=147389

Hopefully with the 16-byte alignment similar to what IDE does, we can have better result. (The original bug is 
https://bugzilla.novell.com/show_bug.cgi?id=229260)
Comment 18 Soner Tari 2007-06-21 03:26:42 UTC
Hi Albert, I've applied Tejun's patch as you suggested. Good news is that I can burn CDs without my drive in the blacklist. But the bad news is that I still see the same trailing data errors. (But, with blacklist, I had experienced one major error, comment #12 above, hopefully I won't have any with the new patch.)
Comment 19 Albert Lee 2007-06-21 06:48:51 UTC
> bad news is that I still see the same trailing data errors.

I am also curious why the trailing data, will make a patch to dig into that later.
Comment 20 Albert Lee 2007-06-25 02:43:06 UTC
Created attachment 11865 [details]
Tejun's patch to limit ATAPI DMA to multiple of 16 bytes
Comment 21 Albert Lee 2007-06-25 02:48:09 UTC
Created attachment 11866 [details]
Detailed debug patch for the trailing data

Hi Soner,

Could you please apply both the "limit ATAPI DMA to multiple of 16 bytes" and this "detailed debug patch". Rebuild/reboot the kernel and try k3b to trigger the trailing data problem.

Hopefully we can have more clue about what's in the trailing data. Thanks.
Comment 22 Soner Tari 2007-06-25 06:51:37 UTC
Created attachment 11867 [details]
kern.log with trailing data

Hi Albert,

I did as you suggested, and I think now we can see the trailing data in the kern.log file. I hope you can decipher them :).

(During these logs, I blanked and burned a CD using K3b.)
Thanks,
Comment 23 Albert Lee 2007-06-25 19:40:31 UTC
Thanks for the debug log. From the log, the trailing data only happens on READ_TOC_PMA_ATIP (0x43) and GET_PERFORMANCE (0xac). 
================================================
1. READ_TOC_PMA_ATIP:

First k3b allocated 4 bytes buffer and issued the command.
The drive actually had 28 bytes of data and wants to transfer all of them.
libata correctly transferred the first 4-bytes of data to the buffer and discarded the trailing 24 bytes.

[  120.291000] CDB (2:0,0,0) 43 00 04 00 00 00 00 00 04
[  120.291000] ata2: protocol 5 task_state 1 (dev_stat 0x58)
[  120.292000] ata2: protocol 5 task_state 2
[  120.292000] ata2: protocol 5 task_state 2 (dev_stat 0x58)
[  120.292000] ata2: xfering 28 bytes
[  120.292000] ata2.00: 24 bytes trailing data read
[  120.292000] 00b3 <3>00ce <3>1a61 <3>0041 <3>3b4f <3>004a <3>1a24 <3>00bc <3>0000 <3>0000 <3>0000 <3>0000 <3>
[  120.292000] ata2: protocol 5 task_state 3
[  120.292000] ata2: protocol 5 task_state 3 (dev_stat 0x50)
[  120.292000] ata2: dev 0 command complete, drv_stat 0x50
-------------------------------------------------------------
After receiving the first 4-bytes of data, k3b knew the actual data length was 28-bytes. So, k3b re-issued the command with 28 bytes of buffer. And the command is ok this time.
[  120.294000] CDB (2:0,0,0) 43 00 04 00 00 00 00 00 1c
[  120.294000] ata2: protocol 5 task_state 1 (dev_stat 0x58)
[  120.295000] ata2: protocol 5 task_state 2
[  120.295000] ata2: protocol 5 task_state 2 (dev_stat 0x58)
[  120.295000] ata2: xfering 28 bytes
[  120.295000] ata2: protocol 5 task_state 3
[  120.295000] ata2: protocol 5 task_state 3 (dev_stat 0x50)
[  120.295000] ata2: dev 0 command complete, drv_stat 0x50

===============================================================
2. GET_PERFORMANCE:

Bytes 8-9 are the "max number of descriptors", but not logged below.
Anyway, it seems k3b prepared only 8 bytes of buffer and the drive tried to return 24-bytes of data. Libata correctly discarded the trailing 16-bytes of data.

Jun 25 16:29:26 laptop kernel: [  163.914000] CDB (2:0,0,0) ac 00 00 00 00 00 00 00 00
Jun 25 16:29:26 laptop kernel: [  163.914000] ata2: protocol 5 task_state 1 (dev_stat 0x58)
Jun 25 16:29:26 laptop kernel: [  163.914000] ata2: protocol 5 task_state 2
Jun 25 16:29:26 laptop kernel: [  163.914000] ata2: protocol 5 task_state 2 (dev_stat 0x58)
Jun 25 16:29:26 laptop kernel: [  163.914000] ata2: xfering 24 bytes
Jun 25 16:29:26 laptop kernel: [  163.914000] ata2.00: 16 bytes trailing data read
Jun 25 16:29:26 laptop kernel: [  163.914000] 0002 <3>0000 <3>0200 <3>800a <3>0000 <3>080b <3>0000 <3>8405 <3>
Jun 25 16:29:26 laptop kernel: [  163.914000] ata2: protocol 5 task_state 3
Jun 25 16:29:26 laptop kernel: [  163.914000] ata2: protocol 5 task_state 3 (dev_stat 0x50)
Jun 25 16:29:26 laptop kernel: [  163.914000] ata2: dev 0 command complete, drv_stat 0x50
Comment 24 Albert Lee 2007-06-25 19:46:35 UTC
After the log analyzed, it seems the CDRW/DVD SN-308B drive wants to return more data than requested/allocated by k3b. We can safely ignore the trailing data since k3b always got what it wanted.

Also, it seems the old IDE subsystem simply ingore the trailing data silently. For libata, although the trailing data is safe to ignore, this is not a good behavior of the drive. So, we had better keep the warning messages in libata.
Comment 25 Albert Lee 2007-06-25 19:50:32 UTC
Hi Soner,

From the log, I cannot find any timeout message. Need your help to confirm that after Tejun's "limit ATAPI DMA to multiple of 16 bytes" patch applied, there is no more timeout.
Comment 26 Soner Tari 2007-06-26 02:45:42 UTC
Hi Albert,

I guess it has been more than a week now that I haven't seen any timeouts with Tejun's patch. Also, I didn't have any major problems yet either. So, yes, I can confirm that timeouts are gone. A couple of trials in a week may be enough to conclude anything, but so far so good.

I think your previous comments above make sense, especially if the old ide driver really has been ignoring the same trailing data for years.

So I conclude that my problems are gone, and I hope this patch goes into the Ubuntu kernel soon, so I won't have to build a kernel myself after each update.

Thank you for your help,
Comment 27 Albert Lee 2007-06-26 03:11:31 UTC
> So I conclude that my problems are gone, and I hope this patch goes
> into the Ubuntu kernel soon

Hi Soner,
Nice to know that. I am also looking forward to Tejun's patch being upstream.
Many thanks for your help and test.

Hi Tejun,

The "limit ATAPI DMA to multiple of 16-bytes" looks promising. Do you have any plan to submit it upstream, say, 2.6.22-rc5?
(BTW, maybe this bug can be closed.)
Comment 28 Tejun Heo 2007-06-26 03:55:03 UTC
Yeah, thanks a lot for all the hard work.  I'll forward the patch upstream.  Closing the bug.

Note You need to log in before you can comment on or make changes to this bug.