Bug 8605
Summary: | ata2: failed to recover some devices | ||
---|---|---|---|
Product: | Drivers | Reporter: | Soner Tari (list) |
Component: | Other | Assignee: | Tejun Heo (htejun) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | albertcc, htejun |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.21.4 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
messages log file, ata debug enabled
messages, lspci -vv output, and dmesg (tar bziped) Patch to limit ATAPI DMA to R/W only for Samsung SN-308B Debug patch for the trailing data Tejun's patch to limit ATAPI DMA to multiple of 16 bytes Detailed debug patch for the trailing data kern.log with trailing data |
Description
Soner Tari
2007-06-10 04:35:30 UTC
Guys, this is a regression. Andrew, this is not a regression in libata. The report is a bit confusing but the device has never worked on libata (it works with the ide driver). Albert, this seems to be continuation of the following. http://bugzilla.kernel.org/show_bug.cgi?id=8244#c71 Any ideas? Sorry if I used the wrong terminology, the problem has been with libata all along. By "old libata" I meant the ide drivers. I don't have any problem with the ide drivers on this system (for the last 4 years). After applying Albert's patch (as you can read on the link Tejun provides too), I was able to burn CD iso's *on the command line* almost without any problems, but K3b (or Gnome CD Writer) problems persisted. So I asked this issue to Albert in a private e-mail, how it could be possible that his fix solved the issue on the command line but not on K3b. He suggested that I download and try the latest vanilla kernel, and report if there are problems. Unfortunately, with the latest vanilla kernel, I can't write (or blank) CD's even on the command line. I'm running my vanilla ide kernel now, but I have libata one also (not to mention Ubuntu Feisty kernels). So I am ready to try your suggestions. Hi Soner, Could you please turn on libata debug trace by - edit include/linux/libata.h #define ATA_DEBUG #define ATA_VERBOSE_DEBUG - rebuild the kernel and reboot - /var/log/messages will be flooded with ton of libata debug trace - Try to reproduce the timeout with k3b - Capture and attach full dmesg with the timeout to this bugzilla Hopefully we can find some clue about the timeout. Thanks. Created attachment 11739 [details]
messages log file, ata debug enabled
I've started K3b after 17:05 in the logs.
Hi Soner, It seems the log is not verbose enough. Could you also #define ATA_VERBOSE_DEBUG, rebuild the kernel and recapture. Thanks. Hi Soner, BTW, could you also attach the output of "dmesg" and "lspci -v -v", thanks. Created attachment 11749 [details]
messages, lspci -vv output, and dmesg (tar bziped)
Hi Albert,
Actually, I had already turned on ATA_VERBOSE_DEBUG per your previous instructions. But since you asked again, I've make-cleaned first and rebuilt the kernel. However, I don't think the result is too different.
Here is the output of 'grep -H "ATA_.*DEBUG" *':
libata.h:#define ATA_DEBUG /* debugging output */
libata.h:#define ATA_VERBOSE_DEBUG /* yet more debugging output */
Do you think I am doing something wrong, or some kernel config problem?
Please let me know if you need anything else.
Thank you,
Summary of the problem: - 00:1f.1 IDE interface: Intel Corporation 82801CAM IDE U100 (rev 02) - SAMSUNG CDRW/DVD SN-308B U021 During k3b burning, the command 0xac (GET_PERFORMANCE) times out: [ 451.780000] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [ 451.780000] ata2.00: cmd a0/01:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0xac data 8 in [ 451.780000] res 40/00:03:00:3c:00/00:00:00:00:00/a0 Emask 0x4 (timeout) Created attachment 11759 [details]
Patch to limit ATAPI DMA to R/W only for Samsung SN-308B
Hi Soner,
Could you please try
1. #undef ATA_DEBUG and #undef ATA_VERBOSE_DEBUG
2. Apply the attached patch
3. rebuild/reboot the kernel
4. Run k3b and see if the timeout problem still occurs
Thanks.
Hi Albert, OK, I can burn CDs using K3b now, there's no problem there, but I see the following lines in messages log (with explanations): K3b starts: Jun 15 11:41:56 laptop kernel: [ 139.836000] ata2.00: 16 bytes trailing data K3b device list refresh: Jun 15 11:42:20 laptop kernel: [ 164.107000] ata2.00: 16 bytes trailing data Blank CDRW inserted: Jun 15 11:45:09 laptop kernel: [ 333.091000] ata2.00: 16 bytes trailing data Jun 15 11:45:09 laptop kernel: [ 333.336000] cdrom: This disc doesn't have any tracks I recognize! Jun 15 11:45:09 laptop kernel: [ 333.339000] ata2.00: 24 bytes trailing data After CD image burn starts: Jun 15 11:47:37 laptop kernel: [ 481.355000] ata2.00: 24 bytes trailing data After CD inserted back (K3b pops it out after successful write): Jun 15 11:48:55 laptop kernel: [ 559.335000] ata2.00: 24 bytes trailing data Jun 15 11:48:55 laptop kernel: [ 559.347000] ata2.00: 16 bytes trailing data After starting CDRW erase: Jun 15 11:50:14 laptop kernel: [ 638.947000] ata2.00: 24 bytes trailing data There were no other log lines between those above. Therefore, I turned ata verbose debug back on to see what the trailing data are (thinking that it could help you), but unfortunately neither messages log file nor dmesg has any information on that. One more thing, trying to burn the same image with Gnome CD Writer fails with the following line in messages, and I would appreciate if anyone knows how to fix it (not that it's too important, because K3b seems ok now): Jun 15 11:52:38 laptop kernel: [ 782.619000] scsi: unknown opcode 0x01 In short, I tried to reproduce the timeout error using K3b (up to 10-15 trials), I don't see any timeout problems now. Of course I don't want to be misled by this first impression like the last time, but my initial report is that I can use K3b now with a vanilla libata kernel without apparent problems (with possibly minor issues above). I don't know if this patch is considered as the final fix for my problem, but I have switched to that kernel now. So, thank you very much for your help. Please let me know if you need me to do anything else for the trailing data issues above (I don't know if they are important at all). Hmm, I've got my first error. It seems like the previous timeout error logs, but now there are more lines. This happened when K3b was blanking a CDRW: Jun 16 21:31:09 laptop kernel: [29960.236000] ata2.00: 16 bytes trailing data Jun 16 21:31:09 laptop kernel: [29960.636000] ata2.00: 24 bytes trailing data Jun 16 21:31:46 laptop kernel: [29997.210000] ata2.00: 24 bytes trailing data Jun 16 21:33:02 laptop kernel: [30073.886000] res 40/00:03:00:10:00/00:00:00:00:00/a0 Emask 0x4 (timeout) Jun 16 21:33:09 laptop kernel: [30080.927000] ata2: port is slow to respond, please be patient (Status 0xd0) Jun 16 21:33:32 laptop kernel: [30103.928000] ata2: soft resetting port Jun 16 21:34:03 laptop kernel: [30134.230000] ata2.00: qc timeout (cmd 0xa1) Jun 16 21:34:03 laptop kernel: [30134.230000] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4) Jun 16 21:34:03 laptop kernel: [30134.230000] ata2: failed to recover some devices, retrying in 5 secs Jun 16 21:34:15 laptop kernel: [30146.272000] ata2: port is slow to respond, please be patient (Status 0x80) Jun 16 21:34:38 laptop kernel: [30169.273000] ata2: soft resetting port Jun 16 21:35:08 laptop kernel: [30199.575000] ata2.00: qc timeout (cmd 0xa1) Jun 16 21:35:08 laptop kernel: [30199.575000] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4) Jun 16 21:35:08 laptop kernel: [30199.575000] ata2.00: limiting speed to MWDMA2:PIO3 Jun 16 21:35:08 laptop kernel: [30199.575000] ata2: failed to recover some devices, retrying in 5 secs Jun 16 21:35:20 laptop kernel: [30211.617000] ata2: port is slow to respond, please be patient (Status 0x80) Jun 16 21:35:43 laptop kernel: [30234.618000] ata2: soft resetting port Jun 16 21:36:13 laptop kernel: [30264.920000] ata2.00: qc timeout (cmd 0xa1) Jun 16 21:36:13 laptop kernel: [30264.920000] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4) Jun 16 21:36:13 laptop kernel: [30264.920000] ata2.00: disabled Jun 16 21:36:14 laptop kernel: [30265.421000] ata2: EH complete As before, if this happens, I have to reboot the system (if I want to use the CD drive even for reading). Otherwise, I am burning CD's successfully. Created attachment 11788 [details] Debug patch for the trailing data > laptop kernel: [ 139.836000] ata2.00: 16 bytes trailing data The trailing data looks bad. Usually when we saw the "trailing data" error, we might have some data that is not written. Please verify the data burned by comparing the md5sum of your burned cd by "dd if=/dev/scd0|md5um" with the md5sum of your iso image file. Could you also please try the attached debug patch to see the direction of the data flow? Thanks. Hi Albert, I've checked the MD5 sum as you suggested, and it is correct. Also, after your debug patch, I see the following sample lines in messages: Jun 19 12:23:05 laptop kernel: [ 140.049000] ata2.00: 24 bytes trailing data read Jun 19 12:23:38 laptop kernel: [ 172.448000] ata2.00: 16 bytes trailing data read So these are reads from the CD drive. I guess that explains successful writes, right? > So these are reads from the CD drive. I guess that explains successful
> writes,
right?
Yes. It's good to see the written data being correct.
I wonder why the device ever returns more data than requested. Possible causes might be firmware bug or cabling/data integrity problem. It's worth a try if firmware is available for this drive.
Since the Samsung SN-308B drive behaves much better with k3b issued command 0xac (GET_PERFORMANCE) after the DMA_RW_ONLY patch, maybe that's what we can do at this moment...
I think the most recent firmware is from 2003, and installed on my system already. But, I actually had ruled out a firmware bug or cabling problem, because I am running Linux on this laptop (not to mention WindowsXP and OpenBSD) for more than 4 years now. Also, there is no problem with the latest vanilla kernel with the old ide driver. I'm not sure if the new libata reveals a problem hidden for years, but I'd find it quite unlikely. (These facts make me still think that the problem is within the new libata somewhere, but you are the expert.) Alright then, thank you for your efforts. I agree, that's enough for now. If it becomes a headache, I can always switch to old ide drivers anyway. (If you know a way to capture the trailing data, probably I can try that as the last thing in this thread.) Hi Soner, Tejun has a new patch related to ATAPI DMA. Could you please drop my previous "DMA_RW_ONLY blacklist" patch and try Tejun's new patch with k3b: https://bugzilla.novell.com/attachment.cgi?id=147389 Hopefully with the 16-byte alignment similar to what IDE does, we can have better result. (The original bug is https://bugzilla.novell.com/show_bug.cgi?id=229260) Hi Albert, I've applied Tejun's patch as you suggested. Good news is that I can burn CDs without my drive in the blacklist. But the bad news is that I still see the same trailing data errors. (But, with blacklist, I had experienced one major error, comment #12 above, hopefully I won't have any with the new patch.) > bad news is that I still see the same trailing data errors.
I am also curious why the trailing data, will make a patch to dig into that later.
Created attachment 11865 [details]
Tejun's patch to limit ATAPI DMA to multiple of 16 bytes
Created attachment 11866 [details]
Detailed debug patch for the trailing data
Hi Soner,
Could you please apply both the "limit ATAPI DMA to multiple of 16 bytes" and this "detailed debug patch". Rebuild/reboot the kernel and try k3b to trigger the trailing data problem.
Hopefully we can have more clue about what's in the trailing data. Thanks.
Created attachment 11867 [details]
kern.log with trailing data
Hi Albert,
I did as you suggested, and I think now we can see the trailing data in the kern.log file. I hope you can decipher them :).
(During these logs, I blanked and burned a CD using K3b.)
Thanks,
Thanks for the debug log. From the log, the trailing data only happens on READ_TOC_PMA_ATIP (0x43) and GET_PERFORMANCE (0xac). ================================================ 1. READ_TOC_PMA_ATIP: First k3b allocated 4 bytes buffer and issued the command. The drive actually had 28 bytes of data and wants to transfer all of them. libata correctly transferred the first 4-bytes of data to the buffer and discarded the trailing 24 bytes. [ 120.291000] CDB (2:0,0,0) 43 00 04 00 00 00 00 00 04 [ 120.291000] ata2: protocol 5 task_state 1 (dev_stat 0x58) [ 120.292000] ata2: protocol 5 task_state 2 [ 120.292000] ata2: protocol 5 task_state 2 (dev_stat 0x58) [ 120.292000] ata2: xfering 28 bytes [ 120.292000] ata2.00: 24 bytes trailing data read [ 120.292000] 00b3 <3>00ce <3>1a61 <3>0041 <3>3b4f <3>004a <3>1a24 <3>00bc <3>0000 <3>0000 <3>0000 <3>0000 <3> [ 120.292000] ata2: protocol 5 task_state 3 [ 120.292000] ata2: protocol 5 task_state 3 (dev_stat 0x50) [ 120.292000] ata2: dev 0 command complete, drv_stat 0x50 ------------------------------------------------------------- After receiving the first 4-bytes of data, k3b knew the actual data length was 28-bytes. So, k3b re-issued the command with 28 bytes of buffer. And the command is ok this time. [ 120.294000] CDB (2:0,0,0) 43 00 04 00 00 00 00 00 1c [ 120.294000] ata2: protocol 5 task_state 1 (dev_stat 0x58) [ 120.295000] ata2: protocol 5 task_state 2 [ 120.295000] ata2: protocol 5 task_state 2 (dev_stat 0x58) [ 120.295000] ata2: xfering 28 bytes [ 120.295000] ata2: protocol 5 task_state 3 [ 120.295000] ata2: protocol 5 task_state 3 (dev_stat 0x50) [ 120.295000] ata2: dev 0 command complete, drv_stat 0x50 =============================================================== 2. GET_PERFORMANCE: Bytes 8-9 are the "max number of descriptors", but not logged below. Anyway, it seems k3b prepared only 8 bytes of buffer and the drive tried to return 24-bytes of data. Libata correctly discarded the trailing 16-bytes of data. Jun 25 16:29:26 laptop kernel: [ 163.914000] CDB (2:0,0,0) ac 00 00 00 00 00 00 00 00 Jun 25 16:29:26 laptop kernel: [ 163.914000] ata2: protocol 5 task_state 1 (dev_stat 0x58) Jun 25 16:29:26 laptop kernel: [ 163.914000] ata2: protocol 5 task_state 2 Jun 25 16:29:26 laptop kernel: [ 163.914000] ata2: protocol 5 task_state 2 (dev_stat 0x58) Jun 25 16:29:26 laptop kernel: [ 163.914000] ata2: xfering 24 bytes Jun 25 16:29:26 laptop kernel: [ 163.914000] ata2.00: 16 bytes trailing data read Jun 25 16:29:26 laptop kernel: [ 163.914000] 0002 <3>0000 <3>0200 <3>800a <3>0000 <3>080b <3>0000 <3>8405 <3> Jun 25 16:29:26 laptop kernel: [ 163.914000] ata2: protocol 5 task_state 3 Jun 25 16:29:26 laptop kernel: [ 163.914000] ata2: protocol 5 task_state 3 (dev_stat 0x50) Jun 25 16:29:26 laptop kernel: [ 163.914000] ata2: dev 0 command complete, drv_stat 0x50 After the log analyzed, it seems the CDRW/DVD SN-308B drive wants to return more data than requested/allocated by k3b. We can safely ignore the trailing data since k3b always got what it wanted. Also, it seems the old IDE subsystem simply ingore the trailing data silently. For libata, although the trailing data is safe to ignore, this is not a good behavior of the drive. So, we had better keep the warning messages in libata. Hi Soner, From the log, I cannot find any timeout message. Need your help to confirm that after Tejun's "limit ATAPI DMA to multiple of 16 bytes" patch applied, there is no more timeout. Hi Albert, I guess it has been more than a week now that I haven't seen any timeouts with Tejun's patch. Also, I didn't have any major problems yet either. So, yes, I can confirm that timeouts are gone. A couple of trials in a week may be enough to conclude anything, but so far so good. I think your previous comments above make sense, especially if the old ide driver really has been ignoring the same trailing data for years. So I conclude that my problems are gone, and I hope this patch goes into the Ubuntu kernel soon, so I won't have to build a kernel myself after each update. Thank you for your help, > So I conclude that my problems are gone, and I hope this patch goes
> into the Ubuntu kernel soon
Hi Soner,
Nice to know that. I am also looking forward to Tejun's patch being upstream.
Many thanks for your help and test.
Hi Tejun,
The "limit ATAPI DMA to multiple of 16-bytes" looks promising. Do you have any plan to submit it upstream, say, 2.6.22-rc5?
(BTW, maybe this bug can be closed.)
Yeah, thanks a lot for all the hard work. I'll forward the patch upstream. Closing the bug. |