Bug 8316
Description
Rohan B. Dhruva
2007-04-09 12:17:25 UTC
Created attachment 11113 [details]
Complete dmesg output.
Created attachment 11114 [details]
lspci -vvv output
Created attachment 11115 [details]
cdrecord dev=/dev/sr0 -inq -vvVV output
> Judging that the dmesg output is exactly the same as bug #8244, > I think the two bugs are somehow related. The patch for bug 8244 is in 2.6.21-rc6. Could you please build and reboot to 2.6.21-rc6 and see if the problem persists? Seeing the CVS of archlinux - http://cvs.archlinux.org/cgi- bin/viewcvs.cgi/kernels/kernel26/?cvsroot=Current&only_with_tag=CURRENT - that patch has been applied, "libata-eh.patch". I will still try kernel 2.6.21-rc6, but is there any way to manually "induce" or "stimulate" the bug. Waiting for the bug to occur is unpredictable at times. > I will still try kernel 2.6.21-rc6, Yes, please. > but is there any way to manually "induce" or "stimulate" the bug. > Waiting for the bug to occur is unpredictable at times. This bug looks different from bug #8244, which always happens during boot. Currently I guess the only way is be patient and capture the dmesg when the problem happens. Created attachment 11117 [details]
Complete dmesg on kernel 2.6.21-rc6
The above attachment is when the pc was running on 2.6.21-rc6, with no other patches applied. Please tell me if I need to attach any other information. The cdrecord output remains the same. Any updates, Albert ? > Any updates, Albert? 1. This bug is not bug 8244. 2. From the dmesg, libata can initialize without problem during boot. 3. The "qc timeout message" might be caused by lost irq or something else. From both dmesgs, the problem happened when "cdb 0x0" (TEST UNIT READY) was issued to the drive. Rohan, Since the machine is at your hand, could you please 1. try if any reliable method to reproduce the problem? e.g. "dd if=/dev/scd0 | md5sum" e.g. stress both the hard drive and the dvd drive at the same time. e.g. Leave a cd in the dvd drive and check if the problem disappears. 2. observe any other symptom or clue when such "qc timeout" happens? Say, the system appears to be frozen? mouse/key board not responding for a moment? The LED of the USB web cam flashing? 3. Collect more dmesgs when such timeout happens. We can check whether this is related to specific command. Yes, keyboard and mouse stop responding when this happens. The Webcam light does not flash, though. All the applications stop responding, but resume from the same point. E.g. if i am playing music when this happens, music stops playing,but resumes from the same point. I will try to get more dmesg's. Created attachment 11153 [details]
One more :)
http://bugzilla.kernel.org/attachment.cgi?id=11153&action=view This dmesg was captured when i used the command "eject" to eject the CD. The media in CD Drive was a cdrw which has been working great. I had done the done md5sum command you suggested, but it worked perfectly and no errors during that time. This time, during eject, as you can see, I got some buffer IO errors in dmesg. And after eject, when i inserted the tray back in (manually), the usual symptoms - pc lag, no kbd/mouse, and the error in dmesg. Also, I don't know whether it is related or no - but while the 30 second lag, I had left mplayer on; the music started playing suddenly at HIGH volume for short intervals. And then after 30 secs it was playing normally. I guess maybe something occupied the CPU. Say, a long busy delay in the interrupt handler or the device appears to be busy when the interrupt happens. I'll be out for several days and be back on 4/21. Will try to make a debug patch to check the activity in the irq handler when back. OK, I will try to investigate and find some way of reliably reproducing the error till then :) Attaching a dmesg taken on ubuntu 7.04 feisty. It has libata enabled, too. Created attachment 11248 [details]
Dmesg on ubuntu 7.04
Maybe this gives more clues ?
Created attachment 11253 [details]
dmesg output on 2.6.21-rc7
Just for the record, problem is not solved in 2.6.21-rc7 :)
Hi albert, I noticed something. From a day or two I've intentionally kept a CD in the drive at all times. All during that period, I never had a single problem. Hope that helps, Rohan. More data point: Both the hard drive and the DVD drive are PATA, connected to the ICH7 IDE port in master/slave configuration. From the dmesg, the timeout happens on command 0x43, 0x1e, 0x0 and 0x25. => Looks not related to specific command. > From a day or two I've intentionally kept a CD in the drive > at all times. All during that period, I never had a single problem. So, if no periodic commands issued to the DVD drive, then no timeout. => This looks like something related to the DVD device. Maybe drive did not response to the command, lost irq or something... Hi Rohan, Could you please try updating the firmware of the DVD drive first? On the TSST website, there is a new firmware "SC03" for the TS-L632D drive, dated 2007-02-15: http://www.samsungodd.com/eng/Firmware/liveupdate.asp Let's see if firmware update can fix this problem... Thanks a lot for pointing me to the update. I have updated to AC01 update, dated 01-Jan-07, because it is Acer OEM. If that does not work, I will update to the SC03 Samsung firmware. After that, I'll comment whether it worked or no. Problem is not solved on upgrading the Acer firmware. Now I am going to upgrade to the samsung SC03 firmware, to see if it helps. I am attaching the dmesg on a kernel 2.6.20, on ubuntu install. Is that ok ? Or do I still attach dmesg of 2.6.21, the released newest kernel ? Created attachment 11369 [details] dmesg after firmware upgrade to Acer AC01 So the problem occurs even after firmware upgrade to Acer AC01 version. Now I will try upgrade to the Samsung SC03 version. http://www.toshibaer.com/firmware/index.php?path=TS-L632D/ It's unable to update to the SC03 firmware. The flashing utility says that "The version of firmware selected is incompatible with your hardware." when I select the SC03 update. However, right now, my drive is updated to the latest Acer OEM upgrade - AC01, released on 20-Jan-07. Please tell me the next step to take :) Created attachment 11370 [details]
Error occurs on 2.6.21.1
OK, the error occurred on a custom compiled kernel 2.6.21.1 which has no
outside patches applied. The firmware, as I said, is updated to the latest
provided by Acer.
I have a Acer TravelMate 5610 laptop with a TSSTcorp CD/DVDW TS-L632D using a shared PATA bus with the WDC WD1200UE-22K hard drive and I have experienced this same problem. After finding that hald was causing the timeout issue and after disabling hald the timeouts stopped. I had a look at the commands that hald was sending to the cdrom and then wrote a test program to determine the cause of the timeouts. I've attached two test programs that I have created to isolate the calls and use of the cdrom device pertaining to this timeout issue. The first test program, by calling the open and close command for the cdrom device every second, causes a timeout after a few hours of execution in an apparent random fashion. This seems to be how hald currently queries the cdrom. I have tested this code with the query commands included but they had no impact on it's ability to cause a timeout issue so I've dropped them from the test program. It is my understanding that the cause of the timeout issue is related directly to the repeated calls to the open and close command when referencing a TSSTcorp CD/DVDW TS-L632D drive. I have attached a second program to show that by keeping the cdrom device open and then repeatedly performing cdrom querying calls I have been unable to cause any cdrom timeouts events even after continuous running for many hours. My conclusion is that the timeout is caused by the repeated use of the open and/or close commands on this cdrom device. Created attachment 11390 [details]
dmesg output
Created attachment 11391 [details]
lspci -vvv output
Created attachment 11392 [details]
cdrecord dev=/dev/sr0 -inq -vvVV output
Created attachment 11393 [details]
cdstatus_1.c
This is the first test program.
Created attachment 11394 [details]
cdstatus_2.c
This is the second test program.
Matthew, I can confirm that the same program causes timeouts for me too. But isn't the method used by HAL very crude ? I have no experience in hardware related programming, but constantly pinging a cd-rom device to check for new media seems strange. Also, would it not drain the laptop battery very fast ? After doing some debugging in the kernel I've found that the timeouts seem to be from the sr scsi cdrom driver always executing certain commands in certain functions that should only be executed when media is in the drive. In the sr_media_changed function, it calls scsi_test_unit_ready which sets the changed flag and returns 0 if the drive returns NOT_READY which my drive returns if no media is in the drive. This causes sr_cd_check and get_sectorsize to be called on every call to sr_media_changed when no media is in the drive which causes the drive to often timeout. Also, in sr_release, sr_set_blocklength is called which sometimes causes a timeout on my drive as well. I looked through the code of the ide-cd driver and it looks like its media_changed function doesn't changed flag if the drive returns NOT_READY. Even when a media change is detected it defers checking of multisession and sectorsize info to the cdrom_read_toc function. When I commented out these three calls, my test programs didn't cause a timeout when no media was in the drive. Perhaps now that libata supports PATA devices some of the code from the ide-cd driver should be ported over to enhance either the sr driver or the Uniform cdrom driver. Created attachment 11569 [details]
A quick patch to try to workaround some of the cdrom timeouts.
Created attachment 11579 [details]
cdstatus_5.c
With the new changes, it looks like I'm still getting occasional timeouts from
TEST UNIT READY. Here is my latest test program which just calls TEST UNIT
READY every second.
I have noticed that I seem to get more timeouts if I add lots of extra
debugging output to the cdrom driver which gets logged to the hard drive via
syslog.
Any updates ? So, it seems somehow TEST UNIT ready triggered the timeiut. Could you please turn on libata debug trace by - edit include/linux/libata.h #define ATA_DEBUG #define ATA_VERBOSE_DEBUG - rebuild the kernel and reboot - /var/log/messages will be flooded with ton of libata debug trace - Try to reproduce the timeout with Matthew's test program - Capture and attach full dmesg with the timeout to this bugzilla Hopefully we can find some clue about the timeout. Thanks. After doing the above, Alex, and upgrading to kernel 2.6.21.5, I am unable to reproduce the timeout under any circumstances, atleast for now. I have tried loading the system, running Matthew's test program number 5, but no success. Previously, under 2.6.21.3 and libata debug disabled, I had timeouts every now and then, very very frequently. However, I am still attaching the dmesg and /var/log/messages.log, hope it gives some clue. Matthew, can you please also try it out ? Thanks Created attachment 11740 [details]
dmesg on 2.6.21.5
On this kernel, and with libata debug enabled, the problem DOES NOT occur. However, still attaching the dmesg hoping that it might help.
Created attachment 11741 [details]
messages.log on 2.6.21.5
Problem does not occur on 2.6.21.5, but still attaching messages.log hoping that it might help.
Created attachment 11743 [details]
FINALLY the timeout occcurred :)
After waiting for 2 hours or so, the timeout occurred, and attached is the captured dmesg. Bingo :)
Hi Rohan,
The attachment #11743 [details] doesn't contain the traces with the timeout. :(
Could you please attach the full /var/log/messages with the timeout by
1. cd /var/log
2. cp messages messages-libata
3. bzip2 messages-libata
4. attach the "messages-libata.bz2" to this bugzilla
Thanks.
Created attachment 11745 [details]
The traces
Oh, here they are. I thought that because they are exactly the same as previous messages.log attachment it'd be ok not to attach them. But anyway :)
Created attachment 11746 [details]
Patch to limit the TSSTcorp TS-L632D ATAPI DMA to R/W only
Strangely /var/log/messages only contains the "ata1: ata_dev_select: ENTER, device 0, wait 1" and no libata HSM trace. :(
Anyway, I guess maybe this is ATAPI DMA problem.
Hi Rohan/Matthew,
Could you please verify if the attached patch fixes the timeout problem? Thanks.
Albert, can you please tell me which line to specifically look out for, in dmesg, which signifies that the timeout occurred ? I sometimes leave the pc on, and am away - so if the timeout occurs at that time, I can capture it .. But with so many debug messages, I cannot understand .. Also, I have applied the patch, and Matthew's test program number 5 is running as we speak .. > ... which line to specifically look out for in dmesg
Hi Rohan,
1. Please find something with the "timeout" keyword like:
ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.01: cmd a0/00:00:00:00:20/00:00:00:00:00/b0 tag 0 cdb 0x0 data 0
res 40/00:03:00:00:00/00:00:00:00:00/b0 Emask 0x4 (timeout)
ata1: port is slow to respond, please be patient (Status 0xd0)
ata1: port failed to respond (30 secs, Status 0xd0)
ata1: soft resetting port
ata1.00: configured for UDMA/100
ata1.01: configured for UDMA/33
ata1: EH complete
===========
2. If such timeout is seen, please capture the log of whole transation, from CDB (1:0,1,0), the timeout itself until the EH traces. Something like below:
ata_scsi_dump_cdb: CDB (1:0,1,0) 00 00 00 00 00 00 00 00 00
ata_scsi_translate: ENTER
ata1: ata_dev_select: ENTER, device 1, wait 1
ata_tf_load: feat 0x0 nsect 0x0 lba 0x0 0x0 0x20
ata_tf_load: device 0xB0
ata_exec_command: ata1: cmd 0xA0
ata_scsi_translate: EXIT
ata_hsm_move: ata1: protocol 6 task_state 1 (dev_stat 0x58)
atapi_send_cdb: send cdb
ata_host_intr: ata1: protocol 6 task_state 3
ata_hsm_move: ata1: protocol 6 task_state 3 (dev_stat 0x51)
ata_hsm_move: ata1: protocol 6 task_state 4 (dev_stat 0x51)
ata_scsi_timed_out: ENTER
ata_scsi_timed_out: EXIT, ret=0
ata_scsi_error: ENTER
ata_port_flush_task: ENTER
ata_port_flush_task: flush #1
ata1: ata_port_flush_task: flush #2
ata1: ata_port_flush_task: EXIT
ata_eh_autopsy: ENTER
atapi_eh_request_sense: ATAPI request sense
ata1: ata_dev_select: ENTER, device 1, wait 1
ata_tf_load: feat 0x0 nsect 0x0 lba 0x0 0x0 0x20
ata_tf_load: device 0xB0
ata_exec_command: ata1: cmd 0xA0
ata_hsm_move: ata1: protocol 5 task_state 1 (dev_stat 0x58)
atapi_send_cdb: send cdb
ata_host_intr: ata1: protocol 5 task_state 2
ata_hsm_move: ata1: protocol 5 task_state 2 (dev_stat 0x58)
atapi_pio_bytes: ata1: xfering 18 bytes
__atapi_pio_bytes: data read
ata_host_intr: ata1: protocol 5 task_state 2
ata_hsm_move: ata1: protocol 5 task_state 2 (dev_stat 0x50)
ata_hsm_move: ata1: protocol 5 task_state 3 (dev_stat 0x50)
ata_hsm_move: ata1: dev 1 command complete, drv_stat 0x50
ata_port_flush_task: ENTER
ata_port_flush_task: flush #1
ata1: ata_port_flush_task: flush #2
ata1: ata_port_flush_task: EXIT
ata_eh_autopsy: EXIT
ata_eh_recover: ENTER
ata_eh_prep_resume: ENTER
ata_eh_prep_resume: EXIT
ata_eh_revalidate_and_attach: ENTER
ata_eh_resume: ENTER
ata_eh_resume: EXIT
ata_eh_suspend: ENTER
ata_eh_suspend: EXIT
ata_eh_recover: EXIT, rc=0
atapi_qc_complete: ENTER, err_mask 0x0
ata_scsi_error: EXIT
No time out as yet with that patch .. Matthew's program has been constantly running too .. Anything else I can try to stimulate the problem ? > Anything else I can try to stimulate the problem ?
If it didn't happen, please try with the libata debug trace turned off by:
#undef ATA_DEBUG
#undef ATA_VERBOSE_DEBUG
Please keep the patch "Limit ATAPI DMA to Read/Write only" applied.
Thanks.
I've just compiled a 2.6.22-rc4 kernel with the added debug and limit atapi dma patches and with SMP and Preempt turned off and am currently running my cdstatus_5 program. Albert, after using that patch for so long, and with libata debug disabled, the timeout did not occur even once.. Great going ! Matthew, any timeouts on your system ? BTW, Albert, does this patch limit or slow down the performance of the CD-ROM drive ? After running cdstatus_5 for a few hours, I did not get any timeouts. Then I disabled the debug patches and ran hald and cdstatus_5 at the same time and got lots of timeouts not long after my desktop gui had finished loading. I'm still running the original ac00 firmware though. Rohan, are you getting any timeouts with hald running with this patch? If not I might upgrade the firmware to see if that helps. I'm going to try cdstatus_1 now with full debugging enabled. Matthew, all the time I was running cdstatus_5 and hald simultaneously. I did not get any timeouts. I will try again today, though... Ok, I still haven't got any timeouts after that patch being applied, and using the computer for long periods .. Any news from your side, Matthew ? Anything else for us to try, Albert ? Created attachment 11771 [details]
Timeout after patch too
I am very sorry, I spoke to soon I think.. The timeout occurs even _after_ I apply the patch.. Here's the dmesg :)
Yes I updated the firmware on my cdrom to AC01 and am still getting timeouts. I'm hoping to do some more testing in a few days in single user mode and then gradually adding options to see if the timeouts still occur then. This bugs happens in ubuntu, a long thread can be found here: https://bugs.launchpad.net/linux/+bug/84603 what type of informations do you need? a fix so far is to crossflah your cd drive as per: https://bugs.launchpad.net/linux/+bug/84603/comments/46 Thanks a lot Chantra. Is it confirmed that the bug is the fault of the CD-ROM drive ? And if so, why is the timeout not caused when libata is disabled ? However, I have upgraded to SC03 and will try for some time whether timeout occurs or no :) > why is the timeout not caused when libata is disabled ?
I guess this is due to hald/scsi mid don't issue as many of commands that triggers the problem under libata.
Anyway, if Samsung's SC03 firmware works well but Acer's AC01 firmware doesn't, maybe we should notify Acer about the problem. Hopefully Acer could come up with better firmware, say "AC03"...
Do we know which commands are problematic? From the dmesg, the timeout happens on command 0x43 (READ_TOC), 0x1e (ALLOW_MEDIUM_REMOVAL), 0x00 (TEST_UNIT_READY) and 0x25 (READ_CAPACITY). I see. All over the place. :-( I was hoping sr/cdrom was issuing commands after TUR failed. That would be something we can fix. We are seeing timeout issues for the TSSTcorp and Samsung drives, like this bug 8316 and another bug 8605. Maybe increasing the timeout to the magic 7 seconds as Alan pointed out can help? (http://www.ussg.iu.edu/hypermail/linux/kernel/0701.0/0196.html) AFAIK, we're using 7sec timeout. drivers/cdrom/cdrom.c::CDROM_DEF_TIMEOUT or are you referring to a different one? Ah, the timeout is already 7 secs... Originally I hoped we can workaround with longer timeout, if it is not long enough... Regarding the timeout, maybe the hald is using its own timeout value via the sgio.timeout (even if drivers/cdrom/cdrom.c::CDROM_DEF_TIMEOUT is 7)? I will check the hald code and see what the timeout value used there. AFAIK, hald simply tries open the device but what do I know? :-) Hmm, instead of workaround from hald or scsi mid (per Matthew's comment 33), maybe firmware update that fixes the root cause is better way to go. I've sent an e-mail to Acer asking if any TS-L632D firmware update available. I've also sent a bug report to Asus, asking if they have firmware update newer than "AS99". The customer service staff replied "Sorry, Linux not supported." I said this is not Linux problem but more likely firmware problem, but the CSR insists "Linux is not supported on Asus notebooks." Actually libata is not alone. The TS-L632D drive is also also causing trouble with old IDE subsystem. https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/75295 https://bugs.launchpad.net/ubuntu/+source/hal/+bug/85695 Comment #69 is false. Ubuntu enabled libata since 2.6.17 (and problem occured since this migration) Check https://launchpad.net/ubuntu/+spec/libata-for-all-ata-disks > Comment #69 is false. Ubuntu enabled libata since 2.6.17 > (and problem occured since this migration) Really? How come the TS-L632D is driven by old IDE in that bug report? ====================== Linux version 2.6.20-6-generic (root@palmer) (gcc version 4.1.2 20070106 (prerelease) (Ubuntu 4.1.1-21ubuntu7)) #2 SMP Wed Jan 31 20:53:39 UTC 2007 (Ubuntu 2.6.20-6.11-generic) <snip> [ 3.444000] ICH6: IDE controller at PCI slot 0000:00:1f.1 [ 3.444000] ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 18 (level, low) -> IRQ 20 [ 3.444000] ICH6: chipset revision 4 [ 3.444000] ICH6: not 100% native mode: will probe irqs later [ 3.444000] ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:DMA [ 3.444000] ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:pio, hdd:pio [ 3.444000] Probing IDE interface ide0... [ 3.684000] usb 1-1: configuration #1 chosen from 1 choice [ 4.152000] hda: ST9100825A, ATA DISK drive [ 4.396000] usb 4-1: new full speed USB device using uhci_hcd and address 2 [ 5.116000] usb 4-1: configuration #1 chosen from 1 choice [ 5.120000] usbcore: registered new interface driver hiddev [ 5.132000] input: Logitech Optical USB Mouse as /class/input/input1 [ 5.132000] input: USB HID v1.10 Mouse [Logitech Optical USB Mouse] on usb-0000:00:1d.0-1 [ 5.132000] usbcore: registered new interface driver usbhid [ 5.132000] drivers/usb/input/hid-core.c: v2.6:USB HID core driver [ 5.520000] hdb: TSSTcorpCD/DVDW TS-L632D, ATAPI CD/DVD-ROM drive [ 8.180000] hdb: ATAPI 24X DVD-ROM DVD-R-RAM CD-R/RW drive, 2048kB Cache, UDMA(33) <snip> [ 681.880000] hdb: status timeout: status=0xd0 { Busy } [ 681.880000] ide: failed opcode was: unknown [ 681.880000] hdb: drive not ready for command =========== And this: https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/75295/comments/18 "I have an HP Pavilion dv6106eu with a TSSTcorpCD/DVDW TS-L632D. Killing hald-addon-storage: polling /dev/hda brings the system back to normal working speed." Ok, my mistake. Indeed on my system, cdrom is still managed by ide-cd. Sorry Created attachment 11883 [details]
Debug patch for the timeout
Hi Rohan/Laurent:
I guess maybe we still need detailed trace of this TS-L632D drive.
Need your help to
- Apply the debug patch to mainline 2.6.22-rc5 kernel
(kernel should be configured to libata, instead of old IDE)
- Rebuild the kernel and reboot
- Hopefully /var/log/messages will be flooded with ton of libata debug trace
- Try to reproduce the timeout with Matthew's test program
- Capture and /var/log/messages with the timeout to this bugzilla
Hopefully we can have better understanding about the timeout. Thanks.
Albert, now that the SC03 firmware is flashed, I don't see a single timeout .. How can I induce one ? Should I downgrade my firmware ? Rohan, nice to know that SC03 firmware yields good result. If not too inconvenient, please help to downgrade the firmware to AC0x and collect some libata traces. Thanks and appreciates. Albert, I am running SC03 too, but I can downgrade and test out the patch. Will do tonight when I get home Hi, I attached both my messages log and my syslog (it seems that all the debug messages are put into syslog). I kept both files unchanged. FYI, cd-rom drive is ata1, hard drive is ata2 Created attachment 11890 [details]
messages logfile since last reboot
Created attachment 11891 [details]
syslog logfile since last reboot
I've downgraded to AC01 and running the cdstatus program. Will attach the log as soon as timeout occurs. Been running the 2.6.22-rc6 kernel with hald killed, and Matthew's program running .. But not a single timeout experienced till now .. > syslog logfile since last reboot
Hi Laurent,
Thanks for the debug trace. Could you also attach "dmesg" without the debug patch. (For the info during boot.) Thanks.
Albert, can you suggest me anything to induce the timeout ? The program's been running from a long time, with no timeout .. Created attachment 11893 [details]
dmesg logfile
Here's the dmesg. There's the nvidia module which taints the kernel but it was disabled during the test.
> Albert, can you suggest me anything to induce the timeout ? > The program's been running from a long time, with no timeout There is no clue yet about the root cause or how to trigger the timeout. The only clue from Laurent's log is: the timeout only happens on REQUEST_SENSE. (This reminds of bug 8244 in which AOpen cdrom 56X/AKH also timeouts on REQUEST_SENSE. However, these two bugs are quite different. For bug 8244, the REQUEST_SENSE timeout is caused by garbage in the taskfile registers and always reproducible. For this bug, the TS-L632D seems timeout on indefinite REQUEST SENSE. The command that REQUEST_SENSE times out are READ_CAPACITY and TEST_UNTI_READY in Laurent's log.) Created attachment 11894 [details]
Laurent's log filtered for ata1 messages only
Created attachment 11895 [details]
Tejun's patch to limit ATAPI DMA to multiple of 16 bytes
Hi Laurent,
Need your help to apply both the "Debug patch for the timeout" and Tejun's patch to "limit ATAPI DMA to multiple of 16 bytes" (and trigger the timeout for debug trace).
Hopefully we could know more about how the TS-L632D behaves under Tejun's patch and any timeout behavior change.
Ok, I trimmed down my ubuntu kernel so I can now compile a new kernel in reasonable time. I extracted the relevant part from the logs. As I can't find a way to attach a file (pointers will be welcome), you can get them from http://www.debuntu.org/files/libata_debug.txt.bz2 Test done with ubuntu's 2.6.22-rc5 patch with both "Tejun's patch" and "Debug patch for the timeout" Jun 28 22:13:46 gutsytop kernel: [ 1989.404000] ata1: xfering 18 bytes Jun 28 22:13:46 gutsytop kernel: [ 1989.404000] ata1: protocol 5 task_state 2 Jun 28 22:13:46 gutsytop kernel: [ 1989.404000] ata1: protocol 5 task_state 2 (dev_stat 0x50) Jun 28 22:13:46 gutsytop kernel: [ 1989.404000] ata1: protocol 5 task_state 3 (dev_stat 0x50) Jun 28 22:13:46 gutsytop kernel: [ 1989.404000] ata1: dev 1 command complete, drv_stat 0x50 Jun 28 22:13:46 gutsytop kernel: [ 1989.404000] CDB (1:0,1,0) 00 00 00 00 00 00 00 00 00 Jun 28 22:14:16 gutsytop kernel: [ 2019.424000] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jun 28 22:14:16 gutsytop kernel: [ 2019.424000] ata1.01: cmd a0/00:00:00:00:20/00:00:00:00:00/b0 tag 0 cdb 0x0 data 0 Jun 28 22:14:16 gutsytop kernel: [ 2019.424000] res 40/00:03:00:00:00/00:00:00:00:00/b0 Emask 0x4 (timeout) Jun 28 22:14:22 gutsytop kernel: [ 2024.468000] ata1: port is slow to respond, please be patient (Status 0xd0) Jun 28 22:14:27 gutsytop kernel: [ 2029.452000] ata1: device not ready (errno=-16), forcing hardreset Jun 28 22:14:27 gutsytop kernel: [ 2029.452000] ata1: BUG: prereset() requested invalid reset type Jun 28 22:14:27 gutsytop kernel: [ 2029.452000] ata1: soft resetting port Jun 28 22:14:27 gutsytop kernel: [ 2029.788000] ata1: protocol 2 task_state 2 (dev_stat 0x58) Hi chantra, Thanks for the log. In your log, the timeout happens on TEST_UNIT_READY itself, not the coming REQUEST_SENSE. It seems the behavior somewhat changed under Tejun's patch (or, under different version of firmware?). I was thinking doing some workaround in REQUEST_SENSE. If timeout on the command itself, we have to find other clue. BTW, the timeout is 30sec in this case, that's much longer than 7sec and what I guessed. Do you have more timeout traces (under Tejun's patch) for analysis? (accessing large log file from http://www.debuntu.org is ok for me) Hi Albert, I will boot my box on my libatadebug kernel tonight when I get home. If you have a some code to trigger the REQUEST_SENSE , I will run it. Are cdstatus_1 and cdstatus_2 meant to trigger it? Unfortunately I am not much aware of cd programming, so I can't code it without to much research. Also, could you tell me how to spot a REQUEST_SENSE call. should it be: ata1.01: cmdxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx tag x cdb 0x03 data x (cdb 0x03)? Has I have compiled a light kernel, I can easily apply new patches if required without taking ages :D. Created attachment 11904 [details]
syslog with Tejun's patch
Here's my syslog (untrimmed) with Tejun.
Created attachment 11905 [details]
syslog cdb 0x0 cdb 0x25
I am rerunning the debug kernel. I have add 2 different errors: cdb 0x25 and cdb 0x0 (GPCMD_READ_CDVD_CAPACITY and TEST_UNIT_READY) if I am right.
I have attach a trimmed version, 40 lines befire and after the timeout. Hope it will do.
I will let the machine run and upload a big files later.
by the way, I am using firmware AC01. I have had new errors: START_STOP (0x1b), ALLOW_MEDIUM_REMOVAL (0x1e) and READ_TOC (0x43) Checked both Laurent and chantra's libata traces. With Tejun's patch, Laurent's drive still timeouts on REQUEST_SENSE, but chantra's drive times out on the command itself. I guess this might be due to firmware versions. Hi Laurent, could you please provide your firmware version for reference. (Adding Alan and Jeff to the cc list for advice.) Summary: From the log, most of the timeout happened on 0x25 (READ_CAPACITY) and 0x00 (TEST_UNIT_READY). It seems there is little we can do from libata since the commands timeouts randomly looks normal by libata. The current workarounds for the TSSTcorp TS-L632D problem are: - update firmware to SC03 or some version that works (preferred) - kill hald-addon-storage - keep a disc in the drive However, we might be able to do something in the Linux SCSI layer, as Matthew pointed out, the behavior of sr_media_change() and scsi_test_unit_ready() is strange: When there is no disc in the drive, scsi_test_unit_ready() set sdev->changed = 1 and returns zero. This makes sr_media_change() keep issuing 0x43 and 0x25 to the drive, in a loop of clearing sdev->changed (and later setting sdev->changed by scsi_test_unit_ready). There is a patch by Alan Stern might be helpful to this bug: http://marc.info/?l=linux-kernel&m=115412002912837&w=2 Anyway, maybe the SCSI developers can look into this. Changing the bug component to SCSI. I have a Asus Laptop: firmware version is AS04 can anybody confirm that updating firmware solve this problem? i have no ms windows and this is quite complicated to update for me. thanks for advice yes, I can confirm that a firmware upgrade fixes the issue. check out this thread: https://bugs.launchpad.net/linux/+bug/84603/comments/74 Yes, updating the firmware to SC03 solved the problem for me, too. thank you chantra and rohan, i will try to update too Unfortunately for me, it seems that Asus firmwares AS05 and AS99 don't solve the problem. (Moreover they have been removed from download some days ago and I haven't find another site) Laurent, crossflash to SC03 -- I've done that for my Acer laptop. In windows, the SFNDWIN.exe tool accepts the command line /nocheck or -nocheck , that allows you to flash the drive regardless of the OEM. The SC03 firmware has been working perfectly, and I've not noticed and performance hits.. Created attachment 11935 [details]
more output
Some extra output, trimmed to 40 lines before and after the timeout.
Created attachment 12710 [details] sr_media_change: avoid READ TOC/CAPACITY if no media present (Patch based on http://marc.info/?l=linux-kernel&m=115412002912837&w=2) Hi Rohan and Laurent, Could you please crossflash the firmware of your TS-L632D to the original version and test if the attached patch fixes the TS-L632D timeout problem? (The patch is against the lastest 2.6.23-rc5 kernel) Thanks! i have a dell precision m90 notebook, which also uses this drive. it has firmware DE04 installed. i am using opensuse10.2 with kernel 2.6.18.8. and i am having similar problems with this drive. whenever my cd-burning-program k3b (and some other applications) accesses the drive, the drive firstly "freezes". it takes about 30 seconds to respond. when i check the output of dmesg at this time, i see the following: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata2.00: cmd a0/01:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0xac data 8 in res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout) ata2: port is slow to respond, please be patient (Status 0xd0) ata2: port failed to respond (30 secs, Status 0xd0) ata2: soft resetting port ata2.00: configured for UDMA/33 ata2: EH complete when i start burning it firstly does this again, and then burning works. so it seems to me, that every access to the drive results in such an "soft reset". if i am not patient enough, and killing the burning software while drive is not respondig, or ejecting the disc, than it gets even worse. then the drive speed will be limited which each soft-reset down to PIO-4... to which kernel versions do these patches from which you are talking about here belong? will a 2.6.22 kernel also have these patches? Hi Albert, I patched a 2.6.22 kernel with http://bugzilla.kernel.org/attachment.cgi?id=12710&action=view Will update this bug report once I have tested it enough. Thanks a million for the patch, regards, chantra ---- http://www.debuntu.org Debuntu deb's repository Debuntu Debian and Ubuntu Tips and Tricks <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN"> <HTML> <HEAD> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8"> <META NAME="GENERATOR" CONTENT="GtkHTML/3.16.0"> </HEAD> <BODY> Hi Albert,<BR> <BR> I patched a 2.6.22 kernel with <PRE> <FONT COLOR="#000000"><A HREF="http://bugzilla.kernel.org/attachment.cgi?id=12710&action=view">http://bugzilla.kernel.org/attachment.cgi?id=12710&action=view</A></FONT> </PRE> Will update this bug report once I have tested it enough.<BR> <BR> Thanks a million for the patch,<BR> <BR> regards,<BR> chantra<BR> <TABLE CELLSPACING="0" CELLPADDING="0" WIDTH="100%"> <TR> <TD> <FONT COLOR="#969696">----</FONT> <PRE> <A HREF="http://www.debuntu.org">http://www.debuntu.org</A> <A HREF="http://repository.debuntu.org">Debuntu deb's repository</A> </PRE> <BR> <A HREF="http://www.debuntu.org/"><IMG SRC="cid:1191270106.4195.0.camel@gutsytop" ALIGN="bottom" ALT="Debuntu Debian and Ubuntu Tips and Tricks"></A><BR> <BR> <BR> </TD> </TR> </TABLE> </BODY> </HTML> Spam detection software, running on the system "murdock.linux-foundation.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi Albert, I patched a 2.6.22 kernel with http://bugzilla.kernel.org/attachment.cgi?id=12710&action=view [...] Content analysis details: (6.6 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -1.0 OSDL_HEADER_SUBJECT_BRACKETED Subject starts with []'d phrase 0.7 EXTRA_MPART_TYPE Header has extraneous Content-type:...type= entry 3.0 HTML_IMAGE_ONLY_12 BODY: HTML: images with 800-1200 bytes of words 0.0 HTML_MESSAGE BODY: HTML included in message 0.1 HTML_70_80 BODY: Message is 70% to 80% HTML -1.1 BAYES_05 BODY: Bayesian spam probability is 1 to 5% [score: 0.0231] 2.7 HTML_SHORT_LINK_IMG_2 HTML is very short with a linked image 2.1 OSDL_IMAGE_SPAM Nearly all HTML with nothing but images Hi Albert, Arg, the error is back. Here is what I have in dmesg: [ 1152.096000] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [ 1152.096000] ata1.01: cmd a0/00:00:00:00:20/00:00:00:00:00/b0 tag 0 cdb 0x0 data 0 [ 1152.096000] res 40/00:03:00:00:00/00:00:00:00:00/b0 Emask 0x4 (timeout) [ 1157.136000] ata1: port is slow to respond, please be patient (Status 0xd0) [ 1162.120000] ata1: device not ready (errno=-16), forcing hardreset [ 1162.120000] ata1: soft resetting port [ 1162.472000] ata1.00: configured for UDMA/100 [ 1162.660000] ata1.01: configured for UDMA/33 [ 1162.660000] ata1: EH complete [ 1162.684000] sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB) [ 1162.692000] sd 0:0:0:0: [sda] Write Protect is off [ 1162.692000] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 [ 1162.716000] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 1163.052000] sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB) [ 1163.052000] sd 0:0:0:0: [sda] Write Protect is off [ 1163.052000] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 [ 1163.056000] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA This happened while I was compiling a kernel, it seems that it mainly happens when running resource intensive process but I am not 100% positive on this assumption :). Sorry for the spam earlier on, but my HTML mail was not really accepted by the spam filter. Closing out old stale bug |