Bug 10887
Summary: | Kernel report fake device busy intermittently | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Masanari Iida (standby24x7) |
Component: | IDE | Assignee: | Bartlomiej Zolnierkiewicz (bzolnier) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | ||
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.25 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: | [PATCH] ide: increase timeout in wait_drive_not_busy() |
Reassigned to IO/storage:IDE Let me ask you a question. The linux kernel retry 100 times, so wait 1m sec to monitor the status. But why the developer chose 100 times retries? Is this documented on ATAPI paper? Created attachment 16492 [details]
[PATCH] ide: increase timeout in wait_drive_not_busy()
Hi, please see if the attached patch fixes the problem.
w.r.t. timeout value: according to the comment we already use value greater than defined in the ATA/ATAPI spec (but it would be great if somebody can re-check it)
Thank you for supporting this issue. I was uncomfortable in simply increasing this retry count because I did not consider adding any form of delay in to code that executes within an interrupt context a good idea; I would like to show you call flow as shown by systemtap collected on a RHEL4 kernel. Vanilla kernel should work as same as this. ## Good request trace: 0 swapper(0): -> ide_intr (1243213214313202), cpu# 1 10 swapper(0): -> drive_is_ready (1243213214335822), cpu# 1 20 swapper(0): -> ide_inb (1243213214361474), cpu# 1 31 swapper(0): <- ide_inb (1243213214386490), cpu# 1 39 swapper(0): <- drive_is_ready (1243213214405662), cpu# 1 49 swapper(0): -> task_in_intr (1243213214430530), cpu# 1 60 swapper(0): -> ide_inb (1243213214455770), cpu# 1 70 swapper(0): <- ide_inb (1243213214480618), cpu# 1 81 swapper(0): -> task_buffer_sectors (1243213214506462), cpu# 1 94 swapper(0): -> taskfile_input_data (1243213214539922), cpu# 1 107 swapper(0): -> ata_input_data (1243213214571146), cpu# 1 121 swapper(0): -> ide_insw (1243213214605714), cpu# 1 386 swapper(0): <- ide_insw (1243213215242606), cpu# 1 396 swapper(0): <- ata_input_data (1243213215264574), cpu# 1 405 swapper(0): <- taskfile_input_data (1243213215288010), cpu# 1 414 swapper(0): <- task_buffer_sectors (1243213215306626), cpu# 1 425 swapper(0): -> wait_drive_not_busy (1243213215333458), cpu# 1 437 swapper(0): -> ide_inb (1243213215361894), cpu# 1 447 swapper(0): <- ide_inb (1243213215387442), cpu# 1 468 swapper(0): -> ide_inb (1243213215436266), cpu# 1 ......... success after 11 retries 743 swapper(0): -> ide_inb (1243213216095630), cpu# 1 753 swapper(0): <- ide_inb (1243213216121602), cpu# 1 773 swapper(0): -> ide_inb (1243213216169802), cpu# 1 784 swapper(0): <- ide_inb (1243213216194814), cpu# 1 792 swapper(0): <- wait_drive_not_busy (1243213216214366), cpu# 1 803 swapper(0): -> task_end_request (1243213216239386), cpu# 1 814 swapper(0): -> default_end_request (1243213216266998), cpu# 1 827 swapper(0): -> ide_end_request (1243213216299342), cpu# 1 839 swapper(0): -> __ide_end_request (1243213216330186), cpu# 1 853 swapper(0): <- __ide_end_request (1243213216365674), cpu# 1 864 swapper(0): <- ide_end_request (1243213216388058), cpu# 1 872 swapper(0): <- default_end_request (1243213216407238), cpu# 1 881 swapper(0): <- task_end_request (1243213216427218), cpu# 1 889 swapper(0): <- task_in_intr (1243213216445942), cpu# 1 899 swapper(0): -> ide_do_request (1243213216471294), cpu# 1 909 swapper(0): <- ide_do_request (1243213216493054), cpu# 1 1011 swapper(0): <- ide_intr (1243213216738754), cpu# 1 0 swapper(0): -> ide_intr (1243213216762710), cpu# 1 11 swapper(0): -> ide_inb (1243213216788322), cpu# 1 21 swapper(0): <- ide_inb (1243213216812170), cpu# 1 29 swapper(0): <- ide_intr (1243213216830854), cpu# 1 ## Bad trace case: 0 swapper(0): -> ide_intr (1243481579247902), cpu# 1 11 swapper(0): -> drive_is_ready (1243481579271978), cpu# 1 22 swapper(0): -> ide_inb (1243481579299038), cpu# 1 32 swapper(0): <- ide_inb (1243481579324874), cpu# 1 42 swapper(0): <- drive_is_ready (1243481579346102), cpu# 1 52 swapper(0): -> task_in_intr (1243481579371050), cpu# 1 63 swapper(0): -> ide_inb (1243481579400282), cpu# 1 75 swapper(0): <- ide_inb (1243481579425914), cpu# 1 85 swapper(0): -> task_buffer_sectors (1243481579450638), cpu# 1 97 swapper(0): -> taskfile_input_data (1243481579480066), cpu# 1 108 swapper(0): -> ata_input_data (1243481579508678), cpu# 1 121 swapper(0): -> ide_insw (1243481579538806), cpu# 1 392 swapper(0): <- ide_insw (1243481580189762), cpu# 1 401 swapper(0): <- ata_input_data (1243481580212802), cpu# 1 411 swapper(0): <- taskfile_input_data (1243481580234210), cpu# 1 419 swapper(0): <- task_buffer_sectors (1243481580252390), cpu# 1 430 swapper(0): -> wait_drive_not_busy (1243481580277614), cpu# 1 441 swapper(0): -> ide_inb (1243481580304758), cpu# 1 451 swapper(0): <- ide_inb (1243481580330622), cpu# 1 ........ exceeded the 100 retries 3592 swapper(0): -> ide_inb (1243481587868790), cpu# 1 3602 swapper(0): <- ide_inb (1243481587894306), cpu# 1 3611 swapper(0): <- wait_drive_not_busy (1243481587914714), cpu# 1 3622 swapper(0): -> ide_cdrom_error (1243481587941562), cpu# 1 By increasing the retry count, we could end up running this interrupt handler for up to 10ms. This seems to contravene all the rules of writing good interrupt handlers. Shouldn't this kind of activity be potentially deferred to a tasklet or workqueue?? What happens to our other handlers during this time, for example, clock or disk...?? Admittedly this increased delay loop only happens corner cases, but poor code design is poor code design....?? Thanks > By increasing the retry count, we could end up running this interrupt > handler for up to 10ms. This seems to contravene all the rules of > writing good interrupt handlers. Shouldn't this kind of activity be > potentially deferred to a tasklet or workqueue?? What happens to our > other handlers during this time, for example, clock or disk...?? Usually we unmask other IRQs while running IRQ handler so they also have a chance to be serviced (hdparm -u command, some host drivers enable it by default) but ideally PIO handling should be done using tasklet or workqueue (as you've noticed). > Admittedly this increased delay loop only happens corner cases, > but poor code design is poor code design....?? Fully agreed and patches to improve the current situation are always welcomed. In the meantime fixing the bug by increasing timeout is OK. I have confirmed increase the retry count from 100 to 1000 fixed my problem. The test was done with TEAC DV-W28ECW. Will you include this patch into Linus's tree? Let me know if this happen on current -rc series, or after the official 2.6.26 release. Tested-by: Masanari Iida <standby24x7@gmail.com> I have confirmed this one was fixed on kernel 2.6.26. You may close this bug. Thank you. Thank you for verifying the fix and sorry for the late update. The commit in Linus' tree is: commit f54feafa6d47d0aa1a96adefdc763b708b02f94f Author: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Date: Fri Jun 20 20:53:33 2008 +0200 ide: increase timeout in wait_drive_not_busy() Some ATAPI devices take longer than the current max timeout value to become ready (i.e. TEAC DV-W28ECW takes 6 ms) so increase the timeout value to 10 ms. This fixes kernel.org bugzilla bug #10887: http://bugzilla.kernel.org/show_bug.cgi?id=10887 Reported-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> I'm closing the bug. |
Distribution: RHEL4 + vanilla kernel (2.6.25) Hardware Environment:TEAC DV-W28ECW Software Environment: Problem Description: When I access /proc/ide/hda/identify, following error is seen intermittently. kernel: hda: task_in_intr: status=0xd0 { Busy } kernel: hda: task_in_intr: error=0xd0LastFailedSense 0x0d Steps to reproduce: # cat /proc/ide/hda/identify This symptom occured very rare. (ex) 22 times in 60000 access. Analyze: After DVD drive return the identify data, the linux kernel wait 1ms to see if the drive status back to "Ready". But this TEAC DV-W28ECW drive, average retry count was 22 times. But in some cases, it exceed the 1ms (100 times), but it just takes a bit longer time... 6m sec. drivers/ide/ide-taskfile.c static u8 wait_drive_not_busy(ide_drive_t *drive) { int retries; u8 stat; /* * Last sector was transfered, wait until drive is ready. * This can take up to 10 usec, but we will wait max 1 ms. */ for (retries = 0; retries < 100; retries++) { stat = ide_read_status(drive); if (stat & BUSY_STAT) udelay(10); else break; } if (stat & BUSY_STAT) printk(KERN_ERR "%s: drive still BUSY!\n", drive->name); return stat; } > * Last sector was transfered, wait until drive is ready. > * This can take up to 10 usec, but we will wait max 1 ms. This part is not documented within ATAPI specification. So the Drive vender never take care of the retry. In order to support some corner cases drive, I expect retry count in the loop need to be adjustable by /proc or /sys. Then just set the same value, 100 as a default.