Bug 15945 - Filling disk with data leads to [sda] Unhandled error code. [sda] Result hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Summary: Filling disk with data leads to [sda] Unhandled error code. [sda] Result host...
Status: RESOLVED INSUFFICIENT_DATA
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jeff Garzik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-09 11:49 UTC by Alexander Konovalenko
Modified: 2020-05-28 04:57 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.32-21-generic #32-Ubuntu SMP
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Output of smartctl -a /dev/sda (5.97 KB, text/plain)
2010-05-09 11:49 UTC, Alexander Konovalenko
Details
Output of hdparm -I /dev/sda (2.64 KB, text/plain)
2010-05-09 11:50 UTC, Alexander Konovalenko
Details
Output of dmesg (40.03 KB, text/plain)
2010-05-09 12:17 UTC, Alexander Konovalenko
Details
log-eh-race.patch (442 bytes, patch)
2010-05-17 10:17 UTC, Tejun Heo
Details | Diff

Description Alexander Konovalenko 2010-05-09 11:49:02 UTC
Created attachment 26296 [details]
Output of smartctl -a /dev/sda

I'm running things like badblocks -w that fill the whole hard disk with data and then attempt to read it. Once in a while errors like this occur:

May  9 01:13:53 ubuntu kernel: [29838.149173] sd 0:0:0:0: [sda] Unhandled error code
May  9 01:13:53 ubuntu kernel: [29838.149193] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
May  9 01:13:53 ubuntu kernel: [29838.149216] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 01 36 c8 00 00 08 00
May  9 01:13:53 ubuntu kernel: [29838.149272] end_request: I/O error, dev sda, sector 79560

This error occurred while writing to the disk through a dm-crypt device with dd bs=512. But the kernel log says 'Read', so I'm confused whether it actually happened during a read or a write. The system was otherwise idle and no other processes should have accessed the disk.

After an error occurs, the faulty sector and the LBAs around it can be both read and written to without any further errors. So the condition is not reliably reproducible.

There is some evidence that the error messages might indicate a bug in the kernel and not just faulty hardware:

1) The SMART error log is empty and reallocated sector count is 0. The SMART data gives no indication that the disk is aware of any read or write errors.

2) Someone has encountered very similar error messages in a virtual machine guest, while the host OS didn't complain about the actual hardware:
http://forums.virtualbox.org/viewtopic.php?f=3&t=25568

The disk has a sector size of 512 bytes (both physical and logical). Sometimes I used 512-byte blocks for dd (dd bs=512) and badblocks (badblocks -b 512). dd with bs=512 is much slower than with 4096-byte blocks, and when writing through a dm-crypt device it is painfully slow (less than 1 MB/s). Not sure if that can be relevant.

Another interesting thing is that when the laptop is idle it often freezes until I move the mouse or start typing. I observe the freeze by looking at the current time (including seconds) on the GNOME panel which stops updating. Once I move the mouse, the clock catches up, but it still tends to be slow, roughly 10 minutes per hour or so. (date output is consistent with the GNOME clock.) That's certainly a separate bug. I'm describing it here in case it might be related to the I/O timeout.

How can I further debug this?

I'm not sure how to tell unstable or faulty hardware (motherboard, CPU, memory, or the disk itself) from a kernel problem here. I tested the memory with Memtest86+ and it found no errors. Everything else seems to work fine. Any pointers would be highly appreciated.

I'm running Ubuntu 10.04 from a live DVD. The Linux kernel package version is 2.6.32-21.32; it is based on mainline kernel 2.6.32.11+drm33.2. Hardware: an Asus Eee PC 1101HA laptop with a Seagate Momentus 5400.6 SATA 250 GB HDD (model ST9250315AS).
Comment 1 Alexander Konovalenko 2010-05-09 11:50:10 UTC
Created attachment 26297 [details]
Output of hdparm -I /dev/sda
Comment 2 Tejun Heo 2010-05-09 11:52:56 UTC
Can you please post the output of dmesg after such problem occurred?  Please turn on printk timestamp if possible.

Thanks.
Comment 3 Alexander Konovalenko 2010-05-09 12:17:35 UTC
Created attachment 26298 [details]
Output of dmesg

Here it is. Are the timestamps in square brackets at the start of each line OK? Or should I turn on some other timestamps?
Comment 4 Tejun Heo 2010-05-17 10:17:41 UTC
Created attachment 26409 [details]
log-eh-race.patch

Sorry about the delay.  Yeap, those timestamps are good.  Something definitely is fishy.  I can't see how you can get READ failure without libata complaining first.  The read can be from dm-crypt if you're writing blocks which are smaller than encryption unit size.

Can you please apply the attached patch, boot with ignore_loglevel kernel parameter, trigger the problem and post the dmesg output?

Thank you.

Note You need to log in before you can comment on or make changes to this bug.