Bug 15945

Summary: Filling disk with data leads to [sda] Unhandled error code. [sda] Result hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Product: IO/Storage Reporter: Alexander Konovalenko (alexkon)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: RESOLVED INSUFFICIENT_DATA    
Severity: normal CC: alan, ovilewade9, tj
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32-21-generic #32-Ubuntu SMP Subsystem:
Regression: No Bisected commit-id:
Attachments: Output of smartctl -a /dev/sda
Output of hdparm -I /dev/sda
Output of dmesg
log-eh-race.patch

Description Alexander Konovalenko 2010-05-09 11:49:02 UTC
Created attachment 26296 [details]
Output of smartctl -a /dev/sda

I'm running things like badblocks -w that fill the whole hard disk with data and then attempt to read it. Once in a while errors like this occur:

May  9 01:13:53 ubuntu kernel: [29838.149173] sd 0:0:0:0: [sda] Unhandled error code
May  9 01:13:53 ubuntu kernel: [29838.149193] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
May  9 01:13:53 ubuntu kernel: [29838.149216] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 01 36 c8 00 00 08 00
May  9 01:13:53 ubuntu kernel: [29838.149272] end_request: I/O error, dev sda, sector 79560

This error occurred while writing to the disk through a dm-crypt device with dd bs=512. But the kernel log says 'Read', so I'm confused whether it actually happened during a read or a write. The system was otherwise idle and no other processes should have accessed the disk.

After an error occurs, the faulty sector and the LBAs around it can be both read and written to without any further errors. So the condition is not reliably reproducible.

There is some evidence that the error messages might indicate a bug in the kernel and not just faulty hardware:

1) The SMART error log is empty and reallocated sector count is 0. The SMART data gives no indication that the disk is aware of any read or write errors.

2) Someone has encountered very similar error messages in a virtual machine guest, while the host OS didn't complain about the actual hardware:
http://forums.virtualbox.org/viewtopic.php?f=3&t=25568

The disk has a sector size of 512 bytes (both physical and logical). Sometimes I used 512-byte blocks for dd (dd bs=512) and badblocks (badblocks -b 512). dd with bs=512 is much slower than with 4096-byte blocks, and when writing through a dm-crypt device it is painfully slow (less than 1 MB/s). Not sure if that can be relevant.

Another interesting thing is that when the laptop is idle it often freezes until I move the mouse or start typing. I observe the freeze by looking at the current time (including seconds) on the GNOME panel which stops updating. Once I move the mouse, the clock catches up, but it still tends to be slow, roughly 10 minutes per hour or so. (date output is consistent with the GNOME clock.) That's certainly a separate bug. I'm describing it here in case it might be related to the I/O timeout.

How can I further debug this?

I'm not sure how to tell unstable or faulty hardware (motherboard, CPU, memory, or the disk itself) from a kernel problem here. I tested the memory with Memtest86+ and it found no errors. Everything else seems to work fine. Any pointers would be highly appreciated.

I'm running Ubuntu 10.04 from a live DVD. The Linux kernel package version is 2.6.32-21.32; it is based on mainline kernel 2.6.32.11+drm33.2. Hardware: an Asus Eee PC 1101HA laptop with a Seagate Momentus 5400.6 SATA 250 GB HDD (model ST9250315AS).
Comment 1 Alexander Konovalenko 2010-05-09 11:50:10 UTC
Created attachment 26297 [details]
Output of hdparm -I /dev/sda
Comment 2 Tejun Heo 2010-05-09 11:52:56 UTC
Can you please post the output of dmesg after such problem occurred?  Please turn on printk timestamp if possible.

Thanks.
Comment 3 Alexander Konovalenko 2010-05-09 12:17:35 UTC
Created attachment 26298 [details]
Output of dmesg

Here it is. Are the timestamps in square brackets at the start of each line OK? Or should I turn on some other timestamps?
Comment 4 Tejun Heo 2010-05-17 10:17:41 UTC
Created attachment 26409 [details]
log-eh-race.patch

Sorry about the delay.  Yeap, those timestamps are good.  Something definitely is fishy.  I can't see how you can get READ failure without libata complaining first.  The read can be from dm-crypt if you're writing blocks which are smaller than encryption unit size.

Can you please apply the attached patch, boot with ignore_loglevel kernel parameter, trigger the problem and post the dmesg output?

Thank you.