Bug 17491

Summary: Reproducible crash on large 64bit write to sata device
Product: IO/Storage Reporter: carl.janzen
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: RESOLVED UNREPRODUCIBLE    
Severity: high    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31 64bit, 2.6.35 64bit Subsystem:
Regression: No Bisected commit-id:
Attachments: Ubuntu 9.10 livecd dmesg
Ubuntu 9.10 64bit livecd lspci
Ubuntu 9.10 64bit livecd log
Smartctl for involved drive
lspci data
kernel 2.6.35 crash (photo)
kernel 2.6.35 crash (photo)

Description carl.janzen 2010-08-30 18:27:28 UTC
Created attachment 28461 [details]
Ubuntu 9.10 livecd dmesg

This is a bug affecting recent 64bit kernels, including the kernel in Ubuntu 10.10 alpha 3  (kernel 2.6.35). The SMART data from the involved Brand new 2TB Western Digital hard drive shows no errors (Motherboard is a brand new Asus P5Q Pro Turbo). I tried it on an older Hard drive (also 2TB Western Digital) and earlier motherboard (asus p5b) with the same results. 

This error likely affects other hard drives, or most likely has nothing to do with the hard drives at all. I had a problem with a corruption of my 4-drive array of 500MB Western Digital drives. Before rebuilding the array I wanted to copy the data from those  drives across to a 2TB backup and that's when I started seeing this reproducible crash. Leading up to this point the system did experience crashes every other day or so, which suggests to me that the bug probably caused that file system corruption also.

The kernel on the Fedora 10 live dvd does not crash ( 2.6.27 ) but I didn't confirm whether it produces the messages as described below.

The kernel on Ubuntu 9.10 live cd does not crash either (2.6.31-14-generic ) , but produced the enclosed dmesg, messages, lspci and smartctl files. Judging by the attempts to access blocks past the end of the device, it looks like a 64bit specific problem. Convert the number to hex and it stands out conspicuously. 

The latest ubuntu distribution freezes up with keyboard LEDs flashing. I tried to reproduce the problerm in text mode so I could take a picture of the trace/panic. That's what the two JPGs are. 

The way I have been triggering the bug is with the following command

dd if=/dev/zero of=/dev/sdb1 bs=2048

There does not seem to be a predictable delay between the start of that command and when it actually crashes/freezes or produces the errors in the log file. Sometimes I can transfer 40GB before the error hits. Once it happened immediately. I also noticed that upon detection of the device there is a complaint "device reported invalid CHS sector 0"
Comment 1 carl.janzen 2010-08-30 18:28:07 UTC
Created attachment 28471 [details]
Ubuntu 9.10 64bit livecd lspci
Comment 2 carl.janzen 2010-08-30 18:29:41 UTC
Created attachment 28481 [details]
Ubuntu 9.10 64bit livecd log
Comment 3 carl.janzen 2010-08-30 18:30:30 UTC
Created attachment 28491 [details]
Smartctl for involved drive
Comment 4 carl.janzen 2010-08-30 18:30:59 UTC
Created attachment 28501 [details]
lspci data
Comment 5 carl.janzen 2010-08-30 18:32:36 UTC
Created attachment 28511 [details]
kernel 2.6.35 crash (photo)
Comment 6 carl.janzen 2010-08-30 18:33:45 UTC
Created attachment 28521 [details]
kernel 2.6.35 crash (photo)
Comment 7 carl.janzen 2010-08-31 01:29:45 UTC
Seems it was likely hardware after all. I forgot that the new motherboard had been set for over-clocking. When reset to factory settings everything was fine. The old one is gone, so cannot test. In any case, not reproducible.