Bug 17491 - Reproducible crash on large 64bit write to sata device
Summary: Reproducible crash on large 64bit write to sata device
Status: RESOLVED UNREPRODUCIBLE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Jeff Garzik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-30 18:27 UTC by carl.janzen
Modified: 2010-08-31 01:29 UTC (History)
0 users

See Also:
Kernel Version: 2.6.31 64bit, 2.6.35 64bit
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Ubuntu 9.10 livecd dmesg (64.62 KB, text/plain)
2010-08-30 18:27 UTC, carl.janzen
Details
Ubuntu 9.10 64bit livecd lspci (14.79 KB, text/plain)
2010-08-30 18:28 UTC, carl.janzen
Details
Ubuntu 9.10 64bit livecd log (71.01 KB, application/octet-stream)
2010-08-30 18:29 UTC, carl.janzen
Details
Smartctl for involved drive (4.60 KB, text/plain)
2010-08-30 18:30 UTC, carl.janzen
Details
lspci data (14.79 KB, text/plain)
2010-08-30 18:30 UTC, carl.janzen
Details
kernel 2.6.35 crash (photo) (898.60 KB, image/jpeg)
2010-08-30 18:32 UTC, carl.janzen
Details
kernel 2.6.35 crash (photo) (926.39 KB, image/jpeg)
2010-08-30 18:33 UTC, carl.janzen
Details

Description carl.janzen 2010-08-30 18:27:28 UTC
Created attachment 28461 [details]
Ubuntu 9.10 livecd dmesg

This is a bug affecting recent 64bit kernels, including the kernel in Ubuntu 10.10 alpha 3  (kernel 2.6.35). The SMART data from the involved Brand new 2TB Western Digital hard drive shows no errors (Motherboard is a brand new Asus P5Q Pro Turbo). I tried it on an older Hard drive (also 2TB Western Digital) and earlier motherboard (asus p5b) with the same results. 

This error likely affects other hard drives, or most likely has nothing to do with the hard drives at all. I had a problem with a corruption of my 4-drive array of 500MB Western Digital drives. Before rebuilding the array I wanted to copy the data from those  drives across to a 2TB backup and that's when I started seeing this reproducible crash. Leading up to this point the system did experience crashes every other day or so, which suggests to me that the bug probably caused that file system corruption also.

The kernel on the Fedora 10 live dvd does not crash ( 2.6.27 ) but I didn't confirm whether it produces the messages as described below.

The kernel on Ubuntu 9.10 live cd does not crash either (2.6.31-14-generic ) , but produced the enclosed dmesg, messages, lspci and smartctl files. Judging by the attempts to access blocks past the end of the device, it looks like a 64bit specific problem. Convert the number to hex and it stands out conspicuously. 

The latest ubuntu distribution freezes up with keyboard LEDs flashing. I tried to reproduce the problerm in text mode so I could take a picture of the trace/panic. That's what the two JPGs are. 

The way I have been triggering the bug is with the following command

dd if=/dev/zero of=/dev/sdb1 bs=2048

There does not seem to be a predictable delay between the start of that command and when it actually crashes/freezes or produces the errors in the log file. Sometimes I can transfer 40GB before the error hits. Once it happened immediately. I also noticed that upon detection of the device there is a complaint "device reported invalid CHS sector 0"
Comment 1 carl.janzen 2010-08-30 18:28:07 UTC
Created attachment 28471 [details]
Ubuntu 9.10 64bit livecd lspci
Comment 2 carl.janzen 2010-08-30 18:29:41 UTC
Created attachment 28481 [details]
Ubuntu 9.10 64bit livecd log
Comment 3 carl.janzen 2010-08-30 18:30:30 UTC
Created attachment 28491 [details]
Smartctl for involved drive
Comment 4 carl.janzen 2010-08-30 18:30:59 UTC
Created attachment 28501 [details]
lspci data
Comment 5 carl.janzen 2010-08-30 18:32:36 UTC
Created attachment 28511 [details]
kernel 2.6.35 crash (photo)
Comment 6 carl.janzen 2010-08-30 18:33:45 UTC
Created attachment 28521 [details]
kernel 2.6.35 crash (photo)
Comment 7 carl.janzen 2010-08-31 01:29:45 UTC
Seems it was likely hardware after all. I forgot that the new motherboard had been set for over-clocking. When reset to factory settings everything was fine. The old one is gone, so cannot test. In any case, not reproducible.

Note You need to log in before you can comment on or make changes to this bug.