17491 – Reproducible crash on large 64bit write to sata device

Bug 17491 - Reproducible crash on large 64bit write to sata device

Summary: Reproducible crash on large 64bit write to sata device

Status:	RESOLVED UNREPRODUCIBLE

Alias:	None

Product:	IO/Storage
Classification:	Unclassified
Component:	Serial ATA (show other bugs)
Hardware:	All Linux

Importance:	P1 high
Assignee:	Jeff Garzik

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-08-30 18:27 UTC by carl.janzen
Modified:	2010-08-31 01:29 UTC (History)
CC List:	0 users

See Also:
Kernel Version:	2.6.31 64bit, 2.6.35 64bit
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
Ubuntu 9.10 livecd dmesg (64.62 KB, text/plain) 2010-08-30 18:27 UTC, carl.janzen	Details
Ubuntu 9.10 64bit livecd lspci (14.79 KB, text/plain) 2010-08-30 18:28 UTC, carl.janzen	Details
Ubuntu 9.10 64bit livecd log (71.01 KB, application/octet-stream) 2010-08-30 18:29 UTC, carl.janzen	Details
Smartctl for involved drive (4.60 KB, text/plain) 2010-08-30 18:30 UTC, carl.janzen	Details
lspci data (14.79 KB, text/plain) 2010-08-30 18:30 UTC, carl.janzen	Details
kernel 2.6.35 crash (photo) (898.60 KB, image/jpeg) 2010-08-30 18:32 UTC, carl.janzen	Details
kernel 2.6.35 crash (photo) (926.39 KB, image/jpeg) 2010-08-30 18:33 UTC, carl.janzen	Details
Add an attachment (proposed patch, testcase, etc.)

Description carl.janzen 2010-08-30 18:27:28 UTC

Created attachment 28461 [details]
Ubuntu 9.10 livecd dmesg

This is a bug affecting recent 64bit kernels, including the kernel in Ubuntu 10.10 alpha 3  (kernel 2.6.35). The SMART data from the involved Brand new 2TB Western Digital hard drive shows no errors (Motherboard is a brand new Asus P5Q Pro Turbo). I tried it on an older Hard drive (also 2TB Western Digital) and earlier motherboard (asus p5b) with the same results. 

This error likely affects other hard drives, or most likely has nothing to do with the hard drives at all. I had a problem with a corruption of my 4-drive array of 500MB Western Digital drives. Before rebuilding the array I wanted to copy the data from those  drives across to a 2TB backup and that's when I started seeing this reproducible crash. Leading up to this point the system did experience crashes every other day or so, which suggests to me that the bug probably caused that file system corruption also.

The kernel on the Fedora 10 live dvd does not crash ( 2.6.27 ) but I didn't confirm whether it produces the messages as described below.

The kernel on Ubuntu 9.10 live cd does not crash either (2.6.31-14-generic ) , but produced the enclosed dmesg, messages, lspci and smartctl files. Judging by the attempts to access blocks past the end of the device, it looks like a 64bit specific problem. Convert the number to hex and it stands out conspicuously. 

The latest ubuntu distribution freezes up with keyboard LEDs flashing. I tried to reproduce the problerm in text mode so I could take a picture of the trace/panic. That's what the two JPGs are. 

The way I have been triggering the bug is with the following command

dd if=/dev/zero of=/dev/sdb1 bs=2048

There does not seem to be a predictable delay between the start of that command and when it actually crashes/freezes or produces the errors in the log file. Sometimes I can transfer 40GB before the error hits. Once it happened immediately. I also noticed that upon detection of the device there is a complaint "device reported invalid CHS sector 0"

Comment 1 carl.janzen 2010-08-30 18:28:07 UTC

Created attachment 28471 [details]
Ubuntu 9.10 64bit livecd lspci

Comment 2 carl.janzen 2010-08-30 18:29:41 UTC

Created attachment 28481 [details]
Ubuntu 9.10 64bit livecd log

Comment 3 carl.janzen 2010-08-30 18:30:30 UTC

Created attachment 28491 [details]
Smartctl for involved drive

Comment 4 carl.janzen 2010-08-30 18:30:59 UTC

Created attachment 28501 [details]
lspci data

Comment 5 carl.janzen 2010-08-30 18:32:36 UTC

Created attachment 28511 [details]
kernel 2.6.35 crash (photo)

Comment 6 carl.janzen 2010-08-30 18:33:45 UTC

Created attachment 28521 [details]
kernel 2.6.35 crash (photo)

Comment 7 carl.janzen 2010-08-31 01:29:45 UTC

Seems it was likely hardware after all. I forgot that the new motherboard had been set for over-clocking. When reset to factory settings everything was fine. The old one is gone, so cannot test. In any case, not reproducible.

Note You need to log in before you can comment on or make changes to this bug.