Bug 13982

Summary: [libata] (?) causing Hardlock in 2.6.30.4 during simultaneous read & write
Product: IO/Storage Reporter: Wylda (wylda)
Component: SCSIAssignee: linux-scsi (linux-scsi)
Status: RESOLVED DUPLICATE    
Severity: normal CC: devzero, hilld, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30.4 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel config
dmesg
lspci -v
First trace during 5days of testing/restarting...

Description Wylda 2009-08-14 08:26:27 UTC
Created attachment 22713 [details]
kernel config

Hi.

HW: Server Board Intel STL2, 2x P3 @ 1GHz, 1GB ECC RAM

SW: self-compiled kernel 2.6.30.4 on Debian Lenny

Symptom: PC completely stops responding (ping, ALT+F2..., Numlock,
CTRL-ALT-DEL, ALT-SysRq)

Traces: No Oops, nothing in syslog etc.


I think it's not HW failure, because it never happened when

 * 2x dd if=/dev/zero bs=1M count=200000 | md5sum -b

 * 2x dd if=/dev/zero of=test-x bs=1M count=200000

such tests take a long time on this HW (51min and 85min) and checksums always
OK. Tested many times.



Anyway i'm usually able to invoke Hardlock in 2min. I use a script:

#!/bin/bash

dd if=/dev/zero bs=1M count=200000 | md5sum -b &
dd if=/dev/zero bs=1M count=200000 | md5sum -b &
cd /home/pik/a
md5sum -c office.md5 &
cd /home/pik/b
md5sum -c office.md5 &

So i run this stress script _and_ begin FTP write to the same HDD. Usually
Hardlock itself, but if it does not Hardlock in 60sec i can help it with
another dd (dd if=/dev/zero of=test1 bs=1M count=200000).


Also why should not be HW failure - No complains of EDAC and happens on
different HW:

 * PATA drive IC35L040AVVA07 on ServerWorks OSB4 (MOBO's chipset aka IB6566
South Bridge)

 * SATA drives 2xWD5000AADS in md0 on Sil3114

 * Network card: PCI-X, Intel 1Gbps 82543GC

 * Network card: PCI Realtek RT8139


Today when doing last test for bugreport there was a trace, but the HardLock
was not 100% same (as always ping stopped working, console switching did not
work, no Numlock reaction, but Alt-SysRq worked). Hope its not misleading - see
attachment.


Another prove(?), that this is not HW failure:
  * never happens with Debian's 2.6.26-17lenny1 all_generic_ide=1 gcc4.1.3
  * easy to trigger with 2.6.30.4 gcc4.3.2

...i know know different kernel version, kernel parameters and gcc, but HW error would occurred anyway.


config kernel, dmesg, lspci atached.
Comment 1 Wylda 2009-08-14 08:28:52 UTC
Created attachment 22714 [details]
dmesg
Comment 2 Wylda 2009-08-14 08:29:23 UTC
Created attachment 22715 [details]
lspci -v
Comment 3 Wylda 2009-08-14 08:30:36 UTC
Created attachment 22716 [details]
First trace during 5days of testing/restarting...
Comment 4 Roland Kletzing 2009-08-15 16:48:29 UTC
this may be http://bugzilla.kernel.org/show_bug.cgi?id=13933
(someone mark it as duplicate then)

can you check if 2.6.29 works without problems ?
Comment 5 David Hill 2009-08-15 17:45:39 UTC
Same network card as in bug 13219
Comment 6 Roland Kletzing 2009-08-15 18:01:07 UTC
is it possible to temporarly disable/remove the realtek nic ?
Comment 7 Rafael J. Wysocki 2009-08-15 20:33:44 UTC

*** This bug has been marked as a duplicate of bug 13933 ***