Distribution: Debian GNU/Linux 6.0.1 Hardware Environment: Motherboard: ASUS E35M1-M PRO CPU: AMD E-350 Processor 1600 MHz RAM: 1GB RAM DDR3 DIMM 1333 BIOS: American Megatrends Inc., 0506 Ehthernet: Realtek PCI-E RTL8111/8168B Storage: SATA controller, 2xST2000DL003-9VT1, 2xSAMSUNG HD204UI Software Environment: 1) kernel raid6 Doing raid6 resync on a newly created array. Not mounted. Resync started by 'echo 0 > /dev/md0'. The array consists of /dev/sda4 and /dev/sd[bcd] /dev/sda4 is the entire first disk minus the first 20 GB. 2) netcat Causing network traffic using 'nc 192.168.0.11 6789 < /dev/zero > /dev/null' The other machine does 'nc -l -p 6789 < /dev/zero > /dev/null', and this saturates its 100 Mbit/s interface, 12 MB/s duplex. Problem Description: With either the raid6 resync running itself, or the network load running itself, the error does not reproduce. With both running at the same time, the machine does sudden reboots after 5-20 s, but sometimes a few minutes. After removing the 'quiet' boot option, various panic messages appear instead of sudden reboot most of the time. The hang/crash does not reproduce when using a PCI-E network card of the same kind instead of the one built into the motherboard. 04 is motherboard, 03 is add-on card, lspci: 03:00.0 0200: 10ec:8168 (rev 01) 04:00.0 0200: 10ec:8168 (rev 06) <-- makes problem The motherboard has been exchanged, same effect on both specimens. Also a different memory stick has been used, same effect. The harddrives have been replaced two and two by older Maxtor 7L300S0, i.e. 2 new and 2 old. The error occurs with both combinations. I could not reproduce when using a set of (four) old disks alone, perhaps due to too little load. Also, raid6 on only 3 disks makes it much harder to reproduce. Another PSU has also been tried, same effect. Running badblocks instead of raid6 resync on the disks (together with the network load) does not produce the error, at least within several hours. The error has been seen with debian's 2.6.32 stable kernel, and their recent 2.6.38 snapshot. Reproduced also with kernel.org 2.6.27.58 (have not tested earlier), 2.6.38.2, linux/kernel/git/torvalds/linux-2.6.git, and linux/kernel/git/next/linux-next.git. The kernel.org kernels use the default configuration plus r8192 and raid6 compiled in. Attached are a number of panic message samples. With the debug boot option, a panic happened, but no additional information appeared. With nosmp the problem has not reproduced. Also, the messages 'r8169 0000:04:00.0: eth0: link up' come only once per hour or so then. Probably one core is not enough to create the load needed. Could some kernel debug options help?
Created attachment 53132 [details] panic example 2
Created attachment 53142 [details] panic example 3
Created attachment 53152 [details] panic example 4
Created attachment 53162 [details] panic example 5 (with debug boot option)
Created attachment 53172 [details] lspci -nn
Created attachment 53182 [details] lshw
Created attachment 53192 [details] another example
Created attachment 53202 [details] another example
Created attachment 53212 [details] another example
Created attachment 53222 [details] another example Perhaps too many panic examples...
(In reply to comment #0) [...] > 03:00.0 0200: 10ec:8168 (rev 01) > 04:00.0 0200: 10ec:8168 (rev 06) <-- makes problem The XID line in one of the dmesg shows the rev 06 device is a 8168e (RTL_GIGA_MAC_VER_33). Its support is very (very) limited before 3.0. I'd suggest working with 3.0 even if it will probably not be a silver bullet for your panics. -- Ueimor