Bug 32492

Summary: r8169 together with raid6 causes reboot/panic
Product: Drivers Reporter: Håkan (disp0se4bl3)
Component: NetworkAssignee: Francois Romieu (romieu)
Status: RESOLVED OBSOLETE    
Severity: normal CC: alan, disp0se4bl3, romieu
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38.2 and others Subsystem:
Regression: No Bisected commit-id:
Attachments: panic example 2
panic example 3
panic example 4
panic example 5 (with debug boot option)
lspci -nn
lshw
another example
another example
another example
another example

Description Håkan 2011-04-02 13:17:29 UTC
Distribution: 
    Debian GNU/Linux 6.0.1
Hardware Environment: 
    Motherboard: ASUS E35M1-M PRO
    CPU: AMD E-350 Processor 1600 MHz
    RAM: 1GB RAM DDR3 DIMM 1333
    BIOS: American Megatrends Inc., 0506
    Ehthernet: Realtek PCI-E RTL8111/8168B
    Storage: SATA controller,
      2xST2000DL003-9VT1, 2xSAMSUNG HD204UI

Software Environment:
    1) kernel raid6
       Doing raid6 resync on a newly created array.  Not mounted.  
       Resync started by 'echo 0 > /dev/md0'.
       The array consists of /dev/sda4 and /dev/sd[bcd]
       /dev/sda4 is the entire first disk minus the first 20 GB.
    2) netcat
       Causing network traffic using 
       'nc 192.168.0.11 6789 < /dev/zero > /dev/null'
       The other machine does 'nc -l -p 6789 < /dev/zero > /dev/null',
       and this saturates its 100 Mbit/s interface, 12 MB/s duplex.

Problem Description:

    With either the raid6 resync running itself, or the network load
    running itself, the error does not reproduce.

    With both running at the same time, the machine does sudden
    reboots after 5-20 s, but sometimes a few minutes.  After removing
    the 'quiet' boot option, various panic messages appear instead
    of sudden reboot most of the time.

    The hang/crash does not reproduce when using a PCI-E network card
    of the same kind instead of the one built into the motherboard.
    04 is motherboard, 03 is add-on card, lspci:

03:00.0 0200: 10ec:8168 (rev 01)
04:00.0 0200: 10ec:8168 (rev 06)    <-- makes problem

    The motherboard has been exchanged, same effect on both specimens.
    Also a different memory stick has been used, same effect.  The
    harddrives have been replaced two and two by older Maxtor 7L300S0,
    i.e. 2 new and 2 old.  The error occurs with both combinations.  I
    could not reproduce when using a set of (four) old disks alone,
    perhaps due to too little load.  Also, raid6 on only 3 disks makes
    it much harder to reproduce.  Another PSU has also been tried,
    same effect.

    Running badblocks instead of raid6 resync on the disks (together
    with the network load) does not produce the error, at least within
    several hours.

    The error has been seen with debian's 2.6.32 stable kernel, and
    their recent 2.6.38 snapshot.  Reproduced also with kernel.org
    2.6.27.58 (have not tested earlier), 2.6.38.2,
    linux/kernel/git/torvalds/linux-2.6.git, and
    linux/kernel/git/next/linux-next.git.  The kernel.org kernels use
    the default configuration plus r8192 and raid6 compiled in.

    Attached are a number of panic message samples.

    With the debug boot option, a panic happened, but no additional
    information appeared.  With nosmp the problem has not reproduced.
    Also, the messages 'r8169 0000:04:00.0: eth0: link up' come only
    once per hour or so then.  Probably one core is not enough to
    create the load needed.

    Could some kernel debug options help?
Comment 1 Håkan 2011-04-02 13:18:39 UTC
Created attachment 53132 [details]
panic example 2
Comment 2 Håkan 2011-04-02 13:19:15 UTC
Created attachment 53142 [details]
panic example 3
Comment 3 Håkan 2011-04-02 13:19:55 UTC
Created attachment 53152 [details]
panic example 4
Comment 4 Håkan 2011-04-02 13:20:32 UTC
Created attachment 53162 [details]
panic example 5 (with debug boot option)
Comment 5 Håkan 2011-04-02 13:20:55 UTC
Created attachment 53172 [details]
lspci -nn
Comment 6 Håkan 2011-04-02 13:21:22 UTC
Created attachment 53182 [details]
lshw
Comment 7 Håkan 2011-04-02 13:27:03 UTC
Created attachment 53192 [details]
another example
Comment 8 Håkan 2011-04-02 13:27:28 UTC
Created attachment 53202 [details]
another example
Comment 9 Håkan 2011-04-02 13:27:50 UTC
Created attachment 53212 [details]
another example
Comment 10 Håkan 2011-04-02 13:28:30 UTC
Created attachment 53222 [details]
another example

Perhaps too many panic examples...
Comment 11 Francois Romieu 2011-08-05 11:56:18 UTC
(In reply to comment #0)
[...]
> 03:00.0 0200: 10ec:8168 (rev 01)
> 04:00.0 0200: 10ec:8168 (rev 06)    <-- makes problem

The XID line in one of the dmesg shows the rev 06 device is a 8168e
(RTL_GIGA_MAC_VER_33). Its support is very (very) limited before 3.0.
I'd suggest working with 3.0 even if it will probably not be a silver
bullet for your panics.

-- 
Ueimor