Bug 42921

Summary: r8169 ethernet stops working after about an hour of use
Product: Drivers Reporter: Guido Winkelmann (guido-kern-bugs)
Component: NetworkAssignee: Francois Romieu (romieu)
Status: RESOLVED CODE_FIX    
Severity: normal CC: alan, kernel, mpagano, romieu, thomas.pi
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.2.9 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on: 42899    
Bug Blocks:    
Attachments: This is the output of dmesg after the problem occured
dmesg output from 3.2.11
Dmesg output from Kernel 3.3-rc7
r8169.c as of 41de8d4cff21a2e81e3d9ff66f5f7c903f9c3ab1
Dmesg output from Kernel 3.3 with r8169.c replaced by attachment 4
Dmesg output from Kernel 3.3 with new driver and firmware loaded

Description Guido Winkelmann 2012-03-13 13:28:18 UTC
Created attachment 72578 [details]
This is the output of dmesg after the problem occured

With all kernel versions greater than 3.0.x, my r8169 ethernet device will completely stop sending and/or receiving packets after some minutes or hours of use.


The ethernet device is the on-board ethernet of a Gigabyte GA-990FXA-UD3 mainboard. lspci will list it as

05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)

I have tried unplugging and replugging the network as well as doing "ifconfig eth0 down; ifconfig eth0 up", neither of which solved the problem. The only way I could find to regain network connectivity was to reboot

See also https://bugs.gentoo.org/show_bug.cgi?id=399777

Another regression that coincides with this one is that, with all kernels > 3.0.x, automatic power-down after shutdown no longer works.
Comment 1 Francois Romieu 2012-03-20 13:29:22 UTC
The driver is unable to load its firmware but if the driver worked before I can only hope
the firmware is not needed yet.

There are a few messages from the IOMMU just before things go bad :
[...]
AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e address=0x0000000000000000 flags=0x0000]
AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e address=0x0000000000000080 flags=0x0000]
AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e address=0x0000000000000040 flags=0x0000]
AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e address=0x00000000000000c0 flags=0x0000]
------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x247/0x250()
Hardware name: GA-990FXA-UD3
NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out

Can you check:
- if they happen more or less at the same time
- if they did happen with kernel 3.0.6
- which device hides behind 00:14.4 (lspci should tell it)

Thanks.

-- 
Ueimor
Comment 2 Francois Romieu 2012-03-20 14:16:41 UTC
*ouch*

v3.1.6 does not include c7c2c39be8ed4e503e987151f4599455060e219a.

Actually it does not include any v3.1..v3.2 fix. :o/

Can you give 3.3 a try ?

-- 
Ueimor
Comment 3 Guido Winkelmann 2012-03-21 22:57:15 UTC
(In reply to comment #1)
> The driver is unable to load its firmware but if the driver worked before I
> can
> only hope
> the firmware is not needed yet.
> 
> There are a few messages from the IOMMU just before things go bad :
> [...]
> AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e
> address=0x0000000000000000 flags=0x0000]
> AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e
> address=0x0000000000000080 flags=0x0000]
> AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e
> address=0x0000000000000040 flags=0x0000]
> AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e
> address=0x00000000000000c0 flags=0x0000]
> ------------[ cut here ]------------
> WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x247/0x250()
> Hardware name: GA-990FXA-UD3
> NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
> 
> Can you check:

Conveniently, my logrotate has been broken for the last couple of months, so it turns out yes...

> - if they happen more or less at the same time

No, the network outage happened about half an hour after that.

> - if they did happen with kernel 3.0.6

They did, but not that same device:

Dec 28 21:54:32 tolkien kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x000000004022d070 flags=0x0030]
Dec 28 21:54:32 tolkien kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x000000004022d010 flags=0x0030]
Dec 28 21:54:32 tolkien kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x000000004022d050 flags=0x0030]
Dec 28 21:54:32 tolkien kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x000000004022d030 flags=0x0030]

(That device would be my graphics card.)

That wasn't followed by any noticeable problems, though.

With 3.1.6, I had an IO_PAGE_FAULT that did directly reference my network card:

Jan 26 13:02:52 tolkien kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x001a address=0x0000000000003000 flags=0x0050]

lspci says:
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)

It's a bit weird, now that I'm going through my old /var/log/messages, it seems that nearly every time this happens, there's an IO_PAGE_FAULT that references device 05:00.0, 00:14.4 seems to be the big exception... In all those other cases, the IO_PAGE_FAULT was followed by the describe network outage about 8 seconds later.

> - which device hides behind 00:14.4 (lspci should tell it)

It says:
00:14.4 PCI bridge: Advanced Micro Devices [AMD] nee ATI SBx00 PCI to PCI Bridge (rev 40)
Comment 4 Guido Winkelmann 2012-03-21 22:58:21 UTC
(In reply to comment #2)
> *ouch*
> 
> v3.1.6 does not include c7c2c39be8ed4e503e987151f4599455060e219a.
> 
> Actually it does not include any v3.1..v3.2 fix. :o/
> 
> Can you give 3.3 a try ?

The bug is still present in both 3.2.11 and 3.3-rc7. Attaching respective dmesg-outputs.
Comment 5 Guido Winkelmann 2012-03-21 22:59:34 UTC
Created attachment 72674 [details]
dmesg output from 3.2.11
Comment 6 Guido Winkelmann 2012-03-21 23:00:15 UTC
Created attachment 72675 [details]
Dmesg output from Kernel 3.3-rc7
Comment 7 Francois Romieu 2012-03-22 10:29:12 UTC
Created attachment 72682 [details]
r8169.c as of 41de8d4cff21a2e81e3d9ff66f5f7c903f9c3ab1
Comment 8 Francois Romieu 2012-03-22 11:10:24 UTC
Ok, can you :
- send a 'cat /sys/bus/pci/devices/0000:05:00.0/resource'.
  With the IOMMU addres, it should help identify where in its PCI regions the driver tried poke into.
- use the attached driver. It should build with your -rc7 or later
- install Realtek's firmware. It will silent the "r8169 0000:05:00.0: eth0: unable to load firmware patch rtl_nic/rtl8168e-3.fw (-2)" message.
  Firmware is not required as long as things works and I doubt it will change much per se
  but I'd rather be safe.
- specify if you are using jumbo frames

Thanks for testing.

-- 
Ueimor
Comment 9 Guido Winkelmann 2012-03-25 03:03:10 UTC
(In reply to comment #8)
> Ok, can you :
> - send a 'cat /sys/bus/pci/devices/0000:05:00.0/resource'.

Sure:

# cat /sys/bus/pci/devices/0000:05:00.0/resource
0x000000000000ae00 0x000000000000aeff 0x0000000000040101
0x0000000000000000 0x0000000000000000 0x0000000000000000
0x00000000fd8ff000 0x00000000fd8fffff 0x000000000014220c
0x0000000000000000 0x0000000000000000 0x0000000000000000
0x00000000fd8f8000 0x00000000fd8fbfff 0x000000000014220c
0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 0x0000000000000000

>   With the IOMMU addres, it should help identify where in its PCI regions the
> driver tried poke into.
> - use the attached driver. It should build with your -rc7 or later

I'm currently building a new kernel with it. Will report back if the bug still bites.

> - install Realtek's firmware. It will silent the "r8169 0000:05:00.0: eth0:
> unable to load firmware patch rtl_nic/rtl8168e-3.fw (-2)" message.
>   Firmware is not required as long as things works and I doubt it will change
> much per se
>   but I'd rather be safe.

Hm, unfortunately, I couldn't figure out where to get that firmware...

> - specify if you are using jumbo frames

I am not using Jumbo frames.
Comment 10 Guido Winkelmann 2012-03-25 18:59:58 UTC
(In reply to comment #8)
> - use the attached driver. It should build with your -rc7 or later

No luck with that. Network still lock up after some time.

BTW, since this is now listed as depending on 42899: On my machine, there is no visible relation between traffic patterns and these lockups. They happen totally at random, as far as I am concerned.
Comment 11 Guido Winkelmann 2012-03-25 19:13:45 UTC
Created attachment 72710 [details]
Dmesg output from Kernel 3.3 with r8169.c replaced by attachment 4 [details]
Comment 12 Francois Romieu 2012-03-29 13:03:49 UTC
(In reply to comment #9)
[...]
> Hm, unfortunately, I couldn't figure out where to get that firmware...

http://git.kernel.org/?p=linux/kernel/git/firmware/linux-firmware.git;a=tree;f=rtl_nic

Apparently there is http://packages.gentoo.org/package/sys-kernel/linux-firmware for your distro too.

-- 
Ueimor
Comment 13 Guido Winkelmann 2012-04-11 22:56:13 UTC
Still no luck, even with the firmware loaded. Attaching dmesg output.
Comment 14 Guido Winkelmann 2012-04-11 22:57:34 UTC
Created attachment 72895 [details]
Dmesg output from Kernel 3.3 with new driver and firmware loaded
Comment 15 Thomas Pilarski 2012-04-19 21:58:50 UTC
You can try the patch from the bug report 42899. I have tested it with 3.3.0, but it applies against 3.3.2 too. 
https://bugzilla.kernel.org/show_bug.cgi?id=42899#c20
Comment 16 Guido Winkelmann 2012-04-25 00:24:42 UTC
I have been using the patch (or anti-patch...) from comment #15 for a few days now with 3.3.2, and so far, the problem has not resurfaced.