Bug 42921
Summary: | r8169 ethernet stops working after about an hour of use | ||
---|---|---|---|
Product: | Drivers | Reporter: | Guido Winkelmann (guido-kern-bugs) |
Component: | Network | Assignee: | Francois Romieu (romieu) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | alan, kernel, mpagano, romieu, thomas.pi |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.2.9 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | 42899 | ||
Bug Blocks: | |||
Attachments: |
This is the output of dmesg after the problem occured
dmesg output from 3.2.11 Dmesg output from Kernel 3.3-rc7 r8169.c as of 41de8d4cff21a2e81e3d9ff66f5f7c903f9c3ab1 Dmesg output from Kernel 3.3 with r8169.c replaced by attachment 4 Dmesg output from Kernel 3.3 with new driver and firmware loaded |
Description
Guido Winkelmann
2012-03-13 13:28:18 UTC
The driver is unable to load its firmware but if the driver worked before I can only hope the firmware is not needed yet. There are a few messages from the IOMMU just before things go bad : [...] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e address=0x0000000000000000 flags=0x0000] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e address=0x0000000000000080 flags=0x0000] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e address=0x0000000000000040 flags=0x0000] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e address=0x00000000000000c0 flags=0x0000] ------------[ cut here ]------------ WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x247/0x250() Hardware name: GA-990FXA-UD3 NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out Can you check: - if they happen more or less at the same time - if they did happen with kernel 3.0.6 - which device hides behind 00:14.4 (lspci should tell it) Thanks. -- Ueimor *ouch* v3.1.6 does not include c7c2c39be8ed4e503e987151f4599455060e219a. Actually it does not include any v3.1..v3.2 fix. :o/ Can you give 3.3 a try ? -- Ueimor (In reply to comment #1) > The driver is unable to load its firmware but if the driver worked before I > can > only hope > the firmware is not needed yet. > > There are a few messages from the IOMMU just before things go bad : > [...] > AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e > address=0x0000000000000000 flags=0x0000] > AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e > address=0x0000000000000080 flags=0x0000] > AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e > address=0x0000000000000040 flags=0x0000] > AMD-Vi: Event logged [IO_PAGE_FAULT device=00:14.4 domain=0x000e > address=0x00000000000000c0 flags=0x0000] > ------------[ cut here ]------------ > WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x247/0x250() > Hardware name: GA-990FXA-UD3 > NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out > > Can you check: Conveniently, my logrotate has been broken for the last couple of months, so it turns out yes... > - if they happen more or less at the same time No, the network outage happened about half an hour after that. > - if they did happen with kernel 3.0.6 They did, but not that same device: Dec 28 21:54:32 tolkien kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x000000004022d070 flags=0x0030] Dec 28 21:54:32 tolkien kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x000000004022d010 flags=0x0030] Dec 28 21:54:32 tolkien kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x000000004022d050 flags=0x0030] Dec 28 21:54:32 tolkien kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0016 address=0x000000004022d030 flags=0x0030] (That device would be my graphics card.) That wasn't followed by any noticeable problems, though. With 3.1.6, I had an IO_PAGE_FAULT that did directly reference my network card: Jan 26 13:02:52 tolkien kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x001a address=0x0000000000003000 flags=0x0050] lspci says: 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06) It's a bit weird, now that I'm going through my old /var/log/messages, it seems that nearly every time this happens, there's an IO_PAGE_FAULT that references device 05:00.0, 00:14.4 seems to be the big exception... In all those other cases, the IO_PAGE_FAULT was followed by the describe network outage about 8 seconds later. > - which device hides behind 00:14.4 (lspci should tell it) It says: 00:14.4 PCI bridge: Advanced Micro Devices [AMD] nee ATI SBx00 PCI to PCI Bridge (rev 40) (In reply to comment #2) > *ouch* > > v3.1.6 does not include c7c2c39be8ed4e503e987151f4599455060e219a. > > Actually it does not include any v3.1..v3.2 fix. :o/ > > Can you give 3.3 a try ? The bug is still present in both 3.2.11 and 3.3-rc7. Attaching respective dmesg-outputs. Created attachment 72674 [details]
dmesg output from 3.2.11
Created attachment 72675 [details]
Dmesg output from Kernel 3.3-rc7
Created attachment 72682 [details]
r8169.c as of 41de8d4cff21a2e81e3d9ff66f5f7c903f9c3ab1
Ok, can you : - send a 'cat /sys/bus/pci/devices/0000:05:00.0/resource'. With the IOMMU addres, it should help identify where in its PCI regions the driver tried poke into. - use the attached driver. It should build with your -rc7 or later - install Realtek's firmware. It will silent the "r8169 0000:05:00.0: eth0: unable to load firmware patch rtl_nic/rtl8168e-3.fw (-2)" message. Firmware is not required as long as things works and I doubt it will change much per se but I'd rather be safe. - specify if you are using jumbo frames Thanks for testing. -- Ueimor (In reply to comment #8) > Ok, can you : > - send a 'cat /sys/bus/pci/devices/0000:05:00.0/resource'. Sure: # cat /sys/bus/pci/devices/0000:05:00.0/resource 0x000000000000ae00 0x000000000000aeff 0x0000000000040101 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x00000000fd8ff000 0x00000000fd8fffff 0x000000000014220c 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x00000000fd8f8000 0x00000000fd8fbfff 0x000000000014220c 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 > With the IOMMU addres, it should help identify where in its PCI regions the > driver tried poke into. > - use the attached driver. It should build with your -rc7 or later I'm currently building a new kernel with it. Will report back if the bug still bites. > - install Realtek's firmware. It will silent the "r8169 0000:05:00.0: eth0: > unable to load firmware patch rtl_nic/rtl8168e-3.fw (-2)" message. > Firmware is not required as long as things works and I doubt it will change > much per se > but I'd rather be safe. Hm, unfortunately, I couldn't figure out where to get that firmware... > - specify if you are using jumbo frames I am not using Jumbo frames. (In reply to comment #8) > - use the attached driver. It should build with your -rc7 or later No luck with that. Network still lock up after some time. BTW, since this is now listed as depending on 42899: On my machine, there is no visible relation between traffic patterns and these lockups. They happen totally at random, as far as I am concerned. Created attachment 72710 [details] Dmesg output from Kernel 3.3 with r8169.c replaced by attachment 4 [details] (In reply to comment #9) [...] > Hm, unfortunately, I couldn't figure out where to get that firmware... http://git.kernel.org/?p=linux/kernel/git/firmware/linux-firmware.git;a=tree;f=rtl_nic Apparently there is http://packages.gentoo.org/package/sys-kernel/linux-firmware for your distro too. -- Ueimor Still no luck, even with the firmware loaded. Attaching dmesg output. Created attachment 72895 [details]
Dmesg output from Kernel 3.3 with new driver and firmware loaded
You can try the patch from the bug report 42899. I have tested it with 3.3.0, but it applies against 3.3.2 too. https://bugzilla.kernel.org/show_bug.cgi?id=42899#c20 I have been using the patch (or anti-patch...) from comment #15 for a few days now with 3.3.2, and so far, the problem has not resurfaced. |