Bug 55841
Summary: | RTL8111/8168F PCI Express reboot on 100% TX utilization | ||
---|---|---|---|
Product: | Drivers | Reporter: | David Hubbard (david.c.hubbard) |
Component: | Network | Assignee: | Francois Romieu (romieu) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | ranma+kernel, rdalek1967, romieu |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | net-next, 3.8.4-gentoo | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
full dmesg
RxConfig hack for the 8168f |
Description
David Hubbard
2013-03-27 13:46:56 UTC
The net-next kernel version is reported as 3.9.0-rc3+ Adding iommu=pt to the kernel command line is enough to TX at 100% utilization. http://forums.gentoo.org/viewtopic-t-955334.html was the source of the idea. I assume the problem is still present, since iommu passthrough just changes the default policy from deny to allow. I believe I'm seeing this as well (on 3.11): Note how the AMD-Vi errors correlate with r8169 messages. And occasionally my network card seems to get into a bad state where I'm getting ~20% packet loss on the local network, next time it happens I can have a look if the log messages happen then as well. Sep 2 03:15:08 nukunuku kernel: [ 5181.204609] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0018 address=0x0000000000003000 flags=0x0050] Sep 2 03:15:26 nukunuku kernel: [ 5198.737751] ------------[ cut here ]------------ Sep 2 03:15:26 nukunuku kernel: [ 5198.737764] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x197/0x1fd() Sep 2 03:15:26 nukunuku kernel: [ 5198.737767] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out Sep 2 03:15:26 nukunuku kernel: [ 5198.737770] Modules linked in: lp parport bnep rfcomm bluetooth usb_storage asix k10temp snd_usb_audio uvcvideo snd_usbmidi_lib videobuf2_v Sep 2 03:15:26 nukunuku kernel: [ 5198.737790] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.10 #66 Sep 2 03:15:26 nukunuku kernel: [ 5198.737794] Hardware name: System manufacturer System Product Name/F2A85-M LE, BIOS 5107 10/22/2012 Sep 2 03:15:26 nukunuku kernel: [ 5198.737797] 0000000000000000 ffff88043ec83d80 ffffffff8185c6c8 ffff88043ec83db8 Sep 2 03:15:26 nukunuku kernel: [ 5198.737803] ffffffff8107a8ae ffff88043ec83dc8 ffff88042d27e000 ffff88043ca21c00 Sep 2 03:15:26 nukunuku kernel: [ 5198.737807] ffff88042d27e320 0000000000000001 ffff88043ec83e18 ffffffff8107a90d Sep 2 03:15:26 nukunuku kernel: [ 5198.737811] Call Trace: Sep 2 03:15:26 nukunuku kernel: [ 5198.737813] <IRQ> [<ffffffff8185c6c8>] dump_stack+0x19/0x1b Sep 2 03:15:26 nukunuku kernel: [ 5198.737825] [<ffffffff8107a8ae>] warn_slowpath_common+0x60/0x78 Sep 2 03:15:26 nukunuku kernel: [ 5198.737829] [<ffffffff8107a90d>] warn_slowpath_fmt+0x47/0x49 Sep 2 03:15:26 nukunuku kernel: [ 5198.737835] [<ffffffff8175d81d>] dev_watchdog+0x197/0x1fd Sep 2 03:15:26 nukunuku kernel: [ 5198.737840] [<ffffffff8175d686>] ? dev_graft_qdisc+0x66/0x66 Sep 2 03:15:26 nukunuku kernel: [ 5198.737845] [<ffffffff810872b9>] call_timer_fn+0x63/0x15f Sep 2 03:15:26 nukunuku kernel: [ 5198.737850] [<ffffffff81087761>] run_timer_softirq+0x1f1/0x252 Sep 2 03:15:26 nukunuku kernel: [ 5198.737855] [<ffffffff8175d686>] ? dev_graft_qdisc+0x66/0x66 Sep 2 03:15:26 nukunuku kernel: [ 5198.737859] [<ffffffff81080def>] __do_softirq+0x103/0x280 Sep 2 03:15:26 nukunuku kernel: [ 5198.737863] [<ffffffff8108106b>] irq_exit+0x53/0xb0 Sep 2 03:15:26 nukunuku kernel: [ 5198.737868] [<ffffffff81065bee>] smp_apic_timer_interrupt+0x86/0x94 Sep 2 03:15:26 nukunuku kernel: [ 5198.737874] [<ffffffff81869ddd>] apic_timer_interrupt+0x6d/0x80 Sep 2 03:15:26 nukunuku kernel: [ 5198.737876] <EOI> [<ffffffff81050849>] ? native_sched_clock+0x29/0x6f Sep 2 03:15:26 nukunuku kernel: [ 5198.737886] [<ffffffff816cfb25>] ? cpuidle_enter_state+0x4d/0xa4 Sep 2 03:15:26 nukunuku kernel: [ 5198.737891] [<ffffffff816cfb1e>] ? cpuidle_enter_state+0x46/0xa4 Sep 2 03:15:26 nukunuku kernel: [ 5198.737896] [<ffffffff816cfca2>] cpuidle_idle_call+0x126/0x22a Sep 2 03:15:26 nukunuku kernel: [ 5198.737900] [<ffffffff81051a58>] arch_cpu_idle+0x9/0x1e Sep 2 03:15:26 nukunuku kernel: [ 5198.737906] [<ffffffff810b1ffa>] cpu_startup_entry+0x162/0x228 Sep 2 03:15:26 nukunuku kernel: [ 5198.737911] [<ffffffff81852b93>] start_secondary+0x1bf/0x1c2 Sep 2 03:15:26 nukunuku kernel: [ 5198.737914] ---[ end trace f0a1a3830fd46e08 ]--- Sep 2 03:15:26 nukunuku kernel: [ 5198.770521] r8169 0000:03:00.0 eth0: link up Sep 2 03:39:37 nukunuku kernel: [ 6650.970740] nf_conntrack: automatic helper assignment is deprecated and it will be removed soon. Use the iptables CT target to attach helpe Sep 2 04:04:57 nukunuku kernel: [ 8173.042344] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0018 address=0x0000000000003000 flags=0x0050] Sep 2 04:05:20 nukunuku kernel: [ 8196.032118] r8169 0000:03:00.0 eth0: link up Sep 2 04:42:31 nukunuku kernel: [10430.017275] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0018 address=0x0000000000003000 flags=0x0050] Sep 2 04:43:14 nukunuku kernel: [10472.456515] r8169 0000:03:00.0 eth0: link up Just a note on why Tobias' r8169 is on 03:00.0 while mine is on 04:00.0 The bus number (03 versus 04) is dynamically allocated when each PCI-E root port is initialized. I have an external Radeon GPU on 01:00.0. The PCI-E slots for 02:**.* and 03:**.* are hidden/unknown to me, but the external GPU is why the r8169 is 04:00.0 on my system. Created attachment 107631 [details]
RxConfig hack for the 8168f
(In reply to Tobias Diedrich from comment #3) > I believe I'm seeing this as well (on 3.11): 8168f as well ? -- Ueimor Francois, Tobias' messages do identify his board. But he could have made this clearer: "Hardware name: System manufacturer System Product Name/F2A85-M LE, BIOS 5107 10/22/2012" Tobias' F2A85-M LE and my F2A85-M/CSM have identical 8168f chips. I would like to know more about the RxConfig hack: RTL_W32(RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST); I thought it was an 8168f firmware problem? But thanks for the tip. (In reply to David Hubbard from comment #7) [...] > I thought it was an 8168f firmware problem? But thanks for the tip. It is not even clear if the network driver / chipset is the culprit: whatever the suspect, the AMD-Vi logged message almost always looks the same. The more I ask the Oracle for similar problem reports, the less I trust the IOMMU :o/ There are two unexplained fixes: 1. "iommu=pt" (aka "disable the iommu", wonderful) 2. fetch one rx descriptor at a time (what the patch does) -- Ueimor Patch went mainline between v3.11 and v3.12 as 3ced8c955e74d319f3e3997f7169c79d524dfd06 ("r8169: enforce RX_MULTI_EN for the 8168f"). Thanks. -- Ueimor |