Created attachment 96371 [details] full dmesg After reading bug #32962 I believe a similar problem has happened again, only with an rtl8168f The CPU has AMD-V which, when active, prevents the reboot by page faulting the rtl8168f. But the device fails to TX or RX any packets after the fault. syslog does not report the old error messages in bug #32962. [ 0.356526] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40 [ 0.356527] AMD-Vi: Extended features: PreF PPR GT IA [ 0.356530] AMD-Vi: Interrupt remapping enabled [ 0.363498] AMD-Vi: Using passthrough domain for device 0000:01:00.0 [ 0.364101] AMD-Vi: Lazy IO/TLB flushing enabled [ 4.242992] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [ 4.243348] r8169 0000:04:00.0: irq 62 for MSI/MSI-X [ 4.243625] r8169 0000:04:00.0 eth0: RTL8168f/8111f at 0xffffc9001126e000, 50:46:5d:90:d6:76, XID 08000800 IRQ 62 [ 4.243632] r8169 0000:04:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [ 7.513449] r8169 0000:04:00.0 eth0: link down [ 7.513527] r8169 0000:04:00.0 eth0: link down [ 9.978426] r8169 0000:04:00.0 eth0: link up [ 63.869892] AMD-Vi: Event logged [IO_PAGE_FAULT device=04:00.0 domain=0x001b address=0x0000000000003000 flags=0x0050] lspci -tnnvv -[0000:00]-+-00.0 Advanced Micro Devices [AMD] Family 15h (Models 10h-1fh) Processor Root Complex [1022:1410] +-00.2 Advanced Micro Devices [AMD] Family 15h (Models 10h-1fh) I/O Memory Management Unit [1022:1419] +-01.0 Advanced Micro Devices [AMD] nee ATI Trinity [Radeon HD 7660D] [1002:9901] +-01.1 Advanced Micro Devices [AMD] nee ATI Trinity HDMI Audio Controller [1002:9902] +-02.0-[01]--+-00.0 Advanced Micro Devices [AMD] nee ATI Tahiti XT [Radeon HD 7970] [1002:6798] | \-00.1 Advanced Micro Devices [AMD] nee ATI Tahiti XT HDMI Audio [Radeon HD 7970 Series] [1002:aaa0] +-10.0 Advanced Micro Devices [AMD] FCH USB XHCI Controller [1022:7812] +-10.1 Advanced Micro Devices [AMD] FCH USB XHCI Controller [1022:7812] +-11.0 Advanced Micro Devices [AMD] FCH SATA Controller [AHCI mode] [1022:7801] +-12.0 Advanced Micro Devices [AMD] FCH USB OHCI Controller [1022:7807] +-12.2 Advanced Micro Devices [AMD] FCH USB EHCI Controller [1022:7808] +-13.0 Advanced Micro Devices [AMD] FCH USB OHCI Controller [1022:7807] +-13.2 Advanced Micro Devices [AMD] FCH USB EHCI Controller [1022:7808] +-14.0 Advanced Micro Devices [AMD] FCH SMBus Controller [1022:780b] +-14.2 Advanced Micro Devices [AMD] FCH Azalia Controller [1022:780d] +-14.3 Advanced Micro Devices [AMD] FCH LPC Bridge [1022:780e] +-14.4-[02]-- +-15.0-[03]-- +-15.1-[04]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168 PCI Express Gigabit Ethernet controller [10ec:8168] +-18.0 Advanced Micro Devices [AMD] Family 15h (Models 10h-1fh) Processor Function 0 [1022:1400] +-18.1 Advanced Micro Devices [AMD] Family 15h (Models 10h-1fh) Processor Function 1 [1022:1401] +-18.2 Advanced Micro Devices [AMD] Family 15h (Models 10h-1fh) Processor Function 2 [1022:1402] +-18.3 Advanced Micro Devices [AMD] Family 15h (Models 10h-1fh) Processor Function 3 [1022:1403] +-18.4 Advanced Micro Devices [AMD] Family 15h (Models 10h-1fh) Processor Function 4 [1022:1404] \-18.5 Advanced Micro Devices [AMD] Family 15h (Models 10h-1fh) Processor Function 5 [1022:1405] Steps to reproduce: 1. git clone git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git .; mv linux-firmware/* /lib/firmare/ 2. ifconfig eth0 up 192.168.0.4 netmask 255.255.255.0 3. nc -l -p 1234 </dev/zero 4. check that the link negotiated at 1000base-T 5. on a remote machine: nc 192.168.0.4 1234 >/dev/null 6. wait up to 10s 7. AMD-V page fault appears in syslog Tested on kernel versions: 1. net-next 2. 3.8.4-gentoo 3. 3.8.2-gentoo 4. 3.8.0-gentoo 5. 3.6.6-gentoo * Note: 3.6.6 behaves differently, recovers from the problem after resetting the device
The net-next kernel version is reported as 3.9.0-rc3+
Adding iommu=pt to the kernel command line is enough to TX at 100% utilization. http://forums.gentoo.org/viewtopic-t-955334.html was the source of the idea. I assume the problem is still present, since iommu passthrough just changes the default policy from deny to allow.
I believe I'm seeing this as well (on 3.11): Note how the AMD-Vi errors correlate with r8169 messages. And occasionally my network card seems to get into a bad state where I'm getting ~20% packet loss on the local network, next time it happens I can have a look if the log messages happen then as well. Sep 2 03:15:08 nukunuku kernel: [ 5181.204609] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0018 address=0x0000000000003000 flags=0x0050] Sep 2 03:15:26 nukunuku kernel: [ 5198.737751] ------------[ cut here ]------------ Sep 2 03:15:26 nukunuku kernel: [ 5198.737764] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x197/0x1fd() Sep 2 03:15:26 nukunuku kernel: [ 5198.737767] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out Sep 2 03:15:26 nukunuku kernel: [ 5198.737770] Modules linked in: lp parport bnep rfcomm bluetooth usb_storage asix k10temp snd_usb_audio uvcvideo snd_usbmidi_lib videobuf2_v Sep 2 03:15:26 nukunuku kernel: [ 5198.737790] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.10 #66 Sep 2 03:15:26 nukunuku kernel: [ 5198.737794] Hardware name: System manufacturer System Product Name/F2A85-M LE, BIOS 5107 10/22/2012 Sep 2 03:15:26 nukunuku kernel: [ 5198.737797] 0000000000000000 ffff88043ec83d80 ffffffff8185c6c8 ffff88043ec83db8 Sep 2 03:15:26 nukunuku kernel: [ 5198.737803] ffffffff8107a8ae ffff88043ec83dc8 ffff88042d27e000 ffff88043ca21c00 Sep 2 03:15:26 nukunuku kernel: [ 5198.737807] ffff88042d27e320 0000000000000001 ffff88043ec83e18 ffffffff8107a90d Sep 2 03:15:26 nukunuku kernel: [ 5198.737811] Call Trace: Sep 2 03:15:26 nukunuku kernel: [ 5198.737813] <IRQ> [<ffffffff8185c6c8>] dump_stack+0x19/0x1b Sep 2 03:15:26 nukunuku kernel: [ 5198.737825] [<ffffffff8107a8ae>] warn_slowpath_common+0x60/0x78 Sep 2 03:15:26 nukunuku kernel: [ 5198.737829] [<ffffffff8107a90d>] warn_slowpath_fmt+0x47/0x49 Sep 2 03:15:26 nukunuku kernel: [ 5198.737835] [<ffffffff8175d81d>] dev_watchdog+0x197/0x1fd Sep 2 03:15:26 nukunuku kernel: [ 5198.737840] [<ffffffff8175d686>] ? dev_graft_qdisc+0x66/0x66 Sep 2 03:15:26 nukunuku kernel: [ 5198.737845] [<ffffffff810872b9>] call_timer_fn+0x63/0x15f Sep 2 03:15:26 nukunuku kernel: [ 5198.737850] [<ffffffff81087761>] run_timer_softirq+0x1f1/0x252 Sep 2 03:15:26 nukunuku kernel: [ 5198.737855] [<ffffffff8175d686>] ? dev_graft_qdisc+0x66/0x66 Sep 2 03:15:26 nukunuku kernel: [ 5198.737859] [<ffffffff81080def>] __do_softirq+0x103/0x280 Sep 2 03:15:26 nukunuku kernel: [ 5198.737863] [<ffffffff8108106b>] irq_exit+0x53/0xb0 Sep 2 03:15:26 nukunuku kernel: [ 5198.737868] [<ffffffff81065bee>] smp_apic_timer_interrupt+0x86/0x94 Sep 2 03:15:26 nukunuku kernel: [ 5198.737874] [<ffffffff81869ddd>] apic_timer_interrupt+0x6d/0x80 Sep 2 03:15:26 nukunuku kernel: [ 5198.737876] <EOI> [<ffffffff81050849>] ? native_sched_clock+0x29/0x6f Sep 2 03:15:26 nukunuku kernel: [ 5198.737886] [<ffffffff816cfb25>] ? cpuidle_enter_state+0x4d/0xa4 Sep 2 03:15:26 nukunuku kernel: [ 5198.737891] [<ffffffff816cfb1e>] ? cpuidle_enter_state+0x46/0xa4 Sep 2 03:15:26 nukunuku kernel: [ 5198.737896] [<ffffffff816cfca2>] cpuidle_idle_call+0x126/0x22a Sep 2 03:15:26 nukunuku kernel: [ 5198.737900] [<ffffffff81051a58>] arch_cpu_idle+0x9/0x1e Sep 2 03:15:26 nukunuku kernel: [ 5198.737906] [<ffffffff810b1ffa>] cpu_startup_entry+0x162/0x228 Sep 2 03:15:26 nukunuku kernel: [ 5198.737911] [<ffffffff81852b93>] start_secondary+0x1bf/0x1c2 Sep 2 03:15:26 nukunuku kernel: [ 5198.737914] ---[ end trace f0a1a3830fd46e08 ]--- Sep 2 03:15:26 nukunuku kernel: [ 5198.770521] r8169 0000:03:00.0 eth0: link up Sep 2 03:39:37 nukunuku kernel: [ 6650.970740] nf_conntrack: automatic helper assignment is deprecated and it will be removed soon. Use the iptables CT target to attach helpe Sep 2 04:04:57 nukunuku kernel: [ 8173.042344] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0018 address=0x0000000000003000 flags=0x0050] Sep 2 04:05:20 nukunuku kernel: [ 8196.032118] r8169 0000:03:00.0 eth0: link up Sep 2 04:42:31 nukunuku kernel: [10430.017275] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0018 address=0x0000000000003000 flags=0x0050] Sep 2 04:43:14 nukunuku kernel: [10472.456515] r8169 0000:03:00.0 eth0: link up
Just a note on why Tobias' r8169 is on 03:00.0 while mine is on 04:00.0 The bus number (03 versus 04) is dynamically allocated when each PCI-E root port is initialized. I have an external Radeon GPU on 01:00.0. The PCI-E slots for 02:**.* and 03:**.* are hidden/unknown to me, but the external GPU is why the r8169 is 04:00.0 on my system.
Created attachment 107631 [details] RxConfig hack for the 8168f
(In reply to Tobias Diedrich from comment #3) > I believe I'm seeing this as well (on 3.11): 8168f as well ? -- Ueimor
Francois, Tobias' messages do identify his board. But he could have made this clearer: "Hardware name: System manufacturer System Product Name/F2A85-M LE, BIOS 5107 10/22/2012" Tobias' F2A85-M LE and my F2A85-M/CSM have identical 8168f chips. I would like to know more about the RxConfig hack: RTL_W32(RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST); I thought it was an 8168f firmware problem? But thanks for the tip.
(In reply to David Hubbard from comment #7) [...] > I thought it was an 8168f firmware problem? But thanks for the tip. It is not even clear if the network driver / chipset is the culprit: whatever the suspect, the AMD-Vi logged message almost always looks the same. The more I ask the Oracle for similar problem reports, the less I trust the IOMMU :o/ There are two unexplained fixes: 1. "iommu=pt" (aka "disable the iommu", wonderful) 2. fetch one rx descriptor at a time (what the patch does) -- Ueimor
Patch went mainline between v3.11 and v3.12 as 3ced8c955e74d319f3e3997f7169c79d524dfd06 ("r8169: enforce RX_MULTI_EN for the 8168f"). Thanks. -- Ueimor