[16863.998343] ------------[ cut here ]------------ [16863.998370] WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0x1bb/0x1d0( ) [16863.998378] Hardware name: [16863.998384] NETDEV WATCHDOG: eth2 (r8169): transmit queue 0 timed out [16863.998390] Modules linked in: cls_u32 cls_fw sch_sfq sch_cbq iptable_nat ipt able_mangle ipt_MASQUERADE nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf _conntrack_ftp ip_tables arptable_filter arp_tables ipv6 ath5k mac80211 ath cfg8 0211 cpufreq_ondemand processor thermal e_powersaver fan thermal_sys f71805f sha 256_generic padlock_sha padlock_aes xt_mac xt_MARK dummy via_rhine r8169 mii tul ip [16863.998472] Pid: 0, comm: swapper Not tainted 2.6.34 #7 [16863.998478] Call Trace: [16863.998489] [<c1344e9b>] ? dev_watchdog+0x1bb/0x1d0 [16863.998499] [<c1344e9b>] ? dev_watchdog+0x1bb/0x1d0 [16863.998513] [<c10348cf>] ? warn_slowpath_common+0x7f/0xf0 [16863.998523] [<c1344e9b>] ? dev_watchdog+0x1bb/0x1d0 [16863.998534] [<c103498b>] ? warn_slowpath_fmt+0x2b/0x30 [16863.998544] [<c1344e9b>] ? dev_watchdog+0x1bb/0x1d0 [16863.998558] [<c1060039>] ? tick_handle_periodic_broadcast+0x109/0x110 [16863.998572] [<c106f78c>] ? handle_edge_irq+0xbc/0x100 [16863.998582] [<c1344ce0>] ? dev_watchdog+0x0/0x1d0 [16863.998596] [<c104344d>] ? run_timer_softirq+0x20d/0x5f0 [16863.998615] [<f805ed82>] ? rtl8169_poll+0x1a2/0x200 [r8169] [16863.998630] [<c103b0ef>] ? __do_softirq+0x7f/0x220 [16863.998639] [<c106dac9>] ? handle_IRQ_event+0x49/0x130 [16863.998650] [<c103b466>] ? irq_exit+0x36/0x40 [16863.998661] [<c1004b73>] ? do_IRQ+0x43/0xb0 [16863.998670] [<c10602a4>] ? tick_program_event+0x14/0x20 [16863.998680] [<c1003229>] ? common_interrupt+0x29/0x30 [16863.998690] [<c105f18a>] ? clockevents_notify+0x2a/0x110 [16863.998706] [<f8128982>] ? acpi_idle_enter_simple+0x11d/0x137 [processor] [16863.998720] [<f812870f>] ? acpi_idle_enter_bm+0xba/0x210 [processor] [16863.998736] [<c13c405a>] ? schedule+0x1aa/0x400 [16863.998747] [<c12ef756>] ? cpuidle_idle_call+0x66/0xe0 [16863.998756] [<c1001a6c>] ? cpu_idle+0x1c/0x30 [16863.998766] [<c150d80a>] ? start_kernel+0x23a/0x2c0 [16863.998776] [<c150d1e0>] ? unknown_bootoption+0x0/0x1c0 [16863.998782] ---[ end trace 398c5f7cd9eb88bb ]---
Looks like a driver issue.
Created attachment 26964 [details] Hardware requires more delay in mdio_{read / write} This in included in post 2.6.34. Jaroslav, can you give it a try ? -- Ueimor
No, I have tested bare 2.6.34 only. I will do it.
I have been used this computer for one year and half until this message occurred. So I do not expect that fixing it I could notify sooner. Anyway, this patch seems to me to be quite bad: Original code: static int mdio_read(void __iomem *ioaddr, int reg_addr) { int i, value = -1; RTL_W32(PHYAR, 0x0 | (reg_addr & 0x1f) << 16); for (i = 20; i > 0; i--) { /* * Check if the RTL8169 has completed retrieving data from * the specified MII register. */ if (RTL_R32(PHYAR) & 0x80000000) { value = RTL_R32(PHYAR) & 0xffff; break; } udelay(25); ^^^ why to spinloop here 25us when 20 is sufficient enough } return value; } Do rather this. This would wait from 20us to 40us after event occurs. Original code waits from 25us to 45us after event occurs. static int mdio_read(void __iomem *ioaddr, int reg_addr) { int i, value = -1; RTL_W32(PHYAR, 0x0 | (reg_addr & 0x1f) << 16); for (i = 25; i > 0; i--) { /* * Check if the RTL8169 has completed retrieving data from * the specified MII register. */ if (RTL_R32(PHYAR) & 0x80000000) { value = RTL_R32(PHYAR) & 0xffff; udelay(20); break; } udelay(20); } return value; }
Jaroslav Fojtik 2010-06-28 23:11:59 : [...] > I have been used this computer for one year and half until this message > occurred. So I do not expect that fixing it I could notify sooner. Fine. There is no hurry. [...] > why to spinloop here 25us when 20 is sufficient enough It is not. See http://marc.info/?l=linux-netdev&m=127605163010938&w=2 -- Ueimor
>It is not. See http://marc.info/?l=linux-netdev&m=127605163010938&w=2 You did not understand me. When I pool device with period 25us, the operation finish could occur anytime in the middle of 25us delay. When I add 20us additional delay, the delay from operation end is random value from 20us to 45us. Is it clear now? So I propose faster pooling with period 20us - see my code. I could test even faster pooling, but I am unsure whether saving several us could help.
Optimizing mdio code is not my #1 priority, especially when I broke the code with falsely innocuous changes in the first place (looong ago). Realtek suggests that the completion needs ~100us in itself. An extra 20us guard is needed too. How much will be saved with the suggested change when compared with current -git code ? How many mdio ops ? Really _that_ often ? Not much imho. Of course you can discuss or submit it on netdev if you feel it is worth it. -- Ueimor
The proposed change would: 1, add one additional op, (100/25=4; 100/20=5) 2, save 5us machine time 3, does not prolong total timeout wait time. I have no idea how often it is used. I only hope that this in not required for every packet, because it could kill overall performance a lot.
Ok, I better understand your concern now. This code is only used for link management related operations. It should mostly be used when the interface status changes or when something goes wrong. The 8168 can use it heavily (several hundred mdio ops) but the old 8169 is an order of magnitude below. Actually it's a bit like saving 5% of the time spent filling the yearly fiscal paper. -- Ueimor
*** Bug 16303 has been marked as a duplicate of this bug. ***
I will return back to main topic and will not poison this bug report with speedup hints. After adding additional 20us delay I did not observe any defect with three RTL-8110SC/8169SC devices. I also guess that diode defect after cable unplug was the same problem, because it vanishes.