Bug 16276 - RTL-8110SC/8169SC driver crash
Summary: RTL-8110SC/8169SC driver crash
Status: RESOLVED CODE_FIX
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV4 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Francois Romieu
URL:
Keywords:
: 16303 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-06-23 13:56 UTC by Jaroslav Fojtik
Modified: 2012-08-09 14:04 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.34
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Hardware requires more delay in mdio_{read / write} (753 bytes, patch)
2010-06-28 21:47 UTC, Francois Romieu
Details | Diff

Description Jaroslav Fojtik 2010-06-23 13:56:17 UTC
[16863.998343] ------------[ cut here ]------------
[16863.998370] WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0x1bb/0x1d0(                         )
[16863.998378] Hardware name:
[16863.998384] NETDEV WATCHDOG: eth2 (r8169): transmit queue 0 timed out
[16863.998390] Modules linked in: cls_u32 cls_fw sch_sfq sch_cbq iptable_nat ipt                         able_mangle ipt_MASQUERADE nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf                         _conntrack_ftp ip_tables arptable_filter arp_tables ipv6 ath5k mac80211 ath cfg8                         0211 cpufreq_ondemand processor thermal e_powersaver fan thermal_sys f71805f sha                         256_generic padlock_sha padlock_aes xt_mac xt_MARK dummy via_rhine r8169 mii tul                         ip
[16863.998472] Pid: 0, comm: swapper Not tainted 2.6.34 #7
[16863.998478] Call Trace:
[16863.998489]  [<c1344e9b>] ? dev_watchdog+0x1bb/0x1d0
[16863.998499]  [<c1344e9b>] ? dev_watchdog+0x1bb/0x1d0
[16863.998513]  [<c10348cf>] ? warn_slowpath_common+0x7f/0xf0
[16863.998523]  [<c1344e9b>] ? dev_watchdog+0x1bb/0x1d0
[16863.998534]  [<c103498b>] ? warn_slowpath_fmt+0x2b/0x30
[16863.998544]  [<c1344e9b>] ? dev_watchdog+0x1bb/0x1d0
[16863.998558]  [<c1060039>] ? tick_handle_periodic_broadcast+0x109/0x110
[16863.998572]  [<c106f78c>] ? handle_edge_irq+0xbc/0x100
[16863.998582]  [<c1344ce0>] ? dev_watchdog+0x0/0x1d0
[16863.998596]  [<c104344d>] ? run_timer_softirq+0x20d/0x5f0
[16863.998615]  [<f805ed82>] ? rtl8169_poll+0x1a2/0x200 [r8169]
[16863.998630]  [<c103b0ef>] ? __do_softirq+0x7f/0x220
[16863.998639]  [<c106dac9>] ? handle_IRQ_event+0x49/0x130
[16863.998650]  [<c103b466>] ? irq_exit+0x36/0x40
[16863.998661]  [<c1004b73>] ? do_IRQ+0x43/0xb0
[16863.998670]  [<c10602a4>] ? tick_program_event+0x14/0x20
[16863.998680]  [<c1003229>] ? common_interrupt+0x29/0x30
[16863.998690]  [<c105f18a>] ? clockevents_notify+0x2a/0x110
[16863.998706]  [<f8128982>] ? acpi_idle_enter_simple+0x11d/0x137 [processor]
[16863.998720]  [<f812870f>] ? acpi_idle_enter_bm+0xba/0x210 [processor]
[16863.998736]  [<c13c405a>] ? schedule+0x1aa/0x400
[16863.998747]  [<c12ef756>] ? cpuidle_idle_call+0x66/0xe0
[16863.998756]  [<c1001a6c>] ? cpu_idle+0x1c/0x30
[16863.998766]  [<c150d80a>] ? start_kernel+0x23a/0x2c0
[16863.998776]  [<c150d1e0>] ? unknown_bootoption+0x0/0x1c0
[16863.998782] ---[ end trace 398c5f7cd9eb88bb ]---
Comment 1 Andrew Morton 2010-06-28 19:51:11 UTC
Looks like a driver issue.
Comment 2 Francois Romieu 2010-06-28 21:47:54 UTC
Created attachment 26964 [details]
Hardware requires more delay in mdio_{read / write}

This in included in post 2.6.34.

Jaroslav, can you give it a try ?

-- 
Ueimor
Comment 3 Jaroslav Fojtik 2010-06-28 22:10:56 UTC
No, I have tested bare 2.6.34 only. I will do it.
Comment 4 Jaroslav Fojtik 2010-06-28 23:11:59 UTC
I have been used this computer for one year and half until this message occurred.
So I do not expect that fixing it I could notify sooner.


Anyway, this patch seems to me to be quite bad:

Original code:
static int mdio_read(void __iomem *ioaddr, int reg_addr)
{
	int i, value = -1;

	RTL_W32(PHYAR, 0x0 | (reg_addr & 0x1f) << 16);

	for (i = 20; i > 0; i--) {
		/*
		 * Check if the RTL8169 has completed retrieving data from
		 * the specified MII register.
		 */
		if (RTL_R32(PHYAR) & 0x80000000) {
			value = RTL_R32(PHYAR) & 0xffff;
			break;
		}
		udelay(25);
                  ^^^ why to spinloop here 25us when 20 is sufficient enough
	}
	return value;
}



Do rather this. This would wait from 20us to 40us after event occurs.
Original code waits from 25us to 45us after event occurs.

static int mdio_read(void __iomem *ioaddr, int reg_addr)
{
	int i, value = -1;

	RTL_W32(PHYAR, 0x0 | (reg_addr & 0x1f) << 16);

	for (i = 25; i > 0; i--) {
		/*
		 * Check if the RTL8169 has completed retrieving data from
		 * the specified MII register.
		 */
		if (RTL_R32(PHYAR) & 0x80000000) {
			value = RTL_R32(PHYAR) & 0xffff;
                        udelay(20);
			break;
		}
		udelay(20);
	}
	return value;
}
Comment 5 Francois Romieu 2010-06-29 22:26:21 UTC
Jaroslav Fojtik   2010-06-28 23:11:59 :
[...]
> I have been used this computer for one year and half until this message
> occurred. So I do not expect that fixing it I could notify sooner.

Fine. There is no hurry.

[...]
> why to spinloop here 25us when 20 is sufficient enough

It is not. See http://marc.info/?l=linux-netdev&m=127605163010938&w=2

-- 
Ueimor
Comment 6 Jaroslav Fojtik 2010-06-29 23:29:53 UTC
>It is not. See http://marc.info/?l=linux-netdev&m=127605163010938&w=2

You did not understand me.

When I pool device with period 25us, the operation finish could occur anytime in the middle of 25us delay.

When I add 20us additional delay, the delay from operation end is random value from 20us to 45us.

Is it clear now?

So I propose faster pooling with period 20us - see my code.


I could test even faster pooling, but I am unsure whether saving several us could help.
Comment 7 Francois Romieu 2010-06-30 21:45:28 UTC
Optimizing mdio code is not my #1 priority, especially when I broke
the code with falsely innocuous changes in the first place (looong
ago).

Realtek suggests that the completion needs ~100us in itself.
An extra 20us guard is needed too.

How much will be saved with the suggested change when compared with
current -git code ? How many mdio ops ? Really _that_ often ?

Not much imho.

Of course you can discuss or submit it on netdev if you feel it is
worth it.

-- 
Ueimor
Comment 8 Jaroslav Fojtik 2010-06-30 22:27:25 UTC
The proposed change would:
  1, add one additional op,  (100/25=4; 100/20=5)
  2, save 5us machine time
  3, does not prolong total timeout wait time.

I have no idea how often it is used.
I only hope that this in not required for every packet, because it could kill overall performance a lot.
Comment 9 Francois Romieu 2010-07-01 20:12:45 UTC
Ok, I better understand your concern now.

This code is only used for link management related operations. It should
mostly be used when the interface status changes or when something goes
wrong. The 8168 can use it heavily (several hundred mdio ops) but the old
8169 is an order of magnitude below.

Actually it's a bit like saving 5% of the time spent filling the yearly
fiscal paper.

-- 
Ueimor
Comment 10 Jaroslav Fojtik 2010-07-01 23:11:03 UTC
*** Bug 16303 has been marked as a duplicate of this bug. ***
Comment 11 Jaroslav Fojtik 2010-07-01 23:25:22 UTC
I will return back to main topic and will not poison this bug report with
speedup hints.

After adding additional 20us delay I did not observe any defect with three
RTL-8110SC/8169SC devices.

I also guess that diode defect after cable unplug was the same problem, because
it vanishes.

Note You need to log in before you can comment on or make changes to this bug.