Bug 10109 - r8169 hangs after some use with WoL enabled and "PME Event Wake up" enabled
Summary: r8169 hangs after some use with WoL enabled and "PME Event Wake up" enabled
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Francois Romieu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-02-25 20:19 UTC by elsabi
Modified: 2012-05-17 15:36 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.34
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg output after the problem (30.96 KB, text/x-log)
2008-02-25 20:22 UTC, elsabi
Details
lspci -nnvvv, after the problem (11.61 KB, text/x-log)
2008-02-25 20:27 UTC, elsabi
Details

Description elsabi 2008-02-25 20:19:01 UTC
Latest working kernel version: Ubuntu Gutsy 2.6.22
Earliest failing kernel version:
Distribution: Ubuntu Gutsy, with a self-build 2.6.24 kenel
Hardware Environment: 
 CPU: AMD Athlon 64 X2
 Mother: Gigabyte GA-MA69VM-S2, BIOS update to version 7.
 Net Card: (onboard) Realtek Semiconductor Co., Ltd. RTL-8110SC/8169SC Gigabit Ethernet [10ec:8167] (rev 10)

Software Environment: Xubuntu Gutsy. 
Problem Description:

After some time using the network this message appears in dmesg:
[  818.483753] NETDEV WATCHDOG: eth0: transmit timed out
[  818.755844] r8169: eth0: link up

After that, the network doesn't work anymore. The link LED is on, but it doesn't blink when you send a packet. Also, you can't receive any packet.
The network card (which is a 1000Mbps card) is connected to a 10Mbps HUB (yes, I use it for slow internet access).
When I run ethtool before it hangs, I can see it in 10Mbps/Half duplex mode. After it hangs (it is, after the "transmit timed out" message), ethtool reports that the card is in 1000Mbps/Full duplex mode. This probably causes the connectivity loss. But, I'm not able to change that speed using ethtool (it simply doesn't change when it run ethtool -s ...).

Bringing down and up the interface doesn't solves the problem.
Worse than that, when I bring down the interface and then remove the module r8169, it removes fine, but when I try to insert it again with "modprobe r8169", the process ends with a "Killed" message. Then, in dmesg we can see "Unable to handle kernel NULL pointer dereference ..." and the backtrace.
At that point, lsmod list the module as inserted, but still doest work.

Even worse (if it could be even more worse), when I shutdown the PC and start it again, the machine doesn't boot anymore. It doesn't matter if you unplug it from the wall for some time and plug it again, the machine doesn't boot. It just hangs with the screen off, and before the initial "beep". Some times, when it doesn't boot, a sequence of a long beep and two short beeps could be heard (but I'm not sure if that's the sequence). That sequence could be "CMOS setting error" or "Monitor or video card error".
Anyway, the only thing that brings the machine on is make a Clear CMOS (whit the jumper on the motherboard) and reboot it again. I did reproduce this behavior about four times.

The failsafe settings in the BIOS has the "PME Event Wake up" feature disabled. I did use the machine for 5 days with that feature disabled and it work well, with no hangs in many hours. Nevertheless, if I enable that feature, the driver will hang after about one hour of use, or with a high traffic over the card. I'm not sure on what exactly causes the timeout, but it seems to be the traffic.

I was confused with the relation between the CMOS memory and the Network driver. So I've cehck the battery voltage, power supply voltages, mem test, and the usual things. But it seems that the PME Event wake up could be key of that relation.
Also, the machine doesn't boot with etherwake from other machine.

I can reproduce the same behavior with and without the CONFIG_R8169_NAPI.

Steps to reproduce:
Enable "PME Event Wake up" in the BIOS.
Use the network some time, and make some heavy traffic. It will hang.
Comment 1 elsabi 2008-02-25 20:22:10 UTC
Created attachment 14995 [details]
dmesg output after the problem
Comment 2 elsabi 2008-02-25 20:27:53 UTC
Created attachment 14996 [details]
lspci -nnvvv, after the problem

Before the problem, it differs a little bit (look at PERR, and Cache Line size)

02:0f.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL-8110SC/8169SC Gigabit Ethernet [10ec:8167] (rev 10)
	Subsystem: Giga-byte Technology Unknown device [1458:e000]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 64 (8000ns min, 16000ns max), Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 23
	Region 0: I/O ports at dc00 [size=256]
	Region 1: Memory at fddff000 (32-bit, non-prefetchable) [size=256]
	[virtual] Expansion ROM at fdc00000 [disabled] [size=128K]
	Capabilities: [dc] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Comment 3 Andrew Morton 2008-02-25 20:36:11 UTC
Marking as a regression.
Comment 4 Francois Romieu 2008-02-26 12:21:26 UTC
elsabi:
[...]
>   17.908803] PCI: Using MMCONFIG at e0000000 - efffffff
[...]
> [   58.665630] [fglrx] GART Table is not in FRAME_BUFFER range 
> [   58.665638] [fglrx] Reserve Block - 0 offset =  0X7ffb000 length = 0X5000
> [   58.665640] [fglrx] Reserve Block - 1 offset =  0X0 length = 0X1000000

Can you test again with a MMCONFIG disabled kernel which does not load
this binary module ?

-- 
Ueimor
Comment 5 Rolf Eike Beer 2008-05-20 23:10:07 UTC
I also get this nasty timeouts. For me suspend/resume solves it. If I recognize the timeout early enough this means that sometimes even the tcp connections survive the powercycle ;)

With recent kernels (currently 2.6.25-rc2-git) I get these messages:

NETDEV WATCHDOG: eth0: transmit timed out
------------[ cut here ]------------
WARNING: at /home/eike/repos/linux-2.6/net/sched/sch_generic.c:222 dev_watchdog+0x95/0xe7()
Modules linked in: iptable_filter ip_tables ip6table_filter ip6_tables x_tables af_packet ipv6 i915 drm cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq speedstep_lib freq_table snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device twofish twofish_common cbc usbhid dm_crypt nls_utf8 ntfs ext3 jbd loop omnibook mmc_block pcmcia arc4 ecb crypto_blkcipher ohci1394 ieee1394 ac snd_hda_intel sdhci mmc_core yenta_socket rsrc_nonstatic snd_pcm i2c_i801 container battery iwl3945 firmware_class pcmcia_core button backlight output intel_agp snd_timer i2c_core agpgart snd soundcore mac80211 iTCO_wdt snd_page_alloc sr_mod cdrom serio_raw joydev r8169 cfg80211 sg sd_mod ehci_hcd uhci_hcd usbcore dm_snapshot edd dm_mod fan ata_piix libata scsi_mod dock thermal processor
Pid: 0, comm: swapper Not tainted 2.6.26-rc2-git #49
[<c011f271>] warn_on_slowpath+0x41/0x6d
[<c011f757>] ? release_console_sem+0x181/0x189
[<c02a6fc0>] ? _spin_lock_irqsave+0xc/0x11
[<c0126b65>] ? lock_timer_base+0x1f/0x3e
[<c0126c8b>] ? __mod_timer+0xa0/0xab
[<c012c8ca>] ? queue_delayed_work_on+0xa1/0xae
[<c012c905>] ? queue_delayed_work+0x1b/0x1e
[<c012c919>] ? schedule_delayed_work+0x11/0x13
[<f8af10c1>] ? rtl8169_schedule_work+0x1e/0x20 [r8169]
[<f8af1127>] ? rtl8169_tx_timeout+0x1f/0x22 [r8169]
[<c025ad75>] dev_watchdog+0x95/0xe7
[<c01266d7>] run_timer_softirq+0x133/0x193
[<c025ace0>] ? dev_watchdog+0x0/0xe7
[<c025ace0>] ? dev_watchdog+0x0/0xe7
[<c0123626>] __do_softirq+0x6d/0xd2
[<c0105895>] do_softirq+0x55/0x88
[<c014c260>] ? handle_edge_irq+0x0/0x10d
[<c0123586>] irq_exit+0x38/0x6b
[<c0105982>] do_IRQ+0xba/0xd0
[<c0104277>] common_interrupt+0x23/0x28
[<c013007b>] ? run_posix_cpu_timers+0x523/0x78e
[<f8828168>] ? acpi_idle_enter_bm+0x286/0x2f4 [processor]
[<c023db6c>] cpuidle_idle_call+0x58/0x88
[<c023db14>] ? cpuidle_idle_call+0x0/0x88
[<c0102570>] cpu_idle+0x90/0xac
[<c029abb6>] rest_init+0x4e/0x50
=======================
---[ end trace e04c42da614d215f ]---

See bug:6807 comment 40 for hardware details
Comment 6 Alan 2009-03-24 04:52:32 UTC
Please test 2.6.28 or 2.6.29 and let me know if this bug is now stale Rolf
Comment 7 Mike Bradley 2009-03-27 19:35:31 UTC
Confirmed virtually identical issue with 2.6.28
Comment 8 Rolf Eike Beer 2010-09-06 12:03:57 UTC
Just to keep this alive, this is from openSUSE's 2.6.34 desktop kernel:

Call Trace:
 [<c02065c3>] try_stack_unwind+0x173/0x190
 [<c02051cf>] dump_trace+0x3f/0xe0
 [<c020662b>] show_trace_log_lvl+0x4b/0x60
 [<c0206658>] show_trace+0x18/0x20
 [<c064d690>] dump_stack+0x6d/0x72
 [<c024430e>] warn_slowpath_common+0x6e/0xb0
 [<c024439b>] warn_slowpath_fmt+0x2b/0x30
 [<c059e63c>] dev_watchdog+0x1dc/0x1f0
 [<c025236d>] run_timer_softirq+0x12d/0x2c0
 [<c024a972>] __do_softirq+0xa2/0x200
 [<c024ab05>] do_softirq+0x35/0x40
 [<c024ae9d>] irq_exit+0x6d/0x70
 [<c021a323>] smp_apic_timer_interrupt+0x53/0x90
 [<c065086e>] apic_timer_interrupt+0x2a/0x30
 [<b5d98465>] 0xb5d98465
---[ end trace 2eb442aeee40a00a ]---
Comment 9 Francois Romieu 2012-05-12 11:06:33 UTC
- there is a problem with TX_BUFFS_AVAIL, see
  http://patchwork.ozlabs.org/patch/157824/. The patch is scheduled
  for mainline.
- there should be a smp_wmb() between updates of txd->opts1 and
  tp->cur_tx in rtl8169_start_xmit

-- 
Ueimor

Note You need to log in before you can comment on or make changes to this bug.