Bug 10134 - r8169 randomly hangs system
Summary: r8169 randomly hangs system
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Francois Romieu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-02-29 07:14 UTC by Maxim Radugin
Modified: 2009-06-15 22:01 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.23.9
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Kernel config (38.29 KB, application/octet-stream)
2008-02-29 07:15 UTC, Maxim Radugin
Details
dmesg output after system boot (15.09 KB, application/octet-stream)
2008-02-29 07:15 UTC, Maxim Radugin
Details
ifconfig output (935 bytes, application/octet-stream)
2008-02-29 07:16 UTC, Maxim Radugin
Details
complete lspci output (13.07 KB, application/octet-stream)
2008-02-29 07:16 UTC, Maxim Radugin
Details
/proc/interrupts (650 bytes, application/octet-stream)
2008-02-29 07:16 UTC, Maxim Radugin
Details
/proc/iomem (1.40 KB, application/octet-stream)
2008-02-29 07:17 UTC, Maxim Radugin
Details
/proc/ioports (1.51 KB, application/octet-stream)
2008-02-29 07:17 UTC, Maxim Radugin
Details
Force renegotiation after resume (376 bytes, patch)
2008-03-05 14:51 UTC, Francois Romieu
Details | Diff

Description Maxim Radugin 2008-02-29 07:14:51 UTC
Distribution: 
	Debian GNU/Linux 4.0
Hardware Environment: 
	Chipset: Intel GMCH: 915GM, ICH6M: 82801FBM
	CPU: Intel Celeron M 1.50GHz
	RAM: 1GB RAM DDR2 DIMM 400/533 (Dual Channel)
	BIOS: ACPI BIOS
	Security: Atmel TPM 1.2
	I/O: Windbond W83627EHF, Oxford OX16PCI954
	Ehthernet: Realket PCI-E 8101E
	Storage: Integrated SATA and IDE controller

Software Environment:
	1) sftp
	2) netcat
	3) dhcp, tftboot, pxelinux nfs client/server

Problem Description:
	1) System randomly hangs on data transfer over the network. In some cases I can transfer/receive 10 or more gigabytes without any problems, sometimes system hangs after trasferring less than 1 MB.
	2) When I boot system over the network, after transferring certain amount of data, network stalls, and then "NETDEV WATCHDOG: eth5: transmit timed out" message appears, and netowork continue to function properly for some time, then network stalls again, etc.
	3) With kernel 2.6.18 I had soft lockup in r8169_rx_interrupt function.
	
	I've also tried different revisions of r8101 driver from the realtek, problem is similar - system either hangs, or during sftp transfer gives error message: "Corrupted MAC on input".
	Tried r8169.c from the linux/kernel/git/torvalds/linux-2.6 - no luck.

If needed, I can provide more detailed information.
Comment 1 Maxim Radugin 2008-02-29 07:15:25 UTC
Created attachment 15085 [details]
Kernel config
Comment 2 Maxim Radugin 2008-02-29 07:15:54 UTC
Created attachment 15086 [details]
dmesg output after system boot
Comment 3 Maxim Radugin 2008-02-29 07:16:08 UTC
Created attachment 15087 [details]
ifconfig output
Comment 4 Maxim Radugin 2008-02-29 07:16:25 UTC
Created attachment 15088 [details]
complete lspci output
Comment 5 Maxim Radugin 2008-02-29 07:16:48 UTC
Created attachment 15089 [details]
/proc/interrupts
Comment 6 Maxim Radugin 2008-02-29 07:17:06 UTC
Created attachment 15090 [details]
/proc/iomem
Comment 7 Maxim Radugin 2008-02-29 07:17:23 UTC
Created attachment 15091 [details]
/proc/ioports
Comment 8 Francois Romieu 2008-02-29 14:06:50 UTC
The r8169 driver has undergone several changes between 2.6.23.9 and 2.6.24.

Can you give 2.6.24 a try (with/without MMCONFIG) ?

Thanks.

-- 
Ueimor
Comment 9 Maxim Radugin 2008-03-01 03:36:36 UTC
Yes, sure, I'll give it a try on monday and post report.
Comment 10 Maxim Radugin 2008-03-03 03:30:28 UTC
I've compiled and installed 2.6.24.3 kernel (with PCI Access set to "Any" and MSI turned on) and tried to boot from the network, the problem is the same:

nfs: server 192.168.100.1 not responding, still trying...
nfs: server 192.168.100.1 not responding, still trying...
...
nfs: server 192.168.100.1 not responding, still trying...
NETDEV WATHDOG: eth0: transmit timeout
r8169: eth0: link up

nfs: server 192.168.100.1 OK
nfs: server 192.168.100.1 OK
...
nfs: server 192.168.100.1 OK
Comment 11 Maxim Radugin 2008-03-03 03:48:00 UTC
With the same 2.6.24.3 kernel (without nfs support) while transferring file using sftp, i got:

int3: 0000 [#1]
Modules linked in:

Pid: 0, comm: swapper Not tainted (2.6.24-diamond #2)
EIP: 0060:[<c0468211>] EFLAGS 00000002 CPU: 0
EIP is at ignore_int+0x1/0x50
EAX: 0001f802 EBX: f7744000 ECX: c0285950 EDX: 0000f802
ESI: f76af7cc EDI: 00000000 EBP: f774407c ESP: c0467f1c
 DS: 007b ES: 007b FS: 0000 GS: 0000 SS:0068
Processor swapper (pid: 0, ti=c0466000 task=c04332e0 task.ti=c0466000)
Stack: c01efbda 00000060 00010002 c02857db c02816c3 c0115614 0000000f f76c2ab0
       00000000 00000286 00000100 c04b5a80 c0467f64 f75b9340 00000000 00000000
       0000000e c013ae75 c043fef0 f75b9340 0000000e 0000000e c013be57 00000310
Call Trace:
 [<c01efbda>] ioread8+0x2a/0x30
 [<c02857db>] ata_bmdma_status+0xb/0x10
 [<c02816c3>] ata_interrupt+0x143/0x1c0
 [<c0115614>] activate_task+0x24/0x40
 [<c013ae75>] handle_IRQ_event+0x25/0x60
 [<c013be57>] handle_edge_irq+0x77/0xf0
 [<c0104c35>] do_IRQ+0x45/0x80
 [<c010322f>] common_interrupt+0x23/0x28
 [<c048007b>] asus_hides_smbus_hostbridge+0x20b/0x270
 [<c010162a>] default_idle+0x2a/0x40
 [<c0100edf>] cpu_idle+0x3f/0x60
 [<c0468aca>] start_kernel+0x1fa/0x280
 [<c0468360>] unknown_bootoption+0x0/0x1f0
 =======================
COde: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc <cc> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
EIP: [<c0468211>] ignore_int+0x1/0x50 SS:ESP 0068:c0467f1c
Kernel panic - not syncing: Fatal exception in interrupt
Comment 12 Francois Romieu 2008-03-05 14:51:19 UTC
Created attachment 15157 [details]
Force renegotiation after resume

Please try the attached patch and send the output of mii-tool after resume.

-- 
Ueimor
Comment 13 Rolf Eike Beer 2008-05-20 23:04:06 UTC
Your problem #1 looks like a dupe of bug:6807
Your problem #2 looks like a dupe of bug:10109
Comment 14 Maxim Radugin 2008-05-27 03:46:47 UTC
(In reply to comment #13)
> Your problem #1 looks like a dupe of bug:6807
> Your problem #2 looks like a dupe of bug:10109
> 

It is a dupe of 6807 bug, but not 10109. I don't have PME event option in BIOS.
Comment 15 Maxim Radugin 2008-05-27 03:53:16 UTC
(In reply to comment #12)
> Created an attachment (id=15157) [details]
> Force renegotiation after resume
> 
> Please try the attached patch and send the output of mii-tool after resume.
> 
> -- 
> Ueimor
> 

Sorry, had no time to apply patch and check it. But is seems to me that the problem is in rtl8169_rx_interrupt handling routine. 
Comment 16 Francois Romieu 2008-09-10 13:46:55 UTC
Maxim, can you give 2.6.27-rc a try ? There are a few r8169 related changes
in it that could fix your problems.

Thanks in advance.

-- 
Ueimor
Comment 17 Maxim Radugin 2008-10-03 02:31:50 UTC
No luck, with 2.6.27-rc8-git4. 
Network becomes unusable after transferring ~300 MB, but at least system did not hang.
No error messages in dmesg, even with RTL8169_DEBUG turned on.

As an experiment we have added udelay(10) to all the i/o read and write functions in the 2.6.26.2 kernel, and surprisingly network became more stable. I think we have had problems only once over about a week intensive network use. 
Probably, it is required to wait some time before/after register read/write operations? 
But unfortunately I didn't find any info in the datasheet.
Comment 18 Rolf Eike Beer 2009-02-24 08:02:52 UTC
We should close this as dupe. #10109 has IMHO nothing to do with PME at all as I can trigger it without that, too. So both problems described here are already reported in other bugs.
Comment 19 Francois Romieu 2009-06-15 22:01:02 UTC
Fixed in 2.6.30.

-- 
Ueimor

Note You need to log in before you can comment on or make changes to this bug.