Bug 32962

Summary: r8169 self reboot the machine with RTL8111/8168B PCI Express Gigabit Ethernet
Product: Drivers Reporter: Enrico Tagliavini (enrico.tagliavini)
Component: NetworkAssignee: Francois Romieu (romieu)
Status: RESOLVED CODE_FIX    
Severity: normal CC: akpm, andyrtr, arthur.titeica, cera, fridjong, kernel, leho, liquid.acid, stuffcorpse, Tom
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 2.6.38.2 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Remove erroneous processing of always set bit (post 8168b only)
don't reset software ring indexes after disabling hardware Rx
remove erroneous processing of always set bit.

Description Enrico Tagliavini 2011-04-10 12:30:52 UTC
I just bought a lenovo thinkpad edge 15. The eth controller is:

09:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)
09:00.0 0200: 10ec:8168 (rev 03)

the driver in use is r8169

The problem is when i try to do a very huge data transfert over the gigabit network: in the syslog i can see the river keeps spam things like:

Apr 10 14:18:38 thinkico kernel: [  160.643248] r8169 0000:09:00.0: eth0: link up
Apr 10 14:18:40 thinkico kernel: [  162.135568] r8169 0000:09:00.0: eth0: link up
Apr 10 14:18:41 thinkico kernel: [  163.594835] r8169 0000:09:00.0: eth0: link up
Apr 10 14:18:44 thinkico kernel: [  166.520487] r8169 0000:09:00.0: eth0: link up
Apr 10 14:18:44 thinkico kernel: [  166.520500] NOHZ: local_softirq_pending 08
Apr 10 14:18:45 thinkico kernel: [  167.302470] r8169 0000:09:00.0: eth0: link up
Apr 10 14:18:45 thinkico kernel: [  167.981786] r8169 0000:09:00.0: eth0: link up
Apr 10 14:18:45 thinkico kernel: [  167.981795] NOHZ: local_softirq_pending 08
Apr 10 14:18:45 thinkico kernel: [  167.982436] NOHZ: local_softirq_pending 08
Apr 10 14:18:46 thinkico kernel: [  168.411702] r8169 0000:09:00.0: eth0: link up
...
Apr 10 14:19:25 thinkico kernel: [  207.175285] r8169 0000:09:00.0: eth0: link up
Apr 10 14:19:25 thinkico kernel: [  207.443637] r8169 0000:09:00.0: eth0: link up
Apr 10 14:19:26 thinkico kernel: [  208.828129] net_ratelimit: 2 callbacks suppressed
Apr 10 14:19:26 thinkico kernel: [  208.828137] r8169 0000:09:00.0: eth0: link up
Apr 10 14:19:27 thinkico kernel: [  209.402700] r8169 0000:09:00.0: eth0: link up
Apr 10 14:19:28 thinkico kernel: [  209.995192] r8169 0000:09:00.0: eth0: link up

and after some times (let's say 30 seconds but this vary also depending on the speed transfert) the machine self reboot. The reboot is not clean of course.

I tested it with both an NFS and samba based trasfert, the same happens, but with NFS the self reboot happens earlier i guess becouse the transfer speed is higher. For the record the speed should be at least 40MBs/sec, at least this is the speed i see when using windows (sorry, jut for test) and transferring from the same samba server. In linux i can reach less then half the speed before the machine self reboots.
Comment 1 Enrico Tagliavini 2011-04-10 12:32:49 UTC
Oh and for the record, my distro is gentoo (mostly from the stable branch), the kernel is gentoo-sources 2.6.38, really near to vanilla afaik.
Comment 2 Enrico Tagliavini 2011-04-10 13:02:02 UTC
dmesg output after a fresh reboot (disconnected the AC and also the battery for 5 minutes before booting again):

enrico@thinkico ~ $ dmesg | grep r8169
[    9.526404] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[    9.526427] r8169 0000:09:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
[    9.526481] r8169 0000:09:00.0: setting latency timer to 64
[    9.526536] r8169 0000:09:00.0: irq 42 for MSI/MSI-X
[    9.526701] r8169 0000:09:00.0: eth0: RTL8168d/8111d at 0xffffc90000054000, 60:eb:69:ac:96:2a, XID 083000c0 IRQ 42
[   24.900682] r8169 0000:09:00.0: eth0: unable to apply firmware patch
[   24.902555] r8169 0000:09:00.0: eth0: link down
[   24.902566] r8169 0000:09:00.0: eth0: link down
[   26.828669] r8169 0000:09:00.0: eth0: link up


it is intresting ifconfig shows a dropped packet just after the boot.

thinkico ~ # ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 60:eb:69:ac:96:2a  
          inet addr:192.168.11.132  Bcast:192.168.11.255  Mask:255.255.255.0
          inet6 addr: fe80::62eb:69ff:feac:962a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:921 errors:0 dropped:1 overruns:0 frame:0
          TX packets:956 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:694936 (678.6 KiB)  TX bytes:170076 (166.0 KiB)
          Interrupt:42 Base address:0x4000
Comment 3 Enrico Tagliavini 2011-04-10 13:24:56 UTC
Just another update: transferring files with a pc equipped with a 100Mbit eth card (but connected to the same 1GB router) seems to work, no error reported.
Comment 4 Enrico Tagliavini 2011-04-10 14:29:03 UTC
I'm trying to transfert from the sabayon 5.5 livedvd (2.6.37 kernel here) and it works. The dmesg line complaining about the firmware is not present here, but the files rtl_nic/rtl8168d-{1,2}.fw are not present on the livedvd.

I tried to add those firmwares in my gentoo install, the warning is gone but the issue persist.

Looking at the r8169.c shows the firmware lines where added in the .38 kernel, that may explain why the issue is present in kernel .38 and not in .37. Can you confirm?
Comment 5 Enrico Tagliavini 2011-04-12 15:22:51 UTC
The last side note: of course i don't think the net driver is rebooting the machine itself, more likely this is some sort of bios protection or something. On the other hand the driver has this really weird (at least for me) output, which quite surely is a driver regression. Feel free to disagree of course :)
Comment 6 Bryan Kam 2011-04-28 20:16:53 UTC
I'm running Arch Linux and I encountered this sometime after upgrading from 2.6.37.3 to 2.6.37.4 of the default Arch kernel26. Apparently the problem remains also with 2.6.37.5 and 2.6.38.2. I myself have not been testing it, choosing instead to remain on kernel 2.6.37.3, but there is a bit of discussion on the Arch Linux boards:

https://bbs.archlinux.org/viewtopic.php?id=115644

The symptoms in my case were that the machine would freeze or reboot when working with large files over gigabit NFS (specifically when doing MythTV commercial flagging over NFS). I also got the "NOHZ: local_softirq_pending 08" messages logged.

Let me know if I can be of any help.
Comment 7 Enrico Tagliavini 2011-04-28 20:50:25 UTC
i add some info i hope will be usefull to solve the issue:

# lspci -tv
-+-[0000:ff]-+-00.0  Intel Corporation Core Processor QuickPath Architecture Generic Non-core Registers
 |           +-00.1  Intel Corporation Core Processor QuickPath Architecture System Address Decoder
 |           +-02.0  Intel Corporation Core Processor QPI Link 0
 |           +-02.1  Intel Corporation Core Processor QPI Physical 0
 |           +-02.2  Intel Corporation Core Processor Reserved
 |           \-02.3  Intel Corporation Core Processor Reserved
 \-[0000:00]-+-00.0  Intel Corporation Core Processor DRAM Controller
             +-01.0-[01]--+-00.0  ATI Technologies Inc M92 [Mobility Radeon HD 4500 Series]
             |            \-00.1  ATI Technologies Inc RV710/730
             +-16.0  Intel Corporation 5 Series/3400 Series Chipset HECI Controller
             +-1a.0  Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller
             +-1b.0  Intel Corporation 5 Series/3400 Series Chipset High Definition Audio
             +-1c.0-[02]--
             +-1c.1-[03]----00.0  Intel Corporation WiFi Link 1000 Series
             +-1c.2-[04]--
             +-1c.3-[05-07]--
             +-1c.4-[08]--
             +-1c.5-[09]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
             +-1d.0  Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller
             +-1e.0-[0c]--
             +-1f.0  Intel Corporation Mobile 5 Series Chipset LPC Interface Controller
             +-1f.2  Intel Corporation 5 Series/3400 Series Chipset 4 port SATA AHCI Controller
             +-1f.3  Intel Corporation 5 Series/3400 Series Chipset SMBus Controller
             \-1f.6  Intel Corporation 5 Series/3400 Series Chipset Thermal Subsystem


thinkico ~ # cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       
  0:        120          4          3          5   IO-APIC-edge      timer
  1:       1444       1537       1492       1527   IO-APIC-edge      i8042
  8:         22         16         26         20   IO-APIC-edge      rtc0
  9:       9487       9443       9448       9478   IO-APIC-fasteoi   acpi
 12:     222530     222297     222449     222076   IO-APIC-edge      i8042
 16:       9940       9986      10018      10020   IO-APIC-fasteoi   ehci_hcd:usb1
 23:         31         31         26         33   IO-APIC-fasteoi   ehci_hcd:usb2
 41:      17211      17335      17196      17252   PCI-MSI-edge      ahci
 42:       4897       4981       5020       4997   PCI-MSI-edge      eth0
 43:      25539      28751      26826      28842   PCI-MSI-edge      iwlagn
 44:       1597       1584       1577       1581   PCI-MSI-edge      hda_intel
 45:         26         15         20         23   PCI-MSI-edge      hda_intel
 46:      67618      67593      67542      67648   PCI-MSI-edge      fglrx[0]@PCI:1:0:0
NMI:          0          0          0          0   Non-maskable interrupts
LOC:    1019675     818437     937715     898096   Local timer interrupts
SPU:          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0   Performance monitoring interrupts
IWI:          0          0          0          0   IRQ work interrupts
RES:       5386       3412       4597       2774   Rescheduling interrupts
CAL:       4705       9345       9204      10980   Function call interrupts
TLB:      27555      24531      25864      28511   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0   Threshold APIC interrupts
MCE:          0          0          0          0   Machine check exceptions
MCP:         16         16         16         16   Machine check polls
ERR:          0
MIS:          0
Comment 8 Tobias Jakobi 2011-05-01 21:40:56 UTC
This is probably the same or similar to:
https://bugzilla.kernel.org/show_bug.cgi?id=29282

I'm also suffering this with vanilla 2.6.38.4 -- I have to hotfix eth0 by disabling autoneg and forcing 100MBit mode. This somehow also kills duplex support, although its indicated by ethtool that duplex is set to 'full'. However when doing both up- and down-transfers either of one stalls (so they seems to alternate).

Needless to say, this is really annoying.
Comment 9 Tobias Jakobi 2011-05-01 21:55:55 UTC
lspci -vvv output:
07:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)
	Subsystem: Fujitsu Limited. Device 15b1
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 43
	Region 0: I/O ports at 5000 [size=256]
	Region 2: Memory at f0a04000 (64-bit, prefetchable) [size=4K]
	Region 4: Memory at f0a00000 (64-bit, prefetchable) [size=16K]
	[virtual] Expansion ROM at f0a20000 [disabled] [size=128K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee0100c  Data: 4181
	Capabilities: [70] Express (v2) Endpoint, MSI 01
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB
	Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
		Vector table: BAR=4 offset=00000000
		PBA: BAR=4 offset=00000800
	Capabilities: [d0] Vital Product Data
		Unknown small resource type 00, will not decode more.
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [140 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
	Kernel driver in use: r8169
	Kernel modules: r8169
Comment 10 Bryan Kam 2011-05-01 22:04:53 UTC
I'm not a programmer, but I went through the changes between Arch Linux's 2.6.37.3 and 2.6.37.4 kernels (which was when the problem occurred for me) and using git I think one of these commits might have introduced the problem:

0d672e9f8ac320c6d1ea9103db6df7f99ea20361
f60ac8e7ab7cbb413a0131d5665b053f9f386526
1519e57fe81c14bb8fa4855579f19264d1ef63b4
b5ba6d12bdac21bc0620a5089e0f24e362645efd

As I believe those were the only four commits to r8169.c between the two Arch Linux compiles in question. I could be wrong of course. But maybe someone compiling a vanilla kernel can try reverting those in git and see whether the problem persists? Or maybe run a git bisect?
Comment 11 Tobias Jakobi 2011-05-01 22:52:43 UTC
I have taken a look at these commits. First of all my mac_version is 11 (RTL_GIGA_MAC_VER_11), so this excludes the following commits for _me_:
f60ac8e7ab7cbb413a0131d5665b053f9f386526 (code isn't active for my mac version)
b5ba6d12bdac21bc0620a5089e0f24e362645efd (not functionality change for my mac version)
1519e57fe81c14bb8fa4855579f19264d1ef63b4 (again no functional change for my version)

So this leaves only 0d672e9f8ac320c6d1ea9103db6df7f99ea20361, I'm going to check this one.

However I don't know if this ever worked on my system at all.
Comment 12 Tobias Jakobi 2011-05-01 23:19:52 UTC
OK, so I commented out the 'netif_carrier_off(dev)' call in r8169.c and rechecked. Which means putting heavy load on the ethernet device (used two netcat pipes connecting /dev/zero and /dev/null, both upstream and downstream).

While the 'local_softirq_pending 08' messages still show up here and there (I managed to produce some of them, but they don't fill the log like before) I failed to freeze/reboot the system like before. And it was fairly easy to do this before (10 seconds of heavy load and it was gone).

I'm going to make Ivan Vecera aware of that, since he wrote the patch (which looks fine from reading the commit description).

Maybe some other people can recheck this?
Comment 13 Enrico Tagliavini 2011-05-02 09:08:14 UTC
(In reply to comment #8)
> This is probably the same or similar to:
> https://bugzilla.kernel.org/show_bug.cgi?id=29282

Seems so, sorry for the duplicate, i searched for an already filled one but i missed it :(
Comment 14 Ivan Vecera 2011-05-02 10:13:28 UTC
(In reply to comment #11)
> I have taken a look at these commits. First of all my mac_version is 11
> (RTL_GIGA_MAC_VER_11), so this excludes the following commits for _me_:
This is not probably right, I saw one A530 and there is not 8168b variant, but 8168e variant. I know that lspci shows 8168b but the XID from dmesg is more representative. As I said I saw in A530's dmesg the following:
...
r8169 0000:07:00.0: eth0: RTL8168b/8111b at 0xffffc900110ba000, 00:23:26:8d:8e:73, XID 0c100000 IRQ 45
...
The XID 0c100000 corresponds to newer variant 8168e and not 8168b. The support for 8168e variants is currently present in Dave's net-next, for your 2.6.38.x kernel this variant is unknown and as fallback is used configuration for 8168b.
Comment 15 Tobias Jakobi 2011-05-02 10:45:03 UTC
That's my line from dmesg:
r8169 0000:07:00.0: eth0: RTL8168b/8111b at 0xffffc900100be000, 00:23:26:8e:f9:6d, XID 0c100000 IRQ 43

Concerning the mac_version. I just modified netif_info to also print out the mac_version from the struct. I haven't really checked where this is set and whether this is a fallback or not.

However you might be right. This is the full output from dmesg:
r8169 0000:07:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
r8169 0000:07:00.0: setting latency timer to 64
r8169 0000:07:00.0: (unregistered net_device): unknown MAC, using family default
r8169 0000:07:00.0: irq 43 for MSI/MSI-X
r8169 0000:07:00.0: eth0: RTL8168b/8111b (11) at 0xffffc900100f8000, 00:23:26:8e:f9:6d, XID 0c100000 IRQ 43

So this message about "using family default", that's indicating a fallback?

Anyway, thanks for looking into this! :)
Comment 16 Ivan Vecera 2011-05-02 11:00:12 UTC
(In reply to comment #15)
> That's my line from dmesg:
> r8169 0000:07:00.0: eth0: RTL8168b/8111b at 0xffffc900100be000,
> 00:23:26:8e:f9:6d, XID 0c100000 IRQ 43
> 
> Concerning the mac_version. I just modified netif_info to also print out the
> mac_version from the struct. I haven't really checked where this is set and
> whether this is a fallback or not.
> 
> However you might be right. This is the full output from dmesg:
> r8169 0000:07:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> r8169 0000:07:00.0: setting latency timer to 64
> r8169 0000:07:00.0: (unregistered net_device): unknown MAC, using family
> default
> r8169 0000:07:00.0: irq 43 for MSI/MSI-X
> r8169 0000:07:00.0: eth0: RTL8168b/8111b (11) at 0xffffc900100f8000,
> 00:23:26:8e:f9:6d, XID 0c100000 IRQ 43
> 
> So this message about "using family default", that's indicating a fallback?
>
Yes, this indicates a fallback.
Is it possible for you to test net-next kernel? (...with appropriate firmware files)
Comment 17 Tobias Jakobi 2011-05-02 12:58:20 UTC
(In reply to comment #16)
> Yes, this indicates a fallback.
> Is it possible for you to test net-next kernel? (...with appropriate firmware
> files)
Sure, now cloning the tree. Concerning the firmware files: Do I have to fetch these from somewhere else or are they included in the tree?
Comment 18 Enrico Tagliavini 2011-05-02 13:06:18 UTC
(In reply to comment #17)
> Sure, now cloning the tree. Concerning the firmware files: Do I have to fetch
> these from somewhere else or are they included in the tree?

I dunno what fw files Ivan is reffering, but if they are rtl_nic/rtl8168d-{1,2}.fw you can get them from http://git.kernel.org/?p=linux/kernel/git/dwmw2/linux-firmware.git;a=summary
Comment 19 Tobias Jakobi 2011-05-02 13:09:06 UTC
Thanks, I just figured it out myself -- never needed these firmware files before.
Comment 20 Tobias Jakobi 2011-05-02 18:59:52 UTC
OK, so net-next completly fixes this for me.

The chip is now detected as:
8169 0000:07:00.0: eth0: RTL8168e/8111e at 0xffffc900101b8000, 00:23:26:8e:f9:6d, XID 0c100000 IRQ 43

The ethernet stress-test doesn't trigger any messages in dmesg anymore (and it doesn't lockup/reboot the machine).
Comment 21 Leho Kraav 2011-05-05 08:09:28 UTC
ok i'm also running into this with 2.6.38-r3. is it possible to just take the r8169 driver from net-next and run it on top of gentoo-sources?

04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)
	Subsystem: Giga-byte Technology GA-EP45-DS5 Motherboard
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 46
	Region 0: I/O ports at be00 [size=256]
	Region 2: Memory at fbbff000 (64-bit, prefetchable) [size=4K]
	Region 4: Memory at fbbf8000 (64-bit, prefetchable) [size=16K]
	[virtual] Expansion ROM at fbb00000 [disabled] [size=128K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee0f00c  Data: 4191
	Capabilities: [70] Express (v2) Endpoint, MSI 01
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend+
		LnkCap:	Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB
	Capabilities: [ac] MSI-X: Enable- Count=4 Masked-
		Vector table: BAR=4 offset=00000000
		PBA: BAR=4 offset=00000800
	Capabilities: [cc] Vital Product Data
		Unknown small resource type 00, will not decode more.
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [140 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [160 v1] Device Serial Number 03-00-00-00-68-4c-e0-00
	Kernel driver in use: r8169
	Kernel modules: r8169
Comment 22 Leho Kraav 2011-05-05 13:09:48 UTC
debugging this i've become more familiar with my "beloved" realtek onboard nic (Gigabyte GA-PM55-UD2). 

for example i've now learned that the nic is capable of going insane, to the point where r8169.ko will not get a link up at all, and r8168.ko gets a 10Mbps link on a gigabit switch. in this case it happened after a reboot into an  older 2.6.34 kernel that i disabled PCI_QUIRKs on (only noteworthy difference i can think of), but i have no idea what exactly triggers that insanity. 

googline around [1] revealed that you might need complete power off to make the nic sane again. while i had turned off the power from front cover switch to check exactly for that a while before, i forgot that the real power switch is in the back of the PSU. complete power cycle restore nic's ability to connect at 1Gbps.

i mentioned r8168. i went ahead and compiled 8.023.00 driver (dated 19.04.2011) from realtek [2]. transferring a 10GB file has now ended smoothly with no error messages previously immediately experienced with r8169.

 [1]: http://www.w7forums.com/realtek-onboard-lan-doesnt-work-above-10-mbps-t9501.html
 [2]: ftp://WebUser:fH7s5YL@207.232.93.28/cn/nic/r8168-8.023.00.tar.bz2
Comment 23 Tobias Jakobi 2011-05-05 13:26:05 UTC
@Leho: I'm also using gentoo but I just did a shallow clone of the net-next repo and copied my kernel config over. I don't think it's that easy to backport the new code to 2.6.38.4 (I think that the kernel which gentoos-sources is currently based on). I'm now living with the hackfix workaround until the code hits a stable kernel release. Or I might look into this again when 2.6.39 becomes stable. It's probably a whole lot easier to apply the patches from net-next against 2.6.39 than against 2.6.38...
Comment 24 Leho Kraav 2011-05-05 13:35:27 UTC
right. side note "shallow clone" was new to me so i googled it a bit [1], looks like it doesn't give that much gain.

but re realtek, looks like i will be sitting on self-maintained r8168 for the foreseeable future. i guess an ebuild would be nice to have, will look into it some time.

 [1]: http://blogs.gnome.org/simos/2009/04/18/git-clones-vs-shallow-git-clones/
Comment 25 Tobias Jakobi 2011-05-05 13:42:39 UTC
Yeah, I just mentioned the shallow clone since you probably don't need the whole history for just testing the kernel -- and it sames bandwidth too :)
Comment 26 Andreas Radke 2011-05-30 16:22:39 UTC
And one more affected user: Asus P8P67 board with its onboard NIC. Sometimes hard freezes with kernels up to 2.6.38.x, and since 2.6.39 reboots that can be reproduced under network load.

May 30 17:58:00 workstation64 kernel: r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
May 30 17:58:00 workstation64 kernel: r8169 0000:07:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
May 30 17:58:00 workstation64 kernel: r8169 0000:07:00.0: setting latency timer to 64
May 30 17:58:00 workstation64 kernel: r8169 0000:07:00.0: (unregistered net_device): unknown MAC, using family default
May 30 17:58:00 workstation64 kernel: r8169 0000:07:00.0: irq 50 for MSI/MSI-X
May 30 17:58:00 workstation64 kernel: r8169 0000:07:00.0: eth0: Features changed: 0x00004980 -> 0x00004180
May 30 17:58:00 workstation64 kernel: r8169 0000:07:00.0: eth0: RTL8168b/8111b at 0xffffc90001858000, bc:ae:c5:ab:17:22, XID 0c200000 IRQ 50


the same dmesg lines here under netload:
May 30 18:09:28 workstation64 kernel: NOHZ: local_softirq_pending 08
May 30 18:09:28 workstation64 kernel: r8169 0000:07:00.0: eth0: link up
May 30 18:09:28 workstation64 kernel: r8169 0000:07:00.0: eth0: link up

07:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)
	Subsystem: ASUSTeK Computer Inc. P8P67 Deluxe Motherboard [Realtek RTL8111E]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 50
	Region 0: I/O ports at e000 [size=256]
	Region 2: Memory at d0004000 (64-bit, prefetchable) [size=4K]
	Region 4: Memory at d0000000 (64-bit, prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000feeff00c  Data: 41b1
	Capabilities: [70] Express (v2) Endpoint, MSI 01
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB
	Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
		Vector table: BAR=4 offset=00000000
		PBA: BAR=4 offset=00000800
	Capabilities: [d0] Vital Product Data
		Unknown small resource type 00, will not decode more.
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [140 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=01
			Status:	NegoPending- InProgress-
	Capabilities: [160 v1] Device Serial Number 02-00-00-00-68-4c-e0-00
	Kernel driver in use: r8169
	Kernel modules: r8169

Any idea for a quick fix to get it stable working(kernel append or module  parameter?)

Anything I can to to locate to bug that we can see a fix going into the stable tree?
Comment 27 Enrico Tagliavini 2011-05-30 16:29:39 UTC
My temporary workaround is to use the realtek r8168 driver from the realtek site
Comment 28 Andreas Radke 2011-05-31 18:09:24 UTC
Thanks. Using the realtek r8168 driver is a workaround for 2.6.39 kernels also for me.

Just testing kernel 3.0r1. So far also no problems anymore under high network load. This seems fixed. It would be nice if a fix could be brought back to the stable .39 tree and also the .32LTS that also prints tons to dmesg entries but never crashed here.
Comment 29 Tobias Jakobi 2011-05-31 21:11:58 UTC
The so called "fix" would be this commit I think. So I don't think a backport is going to happen, since this is all new driver code.
Comment 31 fridjong 2011-06-16 14:36:01 UTC
Add one affected user. Gigagbyte Motherboard 870A-USB3.
Comment 32 Tom Oehser 2011-06-30 03:34:29 UTC
Has this been bug-ported to 2.6.35-30.54?  Suddenly it is happening there too, I think... Happens to my Gigabyte EP45-UD3Rs (I have 2), in Natty 2.6.37+, but now also happened in Maverick :(  I'm going to try another fix: I ordered 2 Marvell NICs.
Comment 33 Leho Kraav 2011-08-19 14:07:22 UTC
my network card is exhibiting still the same crazyness in 3.0.2.

r8168 8.023.00 required some Makefile regex changes to recognized 3.0 kernels, but other than that, seems to work still.
Comment 34 Francois Romieu 2011-08-25 10:25:08 UTC
(In reply to comment #33)
> my network card is exhibiting still the same crazyness in 3.0.2.
>
> r8168 8.023.00 required some Makefile regex changes to recognized 3.0
> kernels,
> but other than that, seems to work still.

Can you try the attached patch with a recent kernel ?

I must make it chipset version dependent but it is needed for post 8168b
chipset where the bit formerly known as Rx FIFO in the Rx descriptor ring
entries is now always one. The driver must not trigger the usual Rx FIFO
overflow recovery method when a different, eventually minor Rx error / event
is signaled.

The patch may not be enough as a race sneaked in the Rx FIFO overflow
event processing from the irq handler (where the event is read in the
IRQ event register, as opposed to the aforementioned Rx desc entries).
Basically the driver resets the Rx and Tx descriptor ring pointers while
racing with the NAPI packet processing methods (*ouch*).

As a side note, I will appreciate dmesg including the r8169 XID line as I need
it to identify the exact revision of the 816x chipset and triage the bugs.

Thanks.

-- 
Ueimor
Comment 35 Francois Romieu 2011-08-25 10:27:44 UTC
Created attachment 70152 [details]
Remove erroneous processing of always set bit (post 8168b only)
Comment 36 Francois Romieu 2011-08-25 22:40:50 UTC
Created attachment 70312 [details]
don't reset software ring indexes after disabling  hardware Rx
Comment 37 Francois Romieu 2011-08-25 22:42:54 UTC
Created attachment 70322 [details]
remove erroneous processing of always set bit.
Comment 38 Enrico Tagliavini 2012-07-28 12:56:05 UTC
Hi all, I'm sorry for not commenting anymore but I was not able to test the issue anymore: the other gigabit powered PC died and I had no other one by hand. Now I bought another lenovo (thinkpad e530) and tested again if the issue was solved. kernel 3.4.4 works like a charm on the edge 15, so this bug can be marked as FIXED for me.

Thank you very very much.
Cheers :)
Comment 39 Francois Romieu 2012-07-29 17:35:21 UTC
Summary:
- Enrico
  fixed
- Bryan
  "NOHZ: local_softirq_pending 08" is fixed by
  8876d6b5f81f4e242a6660da22bbd92f17a8d058 (v3.4 .. v3.5 cycle).
- Tobias
  fixed by 8168e support
- Leho Kraav
  Gigabyte GA-PM55-UD2 appears to contain a 8168d. Depending on which
  MTU is used, fixes for it have been merged as recently as march 2012.
- A. Radke
  Asus P8P67 contains a now supported 8168e. Old 2.6.xy kernel won't
  perform well.
- Fridjong
  Gigabyte 870A-USB3 contains a 8168e as well. 

I'll close this PR as the issues herein reported should be fixed in a
recent 3.4-stable or in 3.5. If you still experience problems with
any of those, please open a new PR or attach to a recent one.

Thanks.

-- 
Ueimor
Comment 40 Leho Kraav 2012-07-29 19:05:04 UTC
Francois, thanks for the update. Out of tree r8168 has also been updated to compile with 3.4+, I'm on that right now. I'll move back to in-tree module sometimeish and report back if any fun times come up.