Bug 32962
Summary: | r8169 self reboot the machine with RTL8111/8168B PCI Express Gigabit Ethernet | ||
---|---|---|---|
Product: | Drivers | Reporter: | Enrico Tagliavini (enrico.tagliavini) |
Component: | Network | Assignee: | Francois Romieu (romieu) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | akpm, andyrtr, arthur.titeica, cera, fridjong, kernel, leho, liquid.acid, stuffcorpse, Tom |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 2.6.38.2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
Remove erroneous processing of always set bit (post 8168b only)
don't reset software ring indexes after disabling hardware Rx remove erroneous processing of always set bit. |
Description
Enrico Tagliavini
2011-04-10 12:30:52 UTC
Oh and for the record, my distro is gentoo (mostly from the stable branch), the kernel is gentoo-sources 2.6.38, really near to vanilla afaik. dmesg output after a fresh reboot (disconnected the AC and also the battery for 5 minutes before booting again): enrico@thinkico ~ $ dmesg | grep r8169 [ 9.526404] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [ 9.526427] r8169 0000:09:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 [ 9.526481] r8169 0000:09:00.0: setting latency timer to 64 [ 9.526536] r8169 0000:09:00.0: irq 42 for MSI/MSI-X [ 9.526701] r8169 0000:09:00.0: eth0: RTL8168d/8111d at 0xffffc90000054000, 60:eb:69:ac:96:2a, XID 083000c0 IRQ 42 [ 24.900682] r8169 0000:09:00.0: eth0: unable to apply firmware patch [ 24.902555] r8169 0000:09:00.0: eth0: link down [ 24.902566] r8169 0000:09:00.0: eth0: link down [ 26.828669] r8169 0000:09:00.0: eth0: link up it is intresting ifconfig shows a dropped packet just after the boot. thinkico ~ # ifconfig eth0 eth0 Link encap:Ethernet HWaddr 60:eb:69:ac:96:2a inet addr:192.168.11.132 Bcast:192.168.11.255 Mask:255.255.255.0 inet6 addr: fe80::62eb:69ff:feac:962a/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:921 errors:0 dropped:1 overruns:0 frame:0 TX packets:956 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:694936 (678.6 KiB) TX bytes:170076 (166.0 KiB) Interrupt:42 Base address:0x4000 Just another update: transferring files with a pc equipped with a 100Mbit eth card (but connected to the same 1GB router) seems to work, no error reported. I'm trying to transfert from the sabayon 5.5 livedvd (2.6.37 kernel here) and it works. The dmesg line complaining about the firmware is not present here, but the files rtl_nic/rtl8168d-{1,2}.fw are not present on the livedvd. I tried to add those firmwares in my gentoo install, the warning is gone but the issue persist. Looking at the r8169.c shows the firmware lines where added in the .38 kernel, that may explain why the issue is present in kernel .38 and not in .37. Can you confirm? The last side note: of course i don't think the net driver is rebooting the machine itself, more likely this is some sort of bios protection or something. On the other hand the driver has this really weird (at least for me) output, which quite surely is a driver regression. Feel free to disagree of course :) I'm running Arch Linux and I encountered this sometime after upgrading from 2.6.37.3 to 2.6.37.4 of the default Arch kernel26. Apparently the problem remains also with 2.6.37.5 and 2.6.38.2. I myself have not been testing it, choosing instead to remain on kernel 2.6.37.3, but there is a bit of discussion on the Arch Linux boards: https://bbs.archlinux.org/viewtopic.php?id=115644 The symptoms in my case were that the machine would freeze or reboot when working with large files over gigabit NFS (specifically when doing MythTV commercial flagging over NFS). I also got the "NOHZ: local_softirq_pending 08" messages logged. Let me know if I can be of any help. i add some info i hope will be usefull to solve the issue: # lspci -tv -+-[0000:ff]-+-00.0 Intel Corporation Core Processor QuickPath Architecture Generic Non-core Registers | +-00.1 Intel Corporation Core Processor QuickPath Architecture System Address Decoder | +-02.0 Intel Corporation Core Processor QPI Link 0 | +-02.1 Intel Corporation Core Processor QPI Physical 0 | +-02.2 Intel Corporation Core Processor Reserved | \-02.3 Intel Corporation Core Processor Reserved \-[0000:00]-+-00.0 Intel Corporation Core Processor DRAM Controller +-01.0-[01]--+-00.0 ATI Technologies Inc M92 [Mobility Radeon HD 4500 Series] | \-00.1 ATI Technologies Inc RV710/730 +-16.0 Intel Corporation 5 Series/3400 Series Chipset HECI Controller +-1a.0 Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller +-1b.0 Intel Corporation 5 Series/3400 Series Chipset High Definition Audio +-1c.0-[02]-- +-1c.1-[03]----00.0 Intel Corporation WiFi Link 1000 Series +-1c.2-[04]-- +-1c.3-[05-07]-- +-1c.4-[08]-- +-1c.5-[09]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller +-1d.0 Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller +-1e.0-[0c]-- +-1f.0 Intel Corporation Mobile 5 Series Chipset LPC Interface Controller +-1f.2 Intel Corporation 5 Series/3400 Series Chipset 4 port SATA AHCI Controller +-1f.3 Intel Corporation 5 Series/3400 Series Chipset SMBus Controller \-1f.6 Intel Corporation 5 Series/3400 Series Chipset Thermal Subsystem thinkico ~ # cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 120 4 3 5 IO-APIC-edge timer 1: 1444 1537 1492 1527 IO-APIC-edge i8042 8: 22 16 26 20 IO-APIC-edge rtc0 9: 9487 9443 9448 9478 IO-APIC-fasteoi acpi 12: 222530 222297 222449 222076 IO-APIC-edge i8042 16: 9940 9986 10018 10020 IO-APIC-fasteoi ehci_hcd:usb1 23: 31 31 26 33 IO-APIC-fasteoi ehci_hcd:usb2 41: 17211 17335 17196 17252 PCI-MSI-edge ahci 42: 4897 4981 5020 4997 PCI-MSI-edge eth0 43: 25539 28751 26826 28842 PCI-MSI-edge iwlagn 44: 1597 1584 1577 1581 PCI-MSI-edge hda_intel 45: 26 15 20 23 PCI-MSI-edge hda_intel 46: 67618 67593 67542 67648 PCI-MSI-edge fglrx[0]@PCI:1:0:0 NMI: 0 0 0 0 Non-maskable interrupts LOC: 1019675 818437 937715 898096 Local timer interrupts SPU: 0 0 0 0 Spurious interrupts PMI: 0 0 0 0 Performance monitoring interrupts IWI: 0 0 0 0 IRQ work interrupts RES: 5386 3412 4597 2774 Rescheduling interrupts CAL: 4705 9345 9204 10980 Function call interrupts TLB: 27555 24531 25864 28511 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 Threshold APIC interrupts MCE: 0 0 0 0 Machine check exceptions MCP: 16 16 16 16 Machine check polls ERR: 0 MIS: 0 This is probably the same or similar to: https://bugzilla.kernel.org/show_bug.cgi?id=29282 I'm also suffering this with vanilla 2.6.38.4 -- I have to hotfix eth0 by disabling autoneg and forcing 100MBit mode. This somehow also kills duplex support, although its indicated by ethtool that duplex is set to 'full'. However when doing both up- and down-transfers either of one stalls (so they seems to alternate). Needless to say, this is really annoying. lspci -vvv output: 07:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06) Subsystem: Fujitsu Limited. Device 15b1 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 43 Region 0: I/O ports at 5000 [size=256] Region 2: Memory at f0a04000 (64-bit, prefetchable) [size=4K] Region 4: Memory at f0a00000 (64-bit, prefetchable) [size=16K] [virtual] Expansion ROM at f0a20000 [disabled] [size=128K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee0100c Data: 4181 Capabilities: [70] Express (v2) Endpoint, MSI 01 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 4096 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB Capabilities: [b0] MSI-X: Enable- Count=4 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00000800 Capabilities: [d0] Vital Product Data Unknown small resource type 00, will not decode more. Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [140 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00 Kernel driver in use: r8169 Kernel modules: r8169 I'm not a programmer, but I went through the changes between Arch Linux's 2.6.37.3 and 2.6.37.4 kernels (which was when the problem occurred for me) and using git I think one of these commits might have introduced the problem: 0d672e9f8ac320c6d1ea9103db6df7f99ea20361 f60ac8e7ab7cbb413a0131d5665b053f9f386526 1519e57fe81c14bb8fa4855579f19264d1ef63b4 b5ba6d12bdac21bc0620a5089e0f24e362645efd As I believe those were the only four commits to r8169.c between the two Arch Linux compiles in question. I could be wrong of course. But maybe someone compiling a vanilla kernel can try reverting those in git and see whether the problem persists? Or maybe run a git bisect? I have taken a look at these commits. First of all my mac_version is 11 (RTL_GIGA_MAC_VER_11), so this excludes the following commits for _me_: f60ac8e7ab7cbb413a0131d5665b053f9f386526 (code isn't active for my mac version) b5ba6d12bdac21bc0620a5089e0f24e362645efd (not functionality change for my mac version) 1519e57fe81c14bb8fa4855579f19264d1ef63b4 (again no functional change for my version) So this leaves only 0d672e9f8ac320c6d1ea9103db6df7f99ea20361, I'm going to check this one. However I don't know if this ever worked on my system at all. OK, so I commented out the 'netif_carrier_off(dev)' call in r8169.c and rechecked. Which means putting heavy load on the ethernet device (used two netcat pipes connecting /dev/zero and /dev/null, both upstream and downstream). While the 'local_softirq_pending 08' messages still show up here and there (I managed to produce some of them, but they don't fill the log like before) I failed to freeze/reboot the system like before. And it was fairly easy to do this before (10 seconds of heavy load and it was gone). I'm going to make Ivan Vecera aware of that, since he wrote the patch (which looks fine from reading the commit description). Maybe some other people can recheck this? (In reply to comment #8) > This is probably the same or similar to: > https://bugzilla.kernel.org/show_bug.cgi?id=29282 Seems so, sorry for the duplicate, i searched for an already filled one but i missed it :( (In reply to comment #11) > I have taken a look at these commits. First of all my mac_version is 11 > (RTL_GIGA_MAC_VER_11), so this excludes the following commits for _me_: This is not probably right, I saw one A530 and there is not 8168b variant, but 8168e variant. I know that lspci shows 8168b but the XID from dmesg is more representative. As I said I saw in A530's dmesg the following: ... r8169 0000:07:00.0: eth0: RTL8168b/8111b at 0xffffc900110ba000, 00:23:26:8d:8e:73, XID 0c100000 IRQ 45 ... The XID 0c100000 corresponds to newer variant 8168e and not 8168b. The support for 8168e variants is currently present in Dave's net-next, for your 2.6.38.x kernel this variant is unknown and as fallback is used configuration for 8168b. That's my line from dmesg: r8169 0000:07:00.0: eth0: RTL8168b/8111b at 0xffffc900100be000, 00:23:26:8e:f9:6d, XID 0c100000 IRQ 43 Concerning the mac_version. I just modified netif_info to also print out the mac_version from the struct. I haven't really checked where this is set and whether this is a fallback or not. However you might be right. This is the full output from dmesg: r8169 0000:07:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 r8169 0000:07:00.0: setting latency timer to 64 r8169 0000:07:00.0: (unregistered net_device): unknown MAC, using family default r8169 0000:07:00.0: irq 43 for MSI/MSI-X r8169 0000:07:00.0: eth0: RTL8168b/8111b (11) at 0xffffc900100f8000, 00:23:26:8e:f9:6d, XID 0c100000 IRQ 43 So this message about "using family default", that's indicating a fallback? Anyway, thanks for looking into this! :) (In reply to comment #15) > That's my line from dmesg: > r8169 0000:07:00.0: eth0: RTL8168b/8111b at 0xffffc900100be000, > 00:23:26:8e:f9:6d, XID 0c100000 IRQ 43 > > Concerning the mac_version. I just modified netif_info to also print out the > mac_version from the struct. I haven't really checked where this is set and > whether this is a fallback or not. > > However you might be right. This is the full output from dmesg: > r8169 0000:07:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 > r8169 0000:07:00.0: setting latency timer to 64 > r8169 0000:07:00.0: (unregistered net_device): unknown MAC, using family > default > r8169 0000:07:00.0: irq 43 for MSI/MSI-X > r8169 0000:07:00.0: eth0: RTL8168b/8111b (11) at 0xffffc900100f8000, > 00:23:26:8e:f9:6d, XID 0c100000 IRQ 43 > > So this message about "using family default", that's indicating a fallback? > Yes, this indicates a fallback. Is it possible for you to test net-next kernel? (...with appropriate firmware files) (In reply to comment #16) > Yes, this indicates a fallback. > Is it possible for you to test net-next kernel? (...with appropriate firmware > files) Sure, now cloning the tree. Concerning the firmware files: Do I have to fetch these from somewhere else or are they included in the tree? (In reply to comment #17) > Sure, now cloning the tree. Concerning the firmware files: Do I have to fetch > these from somewhere else or are they included in the tree? I dunno what fw files Ivan is reffering, but if they are rtl_nic/rtl8168d-{1,2}.fw you can get them from http://git.kernel.org/?p=linux/kernel/git/dwmw2/linux-firmware.git;a=summary Thanks, I just figured it out myself -- never needed these firmware files before. OK, so net-next completly fixes this for me. The chip is now detected as: 8169 0000:07:00.0: eth0: RTL8168e/8111e at 0xffffc900101b8000, 00:23:26:8e:f9:6d, XID 0c100000 IRQ 43 The ethernet stress-test doesn't trigger any messages in dmesg anymore (and it doesn't lockup/reboot the machine). ok i'm also running into this with 2.6.38-r3. is it possible to just take the r8169 driver from net-next and run it on top of gentoo-sources? 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03) Subsystem: Giga-byte Technology GA-EP45-DS5 Motherboard Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 46 Region 0: I/O ports at be00 [size=256] Region 2: Memory at fbbff000 (64-bit, prefetchable) [size=4K] Region 4: Memory at fbbf8000 (64-bit, prefetchable) [size=16K] [virtual] Expansion ROM at fbb00000 [disabled] [size=128K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee0f00c Data: 4191 Capabilities: [70] Express (v2) Endpoint, MSI 01 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 4096 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend+ LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB Capabilities: [ac] MSI-X: Enable- Count=4 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00000800 Capabilities: [cc] Vital Product Data Unknown small resource type 00, will not decode more. Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [140 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [160 v1] Device Serial Number 03-00-00-00-68-4c-e0-00 Kernel driver in use: r8169 Kernel modules: r8169 debugging this i've become more familiar with my "beloved" realtek onboard nic (Gigabyte GA-PM55-UD2). for example i've now learned that the nic is capable of going insane, to the point where r8169.ko will not get a link up at all, and r8168.ko gets a 10Mbps link on a gigabit switch. in this case it happened after a reboot into an older 2.6.34 kernel that i disabled PCI_QUIRKs on (only noteworthy difference i can think of), but i have no idea what exactly triggers that insanity. googline around [1] revealed that you might need complete power off to make the nic sane again. while i had turned off the power from front cover switch to check exactly for that a while before, i forgot that the real power switch is in the back of the PSU. complete power cycle restore nic's ability to connect at 1Gbps. i mentioned r8168. i went ahead and compiled 8.023.00 driver (dated 19.04.2011) from realtek [2]. transferring a 10GB file has now ended smoothly with no error messages previously immediately experienced with r8169. [1]: http://www.w7forums.com/realtek-onboard-lan-doesnt-work-above-10-mbps-t9501.html [2]: ftp://WebUser:fH7s5YL@207.232.93.28/cn/nic/r8168-8.023.00.tar.bz2 @Leho: I'm also using gentoo but I just did a shallow clone of the net-next repo and copied my kernel config over. I don't think it's that easy to backport the new code to 2.6.38.4 (I think that the kernel which gentoos-sources is currently based on). I'm now living with the hackfix workaround until the code hits a stable kernel release. Or I might look into this again when 2.6.39 becomes stable. It's probably a whole lot easier to apply the patches from net-next against 2.6.39 than against 2.6.38... right. side note "shallow clone" was new to me so i googled it a bit [1], looks like it doesn't give that much gain. but re realtek, looks like i will be sitting on self-maintained r8168 for the foreseeable future. i guess an ebuild would be nice to have, will look into it some time. [1]: http://blogs.gnome.org/simos/2009/04/18/git-clones-vs-shallow-git-clones/ Yeah, I just mentioned the shallow clone since you probably don't need the whole history for just testing the kernel -- and it sames bandwidth too :) And one more affected user: Asus P8P67 board with its onboard NIC. Sometimes hard freezes with kernels up to 2.6.38.x, and since 2.6.39 reboots that can be reproduced under network load. May 30 17:58:00 workstation64 kernel: r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded May 30 17:58:00 workstation64 kernel: r8169 0000:07:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 May 30 17:58:00 workstation64 kernel: r8169 0000:07:00.0: setting latency timer to 64 May 30 17:58:00 workstation64 kernel: r8169 0000:07:00.0: (unregistered net_device): unknown MAC, using family default May 30 17:58:00 workstation64 kernel: r8169 0000:07:00.0: irq 50 for MSI/MSI-X May 30 17:58:00 workstation64 kernel: r8169 0000:07:00.0: eth0: Features changed: 0x00004980 -> 0x00004180 May 30 17:58:00 workstation64 kernel: r8169 0000:07:00.0: eth0: RTL8168b/8111b at 0xffffc90001858000, bc:ae:c5:ab:17:22, XID 0c200000 IRQ 50 the same dmesg lines here under netload: May 30 18:09:28 workstation64 kernel: NOHZ: local_softirq_pending 08 May 30 18:09:28 workstation64 kernel: r8169 0000:07:00.0: eth0: link up May 30 18:09:28 workstation64 kernel: r8169 0000:07:00.0: eth0: link up 07:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06) Subsystem: ASUSTeK Computer Inc. P8P67 Deluxe Motherboard [Realtek RTL8111E] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 50 Region 0: I/O ports at e000 [size=256] Region 2: Memory at d0004000 (64-bit, prefetchable) [size=4K] Region 4: Memory at d0000000 (64-bit, prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000feeff00c Data: 41b1 Capabilities: [70] Express (v2) Endpoint, MSI 01 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 4096 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB Capabilities: [b0] MSI-X: Enable- Count=4 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00000800 Capabilities: [d0] Vital Product Data Unknown small resource type 00, will not decode more. Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [140 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 Status: NegoPending- InProgress- Capabilities: [160 v1] Device Serial Number 02-00-00-00-68-4c-e0-00 Kernel driver in use: r8169 Kernel modules: r8169 Any idea for a quick fix to get it stable working(kernel append or module parameter?) Anything I can to to locate to bug that we can see a fix going into the stable tree? My temporary workaround is to use the realtek r8168 driver from the realtek site Thanks. Using the realtek r8168 driver is a workaround for 2.6.39 kernels also for me. Just testing kernel 3.0r1. So far also no problems anymore under high network load. This seems fixed. It would be nice if a fix could be brought back to the stable .39 tree and also the .32LTS that also prints tons to dmesg entries but never crashed here. The so called "fix" would be this commit I think. So I don't think a backport is going to happen, since this is all new driver code. Should post the link as well: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=01dc7fec4025f6bb72b6b98ec88b375346b6dbbb Add one affected user. Gigagbyte Motherboard 870A-USB3. Has this been bug-ported to 2.6.35-30.54? Suddenly it is happening there too, I think... Happens to my Gigabyte EP45-UD3Rs (I have 2), in Natty 2.6.37+, but now also happened in Maverick :( I'm going to try another fix: I ordered 2 Marvell NICs. my network card is exhibiting still the same crazyness in 3.0.2. r8168 8.023.00 required some Makefile regex changes to recognized 3.0 kernels, but other than that, seems to work still. (In reply to comment #33) > my network card is exhibiting still the same crazyness in 3.0.2. > > r8168 8.023.00 required some Makefile regex changes to recognized 3.0 > kernels, > but other than that, seems to work still. Can you try the attached patch with a recent kernel ? I must make it chipset version dependent but it is needed for post 8168b chipset where the bit formerly known as Rx FIFO in the Rx descriptor ring entries is now always one. The driver must not trigger the usual Rx FIFO overflow recovery method when a different, eventually minor Rx error / event is signaled. The patch may not be enough as a race sneaked in the Rx FIFO overflow event processing from the irq handler (where the event is read in the IRQ event register, as opposed to the aforementioned Rx desc entries). Basically the driver resets the Rx and Tx descriptor ring pointers while racing with the NAPI packet processing methods (*ouch*). As a side note, I will appreciate dmesg including the r8169 XID line as I need it to identify the exact revision of the 816x chipset and triage the bugs. Thanks. -- Ueimor Created attachment 70152 [details]
Remove erroneous processing of always set bit (post 8168b only)
Created attachment 70312 [details]
don't reset software ring indexes after disabling hardware Rx
Created attachment 70322 [details]
remove erroneous processing of always set bit.
Hi all, I'm sorry for not commenting anymore but I was not able to test the issue anymore: the other gigabit powered PC died and I had no other one by hand. Now I bought another lenovo (thinkpad e530) and tested again if the issue was solved. kernel 3.4.4 works like a charm on the edge 15, so this bug can be marked as FIXED for me. Thank you very very much. Cheers :) Summary: - Enrico fixed - Bryan "NOHZ: local_softirq_pending 08" is fixed by 8876d6b5f81f4e242a6660da22bbd92f17a8d058 (v3.4 .. v3.5 cycle). - Tobias fixed by 8168e support - Leho Kraav Gigabyte GA-PM55-UD2 appears to contain a 8168d. Depending on which MTU is used, fixes for it have been merged as recently as march 2012. - A. Radke Asus P8P67 contains a now supported 8168e. Old 2.6.xy kernel won't perform well. - Fridjong Gigabyte 870A-USB3 contains a 8168e as well. I'll close this PR as the issues herein reported should be fixed in a recent 3.4-stable or in 3.5. If you still experience problems with any of those, please open a new PR or attach to a recent one. Thanks. -- Ueimor Francois, thanks for the update. Out of tree r8168 has also been updated to compile with 3.4+, I'm on that right now. I'll move back to in-tree module sometimeish and report back if any fun times come up. |