Bug 34172 - [r8169] Unreliable network connection
Summary: [r8169] Unreliable network connection
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Francois Romieu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-05-01 12:28 UTC by Ivan Bulatovic
Modified: 2012-02-14 15:03 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.38.4
Tree: Mainline
Regression: Yes


Attachments

Description Ivan Bulatovic 2011-05-01 12:28:45 UTC
On GA-H67MA-UD2H-B3 mainboard with Realtek 8111E integrated NIC I get constant socket time outs and similar errors while browsing unless the machine is running from cold boot (for now it never happened on cold boot). Powering my computer down is not sufficient though, I have to kill the power to PSU and bring it back on. I have WOL disabled in BIOS. Upon subsequent restarts I get a bunch of link up messages in dmesg:


[  214.662469] r8169 0000:03:00.0: eth0: link up
[  215.896654] r8169 0000:03:00.0: eth0: link up
[  218.752114] r8169 0000:03:00.0: eth0: link up
[  221.972782] r8169 0000:03:00.0: eth0: link up
[  361.128443] r8169 0000:03:00.0: eth0: link up
[  400.513334] r8169 0000:03:00.0: eth0: link up
[  403.638188] r8169 0000:03:00.0: eth0: link up
[  431.180236] r8169 0000:03:00.0: PCI INT A disabled
[  447.680820] r8169 0000:03:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[  447.681335] r8169 0000:03:00.0: setting latency timer to 64
[  447.681342] r8169 0000:03:00.0: (unregistered net_device): unknown MAC, using family default
[  447.681432] r8169 0000:03:00.0: irq 49 for MSI/MSI-X
[  447.681954] r8169 0000:03:00.0: eth0: RTL8168b/8111b at 0xffffc90017884000, 1c:6f:65:ca:88:87, XID 0c900800 IRQ 49
[  458.836279] r8169 0000:03:00.0: eth0: TBI auto-negotiating
[  458.836405] r8169 0000:03:00.0: eth0: link down
[  458.836419] r8169 0000:03:00.0: eth0: link down
[  460.394052] r8169 0000:03:00.0: eth0: link up

restarting network service works but for a short period of time (few seconds). I've tried with debug=16 (and debug=0x16, don't know if they are both valid) but I don't get any usefull messages other than eth0: link up.

some more info:

Linux silverstone 2.6.38.4-INTEL #1 SMP PREEMPT Mon Apr 25 10:33:08 CEST 2011 x86_64 Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz GenuineIntel GNU/Linux

eth0      Link encap:Ethernet  HWaddr 1C:6F:65:CA:88:87  
          inet addr:192.168.5.2  Bcast:192.168.5.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1837 errors:0 dropped:1837 overruns:0 frame:1837
          TX packets:1830 errors:0 dropped:11 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1663078 (1.5 Mb)  TX bytes:301265 (294.2 Kb)
          Interrupt:49 Base address:0xc000 

03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)
        Subsystem: Giga-byte Technology GA-EP45-DS5 Motherboard
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 4 bytes
        Interrupt: pin A routed to IRQ 49
        Region 0: I/O ports at de00 [size=256]
        Region 2: Memory at fbdff000 (64-bit, prefetchable) [size=4K]
        Region 4: Memory at fbdf8000 (64-bit, prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee0f00c  Data: 41a9
        Capabilities: [70] Express (v2) Endpoint, MSI 01
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 unlimited, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB
        Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00000800
        Capabilities: [d0] Vital Product Data
                Unknown small resource type 00, will not decode more.
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [140 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
        Kernel driver in use: r8169
        Kernel modules: r8169

I've tried with dhcpcd, with static addressing and it happens in both cases. Sometimes I can't even ping my router while pinging IP of my NIC works. If I can ping my router sometimes it gives a lot of dropped packets (from 30-70%).

I'm running ArchLinux x86_64 with custom built kernel.
Comment 1 Ivan Bulatovic 2011-05-01 17:38:19 UTC
Regarding WOL, scratched that one out, apparently there is no way to disable WOL in BIOS, ErP support option is disabled but there is no Wake On LAN option there even though manual suggest there is, quote:

ErP Support
Determines whether to let the system consume less than 1W power in S5 (shutdown) state. (Default:
Disabled)
Note: When this item is set to Enabled, the following four functions will become unavailable:
PME event wake up, power on by mouse, power on by keyboard, and wake on LAN.

So this could be related to WOL, is there any option to reset the NIC completely with a kernel parameter or passing some option to the module ?
Comment 2 Ivan Bulatovic 2011-05-02 09:43:43 UTC
Another update, newer drivers for Windows from Realtek is the cause of all of this. When I do a power cycle, boot into Linux and subsequently rebooting to Linux - there are no dropped packets, connection is just fine. 

I havent had these problems before so I thought that this was a regression, but it looks like that the Windows driver update made the connection unusable upon reboot. I've tried everything from turning off power management to switching WOL related settings on/off but upon reboot I get no connection at all or unstable connection in Linux (depending on the settings in Windows).
Comment 3 Francois Romieu 2011-05-04 18:31:21 UTC
The "unknown MAC, using family default" line tells that you are using a yet
unknown (8168) chipset. The driver's default choice may work... or not.

Some support for the 8168e was added during the 2.6.29 cycle. I'd suggest
trying the last -rc, especially as it includes changes to avoid any 60s
freeze when trying to load the (optional) 8168e firmware. If things do not
improve and you see no ethical issue with non-GPL firmware, the 8168e firmware
is available at :

http://git.kernel.org/?p=linux/kernel/git/romieu/linux-firmware.git

(the firmware patches the PHY)

Your XID surprizes me. If the 2.6.39-rc driver does not detect your chipset
as a 8168e/8111e, please search the lines below in the driver and remove
the "& 0x9cf0f88f" part. 

        netif_info(tp, probe, dev, "%s at 0x%lx, %pM, XID %08x IRQ %d\n",
                   rtl_chip_info[chipset].name, dev->base_addr, dev->dev_addr,
                   (u32)(RTL_R32(TxConfig) & 0x9cf0f8ff), dev->irq);

Rationale: it will not work better but it will tell the needed bits for
future identification. :o)

Thanks.

-- 
Ueimor
Comment 4 Francois Romieu 2011-05-04 19:04:15 UTC
(talking to myself)

The firmware part is right, especially the 60s delay related part, but the
8168e bits are not 2.6.29-rc material. They are in davem's net-next tree
and they won't go into the kernel before 2.6.29.

-- 
Ueimor
Comment 5 Ivan Bulatovic 2011-05-04 23:51:02 UTC
I've tried with 2.6.39-rc5 already (without success), but will give it a shot with current mainline, I see that your patch entered mainline after rc5 (commit 953a12cc2889d1be92e80a2d0bab5ffef4942300). Thanx for the guidelines, I'll try with firmware from your tree and modifying the driver - and will report back soon.
Comment 6 Ivan Bulatovic 2011-05-05 02:05:19 UTC
I've tripped out about that commit, it hasn't landed to mainline yet.

Ok, I've tried with rc6, no dice. I've copied r8169.c from davem's next to mainline (had to because rc6 doesn't recognize firmware for r8168e yet, missing

#define FIRMWARE_8168E_1        "rtl_nic/rtl8168e-1.fw"
#define FIRMWARE_8168E_2        "rtl_nic/rtl8168e-2.fw"

in r8169.c.

I've tried with and without firmware loaded (although there isn't any verification in dmesg that firmware has or hasn't been loaded), tried with vanilla mainline and with r8169.c in davem's next branch and I still have connection problems with dropped packets.

I removed the part of the netif_info function as instructed and I get a different XID:

[    7.722953] r8169 0000:03:00.0: eth0: RTL8168b/8111b at 0xffffc9000553a000, 1c:6f:65:ca:88:87, XID 2f900d00 IRQ 48
Comment 7 Ivan Bulatovic 2011-05-15 12:40:55 UTC
I've compiled r8168 from realtek site (ver. 8.023) and it looks like I have RTL8111F ???

[17409.784303] r8168 Gigabit Ethernet driver 8.023.00-NAPI loaded
[17409.784337] r8168 0000:03:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[17409.784377] r8168 0000:03:00.0: setting latency timer to 64
[17409.785428] r8168 0000:03:00.0: irq 43 for MSI/MSI-X
[17409.785763] eth%d: RTL8168B/8111B at 0xffffc9001951a000, 1c:6f:65:ca:88:87, IRQ 43
[17409.966414] r8168: This product is covered by one or more of the following patents: US5,307,459, US5,434,872, US5,732,094, US6,570,884, US6,115,776, and US6,327,625.
[17409.966417] eth0: Identified chip type is 'RTL8168F/8111F'.
[17409.966418] r8168  Copyright (C) 2011  Realtek NIC software team <nicfae@realtek.com> 

There is no mention on Realtek web page about this chip...
Comment 8 Ivan Bulatovic 2011-05-15 12:53:59 UTC
Oh, I forgot, I've compiled kernel from davem's branch again (few commits that entered 6 days ago looked promising) but ethtool reported that firmware isn't loaded (N/A) even though I've copied it to /lib/firmware/rtl_nic/, wich make sense now - we don't have propper identifiers for this chip and for a reason. Is there any chance that we can get some info from Realtek about this NIC anytime soon ? Drivers from their website work propperly, but since they are listed as maintainers for 8169 - I hope that they will cooperate with you on this.

Thanx again Francois for your help.
Comment 9 Ivan Bulatovic 2011-05-28 12:10:36 UTC
Hm, I've just compiled r8168-8.024.00 and now the driver identifies NIC as:

eth0: Identified chip type is 'RTL8168E-VL/8111E-VL'
Comment 10 Luca Tettamanti 2011-06-28 08:28:20 UTC
I just bought a new board (Asus M5A78L) with what appears to be the same NIC.
The symptoms are the same as the one reported by Ivan and the driver found on realtek site also identified the card as "RTL8168E-VL/8111E-VL".
I'm currently testing 3.0-rc4, and noticed that the driver in mainline (r8169) starts working after loading and unloading the r8168 driver from realtek.
I'm of course available for further analysis and testing on the NIC.
Comment 11 Ivan Bulatovic 2011-07-12 18:15:16 UTC
Those latest patches from Hayes Wang in davem.next.r8169 fixes the problem completele, probably due to

commit 04586b0fc2079f047521b26d7129c6dd2610c291

r8169: support RTL8111E-VL.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Acked-by: Francois Romieu <romieu@fr.zoreil.com>


ethtool -i eth0
driver: r8169
version: 2.3LK-NAPI
firmware-version: rtl_nic/rtl8168e-3.fw
bus-info: 0000:03:00.0


dmesg | grep r8169
[   17.832351] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[   17.833785] r8169 0000:03:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[   17.835085] r8169 0000:03:00.0: setting latency timer to 64
[   17.835181] r8169 0000:03:00.0: irq 43 for MSI/MSI-X
[   17.836525] r8169 0000:03:00.0: eth0: RTL8168evl/8111evl at 0xffffc90005b9a000, 1c:6f:65:ca:88:87, XID 0c900800 IRQ 43
[   20.944148] r8169 0000:03:00.0: eth0: link down
[   24.517060] r8169 0000:03:00.0: eth0: link up

Patches tested against 3.0.0-rc7, no connection problems at all. Hope this gets merged in mainline soon, it would be nice to see it in time for 3.1.0 :)

Thanks Francois, Hayes...
Comment 12 Francois Romieu 2011-07-14 07:35:32 UTC
(In reply to comment #11)
[...]
> Patches tested against 3.0.0-rc7, no connection problems at all. Hope this
> gets
> merged in mainline soon, it would be nice to see it in time for 3.1.0 :)

It is scheduled so.

Hayes sent a fix for a 8169 regression which appeared in the series.
I'll test it and formally send the resulting series to davem for inclusion
today.

-- 
Ueimor

Note You need to log in before you can comment on or make changes to this bug.