Bug 199549
Summary: | r8169 doesn't work for 5 minutes after boot until `NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out` | ||
---|---|---|---|
Product: | Drivers | Reporter: | ojab |
Component: | Network | Assignee: | drivers_network (drivers_network) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | bjo, hkallweit1, neilsimp1, nmr, ykonotopov |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.17-rc2 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
stacktrace in the more readable format
lspci -vvvnn -s 06:00.0 Full dmesg until stacktrace dmesg of interface hang on boot |
Description
ojab
2018-04-27 20:10:51 UTC
Created attachment 275627 [details]
stacktrace in the more readable format
Created attachment 275629 [details]
lspci -vvvnn -s 06:00.0
There are lots of different versions of this network chip. Could you please provide a full dmesg output? In addition: - Does any earlier kernel version work? (IOW: is it a regression?) - Does the vendor Linux driver (r8168) from Realtek work? - Did you check with other OS (Windows)? Created attachment 275741 [details]
Full dmesg until stacktrace
It's actually a regression, with this commit reverted 4.17-rc3 works fine: a92a08499b1f9d74eb5ed77ea04c0e42aab46e9a is the first bad commit commit a92a08499b1f9d74eb5ed77ea04c0e42aab46e9a Author: Heiner Kallweit <hkallweit1@gmail.com> Date: Mon Jan 8 21:39:13 2018 +0100 r8169: improve runtime pm in general and suspend unused ports So far rpm doesn't cover cases like unused ports which are never brought up. If they are active at probe time they remain in this state. Included in this patch: - Let the idle notification check whether we can suspend and let it schedule the suspend. This way we don't need to have calls to pm_schedule_suspend in different places. - At the end of rtl_open and rtl_init_one send an idle notification to allow suspending if the link is down. If a cable is plugged in aneg is finished before the suspend timer expires and the suspend request is cancelled. - Change rtl8169_runtime_suspend to power down the chip if the interface is down. Successfully tested on a RTL8168evl (mac version 34). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> :040000 040000 2823af078aa633b890c340e79f1591d80560ceb3 6060585cd59f01a74dbf8447ebe1b1af509a69a0 M drivers git bisect start # good: [d8a5b80568a9cb66810e75b182018e9edb68e8ff] Linux 4.15 git bisect good d8a5b80568a9cb66810e75b182018e9edb68e8ff # bad: [0adb32858b0bddf4ada5f364a84ed60b196dbcda] Linux 4.16 git bisect bad 0adb32858b0bddf4ada5f364a84ed60b196dbcda # bad: [c14376de3a1befa70d9811ca2872d47367b48767] printk: Wake klogd when passing console_lock owner git bisect bad c14376de3a1befa70d9811ca2872d47367b48767 # good: [a103950e0dd2058df5e8a8d4a915707bdcf205f0] Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 git bisect good a103950e0dd2058df5e8a8d4a915707bdcf205f0 # bad: [0542e13b5f5663ffdc18e0e028413b2cd09f426f] Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue git bisect bad 0542e13b5f5663ffdc18e0e028413b2cd09f426f # good: [8a4816cad00bf14642f0ed6043b32d29a05006ce] tg3: Add Macronix NVRAM support git bisect good 8a4816cad00bf14642f0ed6043b32d29a05006ce # bad: [e8a9d9683c8a62f917c19e57f1618363fb9ed04e] Merge branch 'bpf-libbpf-cleanups' git bisect bad e8a9d9683c8a62f917c19e57f1618363fb9ed04e # good: [4312782479fbf7c5efb9a6f19ad90f8d924055c1] net/mlx5e: IPoIB, Fix spelling mistake "functionts" -> "functions" git bisect good 4312782479fbf7c5efb9a6f19ad90f8d924055c1 # bad: [532b5aa0b41ae908842166be4d17fffa53590bf0] Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue git bisect bad 532b5aa0b41ae908842166be4d17fffa53590bf0 # bad: [398958ae48f44bb036d0fa9829cd489270bf1fc2] ipv6: Add support for non-equal-cost multipath git bisect bad 398958ae48f44bb036d0fa9829cd489270bf1fc2 # bad: [61ad64080e039dce99a7f8d89b729bbea995e2f7] Merge branch 'r8169-improve-runtime-pm' git bisect bad 61ad64080e039dce99a7f8d89b729bbea995e2f7 # good: [8d5dee21f6f01f4632c10b750709a1383eefc7aa] tipc: a couple of cleanups git bisect good 8d5dee21f6f01f4632c10b750709a1383eefc7aa # good: [8348500f80d5660af29c475e1f15d412d83564c9] tipc: add option to suppress PUBLISH events for pre-existing publications git bisect good 8348500f80d5660af29c475e1f15d412d83564c9 # good: [fdb533c30496665775867fc9dabd83e8b0308822] Merge branch 'tipc-improvements-to-group-messaging' git bisect good fdb533c30496665775867fc9dabd83e8b0308822 # good: [ef4d5fccebe9253fd2c934b7a1119c9c9f2b9306] r8169: improve runtime pm in rtl8169_check_link_status git bisect good ef4d5fccebe9253fd2c934b7a1119c9c9f2b9306 # bad: [a92a08499b1f9d74eb5ed77ea04c0e42aab46e9a] r8169: improve runtime pm in general and suspend unused ports git bisect bad a92a08499b1f9d74eb5ed77ea04c0e42aab46e9a # first bad commit: [a92a08499b1f9d74eb5ed77ea04c0e42aab46e9a] r8169: improve runtime pm in general and suspend unused ports Thanks for bisecting! With this patch the NIC runtime-suspends after 10s w/o link. In most cases the network is brought up in less time, in your case however the network is brought up >12s after loading the driver. Means it's suspended already. This shouldn't be a problem because bringing up the network interface also resumes the NIC. This seems to fail in your case. As I see it there is nothing wrong with the patch but it reveals another problem which has been existing before. Just two days ago I got a Asus H310I-Plus Mobo with the same RTL8168h chip and I can reproduce the issue. What should help as a workaround: disable Runtime PM Usually Runtime PM for PCI devices isn't enabled per default. There should be somewhere a command like: echo auto > /sys/bus/pci/devices/0000:06:00.0/power/control If you disable this command, I'd expect the issue to be gone. Root cause seems to be that the PHY isn't fully resumed yet when configuration starts. I checked the vendor driver and it has a delay for certain chip types. The following fixes the issue for me. Could you please check whether this patch fixes your network? diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 604ae783..c7aac1fc 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -4981,6 +4981,9 @@ static void rtl_pll_power_down(struct rtl8169_private *tp) static void rtl_pll_power_up(struct rtl8169_private *tp) { rtl_generic_op(tp, tp->pll_power_ops.up); + + /* give MAC/PHY some time to resume */ + msleep(20); } static void rtl_init_pll_power_ops(struct rtl8169_private *tp) Works fine with this patch, thanks. Should be fixed with commit 3148dedfe79e ("r8169: fix powering up RTL8168h") in 4.17-rc5. Thanks, 4-17-rc5 works fine here. FWIW I am seeing a similar hang in Linux waldorf 5.7.0-3-amd64 #1 SMP Debian 5.7.17-1 (2020-08-23) x86_64 GNU/Linux Same PCI ID, I think: 05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 09) Subsystem: ASUSTeK Computer Inc. P8 series motherboard [1043:8505] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 Region 0: I/O ports at d000 [size=256] Region 2: Memory at f0304000 (64-bit, prefetchable) [size=4K] Region 4: Memory at f0300000 (64-bit, prefetchable) [size=16K] Capabilities: <access denied> Kernel driver in use: r8169 Kernel modules: r8169 Seems to happen ~reliably on boot for me. Replugging the link seems to fix it. Will attach dmesg. Happy to open a new bug if you would prefer, don't really know what I'm doing. Created attachment 292441 [details]
dmesg of interface hang on boot
|