Bug 202321 - [Intel GFX CI] "Generic PHY r8169-200:00: Master/Slave resolution failed, maybe conflicting manual settings?" when suspending
Summary: [Intel GFX CI] "Generic PHY r8169-200:00: Master/Slave resolution failed, may...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-17 12:56 UTC by Martin Peres
Modified: 2019-01-25 14:02 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.0.0-rc2
Tree: Mainline
Regression: No


Attachments
Boot log (148.94 KB, text/plain)
2019-01-17 15:31 UTC, Martin Peres
Details
Kernel logs during the execution of our tests (2.79 MB, text/plain)
2019-01-17 15:32 UTC, Martin Peres
Details

Description Martin Peres 2019-01-17 12:56:50 UTC
When suspending of our machines which use a USB r8169 NIC, we sometimes get the following WARNINGS:

<3> [3403.903767] Generic PHY r8169-200:00: Master/Slave resolution failed, maybe conflicting manual settings?
<4> [3403.903801] WARNING: CPU: 0 PID: 14632 at drivers/net/phy/phy.c:728 phy_error+0x5/0x50
<4> [3403.903809] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp btusb btrtl crct10dif_pclmul btbcm btintel crc32_pclmul ghash_clmulni_intel snd_hda_intel bluetooth snd_hda_codec ecdh_generic snd_hwdep snd_hda_core i2c_i801 snd_pcm realtek mei_me r8169 mei prime_numbers i2c_hid pinctrl_geminilake pinctrl_intel
<4> [3403.903837] CPU: 0 PID: 14632 Comm: kworker/0:0 Tainted: G     U            5.0.0-rc2-CI-CI_DRM_5434+ #1
<4> [3403.903840] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
<4> [3403.903843] Workqueue: events_power_efficient phy_state_machine
<4> [3403.903848] RIP: 0010:phy_error+0x5/0x50
<4> [3403.903850] Code: 05 00 00 31 c9 bf 10 00 00 00 e9 36 a6 9f ff 66 0f 1f 44 00 00 eb de 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 55 53 48 89 fb <0f> 0b 48 8d af a8 05 00 00 31 f6 48 89 ef e8 a8 37 2a 00 48 89 ef
<4> [3403.903853] RSP: 0018:ffffc90000cb7e20 EFLAGS: 00010286
<4> [3403.903855] RAX: 0000000000000000 RBX: ffff88826c009bf8 RCX: 8372f7a100000000
<4> [3403.903857] RDX: 0000000000000000 RSI: 0000000060681f36 RDI: ffff88826c009bf8
<4> [3403.903859] RBP: ffff88826c00a1a0 R08: 0000000067060e5c R09: 0000000000000001
<4> [3403.903861] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88826c009bf8
<4> [3403.903863] R13: 0000000000000008 R14: 00000000ffffffbd R15: 0000000000000000
<4> [3403.903865] FS:  0000000000000000(0000) GS:ffff888277e00000(0000) knlGS:0000000000000000
<4> [3403.903867] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [3403.903869] CR2: 00007f81c8e9e090 CR3: 0000000005214000 CR4: 0000000000340ef0
<4> [3403.903872] Call Trace:
<4> [3403.903881]  phy_state_machine+0x85/0x250
<4> [3403.903888]  process_one_work+0x245/0x610
<4> [3403.903897]  worker_thread+0x37/0x380
<4> [3403.903902]  ? process_one_work+0x610/0x610
<4> [3403.903906]  kthread+0x119/0x130
<4> [3403.903910]  ? kthread_park+0x80/0x80
<4> [3403.903916]  ret_from_fork+0x24/0x50
<4> [3403.903929] irq event stamp: 519270
<4> [3403.903933] hardirqs last  enabled at (519269): [<ffffffff81123ca2>] vprintk_emit+0x302/0x320
<4> [3403.903937] hardirqs last disabled at (519270): [<ffffffff810019b0>] trace_hardirqs_off_thunk+0x1a/0x1c
<4> [3403.903939] softirqs last  enabled at (518802): [<ffffffff81c0033a>] __do_softirq+0x33a/0x4b9
<4> [3403.903943] softirqs last disabled at (518791): [<ffffffff810b5081>] irq_exit+0xd1/0xe0
<4> [3403.903947] WARNING: CPU: 0 PID: 14632 at drivers/net/phy/phy.c:728 phy_error+0x5/0x50
<4> [3403.903949] ---[ end trace 7c12a0a911d4fd3d ]---
<6> [3403.903970] r8169 0000:02:00.0 enp2s0: Link is Down
<6> [3403.915188] acpi LNXPOWER:06: Turning OFF
<6> [3403.918358] acpi LNXPOWER:05: Turning OFF
<6> [3403.921543] acpi LNXPOWER:04: Turning OFF
<6> [3403.924722] acpi LNXPOWER:03: Turning OFF
<6> [3403.927756] acpi LNXPOWER:02: Turning OFF
<6> [3403.930455] acpi LNXPOWER:01: Turning OFF
<6> [3403.933149] acpi LNXPOWER:00: Turning OFF
<6> [3403.934512] OOM killer enabled.
<6> [3403.934516] Restarting tasks ... done.
<6> [3403.986198] video LNXVIDEO:00: Restoring backlight state
<6> [3403.986207] PM: suspend exit
<7> [3404.038490] [drm:drm_mode_addfb2] [FB:149]
<7> [3404.093861] [drm:drm_mode_setcrtc] [CRTC:125:pipe C]
<7> [3404.093918] [drm:drm_mode_setcrtc] [CONNECTOR:135:HDMI-A-2]
<6> [3404.219085] PM: suspend entry (deep)
<6> [3404.219092] PM: Syncing filesystems ... done.
<6> [3404.231498] Freezing user space processes ... (elapsed 0.001 seconds) done.
<6> [3404.233197] OOM killer disabled.
<6> [3404.233200] Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
<6> [3404.234301] printk: Suspending console(s) (use no_console_suspend to debug)
<5> [3404.260506] sd 0:0:0:0: [sda] Synchronizing SCSI cache
<4> [3404.266904] ------------[ cut here ]------------
<4> [3404.266916] called from state HALTED
<4> [3404.266941] WARNING: CPU: 1 PID: 14831 at drivers/net/phy/phy.c:849 phy_stop+0x7c/0x90
<4> [3404.266944] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp btusb btrtl crct10dif_pclmul btbcm btintel crc32_pclmul ghash_clmulni_intel snd_hda_intel bluetooth snd_hda_codec ecdh_generic snd_hwdep snd_hda_core i2c_i801 snd_pcm realtek mei_me r8169 mei prime_numbers i2c_hid pinctrl_geminilake pinctrl_intel
<4> [3404.266973] CPU: 1 PID: 14831 Comm: kworker/u8:23 Tainted: G     U  W         5.0.0-rc2-CI-CI_DRM_5434+ #1
<4> [3404.266976] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
<4> [3404.266981] Workqueue: events_unbound async_run_entry_fn
<4> [3404.266985] RIP: 0010:phy_stop+0x7c/0x90
<4> [3404.266987] Code: bb 10 05 00 00 5b 5d e9 52 fd ff ff 48 89 df e8 6a f0 ff ff eb d6 48 8b 34 c5 40 8c ed 81 48 c7 c7 a9 67 11 82 e8 e4 78 9d ff <0f> 0b 48 89 ef 5b 5d e9 08 18 2a 00 0f 0b eb 99 90 90 90 90 31 c0
<4> [3404.266990] RSP: 0018:ffffc90000997ce0 EFLAGS: 00010286
<4> [3404.266993] RAX: 0000000000000000 RBX: ffff88826c009bf8 RCX: 0000000000000000
<4> [3404.266995] RDX: 0000000000000000 RSI: ffffffff8212ef6a RDI: 00000000ffffffff
<4> [3404.266998] RBP: ffff88826c00a1a0 R08: 0000000061307cb3 R09: 0000000000000000
<4> [3404.267000] R10: ffffc90000997ce0 R11: 0000000000000000 R12: ffff88826da489e0
<4> [3404.267002] R13: 0000000000000000 R14: 0000000000000000 R15: ffff888275e0e898
<4> [3404.267005] FS:  0000000000000000(0000) GS:ffff888277e80000(0000) knlGS:0000000000000000
<4> [3404.267007] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [3404.267010] CR2: 00005583121d05b8 CR3: 0000000227dd4000 CR4: 0000000000340ee0
<4> [3404.267012] Call Trace:
<4> [3404.267022]  rtl8169_net_suspend+0x2b/0x70 [r8169]
<4> [3404.267029]  rtl8169_suspend+0x10/0x30 [r8169]
<4> [3404.267035]  pci_pm_suspend+0x6d/0x120
<4> [3404.267039]  ? pci_pm_freeze+0xc0/0xc0
<4> [3404.267045]  dpm_run_callback+0x64/0x280
<4> [3404.267051]  __device_suspend+0x12a/0x5b0
<4> [3404.267058]  ? dpm_watchdog_set+0x60/0x60
<4> [3404.267067]  async_suspend+0x15/0x90
<4> [3404.267071]  async_run_entry_fn+0x34/0x160
<4> [3404.267077]  process_one_work+0x245/0x610
<4> [3404.267086]  worker_thread+0x37/0x380
<4> [3404.267092]  ? process_one_work+0x610/0x610
<4> [3404.267095]  kthread+0x119/0x130
<4> [3404.267099]  ? kthread_park+0x80/0x80
<4> [3404.267106]  ret_from_fork+0x24/0x50
<4> [3404.267119] irq event stamp: 3640
<4> [3404.267124] hardirqs last  enabled at (3639): [<ffffffff81123ca2>] vprintk_emit+0x302/0x320
<4> [3404.267128] hardirqs last disabled at (3640): [<ffffffff810019b0>] trace_hardirqs_off_thunk+0x1a/0x1c
<4> [3404.267131] softirqs last  enabled at (2704): [<ffffffff81c0033a>] __do_softirq+0x33a/0x4b9
<4> [3404.267135] softirqs last disabled at (2697): [<ffffffff810b5081>] irq_exit+0xd1/0xe0
<4> [3404.267138] WARNING: CPU: 1 PID: 14831 at drivers/net/phy/phy.c:849 phy_stop+0x7c/0x90
<4> [3404.267141] ---[ end trace 7c12a0a911d4fd3e ]---
<5> [3404.274706] sd 0:0:0:0: [sda] Stopping disk
Comment 1 Heiner Kallweit 2019-01-17 15:01:05 UTC
Are you sure regarding USB, I only know RTL8168 family members with PCI interface.
Can you provide a full dmesg output?
Comment 2 Martin Peres 2019-01-17 15:31:51 UTC
Created attachment 280567 [details]
Boot log

Boot log for the machine.
Comment 3 Martin Peres 2019-01-17 15:32:26 UTC
Created attachment 280569 [details]
Kernel logs during the execution of our tests
Comment 4 Martin Peres 2019-01-17 15:37:10 UTC
(In reply to Heiner Kallweit from comment #1)
> Are you sure regarding USB, I only know RTL8168 family members with PCI
> interface.
> Can you provide a full dmesg output?

Sorry, I mis-remembered:

02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

I guess I should not trust my memory when caring about so many machines...

All the information about the machine can be found here: https://intel-gfx-ci.01.org/hardware/shard-glk6/.

The same problem can be seen on shard-glk7 too, which should be identical.

So far, it looks like this issue is reproducible roughly every other run of our test suite (which does ~80 suspend/resume cycles per run).

Sorry for botching the bug report, I have been context-switching too much today.
Comment 5 Heiner Kallweit 2019-01-17 19:41:36 UTC
Thanks for the feedback. The content of the relevant registers could provide a hint what's going on. Could you please apply the following and provide the output of the inserted debug statement if case error occurs again?


diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 4f61ebf84..264312137 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1746,6 +1746,7 @@ int genphy_read_status(struct phy_device *phydev)
 				return adv;
 
 			if (lpagb & LPA_1000MSFAIL) {
+				pr_info("master/slave failure: lpagb = 0x%04x adv = 0x%04x\n", lpagb, adv);
 				if (adv & CTL1000_ENABLE_MASTER)
 					phydev_err(phydev, "Master/Slave resolution failed, maybe conflicting manual settings?\n");
 				else
-- 
2.20.1
Comment 6 Martin Peres 2019-01-17 21:51:27 UTC
(In reply to Heiner Kallweit from comment #5)
> Thanks for the feedback. The content of the relevant registers could provide
> a hint what's going on. Could you please apply the following and provide the
> output of the inserted debug statement if case error occurs again?


Sure thing! I'll get this in our testing tree tomorrow :)
Comment 7 Heiner Kallweit 2019-01-24 08:24:08 UTC
(In reply to Martin Peres from comment #6)
> (In reply to Heiner Kallweit from comment #5)
> > Thanks for the feedback. The content of the relevant registers could
> provide
> > a hint what's going on. Could you please apply the following and provide
> the
> > output of the inserted debug statement if case error occurs again?
> 
> 
> Sure thing! I'll get this in our testing tree tomorrow :)

Any news here?
Comment 8 Martin Peres 2019-01-25 14:02:22 UTC
(In reply to Heiner Kallweit from comment #7)
> (In reply to Martin Peres from comment #6)
> > (In reply to Heiner Kallweit from comment #5)
> > > Thanks for the feedback. The content of the relevant registers could
> > provide
> > > a hint what's going on. Could you please apply the following and provide
> > the
> > > output of the inserted debug statement if case error occurs again?
> > 
> > 
> > Sure thing! I'll get this in our testing tree tomorrow :)
> 
> Any news here?

Sorry for the delay, I got the patch applied in our test tree (https://cgit.freedesktop.org/drm-tip/commit/?id=e3a9af5b306dbedfb2a8d84676269ad39473d644), and I'll report back next week with more results :)

At least, results should not take long to come, as this has been 100% reproducible on some machines!

Note You need to log in before you can comment on or make changes to this bug.