Bug 202321 - [Intel GFX CI] "Generic PHY r8169-200:00: Master/Slave resolution failed, maybe conflicting manual settings?" when suspending
Summary: [Intel GFX CI] "Generic PHY r8169-200:00: Master/Slave resolution failed, may...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-17 12:56 UTC by Martin Peres
Modified: 2022-04-07 05:12 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.0.0-rc2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Boot log (148.94 KB, text/plain)
2019-01-17 15:31 UTC, Martin Peres
Details
Kernel logs during the execution of our tests (2.79 MB, text/plain)
2019-01-17 15:32 UTC, Martin Peres
Details

Description Martin Peres 2019-01-17 12:56:50 UTC
When suspending of our machines which use a USB r8169 NIC, we sometimes get the following WARNINGS:

<3> [3403.903767] Generic PHY r8169-200:00: Master/Slave resolution failed, maybe conflicting manual settings?
<4> [3403.903801] WARNING: CPU: 0 PID: 14632 at drivers/net/phy/phy.c:728 phy_error+0x5/0x50
<4> [3403.903809] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp btusb btrtl crct10dif_pclmul btbcm btintel crc32_pclmul ghash_clmulni_intel snd_hda_intel bluetooth snd_hda_codec ecdh_generic snd_hwdep snd_hda_core i2c_i801 snd_pcm realtek mei_me r8169 mei prime_numbers i2c_hid pinctrl_geminilake pinctrl_intel
<4> [3403.903837] CPU: 0 PID: 14632 Comm: kworker/0:0 Tainted: G     U            5.0.0-rc2-CI-CI_DRM_5434+ #1
<4> [3403.903840] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
<4> [3403.903843] Workqueue: events_power_efficient phy_state_machine
<4> [3403.903848] RIP: 0010:phy_error+0x5/0x50
<4> [3403.903850] Code: 05 00 00 31 c9 bf 10 00 00 00 e9 36 a6 9f ff 66 0f 1f 44 00 00 eb de 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 55 53 48 89 fb <0f> 0b 48 8d af a8 05 00 00 31 f6 48 89 ef e8 a8 37 2a 00 48 89 ef
<4> [3403.903853] RSP: 0018:ffffc90000cb7e20 EFLAGS: 00010286
<4> [3403.903855] RAX: 0000000000000000 RBX: ffff88826c009bf8 RCX: 8372f7a100000000
<4> [3403.903857] RDX: 0000000000000000 RSI: 0000000060681f36 RDI: ffff88826c009bf8
<4> [3403.903859] RBP: ffff88826c00a1a0 R08: 0000000067060e5c R09: 0000000000000001
<4> [3403.903861] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88826c009bf8
<4> [3403.903863] R13: 0000000000000008 R14: 00000000ffffffbd R15: 0000000000000000
<4> [3403.903865] FS:  0000000000000000(0000) GS:ffff888277e00000(0000) knlGS:0000000000000000
<4> [3403.903867] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [3403.903869] CR2: 00007f81c8e9e090 CR3: 0000000005214000 CR4: 0000000000340ef0
<4> [3403.903872] Call Trace:
<4> [3403.903881]  phy_state_machine+0x85/0x250
<4> [3403.903888]  process_one_work+0x245/0x610
<4> [3403.903897]  worker_thread+0x37/0x380
<4> [3403.903902]  ? process_one_work+0x610/0x610
<4> [3403.903906]  kthread+0x119/0x130
<4> [3403.903910]  ? kthread_park+0x80/0x80
<4> [3403.903916]  ret_from_fork+0x24/0x50
<4> [3403.903929] irq event stamp: 519270
<4> [3403.903933] hardirqs last  enabled at (519269): [<ffffffff81123ca2>] vprintk_emit+0x302/0x320
<4> [3403.903937] hardirqs last disabled at (519270): [<ffffffff810019b0>] trace_hardirqs_off_thunk+0x1a/0x1c
<4> [3403.903939] softirqs last  enabled at (518802): [<ffffffff81c0033a>] __do_softirq+0x33a/0x4b9
<4> [3403.903943] softirqs last disabled at (518791): [<ffffffff810b5081>] irq_exit+0xd1/0xe0
<4> [3403.903947] WARNING: CPU: 0 PID: 14632 at drivers/net/phy/phy.c:728 phy_error+0x5/0x50
<4> [3403.903949] ---[ end trace 7c12a0a911d4fd3d ]---
<6> [3403.903970] r8169 0000:02:00.0 enp2s0: Link is Down
<6> [3403.915188] acpi LNXPOWER:06: Turning OFF
<6> [3403.918358] acpi LNXPOWER:05: Turning OFF
<6> [3403.921543] acpi LNXPOWER:04: Turning OFF
<6> [3403.924722] acpi LNXPOWER:03: Turning OFF
<6> [3403.927756] acpi LNXPOWER:02: Turning OFF
<6> [3403.930455] acpi LNXPOWER:01: Turning OFF
<6> [3403.933149] acpi LNXPOWER:00: Turning OFF
<6> [3403.934512] OOM killer enabled.
<6> [3403.934516] Restarting tasks ... done.
<6> [3403.986198] video LNXVIDEO:00: Restoring backlight state
<6> [3403.986207] PM: suspend exit
<7> [3404.038490] [drm:drm_mode_addfb2] [FB:149]
<7> [3404.093861] [drm:drm_mode_setcrtc] [CRTC:125:pipe C]
<7> [3404.093918] [drm:drm_mode_setcrtc] [CONNECTOR:135:HDMI-A-2]
<6> [3404.219085] PM: suspend entry (deep)
<6> [3404.219092] PM: Syncing filesystems ... done.
<6> [3404.231498] Freezing user space processes ... (elapsed 0.001 seconds) done.
<6> [3404.233197] OOM killer disabled.
<6> [3404.233200] Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
<6> [3404.234301] printk: Suspending console(s) (use no_console_suspend to debug)
<5> [3404.260506] sd 0:0:0:0: [sda] Synchronizing SCSI cache
<4> [3404.266904] ------------[ cut here ]------------
<4> [3404.266916] called from state HALTED
<4> [3404.266941] WARNING: CPU: 1 PID: 14831 at drivers/net/phy/phy.c:849 phy_stop+0x7c/0x90
<4> [3404.266944] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp btusb btrtl crct10dif_pclmul btbcm btintel crc32_pclmul ghash_clmulni_intel snd_hda_intel bluetooth snd_hda_codec ecdh_generic snd_hwdep snd_hda_core i2c_i801 snd_pcm realtek mei_me r8169 mei prime_numbers i2c_hid pinctrl_geminilake pinctrl_intel
<4> [3404.266973] CPU: 1 PID: 14831 Comm: kworker/u8:23 Tainted: G     U  W         5.0.0-rc2-CI-CI_DRM_5434+ #1
<4> [3404.266976] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
<4> [3404.266981] Workqueue: events_unbound async_run_entry_fn
<4> [3404.266985] RIP: 0010:phy_stop+0x7c/0x90
<4> [3404.266987] Code: bb 10 05 00 00 5b 5d e9 52 fd ff ff 48 89 df e8 6a f0 ff ff eb d6 48 8b 34 c5 40 8c ed 81 48 c7 c7 a9 67 11 82 e8 e4 78 9d ff <0f> 0b 48 89 ef 5b 5d e9 08 18 2a 00 0f 0b eb 99 90 90 90 90 31 c0
<4> [3404.266990] RSP: 0018:ffffc90000997ce0 EFLAGS: 00010286
<4> [3404.266993] RAX: 0000000000000000 RBX: ffff88826c009bf8 RCX: 0000000000000000
<4> [3404.266995] RDX: 0000000000000000 RSI: ffffffff8212ef6a RDI: 00000000ffffffff
<4> [3404.266998] RBP: ffff88826c00a1a0 R08: 0000000061307cb3 R09: 0000000000000000
<4> [3404.267000] R10: ffffc90000997ce0 R11: 0000000000000000 R12: ffff88826da489e0
<4> [3404.267002] R13: 0000000000000000 R14: 0000000000000000 R15: ffff888275e0e898
<4> [3404.267005] FS:  0000000000000000(0000) GS:ffff888277e80000(0000) knlGS:0000000000000000
<4> [3404.267007] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [3404.267010] CR2: 00005583121d05b8 CR3: 0000000227dd4000 CR4: 0000000000340ee0
<4> [3404.267012] Call Trace:
<4> [3404.267022]  rtl8169_net_suspend+0x2b/0x70 [r8169]
<4> [3404.267029]  rtl8169_suspend+0x10/0x30 [r8169]
<4> [3404.267035]  pci_pm_suspend+0x6d/0x120
<4> [3404.267039]  ? pci_pm_freeze+0xc0/0xc0
<4> [3404.267045]  dpm_run_callback+0x64/0x280
<4> [3404.267051]  __device_suspend+0x12a/0x5b0
<4> [3404.267058]  ? dpm_watchdog_set+0x60/0x60
<4> [3404.267067]  async_suspend+0x15/0x90
<4> [3404.267071]  async_run_entry_fn+0x34/0x160
<4> [3404.267077]  process_one_work+0x245/0x610
<4> [3404.267086]  worker_thread+0x37/0x380
<4> [3404.267092]  ? process_one_work+0x610/0x610
<4> [3404.267095]  kthread+0x119/0x130
<4> [3404.267099]  ? kthread_park+0x80/0x80
<4> [3404.267106]  ret_from_fork+0x24/0x50
<4> [3404.267119] irq event stamp: 3640
<4> [3404.267124] hardirqs last  enabled at (3639): [<ffffffff81123ca2>] vprintk_emit+0x302/0x320
<4> [3404.267128] hardirqs last disabled at (3640): [<ffffffff810019b0>] trace_hardirqs_off_thunk+0x1a/0x1c
<4> [3404.267131] softirqs last  enabled at (2704): [<ffffffff81c0033a>] __do_softirq+0x33a/0x4b9
<4> [3404.267135] softirqs last disabled at (2697): [<ffffffff810b5081>] irq_exit+0xd1/0xe0
<4> [3404.267138] WARNING: CPU: 1 PID: 14831 at drivers/net/phy/phy.c:849 phy_stop+0x7c/0x90
<4> [3404.267141] ---[ end trace 7c12a0a911d4fd3e ]---
<5> [3404.274706] sd 0:0:0:0: [sda] Stopping disk
Comment 1 Heiner Kallweit 2019-01-17 15:01:05 UTC
Are you sure regarding USB, I only know RTL8168 family members with PCI interface.
Can you provide a full dmesg output?
Comment 2 Martin Peres 2019-01-17 15:31:51 UTC
Created attachment 280567 [details]
Boot log

Boot log for the machine.
Comment 3 Martin Peres 2019-01-17 15:32:26 UTC
Created attachment 280569 [details]
Kernel logs during the execution of our tests
Comment 4 Martin Peres 2019-01-17 15:37:10 UTC
(In reply to Heiner Kallweit from comment #1)
> Are you sure regarding USB, I only know RTL8168 family members with PCI
> interface.
> Can you provide a full dmesg output?

Sorry, I mis-remembered:

02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

I guess I should not trust my memory when caring about so many machines...

All the information about the machine can be found here: https://intel-gfx-ci.01.org/hardware/shard-glk6/.

The same problem can be seen on shard-glk7 too, which should be identical.

So far, it looks like this issue is reproducible roughly every other run of our test suite (which does ~80 suspend/resume cycles per run).

Sorry for botching the bug report, I have been context-switching too much today.
Comment 5 Heiner Kallweit 2019-01-17 19:41:36 UTC
Thanks for the feedback. The content of the relevant registers could provide a hint what's going on. Could you please apply the following and provide the output of the inserted debug statement if case error occurs again?


diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 4f61ebf84..264312137 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1746,6 +1746,7 @@ int genphy_read_status(struct phy_device *phydev)
 				return adv;
 
 			if (lpagb & LPA_1000MSFAIL) {
+				pr_info("master/slave failure: lpagb = 0x%04x adv = 0x%04x\n", lpagb, adv);
 				if (adv & CTL1000_ENABLE_MASTER)
 					phydev_err(phydev, "Master/Slave resolution failed, maybe conflicting manual settings?\n");
 				else
-- 
2.20.1
Comment 6 Martin Peres 2019-01-17 21:51:27 UTC
(In reply to Heiner Kallweit from comment #5)
> Thanks for the feedback. The content of the relevant registers could provide
> a hint what's going on. Could you please apply the following and provide the
> output of the inserted debug statement if case error occurs again?


Sure thing! I'll get this in our testing tree tomorrow :)
Comment 7 Heiner Kallweit 2019-01-24 08:24:08 UTC
(In reply to Martin Peres from comment #6)
> (In reply to Heiner Kallweit from comment #5)
> > Thanks for the feedback. The content of the relevant registers could
> provide
> > a hint what's going on. Could you please apply the following and provide
> the
> > output of the inserted debug statement if case error occurs again?
> 
> 
> Sure thing! I'll get this in our testing tree tomorrow :)

Any news here?
Comment 8 Martin Peres 2019-01-25 14:02:22 UTC
(In reply to Heiner Kallweit from comment #7)
> (In reply to Martin Peres from comment #6)
> > (In reply to Heiner Kallweit from comment #5)
> > > Thanks for the feedback. The content of the relevant registers could
> > provide
> > > a hint what's going on. Could you please apply the following and provide
> > the
> > > output of the inserted debug statement if case error occurs again?
> > 
> > 
> > Sure thing! I'll get this in our testing tree tomorrow :)
> 
> Any news here?

Sorry for the delay, I got the patch applied in our test tree (https://cgit.freedesktop.org/drm-tip/commit/?id=e3a9af5b306dbedfb2a8d84676269ad39473d644), and I'll report back next week with more results :)

At least, results should not take long to come, as this has been 100% reproducible on some machines!
Comment 9 Lakshminarayana Vudum 2022-04-07 05:12:33 UTC
This issue is not happening anymore, so issue is ready for closure.
Comment 10 cibuglog 2022-04-07 05:12:45 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Note You need to log in before you can comment on or make changes to this bug.