Bug 198519

Summary:	e1000e failed to suspend
Product:	Drivers	Reporter:	Marta Löfstedt (marta.lofstedt)
Component:	Network	Assignee:	drivers_network (drivers_network)
Status:	NEW ---
Severity:	normal	CC:	jbrandeb, marta.lofstedt, uzytkownik2
Priority:	P1
Hardware:	Intel
OS:	Linux
Kernel Version:	4.15.0-rc7	Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	dmesg when the issue occur dmesg on the run after when the issue does not occur

Description Marta Löfstedt 2018-01-19 12:02:11 UTC

While running Ci test for i915 driver the e1000e driver failed to suspend:

<3>[  395.316767] pci_pm_suspend(): e1000e_pm_suspend+0x0/0x40 [e1000e] returns -2
<3>[  395.316772] dpm_run_callback(): pci_pm_suspend+0x0/0x140 returns -2
<3>[  395.316783] PM: Device 0000:00:19.0 failed to suspend async: error -2
<3>[  395.316884] PM: Some devices failed to suspend, or early wake event detected

There are link to more occurrences and data in the original freedesktop bug: https://bugs.freedesktop.org/show_bug.cgi?id=104550

Also, from dmesg0.log there is:
<3>[  191.130545] e1000e 0000:00:19.0 eth0: Hardware Error

However, there is no hint of Hardware errors on the exact same machine on the run after, see dmesg1.log.

Here is an example where the issue happen:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3609/fi-ivb-3520m/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html

Here is an example of the next consecutive run:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3610/fi-ivb-3520m/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html

Comment 1 Marta Löfstedt 2018-01-19 12:05:27 UTC

Created attachment 273717 [details]
dmesg when the issue occur

Comment 2 Marta Löfstedt 2018-01-19 12:06:11 UTC

Created attachment 273719 [details]
dmesg on the run after when the issue does not occur

Comment 3 Marta Löfstedt 2018-02-22 06:51:39 UTC

This was just reproduced with kernel: 4.16.0-rc2

Here are links to more data:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3816/fi-ivb-3520m/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3816/fi-ivb-3520m/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3816/fi-ivb-3520m/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b.html

Comment 4 uzytkownik2@gmail.com 2018-02-23 04:54:04 UTC

I got this after update to 4.15.x. I suspended 23 times successfully and than once it started failing and got 'stuck' in this mode. When I rmmod e1000e I got:

[ 2192.366747] e1000e 0000:00:1f.6 enp0s31f6: removed PHC
[ 2192.895875] e1000e 0000:00:1f.6 enp0s31f6: Hardware Error
[ 2192.898100] e1000e: enp0s31f6 NIC Link is Down

But suspend started working again even after loading e1000e again.

Comment 5 Marta Löfstedt 2018-02-23 06:46:51 UTC

(In reply to uzytkownik2@gmail.com from comment #4)
> I got this after update to 4.15.x. I suspended 23 times successfully and
> than once it started failing and got 'stuck' in this mode. When I rmmod
> e1000e I got:
> 
> [ 2192.366747] e1000e 0000:00:1f.6 enp0s31f6: removed PHC
> [ 2192.895875] e1000e 0000:00:1f.6 enp0s31f6: Hardware Error
> [ 2192.898100] e1000e: enp0s31f6 NIC Link is Down
> 
> But suspend started working again even after loading e1000e again.

Sure, but our test checks rtcwake which fail due to e1000e failing suspend. This doesn't happen frequently, but it still do happen.

If you want to see more history of this specific machine in our lab:

https://intel-gfx-ci.01.org/tree/drm-tip/

then click fi-ivb-3520m and you'll see results on this machine for ~last 80 runs.

/Marta

Comment 6 uzytkownik2@gmail.com 2018-02-23 16:29:25 UTC

(In reply to Marta Löfstedt from comment #5)
> (In reply to uzytkownik2@gmail.com from comment #4)
> > I got this after update to 4.15.x. I suspended 23 times successfully and
> > than once it started failing and got 'stuck' in this mode. When I rmmod
> > e1000e I got:
> > 
> > [ 2192.366747] e1000e 0000:00:1f.6 enp0s31f6: removed PHC
> > [ 2192.895875] e1000e 0000:00:1f.6 enp0s31f6: Hardware Error
> > [ 2192.898100] e1000e: enp0s31f6 NIC Link is Down
> > 
> > But suspend started working again even after loading e1000e again.
> 
> Sure, but our test checks rtcwake which fail due to e1000e failing suspend.
> This doesn't happen frequently, but it still do happen.
> 

I'm not sure what you're trying to correct me on. I wanted to add information:

 - That it happens 'in the wild' on user systems (namely mine) - not only in testing
 - It is e1000e internal state problem, not for example HW sticky state which wouldn't be reset by reloading (sure it was unlikely probably).

Comment 7 Marta Löfstedt 2018-03-14 06:31:42 UTC

The issue is reproduced on Cannonlake.

kernel is drm-tip based on Linux kernel 4.16.0-rc5

From dmesg:

<3>[  451.299599] pci_pm_suspend(): e1000e_pm_suspend+0x0/0x40 [e1000e] returns -2
<3>[  451.299606] dpm_run_callback(): pci_pm_suspend+0x0/0x140 returns -2
<3>[  451.299642] PM: Device 0000:00:1f.6 failed to suspend async: error -2
<3>[  451.299714] PM: Some devices failed to suspend, or early wake event detected

More data:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3918/fi-cnl-drrs/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c.html

Comment 8 Marta Löfstedt 2018-04-03 12:49:53 UTC

This was reproduced on another CNL machine:
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_9/fi-cnl-y3/igt@kms_cursor_crc@cursor-64x64-suspend.html

<3>[  218.537930] pci_pm_suspend(): e1000e_pm_suspend+0x0/0x40 [e1000e] returns -2
<3>[  218.537936] dpm_run_callback(): pci_pm_suspend+0x0/0x140 returns -2
<3>[  218.537951] PM: Device 0000:00:1f.6 failed to suspend async: error -2
<3>[  218.538017] PM: Some devices failed to suspend, or early wake event detected

Comment 9 Marta Löfstedt 2018-04-03 12:51:41 UTC

Also, here:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_9/fi-cnl-y3/igt@kms_vblank@pipe-c-ts-continuation-suspend.html

<3>[  109.019792] pci_pm_suspend(): e1000e_pm_suspend+0x0/0x40 [e1000e] returns -2
<3>[  109.019798] dpm_run_callback(): pci_pm_suspend+0x0/0x140 returns -2
<3>[  109.019812] PM: Device 0000:00:1f.6 failed to suspend async: error -2
<3>[  109.019916] PM: Some devices failed to suspend, or early wake event detected