Bug 203175

Summary: e1000e regression with 5.0.7 Detected Hardware Unit Hang
Product: Drivers Reporter: Joe Yasi (joe.yasi)
Component: NetworkAssignee: drivers_network (drivers_network)
Status: RESOLVED CODE_FIX    
Severity: normal CC: antdev66, aros, jonah.bernhard, koct9i, nicolopiazzalunga, oleksandr
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.1.7, 5.0.21 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: attachment-7107-0.html

Description Joe Yasi 2019-04-06 04:20:07 UTC
After upgrading from 5.0.6 to 5.0.7, I now see this in dmesg on boot:
[Sat Apr  6 00:12:10 2019] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[Sat Apr  6 00:12:10 2019] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Sat Apr  6 00:12:12 2019] e1000e 0000:00:1f.6 eth0: Detected Hardware Unit Hang:                                                                               
                             TDH                  <0>                           
                             TDT                  <1>                           
                             next_to_use          <1>                           
                             next_to_clean        <0>                           
                           buffer_info[next_to_clean]:                          
                             time_stamp           <fffba7a7>                    
                             next_to_watch        <0>
                             jiffies              <fffbb140>
                             next_to_watch.status <0>
                           MAC Status             <40080080>
                           PHY Status             <7949>
                           PHY 1000BASE-T Status  <0>
                           PHY Extended Status    <3000>
                           PCI Status             <10>
[Sat Apr  6 00:12:14 2019] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

This did not happen with 5.0.6. The ethernet card appears to still work after that message is printed in the log.

Hardware is:
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V
        Subsystem: ASUSTeK Computer Inc. Ethernet Connection (2) I219-V
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 145
        Region 0: Memory at df400000 (32-bit, non-prefetchable) [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: e1000e
        Kernel modules: e1000e
Comment 1 Joe Yasi 2019-04-06 04:21:19 UTC
full hardware info with Capabilities:
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V
        Subsystem: ASUSTeK Computer Inc. Ethernet Connection (2) I219-V
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 145
        Region 0: Memory at df400000 (32-bit, non-prefetchable) [size=128K]
        Capabilities: [c8] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00518  Data: 0000
        Capabilities: [e0] PCI Advanced Features
                AFCap: TP+ FLR+
                AFCtrl: FLR-
                AFStatus: TP-
        Kernel driver in use: e1000e
        Kernel modules: e1000
Comment 2 Joe Yasi 2019-04-06 13:29:49 UTC
This is caused by 7f0a3a436e88a71b96694c029f01a9a8eade3d5d e1000e: fix cyclic resets at link up with active tx. I reverted it and rebuilt e1000e.ko. The error is gone without that change.
Comment 3 Antonio 2019-04-08 09:40:33 UTC
same problem

kernel: e1000e 0000:00:19.0 eth2: Detected Hardware Unit Hang:
		TDH                  <0>
		TDT                  <1>
		next_to_use          <1>
		next_to_clean        <0>
	buffer_info[next_to_clean]:
		time_stamp           <fffb90d3>
		next_to_watch        <0>
		jiffies              <fffb9a01>
		next_to_watch.status <0>
	MAC Status             <80080>
	PHY Status             <7949>
	PHY 1000BASE-T Status  <800>
	PHY Extended Status    <3000>
	PCI Status             <10>
Comment 4 Nicolo' 2019-04-09 21:10:10 UTC
Same problem, reverting 7f0a3a436e88a71b96694c029f01a9a8eade3d5d fixes it.

kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
              TDH                                 <0>
              TDT                                 <1>
              next_to_use                     <1>
              next_to_clean                  <0>
            buffer_info[next_to_clean]:
              time_stamp                      <fffeab29>
              next_to_watch                 <0>
              jiffies                                <fffeadc0>
              next_to_watch.status      <0>
            MAC Status                       <80080>
            PHY Status                        <7949>
            PHY 1000BASE-T Status  <0>
            PHY Extended Status        <3000>
            PCI Status                         <10>
Comment 5 Antonio 2019-04-17 14:53:32 UTC
kernel 5.0.8: problem still exist
Comment 6 Konstantin Khlebnikov 2019-04-18 08:40:19 UTC
proposed revert and fix: https://lkml.org/lkml/2019/4/17/169
Comment 7 Nicolo' 2019-04-23 15:43:17 UTC
I assume the revert hasn't made it into kernel 5.0.9, as I still have the same issue with that version.
Comment 8 Artem S. Tashkinov 2019-04-27 17:07:21 UTC
*** Bug 203447 has been marked as a duplicate of this bug. ***
Comment 9 Artem S. Tashkinov 2019-04-27 17:11:20 UTC
<rant>Now the only question is why kernel 5.0.10 has been released without the fix which means the bug has been known for three weeks (!) and no one has bothered to fix it in a timely manner.

Again, what's the point of "stable" "must update" (as GKH always puts them) x.y.Z updates when they often introduce regressions?</rant>
Comment 10 Oleksandr Natalenko 2019-04-28 10:09:24 UTC
(In reply to Artem S. Tashkinov from comment #9)
> <rant>Now the only question is why kernel 5.0.10 has been released without
> the fix which means the bug has been known for three weeks (!) and no one
> has bothered to fix it in a timely manner.
> 
> Again, what's the point of "stable" "must update" (as GKH always puts them)
> x.y.Z updates when they often introduce regressions?</rant>

1. Linus' branch goes first;
2. once the fix is identified, no one prevents you from applying patches by yourself.
Comment 11 Artem S. Tashkinov 2019-04-28 10:57:11 UTC
(In reply to Oleksandr Natalenko from comment #10)
> 1. Linus' branch goes first;
> 2. once the fix is identified, no one prevents you from applying patches by
> yourself.

<rant>I was under the impression that Linux was meant for normal people, not for geeks who prefer having sex with their PCs instead of doing that with other human beings.

I was under the impression that the word "stable" in x.y.Z releases indeed meant "stable".

Guess I've been all wrong.</rant>
Comment 12 Artem S. Tashkinov 2019-05-03 18:12:34 UTC
5.0.11 has been released without the fix.

What the hell, Konstantin? Are we waiting for something else?
Comment 13 Artem S. Tashkinov 2019-05-05 13:52:44 UTC
Needless to say kernel 5.0.12 doesn't contain the fix as well. Has it even been proposed for inclusion?
Comment 14 Antonio 2019-05-05 17:59:58 UTC
I observed that version 5.0.13 has the fix applied.
For me the problem is solved.
Thank you
Comment 15 Artem S. Tashkinov 2019-05-05 18:02:09 UTC
(In reply to anthony from comment #14)
> I observed that version 5.0.13 has the fix applied.
> For me the problem is solved.
> Thank you

Official kernel 5.0.13 doesn't contain the fix. It's most likely your distro which might have included it.
Comment 16 Antonio 2019-05-05 18:45:21 UTC
(In reply to Artem S. Tashkinov from comment #15)
> Official kernel 5.0.13 doesn't contain the fix. It's most likely your distro
> which might have included it.

I not found revert commit, but view that the source file ./drivers/net/ethernet/intel/e1000e/netdev.c was been already patched (with patch at comment 6). I manually compiled new kernel and no error found when shutdown.
Comment 17 Antonio 2019-05-11 10:20:15 UTC
Kernel 5.1.1: the patch has not been applied

(seemed to be fixed... maybe when I compiled the previous version I had already passed the patch, I don't remember)
Comment 18 Joe Yasi 2019-05-15 14:01:52 UTC
Looks like this still hasn't been code reviewed. Konstantin, can you ping the list for a review?
Comment 19 Konstantin Khlebnikov 2019-05-16 08:00:58 UTC
Seems like patches are already queued by maintainer: 
https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/log/?h=dev-queue
Comment 20 Antonio 2019-05-31 17:45:23 UTC
Kernel 5.1.6: no patch applied...
Comment 21 Artem S. Tashkinov 2019-06-05 19:55:01 UTC
This bug affects kernel 5.1.7.

What the fucking hell is going on?
Comment 22 Artem S. Tashkinov 2019-06-05 19:56:46 UTC
(In reply to Konstantin Khlebnikov from comment #19)
> Seems like patches are already queued by maintainer: 
> https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/log/
> ?h=dev-queue

The issue was reported two fucking months ago.

What's taking you so long to apply the patch, i.e. revert the bad commit?
Comment 23 Oleksandr Natalenko 2019-06-05 20:10:32 UTC
Hi.

The patches in question are set up for the next merge window here: [1] [2]. Once the pull request is sent and accepted, you may ask for a backport to the stable branch.

Meanwhile, feel free to cherry-pick them by yourself, they can be applied on top of current 5.1 kernel without any issues.

Also note, so far the printout is considered to be harmless, thus no rush in pushing the patches is expected.

Thanks.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/?h=dev-queue&id=caff422ea81e144842bc44bab408d85ac449377b
[2] https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/?h=dev-queue&id=d17ba0f616a08f597d9348c372d89b8c0405ccf3
Comment 24 Jonah Bernhard 2019-06-06 01:13:00 UTC
It is not harmless, this bug has apparently broken suspend on my Arch machine. When I attempt to suspend, it appears to work for a moment, then the computer resumes within a few seconds and this error appears in the journal.
Comment 25 Oleksandr Natalenko 2019-06-06 05:12:30 UTC
(In reply to Jonah Bernhard from comment #24)
> It is not harmless, this bug has apparently broken suspend on my Arch
> machine. When I attempt to suspend, it appears to work for a moment, then
> the computer resumes within a few seconds and this error appears in the
> journal.

Can you confirm that applying those commits fixes the suspend for you? The printout happens to me as well on resume, but the suspend still works, and your issue might be unrelated.
Comment 26 Jonah Bernhard 2019-06-06 19:33:31 UTC
Created attachment 283129 [details]
attachment-7107-0.html

OK, I will test when I get a chance.

On Thu, Jun 6, 2019 at 1:12 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=203175
>
> --- Comment #25 from Oleksandr Natalenko (oleksandr@natalenko.name) ---
> (In reply to Jonah Bernhard from comment #24)
> > It is not harmless, this bug has apparently broken suspend on my Arch
> > machine. When I attempt to suspend, it appears to work for a moment, then
> > the computer resumes within a few seconds and this error appears in the
> > journal.
>
> Can you confirm that applying those commits fixes the suspend for you? The
> printout happens to me as well on resume, but the suspend still works, and
> your
> issue might be unrelated.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 27 Artem S. Tashkinov 2019-06-17 19:38:11 UTC
2.5 months in: the bad commit has still not been reverted.

Just released 5.1.11 doesn't contain the fix.
Comment 28 Jonah Bernhard 2019-06-29 01:54:18 UTC
Following up on my previous comment, it turns out that this issue was not related to my suspend problem.  Oddly enough, that was caused by a bad USB cable.  I still get the "hardware unit hang" warning on suspend, but it works normally.
Comment 29 Nicolo' 2019-07-11 16:27:09 UTC
I'm on kernel 5.2 and the issue is still present.
Is it normal that it takes so long to revert a simple commit?
Comment 30 Artem S. Tashkinov 2019-07-11 20:42:50 UTC
(In reply to Nicolo' from comment #29)
> I'm on kernel 5.2 and the issue is still present.
> Is it normal that it takes so long to revert a simple commit?

It's not normal but it's quite usual in regard to Linux. At least some kernel developer have bothered to leave their comments here. Also they consider this bug report "solved" since there's even a patch. Isn't it amazing?

I have valid bug reports open for over three years with zero activity from appropriate maintainers.

Get used to it.
Comment 31 Joe Yasi 2019-07-11 20:56:16 UTC
The fixes just hit mainline for 5.3. Now someone can ping stable for backports.
Comment 32 Konstantin Khlebnikov 2019-07-15 10:53:37 UTC
Queued into stable 4.4 .. 5.2
Comment 33 Artem S. Tashkinov 2019-07-21 10:43:30 UTC
Fixed in 5.2.2.