Bug 203175
Summary: | e1000e regression with 5.0.7 Detected Hardware Unit Hang | ||
---|---|---|---|
Product: | Drivers | Reporter: | Joe Yasi (joe.yasi) |
Component: | Network | Assignee: | drivers_network (drivers_network) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | antdev66, aros, jonah.bernhard, koct9i, nicolopiazzalunga, oleksandr |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 5.1.7, 5.0.21 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: | attachment-7107-0.html |
Description
Joe Yasi
2019-04-06 04:20:07 UTC
full hardware info with Capabilities: 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V Subsystem: ASUSTeK Computer Inc. Ethernet Connection (2) I219-V Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 145 Region 0: Memory at df400000 (32-bit, non-prefetchable) [size=128K] Capabilities: [c8] Power Management version 3 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00518 Data: 0000 Capabilities: [e0] PCI Advanced Features AFCap: TP+ FLR+ AFCtrl: FLR- AFStatus: TP- Kernel driver in use: e1000e Kernel modules: e1000 This is caused by 7f0a3a436e88a71b96694c029f01a9a8eade3d5d e1000e: fix cyclic resets at link up with active tx. I reverted it and rebuilt e1000e.ko. The error is gone without that change. same problem kernel: e1000e 0000:00:19.0 eth2: Detected Hardware Unit Hang: TDH <0> TDT <1> next_to_use <1> next_to_clean <0> buffer_info[next_to_clean]: time_stamp <fffb90d3> next_to_watch <0> jiffies <fffb9a01> next_to_watch.status <0> MAC Status <80080> PHY Status <7949> PHY 1000BASE-T Status <800> PHY Extended Status <3000> PCI Status <10> Same problem, reverting 7f0a3a436e88a71b96694c029f01a9a8eade3d5d fixes it. kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang: TDH <0> TDT <1> next_to_use <1> next_to_clean <0> buffer_info[next_to_clean]: time_stamp <fffeab29> next_to_watch <0> jiffies <fffeadc0> next_to_watch.status <0> MAC Status <80080> PHY Status <7949> PHY 1000BASE-T Status <0> PHY Extended Status <3000> PCI Status <10> kernel 5.0.8: problem still exist proposed revert and fix: https://lkml.org/lkml/2019/4/17/169 I assume the revert hasn't made it into kernel 5.0.9, as I still have the same issue with that version. *** Bug 203447 has been marked as a duplicate of this bug. *** <rant>Now the only question is why kernel 5.0.10 has been released without the fix which means the bug has been known for three weeks (!) and no one has bothered to fix it in a timely manner. Again, what's the point of "stable" "must update" (as GKH always puts them) x.y.Z updates when they often introduce regressions?</rant> (In reply to Artem S. Tashkinov from comment #9) > <rant>Now the only question is why kernel 5.0.10 has been released without > the fix which means the bug has been known for three weeks (!) and no one > has bothered to fix it in a timely manner. > > Again, what's the point of "stable" "must update" (as GKH always puts them) > x.y.Z updates when they often introduce regressions?</rant> 1. Linus' branch goes first; 2. once the fix is identified, no one prevents you from applying patches by yourself. (In reply to Oleksandr Natalenko from comment #10) > 1. Linus' branch goes first; > 2. once the fix is identified, no one prevents you from applying patches by > yourself. <rant>I was under the impression that Linux was meant for normal people, not for geeks who prefer having sex with their PCs instead of doing that with other human beings. I was under the impression that the word "stable" in x.y.Z releases indeed meant "stable". Guess I've been all wrong.</rant> 5.0.11 has been released without the fix. What the hell, Konstantin? Are we waiting for something else? Needless to say kernel 5.0.12 doesn't contain the fix as well. Has it even been proposed for inclusion? I observed that version 5.0.13 has the fix applied. For me the problem is solved. Thank you (In reply to anthony from comment #14) > I observed that version 5.0.13 has the fix applied. > For me the problem is solved. > Thank you Official kernel 5.0.13 doesn't contain the fix. It's most likely your distro which might have included it. (In reply to Artem S. Tashkinov from comment #15) > Official kernel 5.0.13 doesn't contain the fix. It's most likely your distro > which might have included it. I not found revert commit, but view that the source file ./drivers/net/ethernet/intel/e1000e/netdev.c was been already patched (with patch at comment 6). I manually compiled new kernel and no error found when shutdown. Kernel 5.1.1: the patch has not been applied (seemed to be fixed... maybe when I compiled the previous version I had already passed the patch, I don't remember) Looks like this still hasn't been code reviewed. Konstantin, can you ping the list for a review? Seems like patches are already queued by maintainer: https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/log/?h=dev-queue Kernel 5.1.6: no patch applied... This bug affects kernel 5.1.7. What the fucking hell is going on? (In reply to Konstantin Khlebnikov from comment #19) > Seems like patches are already queued by maintainer: > https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/log/ > ?h=dev-queue The issue was reported two fucking months ago. What's taking you so long to apply the patch, i.e. revert the bad commit? Hi. The patches in question are set up for the next merge window here: [1] [2]. Once the pull request is sent and accepted, you may ask for a backport to the stable branch. Meanwhile, feel free to cherry-pick them by yourself, they can be applied on top of current 5.1 kernel without any issues. Also note, so far the printout is considered to be harmless, thus no rush in pushing the patches is expected. Thanks. [1] https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/?h=dev-queue&id=caff422ea81e144842bc44bab408d85ac449377b [2] https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/?h=dev-queue&id=d17ba0f616a08f597d9348c372d89b8c0405ccf3 It is not harmless, this bug has apparently broken suspend on my Arch machine. When I attempt to suspend, it appears to work for a moment, then the computer resumes within a few seconds and this error appears in the journal. (In reply to Jonah Bernhard from comment #24) > It is not harmless, this bug has apparently broken suspend on my Arch > machine. When I attempt to suspend, it appears to work for a moment, then > the computer resumes within a few seconds and this error appears in the > journal. Can you confirm that applying those commits fixes the suspend for you? The printout happens to me as well on resume, but the suspend still works, and your issue might be unrelated. Created attachment 283129 [details] attachment-7107-0.html OK, I will test when I get a chance. On Thu, Jun 6, 2019 at 1:12 AM <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=203175 > > --- Comment #25 from Oleksandr Natalenko (oleksandr@natalenko.name) --- > (In reply to Jonah Bernhard from comment #24) > > It is not harmless, this bug has apparently broken suspend on my Arch > > machine. When I attempt to suspend, it appears to work for a moment, then > > the computer resumes within a few seconds and this error appears in the > > journal. > > Can you confirm that applying those commits fixes the suspend for you? The > printout happens to me as well on resume, but the suspend still works, and > your > issue might be unrelated. > > -- > You are receiving this mail because: > You are on the CC list for the bug. 2.5 months in: the bad commit has still not been reverted. Just released 5.1.11 doesn't contain the fix. Following up on my previous comment, it turns out that this issue was not related to my suspend problem. Oddly enough, that was caused by a bad USB cable. I still get the "hardware unit hang" warning on suspend, but it works normally. I'm on kernel 5.2 and the issue is still present. Is it normal that it takes so long to revert a simple commit? (In reply to Nicolo' from comment #29) > I'm on kernel 5.2 and the issue is still present. > Is it normal that it takes so long to revert a simple commit? It's not normal but it's quite usual in regard to Linux. At least some kernel developer have bothered to leave their comments here. Also they consider this bug report "solved" since there's even a patch. Isn't it amazing? I have valid bug reports open for over three years with zero activity from appropriate maintainers. Get used to it. The fixes just hit mainline for 5.3. Now someone can ping stable for backports. Queued into stable 4.4 .. 5.2 Fixed in 5.2.2. |