Kernel Bug Tracker – Bug 74451
Shutdown, Suspend, Hibernate Fail. I1217V Ethernet Port Hangs on Reboot even after power cycle.
Last modified: 2015-06-27 20:21:02 UTC
Created attachment 133051 [details]
Combined file with dmesg and lspci output from bad and good boots
Shutdown oftens reboots rather shutting down. Suspend and Hibernate sometimes reboot, never resume properly. Ethernet port does not work on reboot after any shutdown, suspend, or hibernate, even after a full power cycle. Only opening the case and pressing the motherboard Clear CMOS button brings the Ethernet Port back. When the Ethernet Port is working it stays working after a Restart without an intervening Shutdown, Suspend or Hibernate.
Hardware: ASRrock Z87 Extreme4/TB4 Motherboard with Intel I1217V (Rev 5) Ethernet, Core I7-4770K, onboard HD4600 Graphics. Nothing else beside RAM and SATA drives and USB devices connected.
Problem occurs with all distros and kernels I have tried including 3.14, and with all e1000e drivers I have tested including 18.104.22.168.
Attached are dmesg and lspci outputs from a good boot after a clear CMOS and from a bad boot after shutdown where the Ethernet fails to work.
Happy to provide additional logs or debugging information if given instructions.
Is this a regression? If yes, please do a git bisect. So far, have no good idea.
It is not a regression. It has been this way since I put the machine together in November with all kernels from 3.11 on. Think I even tried one somewhat earlier kernel.
The ASRock Z87 Extreme4/TB4 Motherboard probably isn't a common motherboard.
Some interaction with it/its BIOS borks something, but what?
I am no kernel person, but can provide additional logs, diagnostics, information, etc. as instructed. Can look into it more if given some hints as to how.
Created attachment 134351 [details]
dmesg output after good boot with latest 3.14.2 kernel
Created attachment 134361 [details]
dmesg output after shutdown and bad reboot with latest kernel 3.14.2
Ethernet port doesn't work after shutdown and power button pressed to reboot.
Created attachment 134371 [details]
sudo lspci -vvvxxxxnn output after good boot with latest 3.14.2 kernel
Created attachment 134381 [details]
sudo lspci -vvvxxxxnn output after shutdown and bad reboot with latest kernel 3.14.2
Okay. In the hopes of getting some help I've upgraded the motherboard firmware, and upgraded to the latest 3.14.2 kernel and separately attached dmesg and lspci output after good and bad boots. This is not a regression. Behavior was the same with all previous versions of the kernel and firmware. Maybe it's not ACPI, but then can someone advise me where to report or look for help?
I find it interesting that the dmesg output is longer after an initial good boot.
The first diff in dmesg output shows that a little more RAM is usable during the good boot:
$ diff dmesg.bad.20140429 dmesg.good.20140429
< [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000044f5fffff] usable
> [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x00000004505fffff] usable
What about a shutdown, hibernate, and suspend could twiddle the CMOS into missing some memory?
We see more memory address range differences. Then we see some differences
after a message that MTRR variable ranges are enabled.
< [ 0.000000] 3 base 00C0000000 mask 7FC0000000 uncachable
< [ 0.000000] 4 base 00A0000000 mask 7FE0000000 uncachable
< [ 0.000000] 5 base 009F800000 mask 7FFF800000 uncachable
< [ 0.000000] 6 base 044F800000 mask 7FFF800000 uncachable
< [ 0.000000] 7 base 044F600000 mask 7FFFE00000 uncachable
> [ 0.000000] 3 base 0450000000 mask 7FFFC00000 write-back
> [ 0.000000] 4 base 0450400000 mask 7FFFE00000 write-back
> [ 0.000000] 5 base 00C0000000 mask 7FC0000000 uncachable
> [ 0.000000] 6 base 00A0000000 mask 7FE0000000 uncachable
> [ 0.000000] 7 base 009F800000 mask 7FFF800000 uncachable
Lots more differences with memory ranges for things. Then
< [ 0.000000] DMA zone: 25 pages reserved
> [ 0.000000] DMA zone: 27 pages reserved
Then we have to remove the timestamps for diff to be useful.
The good boot has messages about *BAD*gran_size
and mtrr_cleanup: can not find optimal value
More memory range differences
Then we see a missing line in the bad boot about ehci-pci
< Magic number: 10:30:712
< rtc_cmos 00:06: setting system clock to 2014-04-30 01:44:28 UTC (1398822268)
> Magic number: 10:133:813
> ehci-pci 0000:00:1a.0: hash matches
> rtc_cmos 00:06: setting system clock to 2014-04-30 01:50:52 UTC (1398822652)
The differences get harder to follow. Things seem to be happening in different orders, IRQ numbers vary.
Just grepping for e1000e though
$ grep e1000e dmesg.good.20140429
[ 0.697939] e1000e: Intel(R) PRO/1000 Network Driver - 2.3.2-k
[ 0.697941] e1000e: Copyright(c) 1999 - 2013 Intel Corporation.
[ 0.698090] e1000e 0000:00:19.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[ 0.698104] e1000e 0000:00:19.0: irq 41 for MSI/MSI-X
[ 0.866888] e1000e 0000:00:19.0 eth0: registered PHC clock
[ 0.866890] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width x1) bc:5f:f4:d7:c1:0b
[ 0.866891] e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection
[ 0.866935] e1000e 0000:00:19.0 eth0: MAC: 11, PHY: 12, PBA No: FFFFFF-0FF
[ 2.845759] e1000e 0000:00:19.0: irq 41 for MSI/MSI-X
[ 2.949233] e1000e 0000:00:19.0: irq 41 for MSI/MSI-X
[ 5.776351] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
$ grep e1000e
[ 0.701104] e1000e: Intel(R) PRO/1000 Network Driver - 2.3.2-k
[ 0.701107] e1000e: Copyright(c) 1999 - 2013 Intel Corporation.
[ 0.712900] e1000e 0000:00:19.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[ 0.712922] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[ 1.705047] e1000e: probe of 0000:00:19.0 failed with error -3
And I notice that somehow I've reverted to an old e1000e driver. Upgrading, however doesn't help.
Are there any more logs that I can provide or more debugging that I can do?
I'd like to look for relevant messages during the shutdown after a good boot, but don't have a clue how to make those be stored somewhere where I can find them later after a reboot.
Just to prove it wasn't the driver I installed the latest e1000e driver. Same
Good boot after hitting the Clear CMOS button:
$ dmesg | grep e1000e
[ 0.696466] e1000e: module verification failed: signature and/or required key missing - tainting kernel
[ 0.696894] e1000e: Intel(R) PRO/1000 Network Driver - 22.214.171.124-NAPI
[ 0.696896] e1000e: Copyright(c) 1999 - 2014 Intel Corporation.
[ 0.697054] e1000e 0000:00:19.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[ 0.697071] e1000e 0000:00:19.0: irq 41 for MSI/MSI-X
[ 0.871470] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width x1) bc:5f:f4:d7:c1:0b
[ 0.871471] e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection
[ 0.871515] e1000e 0000:00:19.0 eth0: MAC: 11, PHY: 12, PBA No: FFFFFF-0FF
[ 2.470241] e1000e 0000:00:19.0: irq 41 for MSI/MSI-X
[ 2.573771] e1000e 0000:00:19.0: irq 41 for MSI/MSI-X
[ 5.261046] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Bad reboot after a shutdown
$ dmesg | grep e1000e
[ 0.698100] e1000e: module verification failed: signature and/or required key missing - tainting kernel
[ 0.698434] e1000e: Intel(R) PRO/1000 Network Driver - 126.96.36.199-NAPI
[ 0.698435] e1000e: Copyright(c) 1999 - 2014 Intel Corporation.
[ 0.698569] e1000e 0000:00:19.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[ 0.698582] e1000e 0000:00:19.0: irq 41 for MSI/MSI-X
[ 1.690634] e1000e: probe of 0000:00:19.0 failed with error -3
Since this bug is marked "NEEDINFO" may I ask for some information or response?
I would like to provide "INFO." How can I provide more info if nobody communicates to tell me what might be needed.
The bug is not a regression. It happens with all kernels I have tested.
Perhaps it is an ethernet driver bug, but it happens with all e1000e drivers
I have tried. How could a driver work on initial boot, but then fail subsequently?
Something in the CMOS state must be different? Our kernel is borking the CMOS state somehow.
Perhaps it is not an ACPI bug, but then why do sleep and hibernate not work?
What else in our kernel could be at fault?
Most likely there is some peculiarity with the ASROCK Z87 Extreme4/TB4 UEFI/BIOS.
It's my mistake for thinking it would be fun to have a board with Thunderbolt support, but if this board works fine under Windows, but not under Linux, is it fair to say the board is at fault?
Is anyone even getting these emails? It feels like a blackhole.
Mark it "WONTFIX" if it's a rare motherboard that nobody cares about.
Please don't mark "NEEDINFO" and then ignore me.
okay, it seems that there are several problems here
1. reboots instead of shutdown
2. suspend/hibernate broken (sometimes reboot)
3. ethernnet card does not work after Failure 1 and 2.
IMO, first of all, I suspect that the BIOS is broken when the system enters ACPI s3/S4/S5 state, and this explains why there are so many difference hardware setting during boot, which is apparently a BIOS problem to me.
Anyway, please run "echo shutdown > /sys/power/disk" before hibernate and see if hibernation still fails.
You are correct many problems.
1. Yes. Sometimes. Sometimes shuts down but then no ethernet after powering up.
3. Onboard ethernet port, but yes. Only opening case and hitting clear CMOS button on MB reenables ethernet.
Unfortunately I lost access to the machine. It is repurposed until November.
I am sure you are correct that BIOS is broken, but it becomes philosophical.
If Windows 8 works fine and ASRock will not respond to complaints about Linux problems, is the problem BIOS, Linux, or ASRock? I will not buy another ASRock board, but I still can't run Linux on this one.
I will try "echo shutdown > /sys/power/disk" before hibernate in November.
Are bugs allowed to stay open until November?
Thank you for something to try. Hope you are still around in November.
yes, I am. :)
are you able to do the test?
I tried it. No improvement. Hibernate powered the machine off. It didn't respond to keypresses. When I pushed the power switch, it powered on, gave a message about resuming from some image, but then rebooted rather than resuming. On reboot the ethernet wasn't working. No subsequent shutdowns or reboots could get the ethernet back.
Only opening the case and hitting the Clear CMOS button allows me to boot with a working ethernet port.
Clearly a BIOS/BIOS interaction problem, but Windows doesn't have a problem.
please attach the acpidump output, in both good and bad boots.
please attach the output of "cat /proc/acpi/wakeup" in both boots.
We can also check if the system (hibernation) works without ethernet card/driver, better disable the device in BIOS if there is any, if no, please remove the e1000e module completely.
Sorry. Been traveling away from the machine in question. Will be back Wednesday and can get you output shortly thereafter.
Created attachment 169981 [details]
acpidump output after bad boot, ethernet not working
Created attachment 169991 [details]
acpidump output after good boot, ethernet working
Created attachment 170001 [details]
cat /proc/acpi/wakeup output after bad boot, ethernet not working
Created attachment 170011 [details]
cat /proc/acpi/wakeup output after good boot, ethernet working
Once I disabled Onboard LAN in BIOS/UEFI, Resuming from Suspend and Hibernate both seemed to work. It's not particularly helpful as the machine isn't much use without internet connectvity.
The Suspend and Hibernate were quick. Oddly the Resume was a little slower than I might have expected. There was a little disk activity on Resume which didn't make sense to me as Suspend should use RAM and my O/S and Swap for Hibernate are on a silent SSD. I guess something needs to tickle the hard drive on Resume just to know it's there and working.
Since your issue appears to work when the onboard NIC is disabled, it would seem to point to our NIC. I will have our guys look into the issue as well.
Jeff, Realizing that the tests I just did are with the old 2.3.2 e1000e driver that installs with Ubuntu. I still had problems last April with 188.8.131.52, but you're up to 184.108.40.206 now, so let me install and test that before you have anybody do anything.
Jeff, 220.127.116.11 is just as bad. Have your guys go to town!
As the problem is gone with NIC disabled, I'd prefer this is a driver/hardware issue instead of a general suspend/resume problem.
Re-assign the bug to Jeff.
Plus, I'd suggest to re-do the test without the e1000 driver loaded and see if the problem still exists.