Bug 113701 - INFO: task shutdown:1 block for more then 120 seconds
Summary: INFO: task shutdown:1 block for more then 120 seconds
Status: CLOSED UNREPRODUCIBLE
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Off (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Aaron Lu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-04 14:09 UTC by da_audiophile
Modified: 2016-05-17 05:27 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.4.4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
complete journalctl log while I was booted into 4.4.1 (19.20 KB, application/gzip)
2016-03-04 14:09 UTC, da_audiophile
Details

Description da_audiophile 2016-03-04 14:09:27 UTC
Created attachment 206801 [details]
complete journalctl log while I was booted into 4.4.1

Overview: Machine does not reliability poweroff on a shutdown under the 4.4.x series.

Work-around: Downgrading to the 4.3.6 kernel fixes the hung shutdown.

Since updating to the 4.4.x series (started with 4.4.1 but still occurs under 4.4.4), my Haswell machine fails to poweroff about consistently.  Most of the time when I issue `shutdown -h now` the machine does not shutdown but does hang displaying a loop of error which repeat every 2 min until I hold in the power button:

INFO: task shutdown:1 block for more then 120 seconds.
Not tainted 4.4.4-1-ARCH #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
INFO: task kworker/0:3:1642 block for more than 120 seconds.
Not tainted 4.4.4-1-ARCH #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

I don't see anything in `journalctl -b -1` that indicates an error or warning, so I'm thinking whatever is happening occurs after the journal logging as evident by the last few lines lines (complete journalctl is attached):
...
Feb 14 11:02:41 quartz systemd[1]: Reached target Shutdown.
Feb 14 11:02:41 quartz systemd[1]: Reached target Final Step.
Feb 14 11:02:41 quartz systemd[1]: Starting Power-Off...
Feb 14 11:02:41 quartz systemd[1]: Shutting down.
Feb 14 11:02:41 quartz systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Feb 14 11:02:41 quartz systemd-journald[186]: Journal stopped
Comment 1 Aaron Lu 2016-03-14 06:26:16 UTC
Can you please use git bisect to find the offending commit? Probably you can test v4.4 first, assume it also has problem, then you will need bisect between v4.3 and v4.4. Thanks.
Comment 2 da_audiophile 2016-03-14 20:59:52 UTC
Yes sir, I would be glad to help get to the bottom of this one.
Comment 3 da_audiophile 2016-03-14 22:48:04 UTC
OK.  I pulled from: git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git and did the following to start:

git bisect good 6a13feb9c82803e2b815eca72fa7a9f5561d7861
git bisect bad 73e7d63efb4d774883a338997943bfa59e127085

After completing the biset (allowing the machine 3 tries to shutdown successfully) git tells me:

c0d625cbb541ae68ca3c97fb62f5f6093d3382fa is the first bad commit
Comment 4 Aaron Lu 2016-03-15 01:55:47 UTC
But commit c0d625cbb541ae68ca3c97fb62f5f6093d3382fa is a merge which contains 3 TI OMAP related fixes, which doesn't seem like the right culprit...
Comment 5 da_audiophile 2016-03-15 10:38:47 UTC
Did I bisect it incorrectly?  I did have some questions about finding the commit that corresponds to 4.3.6 and 4.4.  Should I have used the stable tree: git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
Comment 6 da_audiophile 2016-03-15 13:55:44 UTC
I should also mention that while I running the builds from the bisect, the criteria I used was 3 successful shutdowns in a row.  While I was testing, I found kernels that would shutdown after systemd killed something (given its 90 sec grace period) but they did not show the "INFO: task shutdown:1 block for more then 120 seconds" error messages.  I treated them as bad since the shutdown was not more or less instantaneous as it is under v4.3.6.  I can reluctantly repeat the bisect/testing but I need to know what the endpoint is for a "good" outcome.

Is it:
1) A more or less instantaneous shutdown like I get under v4.3.6?
2) A shutdown that ends without the "INFO: task shutdown..." message even though systemd still waits 90s to kill something else?

Thanks!
Comment 7 Aaron Lu 2016-03-16 03:11:05 UTC
You should use Linus' git master tree, and please first make sure v4.3 works and v4.4 doesn't, then you can start the bisect by marking v4.3 as good and v4.4 as bad.

And BTW, which distro are you using? Is the v4.4.x kernel provided by your distro or compiled by you? And do the two kernels use the same config?
Comment 8 Aaron Lu 2016-03-16 03:21:38 UTC
OK, your log shows you are using ARCH. Did you try reporting this issue to some ARCH related mailing list or forum? I wonder if there is other ARCH users having the same issue.

BTW, by not reliably power off, did it power off or not after waiting long enough time?
Comment 9 da_audiophile 2016-03-16 18:57:32 UTC
(In reply to Aaron Lu from comment #8)
> OK, your log shows you are using ARCH. Did you try reporting this issue to
> some ARCH related mailing list or forum? I wonder if there is other ARCH
> users having the same issue.

Yes, I am using Arch.  I did in fact open a bug report with Arch[1] but no action has been taken with it yet.  Since the problem is reversible as I described, I suspect that they are watching this ticket/upstream bug.

> BTW, by not reliably power off, did it power off or not after waiting long
> enough time?

In my initial report with v4.4.1 it would not power off at all.  I would see the "INFO: task kworker/0:3:1642 block for more than 120 seconds" message for hours.  Before I begin the bisect, is my endpoint for a "good" shutdown:

1) A more or less instantaneous shutdown like I get under v4.3.6?
2) A shutdown that ends without the "INFO: task shutdown..." message even though systemd still waits 90s to kill something else?

Note that there is currently a bug open against systemd[2] that infers it might be to blame for another hung shutdown problem but that one will end once the 90 seconds systemd waits to kill processes elapses.

References:
1. https://bugs.archlinux.org/task/48180
2. https://bugs.freedesktop.org/show_bug.cgi?id=70593
Comment 10 Aaron Lu 2016-03-17 02:31:53 UTC
(In reply to da_audiophile from comment #9)
> In my initial report with v4.4.1 it would not power off at all.  I would see
> the "INFO: task kworker/0:3:1642 block for more than 120 seconds" message
> for hours.  Before I begin the bisect, is my endpoint for a "good" shutdown:
> 
> 1) A more or less instantaneous shutdown like I get under v4.3.6?
> 2) A shutdown that ends without the "INFO: task shutdown..." message even
> though systemd still waits 90s to kill something else?

Since we are dealing with kernel power off issue here, as long as the machine eventually powered off, I think it should be regarded as good.
Comment 11 da_audiophile 2016-03-17 21:15:33 UTC
Thank you, Aaron.  I am finding that the bug I am experiencing is not consistently triggered.  For example, I booted into 4.4.5 and shutdown with a 90 second hang which was systemd waiting 90 seconds to kill something twice.  The 3rd time I tried shutting down, I was presented with this bug where it hung indefinitely with the error message.  Is there any debug kernel parameter I can boot with to help me narrow this down without completing the bisect?
Comment 12 Aaron Lu 2016-03-18 07:55:10 UTC
We should probably isolate kernel poweroff problem with systemd, so please boot with init=/bin/bash added to your kernel command line and after boot, do this to see if poweroff works(as root):
# echo 1 > /proc/sys/kernel/sysrq  # enable sysrq
# echo o > /proc/sysrq-trigger     # immediately power off

The above two commands will immediately shutoff the computer without doing anything(syncing file system, shutting off user space daemons, etc.), but since we are using bash as init and nothing gets started and the filesystem is mounted read only, it shouldn't cause any harm to your system.

If poweroff works with all the kernels, then the shutoff problem probably is a systemd one.
Comment 13 da_audiophile 2016-03-18 18:51:43 UTC
@Aaron - Thank you for the suggestion.  I ran the experiment you outlined above while booted into v4.4.5 a total of 5 times in a row and in all 5 cases the system went down instantaneously as I have observed with the v4.3.6 kernel.  I can only conclude that systemd 229 + linux 4.4.x causes these problems.
Comment 14 da_audiophile 2016-03-18 19:06:46 UTC
https://github.com/systemd/systemd/issues/2868
Comment 15 Aaron Lu 2016-05-13 05:44:34 UTC
Do you still have this problem considering there should be some updates from both kernel and systemd during this period?
Comment 16 da_audiophile 2016-05-13 18:26:23 UTC
@Aaron - This isn't an issue for me on the affected box.  Probably safe to close the ticket.
Comment 17 Zhang Rui 2016-05-15 05:35:41 UTC
good to know. Bug closed.
Comment 18 Aaron Lu 2016-05-17 05:27:49 UTC
(In reply to da_audiophile from comment #16)
> @Aaron - This isn't an issue for me on the affected box.  Probably safe to
> close the ticket.

Thanks for the update.

Note You need to log in before you can comment on or make changes to this bug.