13366 – About 80% of shutdowns fail (blocking)

Bug 13366 - About 80% of shutdowns fail (blocking)

Summary: About 80% of shutdowns fail (blocking)

Status:	CLOSED INSUFFICIENT_DATA

Alias:	None

Product:	Process Management
Classification:	Unclassified
Component:	Other (show other bugs)
Hardware:	All Linux

Importance:	P1 blocking
Assignee:	process_other

URL:
Keywords:

Depends on:
Blocks:	56331
	Show dependency tree

Reported:	2009-05-23 00:58 UTC by Martin Bammer
Modified:	2013-04-09 06:23 UTC (History)
CC List:	6 users (show)

See Also:
Kernel Version:	2.6.30-rc6+latest git patches
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
Screenshot of kernel crash output. (336.28 KB, image/jpeg) 2009-05-23 00:58 UTC, Martin Bammer	Details
kernel configuration (85.56 KB, application/octet-stream) 2009-05-23 01:00 UTC, Martin Bammer	Details
dmesg output (61.86 KB, text/plain) 2009-05-23 01:03 UTC, Martin Bammer	Details
lspci -vvv output (10.50 KB, text/plain) 2009-05-23 01:04 UTC, Martin Bammer	Details
kernel output with last patch in thread for debugging (190.38 KB, image/jpeg) 2009-05-26 21:06 UTC, Martin Bammer	Details
Shutdown problem without umts stick (154.18 KB, image/jpeg) 2009-05-26 21:21 UTC, Martin Bammer	Details
Add an attachment (proposed patch, testcase, etc.)

Description Martin Bammer 2009-05-23 00:58:24 UTC

Created attachment 21499 [details]
Screenshot of kernel crash output.

When the system shuts down/reboots nearly every shutdown stops when trying to kill all processes with killall5. Pressing the power button has no effect in most cases. Only the magic key sequences work.
This problem occures since 2.6.30-rc6. 2.6.30-rc5 had no problems with shutting down/rebooting.

Comment 1 Martin Bammer 2009-05-23 01:00:28 UTC

Created attachment 21500 [details]
kernel configuration

KMS and plymouth are used on this test system.

Comment 2 Martin Bammer 2009-05-23 01:03:31 UTC

Created attachment 21501 [details]
dmesg output

Comment 3 Martin Bammer 2009-05-23 01:04:01 UTC

Created attachment 21502 [details]
lspci -vvv output

Comment 4 Zhang Rui 2009-05-25 03:16:00 UTC

I thought this may be ec related at the first glance,
but given that the problem occurs since 2.6.30-rc6, this is not an ec problem.
Martin,
can you run git-bisect to find out which commit introduces the problem.

Comment 5 Anonymous Emailer 2009-05-26 11:50:00 UTC

Reply-To: hugh.dickins@tiscali.co.uk

A git-bisect would indeed be worthwhile; but looking through the diff
between 2.6.30-rc5 and 2.6.30-rc7 didn't show any likely candidates -
I wonder if this will turn out to be something more elusive.

Is this an Acer Aspire One?  Looks rather like it: I tried building
your kernel (on openSUSE rather than Ubuntu) on mine, and running it:
no luck reproducing your issue here.

I doubt this has got much to do with mlockall() or lru_add_drain_all()
themselves: it looks rather as if an events thread has "gone away".

Would you mind applying the hacky patch below, and posting the
screenshot from shutdown?  I assume from the fact that you posted
a photo, that nothing useful gets out to the logs: so here I'm trying
to leave just the "events/0" and "events/1" stacktraces onscreen.

--- 2.6.30-rc7/kernel/hung_task.c	2009-04-08 14:59:26.000000000 +0100
+++ linux/kernel/hung_task.c	2009-05-25 18:45:11.000000000 +0100
@@ -98,7 +98,7 @@ static void check_hung_task(struct task_
 	printk(KERN_ERR "\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""
 			" disables this message.\n");
 	sched_show_task(t);
-	__debug_show_held_locks(t);
+	show_state_filter(512);
 
 	touch_nmi_watchdog();
 
--- 2.6.30-rc7/kernel/sched.c	2009-05-09 09:24:35.000000000 +0100
+++ linux/kernel/sched.c	2009-05-25 19:08:05.000000000 +0100
@@ -6514,13 +6514,14 @@ void show_state_filter(unsigned long sta
 		 * console might take alot of time:
 		 */
 		touch_nmi_watchdog();
-		if (!state_filter || (p->state & state_filter))
+		if ((state_filter == 512 && !strncmp(p->comm, "events/", 7)) ||
+		    !state_filter || (p->state & state_filter))
 			sched_show_task(p);
 	} while_each_thread(g, p);
 
 	touch_all_softlockup_watchdogs();
 
-#ifdef CONFIG_SCHED_DEBUG
+#ifdef CONFIG_SCHED_DEBUG_NOT
 	sysrq_sched_debug_show();
 #endif
 	read_unlock(&tasklist_lock);

Comment 6 Martin Bammer 2009-05-26 21:06:19 UTC

Created attachment 21567 [details]
kernel output with last patch in thread for debugging

I've found out that the shutdown problem only occurs when I:
1) plug my usb umts stick
2) go online (with network manager)
3) unplug the umts stick
4) shut down or reboot

Comment 7 Martin Bammer 2009-05-26 21:21:39 UTC

Created attachment 21569 [details]
Shutdown problem without umts stick

Ok, forget the info that the shutdown problem only occurs with the umts stick.
Currently had this problem without this usb device. But as you can see in the screenshot when I use the magic key sequence to reboot the system the network subsystem outputs something. Maybe the problem is related to the network code?

Comment 8 Martin Bammer 2009-05-26 21:38:49 UTC

One more info:
I've compared now the shutdown/reboot process with rc5.
The differences are:
- with rc5 the shutdown/reboot process begins immediately
- with rc6 and rc7 the shutdown process begins with a delay of ~2 seconds
- when the shutdown/reboot process hangs the 2 lines with the network manager are missing
Hope this helps.

Comment 9 Martin Bammer 2009-05-29 18:47:04 UTC

I tried to bisect the issue. But unfortunately it fails, because of a reiserfs
bug after rc5 which causes the kernel crashing when it mounts the root fs.
But I've found the reason for this issue. I remembered that I compiled rc5 with
a slightly different config than rc6 and rc7. In rc6/rc7 I had enabled more debugging options and also the kdbg. Looking at the kernel messages when it stopped booting I saw that the last outputs always came from kdbg.
Then I disabled kdbg and the additional debugging options in the current
master. Now the issue seems to be gone. I've rebootet the kernel several times
without any problems. I also compiled rc5 with the config of rc6/rc7 and it showed the same problems as rc6/rc7.

Comment 10 Len Brown 2009-05-29 20:54:43 UTC

> I also compiled rc5 with the config of rc6/rc7 and it
> showed the same problems as rc6/rc7

Clearing the "regression" flag, since it is now unclear if this
configuration ever worked on this machine.

Can you narrow the problem down to a single .config option?

Comment 11 Rafael J. Wysocki 2009-06-07 21:06:49 UTC

On Sunday 07 June 2009, Martin Bammer wrote:
> Since i disabled most of the debug options this problem has gone. IMHO
> this issue has been caused by kdbg.

Comment 12 Rafael J. Wysocki 2009-06-07 21:07:39 UTC

Dropping from the list of recent regressions as per comment #10.

Comment 13 Zhang Rui 2009-06-08 01:50:24 UTC

so can you reproduce this bug on any earlier kernel releases?

Note You need to log in before you can comment on or make changes to this bug.