Bug 13366

Summary: About 80% of shutdowns fail (blocking)
Product: Process Management Reporter: Martin Bammer (mrb74)
Component: OtherAssignee: process_other
Status: CLOSED INSUFFICIENT_DATA    
Severity: blocking CC: acpi-bugzilla, alan, bgamari, lenb, rjw, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30-rc6+latest git patches Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 56331    
Attachments: Screenshot of kernel crash output.
kernel configuration
dmesg output
lspci -vvv output
kernel output with last patch in thread for debugging
Shutdown problem without umts stick

Description Martin Bammer 2009-05-23 00:58:24 UTC
Created attachment 21499 [details]
Screenshot of kernel crash output.

When the system shuts down/reboots nearly every shutdown stops when trying to kill all processes with killall5. Pressing the power button has no effect in most cases. Only the magic key sequences work.
This problem occures since 2.6.30-rc6. 2.6.30-rc5 had no problems with shutting down/rebooting.
Comment 1 Martin Bammer 2009-05-23 01:00:28 UTC
Created attachment 21500 [details]
kernel configuration

KMS and plymouth are used on this test system.
Comment 2 Martin Bammer 2009-05-23 01:03:31 UTC
Created attachment 21501 [details]
dmesg output
Comment 3 Martin Bammer 2009-05-23 01:04:01 UTC
Created attachment 21502 [details]
lspci -vvv output
Comment 4 Zhang Rui 2009-05-25 03:16:00 UTC
I thought this may be ec related at the first glance,
but given that the problem occurs since 2.6.30-rc6, this is not an ec problem.
Martin,
can you run git-bisect to find out which commit introduces the problem.
Comment 5 Anonymous Emailer 2009-05-26 11:50:00 UTC
Reply-To: hugh.dickins@tiscali.co.uk

A git-bisect would indeed be worthwhile; but looking through the diff
between 2.6.30-rc5 and 2.6.30-rc7 didn't show any likely candidates -
I wonder if this will turn out to be something more elusive.

Is this an Acer Aspire One?  Looks rather like it: I tried building
your kernel (on openSUSE rather than Ubuntu) on mine, and running it:
no luck reproducing your issue here.

I doubt this has got much to do with mlockall() or lru_add_drain_all()
themselves: it looks rather as if an events thread has "gone away".

Would you mind applying the hacky patch below, and posting the
screenshot from shutdown?  I assume from the fact that you posted
a photo, that nothing useful gets out to the logs: so here I'm trying
to leave just the "events/0" and "events/1" stacktraces onscreen.

--- 2.6.30-rc7/kernel/hung_task.c	2009-04-08 14:59:26.000000000 +0100
+++ linux/kernel/hung_task.c	2009-05-25 18:45:11.000000000 +0100
@@ -98,7 +98,7 @@ static void check_hung_task(struct task_
 	printk(KERN_ERR "\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""
 			" disables this message.\n");
 	sched_show_task(t);
-	__debug_show_held_locks(t);
+	show_state_filter(512);
 
 	touch_nmi_watchdog();
 
--- 2.6.30-rc7/kernel/sched.c	2009-05-09 09:24:35.000000000 +0100
+++ linux/kernel/sched.c	2009-05-25 19:08:05.000000000 +0100
@@ -6514,13 +6514,14 @@ void show_state_filter(unsigned long sta
 		 * console might take alot of time:
 		 */
 		touch_nmi_watchdog();
-		if (!state_filter || (p->state & state_filter))
+		if ((state_filter == 512 && !strncmp(p->comm, "events/", 7)) ||
+		    !state_filter || (p->state & state_filter))
 			sched_show_task(p);
 	} while_each_thread(g, p);
 
 	touch_all_softlockup_watchdogs();
 
-#ifdef CONFIG_SCHED_DEBUG
+#ifdef CONFIG_SCHED_DEBUG_NOT
 	sysrq_sched_debug_show();
 #endif
 	read_unlock(&tasklist_lock);
Comment 6 Martin Bammer 2009-05-26 21:06:19 UTC
Created attachment 21567 [details]
kernel output with last patch in thread for debugging

I've found out that the shutdown problem only occurs when I:
1) plug my usb umts stick
2) go online (with network manager)
3) unplug the umts stick
4) shut down or reboot
Comment 7 Martin Bammer 2009-05-26 21:21:39 UTC
Created attachment 21569 [details]
Shutdown problem without umts stick

Ok, forget the info that the shutdown problem only occurs with the umts stick.
Currently had this problem without this usb device. But as you can see in the screenshot when I use the magic key sequence to reboot the system the network subsystem outputs something. Maybe the problem is related to the network code?
Comment 8 Martin Bammer 2009-05-26 21:38:49 UTC
One more info:
I've compared now the shutdown/reboot process with rc5.
The differences are:
- with rc5 the shutdown/reboot process begins immediately
- with rc6 and rc7 the shutdown process begins with a delay of ~2 seconds
- when the shutdown/reboot process hangs the 2 lines with the network manager are missing
Hope this helps.
Comment 9 Martin Bammer 2009-05-29 18:47:04 UTC
I tried to bisect the issue. But unfortunately it fails, because of a reiserfs
bug after rc5 which causes the kernel crashing when it mounts the root fs.
But I've found the reason for this issue. I remembered that I compiled rc5 with
a slightly different config than rc6 and rc7. In rc6/rc7 I had enabled more debugging options and also the kdbg. Looking at the kernel messages when it stopped booting I saw that the last outputs always came from kdbg.
Then I disabled kdbg and the additional debugging options in the current
master. Now the issue seems to be gone. I've rebootet the kernel several times
without any problems. I also compiled rc5 with the config of rc6/rc7 and it showed the same problems as rc6/rc7.
Comment 10 Len Brown 2009-05-29 20:54:43 UTC
> I also compiled rc5 with the config of rc6/rc7 and it
> showed the same problems as rc6/rc7

Clearing the "regression" flag, since it is now unclear if this
configuration ever worked on this machine.

Can you narrow the problem down to a single .config option?
Comment 11 Rafael J. Wysocki 2009-06-07 21:06:49 UTC
On Sunday 07 June 2009, Martin Bammer wrote:
> Since i disabled most of the debug options this problem has gone. IMHO
> this issue has been caused by kdbg.
Comment 12 Rafael J. Wysocki 2009-06-07 21:07:39 UTC
Dropping from the list of recent regressions as per comment #10.
Comment 13 Zhang Rui 2009-06-08 01:50:24 UTC
so can you reproduce this bug on any earlier kernel releases?