Bug 31562
Summary: | Suspend to RAM (S3) sporadically hangs | ||
---|---|---|---|
Product: | Power Management | Reporter: | Thilo-Alexander Ginkel (thilo) |
Component: | Hibernation/Suspend | Assignee: | power-management_other |
Status: | CLOSED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | lenb, rjw, rui.zhang, tj |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.38 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 7216 | ||
Attachments: |
dmesg debugging output
lspci -vvv kern.log from the successful (up to 'platform') and failed ('processor') attempts acpidump log produced by fwts (firmware test suite) (Truncated) dmesg output (tail missing) lsusb -v .config (config-2.6.35-rc3-e22bee+) |
Created attachment 51492 [details]
lspci -vvv
It looks like a problem with the graphics driver. Do you use s2ram by chance? AFAICS from /etc/pm/config.d/00sleep_module, the default (which according to a comment is 'kernel') is used and not 's2ram'. While this may be a bug in the closed-source graphics driver, some change in the kernel since 2.6.35 seems to have triggered it. I can try a bisect if that helps... > nvidia: module license 'NVIDIA' taints kernel.
please re-open if this this is still an issue
when the binary nvidia graphics driver is not present in the system.
I switched from the proprietary nvidia driver to nouveau and am still able to reproduce the issue. I also did a bisect, but the results are inconclusive (it identified some 2.6.35-rc5+ revisions are the culprit, but I told bisect that 2.6.35 works, but 2.6.36 does not, so I am confused why it even bothers to have a look at an older version (I had to skip a couple of revisions, which I could not get to boot): git bisect start # good: [9fe6206f400646a2322096b56c59891d530e8d51] Linux 2.6.35 git bisect good 9fe6206f400646a2322096b56c59891d530e8d51 # bad: [f6f94e2ab1b33f0082ac22d71f66385a60d8157f] Linux 2.6.36 git bisect bad f6f94e2ab1b33f0082ac22d71f66385a60d8157f # skip: [78417334b5cb6e1f915b8fdcc4fce3f1a1b4420c] Merge branch 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing git bisect skip 78417334b5cb6e1f915b8fdcc4fce3f1a1b4420c # skip: [0a492896ac07336c98f37ad7fab4a6387b6ada78] sparc: Really fix "console=" for serial consoles. git bisect skip 0a492896ac07336c98f37ad7fab4a6387b6ada78 # bad: [bbb7030fbb6469c46974c4736a5f90d12102f299] ACPI: expand Vista blacklist to include SP1 and SP2 git bisect bad bbb7030fbb6469c46974c4736a5f90d12102f299 # bad: [4a386c3e177ca2fbc70c9283d0b46537844763a0] Merge branch 'x86-xsave-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip git bisect bad 4a386c3e177ca2fbc70c9283d0b46537844763a0 # good: [6ba74014c1ab0e37af7de6f64b4eccbbae3cb9e7] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 git bisect good 6ba74014c1ab0e37af7de6f64b4eccbbae3cb9e7 # bad: [fc1caf6eafb30ea185720e29f7f5eccca61ecd60] Merge branch 'drm-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6 git bisect bad fc1caf6eafb30ea185720e29f7f5eccca61ecd60 # bad: [c3d1f1746b966907ba5ad2f75ddca24db8b21147] Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus git bisect bad c3d1f1746b966907ba5ad2f75ddca24db8b21147 # good: [ffd386a9a8273dcfa61705d0b349eebc7525ef87] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu git bisect good ffd386a9a8273dcfa61705d0b349eebc7525ef87 # good: [c853d945d3e0cadf60da03106f7d9bbf1346a518] MIPS: Unify the suffix of compressed vmlinux.bin git bisect good c853d945d3e0cadf60da03106f7d9bbf1346a518 # bad: [5cf65713f87775c548e3eb48dbafa32e12f28000] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid git bisect bad 5cf65713f87775c548e3eb48dbafa32e12f28000 # good: [0500e9b3f11ce84fc6ee48a3e29909145e58ba48] padata: Remove padata_get_cpumask git bisect good 0500e9b3f11ce84fc6ee48a3e29909145e58ba48 # bad: [35d03a0d9b22fdf7cba75cd41a6f0a039ba35cdd] Merge branch 'acrux' into for-linus git bisect bad 35d03a0d9b22fdf7cba75cd41a6f0a039ba35cdd # skip: [a11b3fab94d4fb67297b76d0cb81612ebbff276e] HID: hid-ids.h: Whitespace fixup, align using TABs git bisect skip a11b3fab94d4fb67297b76d0cb81612ebbff276e # good: [64b386ea16112564e0b93473e2c347125effb6b2] HID: add proper support for Elecom BM084 bluetooth mouse git bisect good 64b386ea16112564e0b93473e2c347125effb6b2 # good: [8c8b01c38a70661d663175d355fdea85ca082272] HID: ignore digitizer usage Undefined (0x00) git bisect good 8c8b01c38a70661d663175d355fdea85ca082272 # skip: [9929ffed889b4861b52ba648a44fa4add945c4d1] Merge branch 'upstream-fixes' into for-linus git bisect skip 9929ffed889b4861b52ba648a44fa4add945c4d1 # skip: [7d3d42a79519df4cd62c3aa5d9ae2d77ebbf8fab] HID: fix up Kconfig entry for ACRUX driver git bisect skip 7d3d42a79519df4cd62c3aa5d9ae2d77ebbf8fab # skip: [540010812179a16d3d00fb8363bb06ee83af25b8] HID: Add support for Conceptronic CLLRCMCE git bisect skip 540010812179a16d3d00fb8363bb06ee83af25b8 # skip: [d5e0a06f17a0ffb0eb08a5bd7b18f00af70d9a12] Revert "HID: add support for the Wacom Intuos 4 wireless" git bisect skip d5e0a06f17a0ffb0eb08a5bd7b18f00af70d9a12 # skip: [c0dbcc33c652a0646542560de29a1c3f1ab7169f] HID: add ACRUX game controller force feedback support git bisect skip c0dbcc33c652a0646542560de29a1c3f1ab7169f # skip: [fc73697bd5bf4efe8b5b2b78ec09e00f967cd055] Merge branch 'upstream' into for-linus git bisect skip fc73697bd5bf4efe8b5b2b78ec09e00f967cd055 I did some additional debugging using pm_test: freezer, devices and platform work reliably in a reproducible fashion. processors, however, fails (power LED stays lit instead of blinking and the system no longer wakes up as part of the test and requires a hard reset). Unfortunately, tracing again only yields: andromeda kernel: [ 2.177994] Magic number: 0:400:740 andromeda kernel: [ 2.177995] hash matches /raid/home/tg/kernel-2.6.38/source/drivers/base/power/main.c:577 However, I have captured the dmesg output for the failed session (from /var/log/kern.log) although it looks like the interesting part may be missing and only the previous attempts up to "platform" may have been fully recorded. Created attachment 52332 [details]
kern.log from the successful (up to 'platform') and failed ('processor') attempts
Created attachment 52342 [details]
acpidump
Created attachment 52352 [details]
log produced by fwts (firmware test suite)
I applied some debug instrumentation to my kernel (using the TRACE_RESUME framework) and the last method call that is recorded is disable_nonboot_cpus() in suspend_enter (kernel/power/suspend.c). As there is another instrumentation point right after the method call I am quite certain that this method call never returns. I'll add some further instrumentation code to figure out what is going on in that method... Just an idea, but is it guaranteed that disable_nonboot_cpus is executed on CPU0? Did some further instrumentation, which ended up in kernel/notifier.c. As soon as I compile the kernel with CONFIG_DEBUG_NOTIFIERS, the error probability decreases, but does not go down to zero. From the second consecutive suspend on, the power LED of the system no longer flashes, but instead stays dark (while suspended). Otherwise, the suspend works correctly. Eventually, the system will freeze again while suspending, HDDs powered down, power LED lit. When this happens the last statement recorded by the instrumentation is right before "return ret;" in notifier_call_chains (kernel/notifier.c) most probably called by __cpu_notify (kernel/cpu.c). To me this has two implications: 1) There seem to be invalid notifiers in this call chain 2) Some notifier being called is corrupting memory and also the stack, which is why the return ends up nowhere Now I could need some help figuring out which notifiers are called (which is not that obvious given their dynamic nature). ;-) Just some additional info: I cannot reproduce this bug if I limit the number of used CPU cores to two (using maxcpus=2) Please disregard my comment #12, which is just plain wrong: I instrumented notifier_call_chains, which is of course not exclusively called from __cpu_notify, so it is quite natural not to see any further instrumentation afterwards if the respective caller has not been instrumented. I found an error in my bisect history, which I corrected. I now ended up at: bd25f4dd6972755579d0ea50d1a5ace2e9b00d1a is the first bad commit commit bd25f4dd6972755579d0ea50d1a5ace2e9b00d1a Author: Arnd Bergmann <arnd@arndb.de> Date: Sun Jul 11 15:34:05 2010 +0200 HID: hiddev: use usb_find_interface, get rid of BKL This removes the private hiddev_table in the usbhid driver and changes it to use usb_find_interface instead. The advantage is that we can avoid the race between usb_register_dev and usb_open and no longer need the big kernel lock. This doesn't introduce race condition -- the intf pointer could be invalidated only in hiddev_disconnect() through usb_deregister_dev(), but that will block on minor_rwsem and not actually remove the device until usb_open(). Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: "Greg Kroah-Hartman" <gregkh@suse.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz> :040000 040000 4ae14b3ba486373d7a354874e9ad334858f094e3 8041ffda20ca3020a6b60d64235ae179f8186bf0 M drivers Booting with no_console_suspend also brings up a BUG mentioning a soft lockup. As my serial console breaks during suspend, I had to take photos of the error messages: https://secure.tgbyte.de/dropbox/eaghoh9M.jpg https://secure.tgbyte.de/dropbox/ecae8ieR.jpg https://secure.tgbyte.de/dropbox/vah5ooR9.jpg Unfortunately, I have so far been unable to get the complete error message in a single screen capture (as the output is either too verbose or too scarce). Created attachment 53282 [details]
(Truncated) dmesg output (tail missing)
Created attachment 53292 [details]
lsusb -v
Some update: The bug inroduced by the commit mentioned in comment #15 has similar symptoms to what I have been experiencing, but has already been fixed in the meantime. A new (pretty tedious) bisect brought up this change: | e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c is the first bad commit | commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c | Author: Tejun Heo <tj@kernel.org> | Date: Tue Jun 29 10:07:14 2010 +0200 | | workqueue: implement concurrency managed dynamic worker pool Reverting this commit reliably fixes the issue. In the meantime, I have also been able to reproduce the issue in a KVM virtual machine, i.e., it is not hardware-specific, but must probably affects suspend on any SMP system (but the probability may depend on the number of CPUs). Further details bout my system config (which is also mimicked by my VM): - lvm running on top of - dmcrypt (luks) running on top of - md raid1 Some further details about the observed symptoms: After issuing pm-suspend, it all starts with the suspend sequence hanging while shutting down non-boot CPUs (the CPU# varies from case to case): https://secure.tgbyte.de/dropbox/ueph3Ohm-1.jpg Then, after some time, the following backtrace is written: https://secure.tgbyte.de/dropbox/ueph3Ohm-2.jpg Transcript: | RIP: _raw_spin_unlock_irqrestore | Call Stack: | _set_cpus_allowed_ptr | worker_maybe_bind_and_lock | rescuer_thread | rescuer_thread | kernel_thread_helper | kthread | kernel_thread_helper | kthread After some more time, the following backtrace is printed: https://secure.tgbyte.de/dropbox/ueph3Ohm-3.jpg Transcript: | RIP: worker_maybe_bind_and_lock | Call Stack: | rescuer_thread | rescuer_thread | kernel_thread_helper | kthread | kernel_thread_helper | kthread From then on these two backtraces are printed in an alternating fashion. In another attempt to further isolate the system parameters that trigger the issue, I set up a couple of minimal systems using various combinations of dmcrypt and raid1. As it turned out, having raid1 enabled is sufficient for triggering the issue, so commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c apparently broke suspend for all raid(1?) multi-core systems. Not really, I cannot reproduce the issue on three different systems using RAID-1 with 2.6.38 and 2.6.39-rc3. Have you tried all that on the same or similar hardware? That's weird. I assume your tests machines have more than two cores? I can reproduce the issue on my physical PC (4 cores) as well as within a KVM (which is running on the same machine, but that should IMHO not make a difference considering the virtualization layer inbetween) with -smp 8. I am currently uploading my test VM and will post an update including test instructions once it is available. One of my test boxes is six-core, but none of them supports hyperthreading. Can you verify that the issue is still reproducible when hyperthreading support is disabled in the kernel config? My CPU (Intel Core 2 Quad Q8400) does not support HT either and I don't think that QEMU will provide a virtualized hyperthreaded CPU, but I can still test that, if you think it makes sense under these circumstances. Not really. :-) Please attach your .config. Created attachment 54502 [details]
.config (config-2.6.35-rc3-e22bee+)
BTW, I have run a couple of further tests. 2.6.39-rc3 no longer seems to be exhibiting the issue when testing using my VM (I have yet to run some tests with my 'real' PC). 2.6.38, however, reliably freezes in the VM (but it freezes so reliably that this may be a different issue or related to the Ubuntu kernel package - I'll build a vanilla kernel to rule out any distribution-specific bugs [any other kernels I tested were vanilla kernels]). I just verified using a 2.6.38 vanilla kernel that the lockups also happen within my virtualized environment (although for 2.6.38 the lockup detector does not trigger [although it is compiled in], i.e. the suspend sequence just hangs after "CPU i is now offline"; i < n). The lockups happen with a higher probability when both mdraid and dmcrypt are deployed (compared to mdraid only). BTW, this is my test script: -- 8< -- #!/bin/bash sysctl -w kernel.printk=8 echo core > /sys/power/pm_test for i in {1..99}; do echo "Pass $i"; uname -a; sync; pm-suspend; done -- 8< -- In case you'd like to get hold of my test environment: https://secure.tgbyte.de/dropbox/lockup-test.tar.bz2 contains the disk images for my test VM. Boot using: kvm -hda LockupTestRaid-1.qcow2 -hdb LockupTestRaid-2.qcow2 -smp 8 -m 1024 -curses To run the test, log in as root / test and run: /root/suspend-test Currently, the image only contains kernels from my bisect test (i.e., no recent kernels, but those can be easily installed). It's great that the kernel bugzilla is back. Can you please verify if the problem still exists in the latest upstream kernel? This issue has already been fixed. |
Created attachment 51482 [details] dmesg debugging output Since upgrading my system's kernel from 2.6.35 to 2.6.38 Suspend to RAM (S3) sporadically hangs with the following symptoms: The system goes through the shutdown sequence - Switches to tty1 - Turns off the HDDs ...and then just stays there, i.e., with tty1 on screen, but completely unresponsive (even to Magic SysRq). The issue occurs both with the current Ubuntu 11.04 as well as the vanilla 2.6.38 kernel. It did not occur with kernel 2.6.35. I compiled a kernel with PM debugging turned on (the resulting dmesg output is attached to this ticket), but the fingerprinting did not yield any meaningful output: andromeda kernel: [ 2.187047] Magic number: 0:400:740 andromeda kernel: [ 2.187049] hash matches /raid/home/tg/kernel-2.6.38/source/drivers/base/power/main.c:577 I'd appreciate any hints or experimental patches to further isolate the issue.