Systemd-207 receives SEGV after waking up from suspemd-to-RAM on many different machines. Maintainer says that the cause of memory corruption by Linux-3.11 kernel. Rolling back to kernel 3.10 solves the problem. Links: https://bugs.archlinux.org/task/36991#comment114444 https://bbs.archlinux.org/viewtopic.php?pid=1326769 https://bbs.archlinux.de/viewtopic.php?pid=311974
I can't reproduce the problem here, could you please bisect the offending commit? BTW, does x86_64 kernel have this problem?
(In reply to Aaron Lu from comment #1) > I can't reproduce the problem here, could you please bisect the offending > commit? > BTW, does x86_64 kernel have this problem? All known (for me) bug reports is aboit Arch Linux 3.11.1 i686 (not 64) kernel build: https://projects.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux
Please help to verify if 64 bits kernel has this problem and please do git bisect to find the offending commit, thanks.
x86_64 kernel is clean. No SEGV, no memory corruptions after 10 suspends/wakes.
Can confirm with 3.11.1 on Fedora 19. Appears to affect only i686. Reported at several other places now e.g.: https://bugzilla.redhat.com/show_bug.cgi?id=1010603 https://bugs.freedesktop.org/show_bug.cgi?id=69663
I did a checkout from https://github.com/torvalds/linux.git Running ArchLinux, i686, gcc-4.8.1, systemd-207. After doing a bisect, this is what I'm seeing: 176455e: boots fine, suspend works as expected. 85eb982: systemd goes into a tailspin at boot-up, spewing out messages faster than i can read. (same kernel boots fine with busybox init) 91bdad0: same as 85eb982, above 266c13d: boots fine, but systemd fails when resuming from suspend. That's about as much as I have time for right now, hope it helps.
Thanks Jeff. So commit 266c13d failed, is it the same failure as described by Natrio? And does commit 8f0b3b7(the commit before 266c13d) have problem?
On my Thinkpad X41, commit 266c13d fails, while on commit 8f0b3b7 I cannot reproduce the problem (that is, suspending-resuming once or a few times and watching systemd crash with a general protection fault). The oldest bad commit I have found so far is d2b4a64.
Scratch that, 21884a8 is bad; 73b0cd67 doesn't seem to be affected.
A few bisects later... d5ba5b1 is the deepest bad commit I can find right now, after that (assuming the commits marked 'good' are indeed 'good'), suspend doesn't work anymore and the machine just hangs with a black (no backlight) screen. These commits are marked 'skip' in the git bisect log. More advanced git-fu seems to be called for here, or maybe someone who doesn't have this problem can pick it up from here. --8<----8<----8<--{git bisect log}--8<----8<----8<-- git bisect start # [...] # bad: [d5ba5b141dae17155d12b897f2eedeaf6800db7c] Merge branch 'acpi-pm' git bisect bad d5ba5b141dae17155d12b897f2eedeaf6800db7c # skip: [996e2569dab534089b095da86e5d01d9d954279e] Merge branch 'acpi-hotplug' git bisect skip 996e2569dab534089b095da86e5d01d9d954279e # skip: [405a1086bdd091d2d55db0ac905cd6332b35cec1] Merge branch 'pm-cpufreq' git bisect skip 405a1086bdd091d2d55db0ac905cd6332b35cec1 # good: [f5ce1572109049b90484e2bb44927cb6034c5eb1] PM / QoS: Add pm_qos and dev_pm_qos to events-power.txt git bisect good f5ce1572109049b90484e2bb44927cb6034c5eb1 # good: [173a5a4c909789fcd57d00355d2237618a3824a4] ACPI / processor: Fix potential NULL pointer dereference in acpi_processor_add() git bisect good 173a5a4c909789fcd57d00355d2237618a3824a4 # skip: [bdc8f09685a25a12f2f5282f56672ba663ecb88c] Merge branch 'acpi-assorted' git bisect skip bdc8f09685a25a12f2f5282f56672ba663ecb88c # skip: [e52cff8bdd4a30c40a7f65c7ea8f1f425f8a15eb] Merge branch 'pm-assorted' git bisect skip e52cff8bdd4a30c40a7f65c7ea8f1f425f8a15eb # good: [09d5ca804e796a20dbc50be0971b9e89e692c256] ACPI / processor: Drop unused variable from processor_perflib.c git bisect good 09d5ca804e796a20dbc50be0971b9e89e692c256 # good: [9b5c7a5a977a330ffaf83c4d383ba247c74c800f] ACPI / PM: Fix possible NULL pointer deref in acpi_pm_device_sleep_state() git bisect good 9b5c7a5a977a330ffaf83c4d383ba247c74c800f --8<----8<----8<----8<----8<----8<--
I'm on vocation from 09/27-10/07, probably no internet access so expect no response from me, sorry for the inconvenience.
Reverting 207bc11 "fixes" the problem, and 2b15af6f is definitely bad. Wheee.
Bisected to: 1c441e9 "epoll: use freezable blocking call"
I think you found it, Leo! Testing here, the previous commit (e2610b2) suspend/resume works fine, but 1c441e9 causes the SEGV in systemd. Looks like it has something to do with "freezing" epoll events when the system is suspended.
Patch 1c441e9 "epoll: use freezable blocking call" causes the kernel not to return from the epoll_wait syscall with -EINTR after every suspend cycle, it continues the poll instead. I don't see anything obvious that would cause the kernel to corrupt the process' memory, unless it freed the events pointer passed to epoll_wait. Is it possible to run systemd in valgrind, or connect strace to it (strace -ittfp `pgrep systemd`)?
Hi, Colin - I'm really not convinced that the title of this report accurately describes the problem, or that the kernel is really to blame, but i'm pretty sure it's this commit that triggers the bug. Valgrind would be great but in this case systemd is pid #1 and i think it needs to stay that way, so spawning it from valgrind won't work. I tried strace but couldn't trigger the bug with strace attached, i'm guessing it's some sort of race condition and strace slows things down just enough that the bug disappears. Maybe systemd is trying to read/write too soon, i.e. before the system is completely re-stabilized? I was able to capture the crash with gdb attached and the backtrace looks like this: http://codepad.org/at7IhtMI
I can believe its related. Before 1c441e9, every epoll_wait would return -EINTR after suspend, and then the loop in systemd's manager_loop would call all the manager_dispatch functions, reenter epoll_wait, and immediately see the time change event. After 1c441e9 epoll_wait will not return -EINTR, it will go directly to returning the time change event. That could definitely change the timing of something inside systemd or in the kernel. A few more ideas: Any chance systemd will compile with LLVM with AddressSanitizer turned on? Can you grab a /var/log/messages with kernel and systemd logs set to debug, with and without the changed reverted? I'm specifically interested if systemd logs any events between the kernel's "Restarting tasks ... done." and systemd's "Time has been changed"
(In reply to Colin Cross from comment #17) > Any chance systemd will compile with LLVM with AddressSanitizer turned on? Yes, this has been fixed now: https://bugzilla.redhat.com/show_bug.cgi?id=1010506#c15.
I have traced the problem to the DS register being set to 0 after suspend, when it should be 0x7b. If I set it back to 0x7b and continue where the fault occurred it doesn't crash.
Any more outside information to be provided to take this off NEEDINFO? This regression affects people on Fedora 19 and elsewhere.
Colin, are you looking into this one?
I'm pretty stuck on debugging it. An epoll_wait syscall during suspend is returning to userspsace with %ds set to KERNEL_DS after suspend. If I hack RESTORE_REGS to always put USER_DS into %ds it doesn't happen, but reading the saved regs from epoll_wait never sees the wrong value.
Cc'd Rafael and Peter. After 1c441e9 which changed how tasks which were sleeping inside epoll are handled across suspend / resume cycles, %ds is getting corrupted across suspend. The original behavior is that a task sleeping in epoll would be woken up to enter freezer and then on resume return with -EINTR. After the commit, such a task would stay in epoll wait across suspend / resume cycles. Any ideas? Thanks.
Colin, if we can't debug it soonish, I think we probably should revert it and try it in the next cycle. Thanks.
Any chance to get this reverted soon? I don't think it's acceptable to completely break suspend on some systems for more than a month. It makes Fedora looks very unprofessional. Thanks!
Colin, can you please submit a patch to revert the change w/ stable cc'd? We probably should try it in the next cycle once the problem is debugged. Thanks.
On Monday, October 28, 2013 02:23:39 PM bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=61781 > > --- Comment #26 from Tejun Heo <tj@kernel.org> --- > Colin, can you please submit a patch to revert the change w/ stable cc'd? We > probably should try it in the next cycle once the problem is debugged. Since the offending commit went it through my tree, I've just queued up a revert of it for 3.12 and will push it to Linus tomorrow. Thanks!
Great, thank you!
The revert patch has been merged into upstream tree. So close it. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c511851de162e8ec03d62e7d7feecbdf590d881d