Hello, Running gentoo with kernel 3.6.11 (hibernation and suspend work) Upgrading to 3.7 or 3.8 break hibernation and the computer freeze. Suspend work. Testing with this page http://www.mjmwired.net/kernel/Documentation/power/basic-pm-debugging.txt All is ok except processors and core (black screen and kernel panic). I've tried to hibernate with just one core online and everything worked echo 0 > /sys/devices/system/cpu/cpu*/online Don't know what information you need. Thx
Hi Esteban, Thanks for the report and the debug you've done. Can you please do a git bisect for 3.6-3.7 and find out the offending commit? Thanks.
This is the result 033d9959ed2dc1029217d4165f80a71702dc578e is the first bad commit Thx
Which is a merge: commit 033d9959ed2dc1029217d4165f80a71702dc578e Merge: 974a847 7c6e72e Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue Oct 2 09:54:49 2012 -0700 Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq I wonder if you can check both parents of that commit and see if they work separately?
Any chance you can capture the message from the panic? I'll try to reproduce the problem. Thanks.
(In reply to comment #4) > Any chance you can capture the message from the panic? I'll try to reproduce > the problem. > > Thanks. No, when it happen, my screen turn off and a few seconds after I have a kernel panic. I have no log. I saw arch is also affected https://bbs.archlinux.org/viewtopic.php?id=156276
(In reply to comment #3) > Which is a merge: > > commit 033d9959ed2dc1029217d4165f80a71702dc578e > Merge: 974a847 7c6e72e > Author: Linus Torvalds <torvalds@linux-foundation.org> > Date: Tue Oct 2 09:54:49 2012 -0700 > > Merge branch 'for-3.7' of > git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq > > I wonder if you can check both parents of that commit and see if they work > separately? Don't understand exactly what you want. Sorry but my english is bad. I've try with workqueue.c from http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ea1abd6197d5805655da1bb589929762f4b4aa08 but same problem. This is the highest in the parent I can compil my kernel. With earlier my kernel compil fail.
Hmmm... It doesn't reproduce here.
Oops, fat finger pressed enter too soon. Cc'ing Lai who wrote a lot of change in that pull request. Esteban, can you please attach your .config? Also, can you please test 7c6e72e46c9ea4a88f3f8ba96edce9db4bd48726 and if that one is bad, start bisection from there? For the "good" starting point 0d7614f09c1ebdbaa1599a5aba7593f147bf96ee should do (but please verify it actually works just in case). Thank you very much.
Created attachment 94241 [details] .config kernel
(In reply to comment #8) > Also, can you please test 7c6e72e46c9ea4a88f3f8ba96edce9db4bd48726 and if > that > one is bad, start bisection from there? > > Thank you very much. Same result, strange? 033d9959ed2dc1029217d4165f80a71702dc578e is the first bad commit. Don't know if what I do is good. I get workqueue.c from commit 7c6e72e46c9ea4a88f3f8ba96edce9db4bd48726 and I put in kernel-3.7.2/kernel/workqueue.c I compil and install 3.7.2 After I do a git bisect with 3.6.11 > For the "good" starting point > 0d7614f09c1ebdbaa1599a5aba7593f147bf96ee should do (but please verify it > actually works just in case). Don't know how to test because if I put workqueue.c from 3.6-rc1 in 3.7.2, I can't compile my kernel. I'll try a git bisect with commit ea1abd6197d5805655da1bb589929762f4b4aa08 which don't work
Sorry, I saw that I've do is wrong. So I download the tree from commit 7c6e72e46c9ea4a88f3f8ba96edce9db4bd48726, compiled the kernel and hibernation work.
I'm confused. Are you saying the bisection was wrong or the bug report was wrong?
Sorry, The bug and the bisection are right. My comment #10 was wrong because I only replace the workqueue.c file in kernel 3.7.2. So I have made test with the snapshots from the commits. To resume : Kernel 3.7.2 gentoo : kernel panic with hibernation Kernel 3.6.11 gentoo : hibernation work kernel from commit 033d9959ed2dc1029217d4165f80a71702dc578e: kernel panic with hibernation Kernel from commit 7c6e72e46c9ea4a88f3f8ba96edce9db4bd48726: hibernation work Kernel from commit 0d7614f09c1ebdbaa1599a5aba7593f147bf96ee: hibernation work
Hmmm.... that's unexpected, can you please test 974a847e00cf3ff1695e62b276892137893706ab? Also, can you please do the usual bisection rather than copying workqueue.c around? Thanks.
My first bisection between the 2 gentoo kernel was made without copying workqueue.c. git.kernel.org has change and I don't find where to download snapshot from commit 974a847e00cf3ff1695e62b276892137893706ab You want a bisection between which commits? Thanks
It probably is the easiest if you use git repo for bisection. https://www.kernel.org/pub//software/scm/git/docs/git-bisect.html I'm mostly curious whether 974a847e00cf3ff1695e62b276892137893706ab works or not. If it doesn't, bisecting between it and whatever is the latest that you know which works should give us a better idea of where the fault is located. If it indeed is the merge commit which introduced the problem - ie. both 974a847e00cf3ff1695e62b276892137893706ab and 7c6e72e46c9ea4a88f3f8ba96edce9db4bd48726 work but 033d9959ed2dc1029217d4165f80a71702dc578e doesn't, then we'd be looking at a side effect of merging, which would be pretty interesting too. Thanks.
From the git repo 7c6e72e46c9ea4a88f3f8ba96edce9db4bd48726 work. 974a847e00cf3ff1695e62b276892137893706ab work too. 033d9959ed2dc1029217d4165f80a71702dc578e doesn't. Don't sure if what I do is good so I describe. In the git repo I do git branch 974a84 974a847e00cf3ff1695e62b276892137893706ab git checkout 974a84 compile kernel and install it git checkout master git branch -d 974a84 git reset and I restart with git branch...... for every commit. Thanks
Hi, Is it possible to know all commits apply between (for workqueue.c) 033d9959ed2dc1029217d4165f80a71702dc578e and 974a847e00cf3ff1695e62b276892137893706ab I can't determine which commit is the workqueue.c from 974a847e00cf3ff1695e62b276892137893706ab I'll try to apply patch one by one to find which cause the kernel panic. Don't know if it's a good way to proceed Thanks
Ok, found the problem commit 715f1300802e6eaefa85f6cfc70ae99af3d5d497 (workqueue: fix zero @delay handling of queue_delayed_work_on()) and commit 8852aac25e79e38cc6529f20298eed154f60b574 (workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay) ** If I remove commit 715f1300802e6eaefa85f6cfc70ae99af3d5d497 from 033d9959ed2dc1029217d4165f80a71702dc578e i.e deleting in workqueue.c if (!delay) return queue_work_on(cpu, wq, &dwork->work); and adding if (delay == 0) return queue_work(wq, &dwork->work); hibernation work. ** In kernel 3.7.10 and 3.8.1 in gentoo I have to remove commit 715f1300802e6eaefa85f6cfc70ae99af3d5d497 and commit 8852aac25e79e38cc6529f20298eed154f60b574 to make hibernation work i.e deleting /* * If @delay is 0, queue @dwork->work immediately. This is for * both optimization and correctness. The earliest @timer can * expire is on the closest next tick and delayed_work users depend * on that there's no such delay when @delay is 0. */ if (!delay) { __queue_work(cpu, wq, &dwork->work); return; } and adding */ bool queue_delayed_work(struct workqueue_struct *wq, struct delayed_work *dwork, unsigned long delay) { if (delay == 0) return queue_work(wq, &dwork->work); return queue_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay); } EXPORT_SYMBOL_GPL(queue_delayed_work); Thanks
Hello, Esteban. Thanks a lot for the bisection. I think I have an idea about what's going on. There was another case which had a similar problem. It wasn't a bug in workqueue itself but the workqueue user abusing delayed_work interface. Hmmm... we need to locate the abuser. I'll think about how to hunt it down. Thanks.
Created attachment 94521 [details] dwork-dbg.patch Can you please apply the patch on a broken kernel, try hibernation and attach the kernel log afterwards? Thanks.
Try the patch with 033d9959ed2dc1029217d4165f80a71702dc578e, kernel 3.7.10 and 3.8.1. I always have the kernel panic. I don't have any log. I post my /var/log/messages but I don't see anything Thanks
Created attachment 94571 [details] /var/log/message
I guessed the cpu is offline when workqueue.c do add_timer_on() (but I totally has no idea why reverting 715f1300802e6eaefa85f6cfc70ae99af3d5d497 can hide this problem, so I send this comment very late), I think we need some checking code in workqueue.c for this purpose.
Lai, can you prep a debug patch to confirm your suspicion? I don't get how my debug patch doesn't make the hibernation succeed again when reverting 715f130080 does. Hmmm.... weird....
Created attachment 94711 [details] dwork-cpu-dbg.patch Can you please try this patch and post the kernel log? Thanks.
Created attachment 94721 [details] dwork-cpu-dbg.patch Oops, please try this one instead.
Created attachment 94731 [details] dwork-cpu-dbg.patch I'm on a roll today. Sorry. :) I misread what Lai wrote. Please try this one instead.
I think there is an error in the patch. There is no ref to this line struct cpu_workqueue_struct *cwq = get_work_cwq(&dwork->work); So the patch can't be applied. This is the section I have in workqueue.c void delayed_work_timer_fn(unsigned long __data) { struct delayed_work *dwork = (struct delayed_work *)__data; struct cpu_workqueue_struct *cwq = get_work_cwq(&dwork->work); /* should have been called from irqsafe timer with irq already off */ __queue_work(dwork->cpu, cwq->wq, &dwork->work); } I've try putting the line between the 2 "struct" but I always have a kernel panic. Removing struct cpu_workqueue_struct *cwq = get_work_cwq(&dwork->work); and I can't compile my kernel. Thanks
Applying the patch to the last git sources and I still have a kernel panic. Thanks
Sorry about the delay. I forgot about this. Any chance you can post the panic with the patch applied? Taking a photo of the panic would work too. Thanks.
I did many test and I am completely lost... For debugging, I have activate all options in the Kernel hacking. The result was hibernation work. Don't know why but this work. So like git bisect, I remove step by step options to see witch make hibernation work. I found that enabling "SLUB debugging on by default" (no over option) make hibernation work. To debug, I enable all options in "Kernel hacking" except "SLUB debugging on by default". With this combination, hibernation....... work too. The problem is, some combination make hibernation work and over not. I can't try all possibilities and I don't understand why enabling options in "kernel hacking" make hibernation work. Like I said, when hibernation fail, the screen turns off and there is the kernel panic. I don't see any log. With some combination I have logs but hibernation work. Do you know what options I have to enable in the kernel to see log? Searching internet and I found that I can get kernel messages over USB, but don't know how to do this. So like I said I'm totaly lost. Thanks
Hi Esteban, Not sure if this helps, but you can try to follow Documentation/power/basic-pm-debugging.txt for some tests. I think you can start from devices, and if everything is OK, proceed to next test level. Thanks.
Aaron, Esteban said he did that in Description. The fact that the processors test fails (and the problem is not reproducible with just 1 CPU online) means that CPU offline is involved and since the issue is not reproducible with different combinations of config/debug options, it most likely is due to a race somewhere. So commit 033d9959ed2dc1029217d4165f80a71702dc578e may not even be the cuplrit, it just might change the timing of things slightly and that might cause an *old* race to show up. Esteban, is this reproducible with the current Linus' tree?
You can also try to play with CPU online/offline using the sysfs interface and see if you're able to trigger anything suspicious this way.
Try with the current linus' tree with and without the patch and same result. I've try putting cpu online/offline. One cpu work but 2 give kernel panic. Try netconsole and kexec to have the log but no result. I'll try with next releases to see if it work. Thanks
(In reply to comment #37) > Try with the current linus' tree with and without the patch and same result. > I've try putting cpu online/offline. One cpu work but 2 give kernel panic. > Try netconsole and kexec to have the log but no result. Can you boot into console mode and then put cpu offline/online, when panic occurs, you may be able to see something I think.
Esteban, what was the newest kernel you tested?
Hi Esteban, Are you still there?
Sorry about the delay, I'm moving. I'm going to make test next week, if I have time. Thanks
Try kernel 3.9.7 and hibernation/suspend work without enabling any option in "Kernel hacking". Thanks for your work
Thanks for the confirmation! Closing.