Bug 54331 - Hibernate not working with kernel 3.7 and 3.8
Summary: Hibernate not working with kernel 3.7 and 3.8
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: Hibernation/Suspend (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Rafael J. Wysocki
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-02-23 20:37 UTC by Esteban Taroni
Modified: 2013-06-23 21:34 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.7.9 3.8.0
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
.config kernel (67.35 KB, application/octet-stream)
2013-02-28 17:57 UTC, Esteban Taroni
Details
dwork-dbg.patch (2.19 KB, patch)
2013-03-05 01:12 UTC, Tejun Heo
Details | Diff
/var/log/message (99.55 KB, text/plain)
2013-03-05 15:21 UTC, Esteban Taroni
Details
dwork-cpu-dbg.patch (570 bytes, patch)
2013-03-07 16:15 UTC, Tejun Heo
Details | Diff
dwork-cpu-dbg.patch (602 bytes, patch)
2013-03-07 16:18 UTC, Tejun Heo
Details | Diff
dwork-cpu-dbg.patch (1.02 KB, patch)
2013-03-07 16:24 UTC, Tejun Heo
Details | Diff

Description Esteban Taroni 2013-02-23 20:37:11 UTC
Hello,

Running gentoo with kernel 3.6.11 (hibernation and suspend work)

Upgrading to 3.7 or 3.8 break hibernation and the computer freeze.
Suspend work.

Testing with this page
http://www.mjmwired.net/kernel/Documentation/power/basic-pm-debugging.txt
All is ok except processors and core (black screen and kernel panic).

I've tried to hibernate with just one core online and everything worked
echo 0 > /sys/devices/system/cpu/cpu*/online

Don't know what information you need.

Thx
Comment 1 Aaron Lu 2013-02-25 05:29:45 UTC
Hi Esteban,

Thanks for the report and the debug you've done.
Can you please do a git bisect for 3.6-3.7 and find out the offending commit? Thanks.
Comment 2 Esteban Taroni 2013-02-25 17:09:48 UTC
This is the result

033d9959ed2dc1029217d4165f80a71702dc578e is the first bad commit

Thx
Comment 3 Rafael J. Wysocki 2013-02-25 23:35:32 UTC
Which is a merge:

commit 033d9959ed2dc1029217d4165f80a71702dc578e
Merge: 974a847 7c6e72e
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Tue Oct 2 09:54:49 2012 -0700

    Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

I wonder if you can check both parents of that commit and see if they work separately?
Comment 4 Tejun Heo 2013-02-25 23:40:45 UTC
Any chance you can capture the message from the panic? I'll try to reproduce the problem.

Thanks.
Comment 5 Esteban Taroni 2013-02-26 21:38:13 UTC
(In reply to comment #4)
> Any chance you can capture the message from the panic? I'll try to reproduce
> the problem.
> 
> Thanks.

No, when it happen, my screen turn off and a few seconds after I have a kernel panic. I have no log.
I saw arch is also affected
https://bbs.archlinux.org/viewtopic.php?id=156276
Comment 6 Esteban Taroni 2013-02-26 21:51:24 UTC
(In reply to comment #3)
> Which is a merge:
> 
> commit 033d9959ed2dc1029217d4165f80a71702dc578e
> Merge: 974a847 7c6e72e
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date:   Tue Oct 2 09:54:49 2012 -0700
> 
>     Merge branch 'for-3.7' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
> 
> I wonder if you can check both parents of that commit and see if they work
> separately?

Don't understand exactly what you want.
Sorry but my english is bad.
I've try with workqueue.c from
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ea1abd6197d5805655da1bb589929762f4b4aa08
but same problem.
This is the highest in the parent I can compil my kernel.
With earlier my kernel compil fail.
Comment 7 Tejun Heo 2013-02-27 21:47:10 UTC
Hmmm... It doesn't reproduce here.
Comment 8 Tejun Heo 2013-02-27 21:49:49 UTC
Oops, fat finger pressed enter too soon. Cc'ing Lai who wrote a lot of change in that pull request. Esteban, can you please attach your .config? Also, can you please test 7c6e72e46c9ea4a88f3f8ba96edce9db4bd48726 and if that one is bad, start bisection from there? For the "good" starting point 0d7614f09c1ebdbaa1599a5aba7593f147bf96ee should do (but please verify it actually works just in case).

Thank you very much.
Comment 9 Esteban Taroni 2013-02-28 17:57:29 UTC
Created attachment 94241 [details]
.config kernel
Comment 10 Esteban Taroni 2013-02-28 18:24:14 UTC
(In reply to comment #8)
> Also, can you please test 7c6e72e46c9ea4a88f3f8ba96edce9db4bd48726 and if
> that > one is bad, start bisection from there?
> 
> Thank you very much.

Same result, strange?

033d9959ed2dc1029217d4165f80a71702dc578e is the first bad commit.

Don't know if what I do is good.
I get workqueue.c from commit 7c6e72e46c9ea4a88f3f8ba96edce9db4bd48726 and I put in kernel-3.7.2/kernel/workqueue.c
I compil and install 3.7.2
After I do a git bisect with 3.6.11

> For the "good" starting point
> 0d7614f09c1ebdbaa1599a5aba7593f147bf96ee should do (but please verify it
> actually works just in case).

Don't know how to test because if I put workqueue.c from 3.6-rc1 in 3.7.2, I can't compile my kernel.

I'll try a git bisect with commit ea1abd6197d5805655da1bb589929762f4b4aa08 which don't work
Comment 11 Esteban Taroni 2013-02-28 20:44:12 UTC
Sorry, I saw that I've do is wrong.
So I download the tree from commit 7c6e72e46c9ea4a88f3f8ba96edce9db4bd48726, compiled the kernel and hibernation work.
Comment 12 Tejun Heo 2013-02-28 21:06:41 UTC
I'm confused. Are you saying the bisection was wrong or the bug report was wrong?
Comment 13 Esteban Taroni 2013-02-28 22:05:08 UTC
Sorry,

The bug and the bisection are right.

My comment #10 was wrong because I only replace the workqueue.c file in kernel 3.7.2.
So I have made test with the snapshots from the commits.

To resume :
Kernel 3.7.2 gentoo : kernel panic with hibernation
Kernel 3.6.11 gentoo : hibernation work
kernel from commit 033d9959ed2dc1029217d4165f80a71702dc578e: kernel panic with hibernation
Kernel from commit 7c6e72e46c9ea4a88f3f8ba96edce9db4bd48726: hibernation work
Kernel from commit 0d7614f09c1ebdbaa1599a5aba7593f147bf96ee: hibernation work
Comment 14 Tejun Heo 2013-02-28 22:22:40 UTC
Hmmm.... that's unexpected, can you please test 974a847e00cf3ff1695e62b276892137893706ab? Also, can you please do the usual bisection rather than copying workqueue.c around?

Thanks.
Comment 15 Esteban Taroni 2013-03-01 16:01:50 UTC
My first bisection between the 2 gentoo kernel was made without copying workqueue.c.
git.kernel.org has change and I don't find where to download snapshot from commit 974a847e00cf3ff1695e62b276892137893706ab
You want a bisection between which commits?

Thanks
Comment 16 Tejun Heo 2013-03-01 16:06:49 UTC
It probably is the easiest if you use git repo for bisection.

  https://www.kernel.org/pub//software/scm/git/docs/git-bisect.html

I'm mostly curious whether 974a847e00cf3ff1695e62b276892137893706ab works or not. If it doesn't, bisecting between it and whatever is the latest that you know which works should give us a better idea of where the fault is located. If it indeed is the merge commit which introduced the problem - ie. both 974a847e00cf3ff1695e62b276892137893706ab and 7c6e72e46c9ea4a88f3f8ba96edce9db4bd48726 work but 033d9959ed2dc1029217d4165f80a71702dc578e doesn't, then we'd be looking at a side effect of merging, which would be pretty interesting too.

Thanks.
Comment 17 Esteban Taroni 2013-03-01 20:55:12 UTC
From the git repo

7c6e72e46c9ea4a88f3f8ba96edce9db4bd48726 work.
974a847e00cf3ff1695e62b276892137893706ab work too.
033d9959ed2dc1029217d4165f80a71702dc578e doesn't.

Don't sure if what I do is good so I describe.
In the git repo I do

git branch 974a84 974a847e00cf3ff1695e62b276892137893706ab
git checkout 974a84
compile kernel and install it
git checkout master
git branch -d 974a84
git reset

and I restart with git branch...... for every commit.

Thanks
Comment 18 Esteban Taroni 2013-03-04 10:32:41 UTC
Hi,

Is it possible to know all commits apply between (for workqueue.c)
033d9959ed2dc1029217d4165f80a71702dc578e
and
974a847e00cf3ff1695e62b276892137893706ab

I can't determine which commit is the workqueue.c from 974a847e00cf3ff1695e62b276892137893706ab

I'll try to apply patch one by one to find which cause the kernel panic.

Don't know if it's a good way to proceed

Thanks
Comment 19 Esteban Taroni 2013-03-04 22:27:10 UTC
Ok, found the problem

commit 715f1300802e6eaefa85f6cfc70ae99af3d5d497 (workqueue: fix zero @delay handling of queue_delayed_work_on())
and
commit 8852aac25e79e38cc6529f20298eed154f60b574 (workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay)

**
If I remove commit 715f1300802e6eaefa85f6cfc70ae99af3d5d497 from 033d9959ed2dc1029217d4165f80a71702dc578e i.e
deleting in workqueue.c
 if (!delay)
 return queue_work_on(cpu, wq, &dwork->work);

and adding
 if (delay == 0)
 return queue_work(wq, &dwork->work);

hibernation work.

**
In kernel 3.7.10 and 3.8.1 in gentoo I have to remove commit 715f1300802e6eaefa85f6cfc70ae99af3d5d497 and commit 8852aac25e79e38cc6529f20298eed154f60b574 to make hibernation work i.e
deleting
 /*
 * If @delay is 0, queue @dwork->work immediately. This is for
 * both optimization and correctness. The earliest @timer can
 * expire is on the closest next tick and delayed_work users depend
 * on that there's no such delay when @delay is 0.
 */
 if (!delay) {
 __queue_work(cpu, wq, &dwork->work);
 return;
 }

and adding
*/
bool queue_delayed_work(struct workqueue_struct *wq,
struct delayed_work *dwork, unsigned long delay)
{
 if (delay == 0)
 return queue_work(wq, &dwork->work);

return queue_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay);
}
EXPORT_SYMBOL_GPL(queue_delayed_work);

Thanks
Comment 20 Tejun Heo 2013-03-05 00:34:57 UTC
Hello, Esteban.

Thanks a lot for the bisection. I think I have an idea about what's going on. There was another case which had a similar problem. It wasn't a bug in workqueue itself but the workqueue user abusing delayed_work interface. Hmmm... we need to locate the abuser. I'll think about how to hunt it down.

Thanks.
Comment 21 Tejun Heo 2013-03-05 01:12:49 UTC
Created attachment 94521 [details]
dwork-dbg.patch

Can you please apply the patch on a broken kernel, try hibernation and attach the kernel log afterwards?

Thanks.
Comment 22 Esteban Taroni 2013-03-05 15:20:07 UTC
Try the patch with 033d9959ed2dc1029217d4165f80a71702dc578e, kernel 3.7.10 and 3.8.1.
I always have the kernel panic.

I don't have any log.
I post my /var/log/messages but I don't see anything

Thanks
Comment 23 Esteban Taroni 2013-03-05 15:21:41 UTC
Created attachment 94571 [details]
/var/log/message
Comment 24 Lai Jiangshan 2013-03-05 16:35:37 UTC
I guessed the cpu is offline when workqueue.c do add_timer_on() (but I totally has no idea why reverting 715f1300802e6eaefa85f6cfc70ae99af3d5d497 can hide this problem, so I send this comment very late), I think we need some checking code in workqueue.c for this purpose.
Comment 25 Tejun Heo 2013-03-05 16:38:50 UTC
Lai, can you prep a debug patch to confirm your suspicion? I don't get how my debug patch doesn't make the hibernation succeed again when reverting 715f130080 does. Hmmm.... weird....
Comment 26 Tejun Heo 2013-03-07 16:15:59 UTC
Created attachment 94711 [details]
dwork-cpu-dbg.patch

Can you please try this patch and post the kernel log? Thanks.
Comment 27 Tejun Heo 2013-03-07 16:18:36 UTC
Created attachment 94721 [details]
dwork-cpu-dbg.patch

Oops, please try this one instead.
Comment 28 Tejun Heo 2013-03-07 16:24:54 UTC
Created attachment 94731 [details]
dwork-cpu-dbg.patch

I'm on a roll today. Sorry. :)

I misread what Lai wrote. Please try this one instead.
Comment 29 Esteban Taroni 2013-03-07 17:46:11 UTC
I think there is an error in the patch.
There is no ref to this line 
struct cpu_workqueue_struct *cwq = get_work_cwq(&dwork->work);

So the patch can't be applied.

This is the section I have in workqueue.c

void delayed_work_timer_fn(unsigned long __data)
{
	struct delayed_work *dwork = (struct delayed_work *)__data;
	struct cpu_workqueue_struct *cwq = get_work_cwq(&dwork->work);

	/* should have been called from irqsafe timer with irq already off */
	__queue_work(dwork->cpu, cwq->wq, &dwork->work);
}

I've try putting the line between the 2 "struct" but I always have a kernel panic.
Removing struct cpu_workqueue_struct *cwq = get_work_cwq(&dwork->work); and I can't compile my kernel.

Thanks
Comment 30 Esteban Taroni 2013-03-11 17:04:15 UTC
Applying the patch to the last git sources and I still have a kernel panic.

Thanks
Comment 31 Tejun Heo 2013-04-04 19:17:17 UTC
Sorry about the delay. I forgot about this. Any chance you can post the panic with the patch applied? Taking a photo of the panic would work too.

Thanks.
Comment 32 Esteban Taroni 2013-04-05 17:57:35 UTC
I did many test and I am completely lost...

For debugging, I have activate all options in the Kernel hacking.
The result was hibernation work. Don't know why but this work.

So like git bisect, I remove step by step options to see witch make hibernation work.
I found that enabling "SLUB debugging on by default" (no over option) make hibernation work.

To debug, I enable all options in "Kernel hacking" except "SLUB debugging on by default". With this combination, hibernation....... work too.
The problem is, some combination make hibernation work and over not. I can't try all possibilities and I don't understand why enabling options in "kernel hacking" make hibernation work.

Like I said, when hibernation fail, the screen turns off and there is the kernel panic. I don't see any log.
With some combination I have logs but hibernation work.

Do you know what options I have to enable in the kernel to see log?
Searching internet and I found that I can get kernel messages over USB, but don't know how to do this.

So like I said I'm totaly lost.

Thanks
Comment 33 Aaron Lu 2013-04-07 01:15:58 UTC
Hi Esteban,

Not sure if this helps, but you can try to follow Documentation/power/basic-pm-debugging.txt for some tests. I think you can start from devices, and if everything is OK, proceed to next test level. Thanks.
Comment 35 Rafael J. Wysocki 2013-04-07 21:54:14 UTC
Aaron, Esteban said he did that in Description.

The fact that the processors test fails (and the problem is not reproducible with just 1 CPU online) means that CPU offline is involved and since the issue is not reproducible with different combinations of config/debug options, it most likely is due to a race somewhere.

So commit 033d9959ed2dc1029217d4165f80a71702dc578e may not even be the cuplrit, it just might change the timing of things slightly and that might cause an *old* race to show up.

Esteban, is this reproducible with the current Linus' tree?
Comment 36 Rafael J. Wysocki 2013-04-07 21:58:41 UTC
You can also try to play with CPU online/offline using the sysfs interface and see if you're able to trigger anything suspicious this way.
Comment 37 Esteban Taroni 2013-04-10 20:13:19 UTC
Try with the current linus' tree with and without the patch and same result.
I've try putting cpu online/offline. One cpu work but 2 give kernel panic.
Try netconsole and kexec to have the log but no result.
I'll try with next releases to see if it work.

Thanks
Comment 38 Aaron Lu 2013-04-11 01:20:14 UTC
(In reply to comment #37)
> Try with the current linus' tree with and without the patch and same result.
> I've try putting cpu online/offline. One cpu work but 2 give kernel panic.
> Try netconsole and kexec to have the log but no result.

Can you boot into console mode and then put cpu offline/online, when panic occurs, you may be able to see something I think.
Comment 39 Rafael J. Wysocki 2013-06-04 01:00:13 UTC
Esteban, what was the newest kernel you tested?
Comment 40 Aaron Lu 2013-06-14 07:47:31 UTC
Hi Esteban,

Are you still there?
Comment 41 Esteban Taroni 2013-06-15 03:26:20 UTC
Sorry about the delay, I'm moving.
I'm going to make test next week, if I have time.

Thanks
Comment 42 Esteban Taroni 2013-06-23 17:21:58 UTC
Try kernel 3.9.7 and hibernation/suspend work without enabling any option in "Kernel hacking".

Thanks for your work
Comment 43 Rafael J. Wysocki 2013-06-23 21:34:40 UTC
Thanks for the confirmation!

Closing.

Note You need to log in before you can comment on or make changes to this bug.