Bug 47301 - [BISECTED]Regression 3.5 -> 3.6-rc1 commit 63d95a91, BUG at kernel/sched/core.c 1654, always kworker
[BISECTED]Regression 3.5 -> 3.6-rc1 commit 63d95a91, BUG at kernel/sched/core...
Status: CLOSED CODE_FIX
Product: Process Management
Classification: Unclassified
Component: Scheduler
All Linux
: P1 normal
Assigned To: Tejun Heo
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-10 08:45 UTC by Duncan
Modified: 2012-11-05 23:14 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.6-rc1
Tree: Mainline
Regression: Yes


Attachments
BUG log (3.61 KB, text/plain)
2012-09-10 08:45 UTC, Duncan
Details
kernel config (67.12 KB, text/plain)
2012-09-10 08:53 UTC, Duncan
Details
rescuer-is-cpu_intensive.patch (752 bytes, patch)
2012-09-10 17:43 UTC, Tejun Heo
Details | Diff
local-wakeup-dbg.patch (2.50 KB, patch)
2012-09-10 19:31 UTC, Tejun Heo
Details | Diff
local-wakeup-dbg-1.patch (2.61 KB, patch)
2012-09-12 00:27 UTC, Tejun Heo
Details | Diff
dmesg from local-wakeup-dbg-1 (59.72 KB, text/plain)
2012-09-12 12:17 UTC, Duncan
Details
System.map (gzipped) matching attachment #79901 (dmesg from local-wakeup-dbg-1) (288.54 KB, application/octet-stream)
2012-09-12 17:26 UTC, Duncan
Details
powernow-k8-dont-muck-with-kworker.patch (4.07 KB, patch)
2012-09-12 18:22 UTC, Tejun Heo
Details | Diff
powernowk8-fix.patch (5.02 KB, patch)
2012-09-18 20:54 UTC, Tejun Heo
Details | Diff

Description Duncan 2012-09-10 08:45:18 UTC
Created attachment 79561 [details]
BUG log

BUG at kernel/sched/core.c 1654
invalid opcode: 0000 [#1] SMP

Line 1654 is in try_to_wakeup_local: BUG_ON(rq != this_rq());

bisected to 3.6 workqueue merge:

commit 63d95a91
workqueue: use @pool instead of @gcwq or @cpu where applicable

Given that the description says mechanical changes only, no function changes, maybe an alignment and field ordering change?

I don't however have an easy to trigger reproducer, tho time constraints of playing an incoming internet mp3 stream along with doing a (gentoo) system update to hit 100% usage on all 6 cores (bulldozer fx6100 CPU) seems to help, so it's not a 100% certainty that all my bisect goods are actually good and the first bad commit /might/ therefore be a few short of that.  But it's close and I THINK that's it.  Everything from there on thru 3.6-rc4 at least is bad tho.

The result is almost always a dead kernel, but I did happen to get one that still let me do magic-srq along the way and thus got the BUG log.  But I didn't have SystemMap installed, so if it would have helped...  Anyway, it's attached.

I'm setting KCFLAGS.  I've not YET tested with generic KCFLAGS but will do so.  But 3.5 was fine with the same KCFLAGS, so there's still something funny going on with that commit, even if it turns out that generic KCFLAGS doesn't trigger the problem.  Anyway, here they are (not /too/ ricer):

KCFLAGS="-march=native -pipe -O2 -frename-registers -fweb -fmerge-all-constants -fgcse-sm -fgcse-las -fgcse-after-reload -ftree-vectorize"

Hardware/software environment:

AMD "bulldozer" fx-6100 (x86_64 mode) w/ 16 gig RAM, Radeon hd6770 "Turks" 0x6758 w/ 1 gig VRAM, KMS and the native freedomware drivers, ga-990fxa-ud3 mobo, onboard Realtek ALC889 (alsa hda driver), onboard Realtek RTL8111/8168B ethernet (8169 driver), monolithic kernel (no modules), gpt and reiserfs.

Gentoo/~amd64, gcc-4.7.1, glibc-2.16.0, xorg-server-1.13.0, xf86-video-ati-6.14.6-r1, mesa-9_pre20120831-r1, kde-4.9.1.

I'll attach a kernel config momentarily, then try a generic KCFLAGS build and confirm the problem (or not) with it, tho without a 100% trigger reproducer it might take hours or days to confirm if it's clean.
Comment 1 Duncan 2012-09-10 08:53:23 UTC
Created attachment 79571 [details]
kernel config
Comment 2 Tejun Heo 2012-09-10 17:43:21 UTC
Created attachment 79591 [details]
rescuer-is-cpu_intensive.patch

Is any CPU hotplug operation going on the system? If so, can you please try the attache patch and see whether it changes the situation?

Thanks.
Comment 3 Tejun Heo 2012-09-10 19:31:20 UTC
Created attachment 79631 [details]
local-wakeup-dbg.patch

And if that doesn't work, please try the attached patch on top of any 3.6-rcX. If the patch works, it'll print out a debug message starting with "XXX" instead of triggering BUG_ON(). It doesn't really fix the issue but should help debugging it. Please trigger the problem with the patch applied and report the kernel log.

Thanks.
Comment 4 Duncan 2012-09-11 07:47:13 UTC
Re: comment #2, no deliberate CPU hotplugging going on at all.  It's a single-socket bulldozer fx6100 so physical hotplugging would be kinda difficult! =:^)

Cpufreq's active, 1400-3600 megahertz range (the top's slightly overclocked from the rated 3300, but it's stable and only exceeds the 95W power rating for occasional fractions of a second at full load), conservative, and I have all the C1E etc power saving options on in the BIOS, but unless the kernel does some hotplugging trickery with its PM and I've no evidence of that, no hotplugging at all.

Meanwhile, just had a crash with the same BUGON, running 3.6-rc5, no custom KCFLAGS at all, so I've confirmed it's not my custom KCFLAGS, now.  I wish I had a more reliable trigger; that one took long enough I was beginning to wonder if it WAS my KCFLAGS.

Re: comment #3, I was trying the unmodified KCFLAGS so haven't applied the suggested patch yet.  I'll try it next.  Hopefully it doesn't take so long to trigger this time.  (That's why I hadn't reported it before now too.  Having to wait two days plus to be reasonably sure it's a bisect good doesn't help when you're staring at 12 bisect steps!  Fortunately most of the bisect bads triggered sooner than this one but I had a long run of bisect goods that took over a week!)
Comment 5 Tejun Heo 2012-09-11 21:46:49 UTC
With the patch from #3 applied, the kernel shouldn't crash or hang (well, if it works as intended, that is) but will produce kernel message, so please check dmesg from time to time.

Thanks!
Comment 6 Duncan 2012-09-11 23:27:43 UTC
(In reply to comment #5)
> With the patch from #3 applied, the kernel shouldn't crash or hang (well, if it
> works as intended, that is) but will produce kernel message, so please check
> dmesg from time to time.

Thanks.  I was wondering about that.  Fortunately I have a log tail displayed on my system monitor panel, and when I took a look after letting the system run a few hours, three new entries of interest. =:^)  (The consistently 4-digit-max PIDs are because I have kernel/pid_max=10000 set.  Hasn't been a problem that I know of so far, but then again, I'm not sure what pid-starvation would look like, so...)


[13130.471637] XXX workqueue: foreign wakeup bug: current = 3688 (kworker/0:2) on cpu 3/08, target = 3608 (kworker/0:1) on cpu 0/01, cpu=3

[14706.114935] XXX workqueue: foreign wakeup bug: current = 3608 (kworker/0:1) on cpu 1/02, target = 887 (kworker/0:2) on cpu 0/01, cpu=1

[17105.617686] XXX workqueue: foreign wakeup bug: current = 4336 (kworker/0:0) on cpu 4/10, target = 698 (kworker/0:1) on cpu 0/01, cpu=4


Those were spread over a bit more than an hour, several hours ago.  There's no harm in letting the system continue to run, I hope?

Just got another one!

[36863.225461] XXX workqueue: foreign wakeup bug: current = 8563 (kworker/0:1) on cpu 4/10, target = 3700 (kworker/0:0) on cpu 0/01, cpu=4
Comment 7 Tejun Heo 2012-09-12 00:15:32 UTC
Hmmm, somebody seems to be mucking with ->cpus_allowed of kworkers. I have no idea why the recent changes affect this in any way yet. Let me cook up another debug patch.
Comment 8 Tejun Heo 2012-09-12 00:27:51 UTC
Created attachment 79871 [details]
local-wakeup-dbg-1.patch

Can you please test this patch instead and post full kernel log after the debug message is printed? I don't think this is something broken in workqueue. It seems like one of its users is migrating kworker thread to another CPU.
Comment 9 Lai Jiangshan 2012-09-12 03:41:03 UTC
(In reply to comment #8)
> Created an attachment (id=79871) [details]
> local-wakeup-dbg-1.patch
> 
> Can you please test this patch instead and post full kernel log after the debug
> message is printed? I don't think this is something broken in workqueue. It
> seems like one of its users is migrating kworker thread to another CPU.

I think the check in wq_worker_sleeping() is late. if the worker whose ->cpus_allowed is changed by a work item does not go to sleep until it processes tens of items, the code will blame to wrong work item. so some similar test is also needed in process_one_work().
Comment 10 Duncan 2012-09-12 12:17:53 UTC
Created attachment 79901 [details]
dmesg from local-wakeup-dbg-1

Re: comment #8, here's the full dmesg with the second dbg patch applied, after the bug triggered, as requested.
Comment 11 Tejun Heo 2012-09-12 16:29:06 UTC
Lai, we already have the check at the beginning of process_one_work(), so I think the debug patch is enough.

Duncan, we need to map the function address 0xffffffff81405d40 to symbol. Can you please attach the System.map file or redo the test with CONFIG_KALLSYMS[_ALL] = y? It's usually a good idea to have that turned on along with CONFIG_FRAME_POINTER when doing any kind of kernel debugging.

Thanks.
Comment 12 Duncan 2012-09-12 17:26:16 UTC
Created attachment 80061 [details]
System.map (gzipped) matching  attachment #79901 [details] (dmesg from local-wakeup-dbg-1)

Re: Comment #11: 
> Duncan, we need to map the function address 0xffffffff81405d40
> to symbol. Can you please attach the System.map file

I've never known what to do with System.map so originally had it turned off.  But luckily I had an idea it might come in useful with this bug and turned on installing it a few builds ago.

Here's the one that goes with the latest dmesg (Comment #10).  Gzipped so bugzy will take it.

That address appears to be do_dbs_timer.
Comment 13 Duncan 2012-09-12 17:50:12 UTC
(In reply to comment #12)
> Re: Comment #11: 
> > Duncan, we need to map the function address 0xffffffff81405d40
> > to symbol.
> 
> That address appears to be do_dbs_timer.

drivers/cpufreq/cpufreq.conservative.c 443 (on that kernel)

I mentioned cpufreq/conservative earlier.  Maybe between the hours reading the git logs and the hours reading lwn, this stuff's beginning to sink in a bit, despite no claims here to even know C, let alone at the kernel level?  Unless Lai is correct and that's not even pointing at the culprit.
Comment 14 Tejun Heo 2012-09-12 17:51:56 UTC
I found the bug. It seems to have been there for quite some time. Dunno why it's showing up now. The culprit is powernow-k8 cpufreq driver which temporarily migrates the kworker thread to cpufreq_policy->cpu. Working on the fix.

Thanks.
Comment 15 Tejun Heo 2012-09-12 18:22:13 UTC
Created attachment 80071 [details]
powernow-k8-dont-muck-with-kworker.patch

Can you please apply the attached patch along with the debug patch and see whether the problem goes away? Please enable FRAME_POINTER and KALLSYMS.

Thanks.
Comment 16 Tejun Heo 2012-09-12 18:26:40 UTC
Cc'ing Rafael. Rafael, powernow-k8 was migrating workqueue worker temporarily to another CPU which isn't allowed. It triggers BUG_ON() in scheduler if it contends on fidvid_mutex and goes to sleep. I created a patch to update it to use a separate work item to punt to the target cpu in comment #15. One question tho. cpufreq guarantees pol->cpu is up across ->target() call, right?

Thanks.
Comment 17 Rafael J. Wysocki 2012-09-12 22:09:24 UTC
What exactly do you mean by "up"?
Comment 18 Tejun Heo 2012-09-12 22:12:19 UTC
To rephrase: Does cpufreq guarantee that CPU_DOWN doesn't finish on pol->cpu while the callback is in progress?
Comment 19 Rafael J. Wysocki 2012-09-13 20:48:27 UTC
The execution of ->target() is under lock_policy_rwsem_write() and cpufreq_cpu_callback() also calls lock_policy_rwsem_write() before __cpufreq_remove_dev(), so I think the answer is "yes".  [lock_policy_rwsem_write() appears to be always called before __cpufreq_remove_dev(), for that matter.]
Comment 20 Tejun Heo 2012-09-13 20:49:30 UTC
Alrighty, thanks, Rafael. Duncan, how's the patch behaving?
Comment 21 Duncan 2012-09-14 00:27:00 UTC
(In reply to comment #20)
> Duncan, how's the patch behaving?

Looking good so far.

No further instances triggered, but I've only been running it a few hours and sometimes it'd take a couple days to trigger.  But so far, so good. =:^)
Comment 22 Tejun Heo 2012-09-14 20:58:43 UTC
How is it going? I'm fairly sure the fix is correct (that is, if it isn't broken from cpufreq's POV) and I'd like to push it to Linus before -rc6.

Thanks.
Comment 23 Duncan 2012-09-15 03:51:00 UTC
I'd say push it.

Without a 100% reproducer other than time, we don't have a 100% certainty yet, but it's certainly not making things noticeably worse here either, so push it and get as much broader testing as possible before 3.6.0. =:^)
Comment 24 Tejun Heo 2012-09-18 20:54:34 UTC
Created attachment 80501 [details]
powernowk8-fix.patch

Duncan, can you please test this one instead? Linus wants to push this version before 3.6 release so it would be great if you can verify this soonish. If possible, please update us as test progresses (e.g. after surviving the first few hours).

Thanks!
Comment 25 Duncan 2012-09-18 23:46:58 UTC
 (In reply to comment #24)
> Created an attachment (id=80501) [details]
> powernowk8-fix.patch
> 
> Duncan, can you please test this one instead? Linus wants to push this version
> before 3.6 release so it would be great if you can verify this soonish. If
> possible, please update us as test progresses (e.g. after surviving the first
> few hours).


Thanks for posting the new patch here.  I read the thread as CCed, but lost track of which one I was supposed to test.  With the post here I know specifically what to test. =:^)

To be clear, this applies with local-wakeup-dbg-1.patch still applied, correct?  I'll assume so.  And I'm not clear what necessary commit it builds on (re: the thread), but I'll pull before applying so will have the latest mainline.

I'm about to go out but will apply this and leave the system running.  When I get back I'll post an update.

Meanwhile, the don't-muck-with-kworker patch that this obsoletes continues to work without issue.  Solid as it gets.  So we have a backup, if this one fails.
Comment 26 Tejun Heo 2012-09-18 23:50:37 UTC
Yes, including local-wakeup-dbg-1.patch can't hurt.  So, 3.6-rcX + local-wakeup-dbg-1.patch + powernowk8-fix.patch.

Thanks a lot.
Comment 27 Duncan 2012-09-19 04:39:01 UTC
Looking good so far, uptime saying 4:30, on powernowk8-fix.patch. =:^)
Comment 28 Duncan 2012-09-19 17:54:35 UTC
Still seems fine, even after running a task that burned enough cycles to trigger CPU frequency changes, for awhile.  So again, not 100% sure yet, but maybe 98%, so the odds are pretty good.

Thanks. =:^)
Comment 29 Tejun Heo 2012-09-19 17:58:41 UTC
Pull request sent.  Thanks a lot for the detailed reporting and testing. :)

Resolving as CODE_FIX.
Comment 30 Florian Mickler 2012-10-15 20:44:56 UTC
A patch referencing this bug report has been merged in Linux v3.6-rc7:

commit 6889125b8b4e09c5e53e6ecab3433bed1ce198c9
Author: Tejun Heo <tj@kernel.org>
Date:   Tue Sep 18 14:24:59 2012 -0700

    cpufreq/powernow-k8: workqueue user shouldn't migrate the kworker to another CPU
Comment 31 Florian Mickler 2012-11-05 23:14:50 UTC
A patch referencing a commit referencing this bug report has been merged in Linux v3.7-rc3:

commit e4df1cbcc1f329e53a1fff7450b2229e0addff20
Author: Andreas Herrmann <andreas.herrmann3@amd.com>
Date:   Tue Oct 23 00:55:10 2012 +0200

    cpufreq / powernow-k8: Remove usage of smp_processor_id() in preemptible code

Note You need to log in before you can comment on or make changes to this bug.