Bug 11264
Summary: | Invalid op opcode in kernel/workqueue | ||
---|---|---|---|
Product: | Process Management | Reporter: | Jean-Luc Coulon (jean.luc.coulon) |
Component: | Scheduler | Assignee: | Ingo Molnar (mingo) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | oleg, venki |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.27-rc2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 11167 | ||
Attachments: | Syslog trace |
Description
Jean-Luc Coulon
2008-08-07 04:18:38 UTC
Created attachment 17123 [details]
Syslog trace
I remarked the kernel is marked as tainted:
Aug 7 13:04:57 tangerine kernel: Pid: 4082, comm: kondemand/0 Tainted: G M
2.6.27-rc2-k8-1 #1
It is *not* tainted.
J-L
On 08/07, bugme-daemon@bugzilla.kernel.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=11264 > > kernel BUG at kernel/workqueue.c:240! 232 int queue_delayed_work_on(int cpu, struct workqueue_struct *wq, 233 struct delayed_work *dwork, unsigned long delay) 234 { 235 int ret = 0; 236 struct timer_list *timer = &dwork->timer; 237 struct work_struct *work = &dwork->work; 238 239 if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) { 240 BUG_ON(timer_pending(timer)); ^^^^^^^^^^^^^ > Process kondemand/0 (pid: 4082, threadinfo ffff88007db08000, task > ffff88007e15a1c0) > ... > [<ffffffffa03ce79b>] ? do_dbs_timer+0x20b/0x2f0 [cpufreq_ondemand] > [<ffffffffa03ce590>] ? do_dbs_timer+0x0/0x2f0 [cpufreq_ondemand] > [<ffffffff8024c0a5>] ? run_workqueue+0x85/0x150 > [<ffffffff8024c20f>] ? worker_thread+0x9f/0x110 There are 2 do_dbs_timer()s, but this one runs on kondemand, so it lives in drivers/cpufreq/cpufreq_ondemand.c Oh. I don't understand this code, to the point I don't really know what cpufreq is, but I'd like to believe the bug is here, not in workqueue.c ;) In short, dbs_timer_exit() uses cancel_delayed_work(), it can't stop the work if it requeus itself. Even if it doesn't, this helper needs the subsequent flush_workqueue(). The code tries to solve this using dbs_info->enable which is protected by per-cpu cpu_policy_rwsem, this flag is checked by do_dbs_timer() under the same lock, but afaics we still have problems. // called under cpu_policy_rwsem __cpufreq_set_policy: __cpufreq_governor(CPUFREQ_GOV_STOP); __cpufreq_governor(CPUFREQ_GOV_START); so we are doing dbs_timer_exit: dbs_info->enable = 0; cancel_delayed_work(&dbs_info->work); dbs_timer_init(): dbs_info->enable = 1; INIT_DELAYED_WORK_DEFERRABLE(&dbs_info->work, do_dbs_timer); queue_delayed_work_on(dbs_info->work); This looks wrong. Let's suppose that dbs_info->work->timer is not pending, and dbs_info->work is queued. In that case cancel_delayed_work() does nothing. So, INIT_DELAYED_WORK_DEFERRABLE() re-initializes the pending work, this is BUG. But how it is possible to hit BUG_ON(timer_pending(timer)) above? Afaics, this is very unlikely but possible. Let's suppose that run_workqueue() has already dequeued dbs_info->work, and it is going to call work->func(). run_workqueue: dbs_timer_init: BUG_ON(get_wq_data(work) != cwq); INIT_DELAYED_WORK_DEFERRABLE(); queue_delayed_work_on(dbs_info->work); work_clear_pending(work); work->func(); work->func() is dbs_timer_exit(), it waits for cpu_policy_rwsem, takes the lock, and figures out that dbs_info->enable = 1 again. So it does queue_delayed_work_on(dbs_info->work). queue_delayed_work_on() successfully sets WORK_STRUCT_PENDING (it was cleared by work_clear_pending(), and checks timer_pending(). And the timer is pending because of the previous queue_delayed_work_on() from dbs_timer_init(). Does __cpufreq_driver_target() need cpu_policy_rwsem? If not, perhaps we can do: --- drivers/cpufreq/cpufreq_ondemand.c +++ drivers/cpufreq/cpufreq_ondemand.c @@ -446,14 +446,6 @@ static void do_dbs_timer(struct work_str delay -= jiffies % delay; - if (lock_policy_rwsem_write(cpu) < 0) - return; - - if (!dbs_info->enable) { - unlock_policy_rwsem_write(cpu); - return; - } - /* Common NORMAL_SAMPLE setup */ dbs_info->sample_type = DBS_NORMAL_SAMPLE; if (!dbs_tuners_ins.powersave_bias || @@ -470,7 +462,6 @@ static void do_dbs_timer(struct work_str CPUFREQ_RELATION_H); } queue_delayed_work_on(cpu, kondemand_wq, &dbs_info->work, delay); - unlock_policy_rwsem_write(cpu); } static inline void dbs_timer_init(struct cpu_dbs_info_s *dbs_info) @@ -490,7 +481,7 @@ static inline void dbs_timer_init(struct static inline void dbs_timer_exit(struct cpu_dbs_info_s *dbs_info) { dbs_info->enable = 0; - cancel_delayed_work(&dbs_info->work); + cancel_delayed_work_sync(&dbs_info->work); } static int cpufreq_governor_dbs(struct cpufreq_policy *policy, Or, perhaps, something like --- drivers/cpufreq/cpufreq_ondemand.c +++ drivers/cpufreq/cpufreq_ondemand.c @@ -482,7 +482,8 @@ static inline void dbs_timer_init(struct dbs_info->enable = 1; ondemand_powersave_bias_init(); dbs_info->sample_type = DBS_NORMAL_SAMPLE; - INIT_DELAYED_WORK_DEFERRABLE(&dbs_info->work, do_dbs_timer); + if (!dbs_info->work->work->func) + INIT_DELAYED_WORK_DEFERRABLE(&dbs_info->work, do_dbs_timer); queue_delayed_work_on(dbs_info->cpu, kondemand_wq, &dbs_info->work, delay); } @@ -600,6 +601,8 @@ static int __init cpufreq_gov_dbs_init(v static void __exit cpufreq_gov_dbs_exit(void) { cpufreq_unregister_governor(&cpufreq_gov_ondemand); + for_each_dbs_info(dbs_info) + cancel_work_sync(&dbs_info->work); destroy_workqueue(kondemand_wq); } Oleg. >> to the point I don't really know what cpufreq is
cpufreq is part of the system power managment. There is cpufreq daemon in the userspace who allows to adjust CPU frequency and voltage according to machine load and policy declared. On amd64, it relies on powernow_k8 module.
"ondemand" can be a policy for cpufreq management. But on my system the defaut policy is "userspace".
J-L
On 08/08, bugme-daemon@bugzilla.kernel.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=11264 > > ------- Comment #3 from jean.luc.coulon@gmail.com 2008-08-08 12:52 ------- > >> to the point I don't really know what cpufreq is > > cpufreq is part of the system power managment. There is cpufreq daemon in the > userspace who allows to adjust CPU frequency and voltage according to machine > load and policy declared. On amd64, it relies on powernow_k8 module. OK, thanks. > "ondemand" can be a policy for cpufreq management. But on my system the > defaut > policy is "userspace". cpufreq_userspace.c doesn't use workqueues at all, and please note that the trace has "Process kondemand/0". This means that "ondemand" must be involved. Perhaps we have another bug, but I believe cpufreq_ondemand.c is buggy too. dbs_timer_init()->INIT_DELAYED_WORK() can rewrite the pending work. But yes, it is very unlikely to hit this bug. So. As I said, I don't understand this code, but I'd suggest something like this completely untested patch for now, --- DBS/drivers/cpufreq/cpufreq_ondemand.c~1_DBS_TIMER 2008-07-30 13:12:42.000000000 +0400 +++ DBS/drivers/cpufreq/cpufreq_ondemand.c 2008-08-09 16:17:07.000000000 +0400 @@ -482,7 +482,6 @@ static inline void dbs_timer_init(struct dbs_info->enable = 1; ondemand_powersave_bias_init(); dbs_info->sample_type = DBS_NORMAL_SAMPLE; - INIT_DELAYED_WORK_DEFERRABLE(&dbs_info->work, do_dbs_timer); queue_delayed_work_on(dbs_info->cpu, kondemand_wq, &dbs_info->work, delay); } @@ -589,6 +588,13 @@ EXPORT_SYMBOL(cpufreq_gov_ondemand); static int __init cpufreq_gov_dbs_init(void) { + int cpu; + + for_each_possible_cpu(cpu) { + struct cpu_dbs_info_s *dbs_info = &per_cpu(cpu_dbs_info, cpu); + INIT_DELAYED_WORK_DEFERRABLE(&dbs_info->work, do_dbs_timer); + } + kondemand_wq = create_workqueue("kondemand"); if (!kondemand_wq) { printk(KERN_ERR "Creation of kondemand failed\n"); ------------------------------------------------------------------------------- It is not clear why cpufreq_ondemand.c creates its own kondemand_wq, it was added in 2f8a835c705794f71726eb12c06fb0f24fe07ed3 and the changelog says nothing about this. Perhaps we can kill it? --- DBS/drivers/cpufreq/cpufreq_ondemand.c~2_KILL_WQ 2008-08-09 16:17:07.000000000 +0400 +++ DBS/drivers/cpufreq/cpufreq_ondemand.c 2008-08-09 16:21:47.000000000 +0400 @@ -81,8 +81,6 @@ static unsigned int dbs_enable; /* numbe */ static DEFINE_MUTEX(dbs_mutex); -static struct workqueue_struct *kondemand_wq; - static struct dbs_tuners { unsigned int sampling_rate; unsigned int up_threshold; @@ -469,7 +467,7 @@ static void do_dbs_timer(struct work_str dbs_info->freq_lo, CPUFREQ_RELATION_H); } - queue_delayed_work_on(cpu, kondemand_wq, &dbs_info->work, delay); + schedule_delayed_work_on(cpu, &dbs_info->work, delay); unlock_policy_rwsem_write(cpu); } @@ -482,8 +480,7 @@ static inline void dbs_timer_init(struct dbs_info->enable = 1; ondemand_powersave_bias_init(); dbs_info->sample_type = DBS_NORMAL_SAMPLE; - queue_delayed_work_on(dbs_info->cpu, kondemand_wq, &dbs_info->work, - delay); + schedule_delayed_work_on(dbs_info->cpu, &dbs_info->work, delay); } static inline void dbs_timer_exit(struct cpu_dbs_info_s *dbs_info) @@ -595,18 +592,18 @@ static int __init cpufreq_gov_dbs_init(v INIT_DELAYED_WORK_DEFERRABLE(&dbs_info->work, do_dbs_timer); } - kondemand_wq = create_workqueue("kondemand"); - if (!kondemand_wq) { - printk(KERN_ERR "Creation of kondemand failed\n"); - return -EFAULT; - } return cpufreq_register_governor(&cpufreq_gov_ondemand); } static void __exit cpufreq_gov_dbs_exit(void) { + int cpu; + cpufreq_unregister_governor(&cpufreq_gov_ondemand); - destroy_workqueue(kondemand_wq); + for_each_possible_cpu(cpu) { + struct cpu_dbs_info_s *dbs_info = &per_cpu(cpu_dbs_info, cpu); + flush_work(&dbs_info->work.work); + } } ------------------------------------------------------------------------------- Sadly, this BUG is not reproducable... Oleg. If I start the system in single user the kondemand processes (there are two, probably one for each core) are NOT present. THey appear when I start cpufreqd (and load the related modules, loaded automatically but powernow-k8) : powernow_k8 14228 1 cpufreq_powersave 2304 0 cpufreq_performance 2368 0 cpufreq_ondemand 8272 1 cpufreq_conservative 8200 0 freq_table 5376 2 powernow_k8,cpufreq_ondemand processor 39160 2 powernow_k8,thermal The processes kondemand are not killeable. I tried to stop cpufreq and unload the modules before, I failed to unload powernow-k8. 4109 ? S< 0:00 [kondemand/0] 4111 ? S< 0:00 [kondemand/1] I've the same behaviour with 2.6.26. J-L Were you stressing anything related to cpufreq when you reproduced the problem. Like changing governors continuously or taking CPU offline online etc. Or was it just the kernel compilation workload. init and exit should not be called multiple times unless governor is getting changed or things like that. In normal case, ondemand just creates a per CPU workqueue and requeues work from currently running work. Well, the CPU is constantly adjusting its parameters according to the load and to the temperature. As it is quite (very) hot for the moment, I've a cpufreq rule to set the clock to 2GHz (or less) instead of 2.2GHz when the dye temperature exceeds 68C (the amd64 had diodes on the chip for temperature measurements). Under continous high load, this can conduct to some kind of regulation around this point, the CPU switching back and forth 2.2GHz <--> 2GHz. I didnt change the governor. I didnt switch one CPU offline. Could be that the high CPU temperature is causing computational errors; 68C is not terrible, but it's a bit on the high side. Google for 'prime95' and run two copies of it in Torture Test mode; use 'taskset' from schedutils to set the affinity of the two processes to one CPU only (respectively). If you can run this test without problems for 30 minutes (assuming the CPU reaches temperature), we could eliminate this possibility. This is the temperature given by the diode on the core itself. The motherboard sensor gives always a temperature less than 60C. 68C is the maximum allowed by my cpufreq parameters. When the hotest core reachs this temperature, I lower the frequency (the Vcore as well is lowered). BTW, I've not found any linux version of prime95... If I use cpuburn (2 iinstances of cpuburnK7: I've not found any binary specifically designed for k8), the Core0 temperature stabilize around 68/69C and Core1 temperature around 60C). Core0 is always 4 to 8° over Core0. Motherboard temperature sensor is about 59C. I've pretty the same results compiling a kernel with make -j4. Notify-Also : Alistair Strachan <alistair@devzero.co.uk> Notify-Also : Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> This issue appears to have been fixed. Closing. |