Bug 11264

Summary: Invalid op opcode in kernel/workqueue
Product: Process Management Reporter: Jean-Luc Coulon (jean.luc.coulon)
Component: SchedulerAssignee: Ingo Molnar (mingo)
Status: CLOSED CODE_FIX    
Severity: normal CC: oleg, venki
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27-rc2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 11167    
Attachments: Syslog trace

Description Jean-Luc Coulon 2008-08-07 04:18:38 UTC
Latest working kernel version: nver seen this bug before
Earliest failing kernel version: 2.6.27-rc2
Distribution: Debian Sid
Hardware Environment: athlon64 x2 4200
Software Environment: X86_64
Problem Description: Message from syslog

Steps to reproduce: dunno. This was during high load (compiling the latest 2.6.27-rc2-git1 kernel). 

There was no task failure as far as I can see.

I got :
Aug  7 13:04:57 tangerine kernel: kernel BUG at kernel/workqueue.c:240!
Aug  7 13:04:57 tangerine kernel: invalid opcode: 0000 [1] PREEMPT SMP 
Aug  7 13:04:57 tangerine kernel: CPU 0 


I will attach the whole syslog trace.

Regards

Jean-Luc
Comment 1 Jean-Luc Coulon 2008-08-07 04:20:35 UTC
Created attachment 17123 [details]
Syslog trace

I remarked the kernel is marked as tainted:

Aug  7 13:04:57 tangerine kernel: Pid: 4082, comm: kondemand/0 Tainted: G   M   
   2.6.27-rc2-k8-1 #1

It is *not* tainted. 

J-L
Comment 2 Oleg Nesterov 2008-08-08 06:11:44 UTC
On 08/07, bugme-daemon@bugzilla.kernel.org wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=11264
>
> kernel BUG at kernel/workqueue.c:240!

	232  int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
	233                          struct delayed_work *dwork, unsigned long delay)
	234  {
	235          int ret = 0;
	236          struct timer_list *timer = &dwork->timer;
	237          struct work_struct *work = &dwork->work;
	238
	239          if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
	240                  BUG_ON(timer_pending(timer));
	                            ^^^^^^^^^^^^^

> Process kondemand/0 (pid: 4082, threadinfo ffff88007db08000, task
> ffff88007e15a1c0)
> ...
> [<ffffffffa03ce79b>] ? do_dbs_timer+0x20b/0x2f0 [cpufreq_ondemand]
> [<ffffffffa03ce590>] ? do_dbs_timer+0x0/0x2f0 [cpufreq_ondemand]
> [<ffffffff8024c0a5>] ? run_workqueue+0x85/0x150
> [<ffffffff8024c20f>] ? worker_thread+0x9f/0x110

There are 2 do_dbs_timer()s, but this one runs on kondemand, so it lives
in drivers/cpufreq/cpufreq_ondemand.c

Oh. I don't understand this code, to the point I don't really know what
cpufreq is, but I'd like to believe the bug is here, not in workqueue.c ;)

In short, dbs_timer_exit() uses cancel_delayed_work(), it can't stop the
work if it requeus itself. Even if it doesn't, this helper needs the
subsequent flush_workqueue().

The code tries to solve this using dbs_info->enable which is protected
by per-cpu cpu_policy_rwsem, this flag is checked by do_dbs_timer() under
the same lock, but afaics we still have problems.

	// called under cpu_policy_rwsem
	__cpufreq_set_policy:

		__cpufreq_governor(CPUFREQ_GOV_STOP);
		__cpufreq_governor(CPUFREQ_GOV_START);

so we are doing

	dbs_timer_exit:
		dbs_info->enable = 0;
		cancel_delayed_work(&dbs_info->work);

	dbs_timer_init():
		dbs_info->enable = 1;
		INIT_DELAYED_WORK_DEFERRABLE(&dbs_info->work, do_dbs_timer);
		queue_delayed_work_on(dbs_info->work);
		
This looks wrong. Let's suppose that dbs_info->work->timer is not pending,
and dbs_info->work is queued. In that case cancel_delayed_work() does nothing.
So, INIT_DELAYED_WORK_DEFERRABLE() re-initializes the pending work, this
is BUG. But how it is possible to hit BUG_ON(timer_pending(timer)) above?

Afaics, this is very unlikely but possible. Let's suppose that run_workqueue()
has already dequeued dbs_info->work, and it is going to call work->func().

	run_workqueue:				dbs_timer_init:


	BUG_ON(get_wq_data(work) != cwq);
						INIT_DELAYED_WORK_DEFERRABLE();
						queue_delayed_work_on(dbs_info->work);
	work_clear_pending(work);

	work->func();

work->func() is dbs_timer_exit(), it waits for cpu_policy_rwsem, takes the
lock, and figures out that dbs_info->enable = 1 again. So it does
queue_delayed_work_on(dbs_info->work).

queue_delayed_work_on() successfully sets WORK_STRUCT_PENDING (it was cleared
by work_clear_pending(), and checks timer_pending(). And the timer is pending
because of the previous queue_delayed_work_on() from dbs_timer_init().

Does __cpufreq_driver_target() need cpu_policy_rwsem? If not, perhaps we can
do:

	--- drivers/cpufreq/cpufreq_ondemand.c
	+++ drivers/cpufreq/cpufreq_ondemand.c
	@@ -446,14 +446,6 @@ static void do_dbs_timer(struct work_str
	 
		delay -= jiffies % delay;
	 
	-	if (lock_policy_rwsem_write(cpu) < 0)
	-		return;
	-
	-	if (!dbs_info->enable) {
	-		unlock_policy_rwsem_write(cpu);
	-		return;
	-	}
	-
		/* Common NORMAL_SAMPLE setup */
		dbs_info->sample_type = DBS_NORMAL_SAMPLE;
		if (!dbs_tuners_ins.powersave_bias ||
	@@ -470,7 +462,6 @@ static void do_dbs_timer(struct work_str
						CPUFREQ_RELATION_H);
		}
		queue_delayed_work_on(cpu, kondemand_wq, &dbs_info->work, delay);
	-	unlock_policy_rwsem_write(cpu);
	 }
	 
	 static inline void dbs_timer_init(struct cpu_dbs_info_s *dbs_info)
	@@ -490,7 +481,7 @@ static inline void dbs_timer_init(struct
	 static inline void dbs_timer_exit(struct cpu_dbs_info_s *dbs_info)
	 {
		dbs_info->enable = 0;
	-	cancel_delayed_work(&dbs_info->work);
	+	cancel_delayed_work_sync(&dbs_info->work);
	 }
	 
	 static int cpufreq_governor_dbs(struct cpufreq_policy *policy,

Or, perhaps, something like

	--- drivers/cpufreq/cpufreq_ondemand.c
	+++ drivers/cpufreq/cpufreq_ondemand.c
	@@ -482,7 +482,8 @@ static inline void dbs_timer_init(struct
		dbs_info->enable = 1;
		ondemand_powersave_bias_init();
		dbs_info->sample_type = DBS_NORMAL_SAMPLE;
	-	INIT_DELAYED_WORK_DEFERRABLE(&dbs_info->work, do_dbs_timer);
	+	if (!dbs_info->work->work->func)
	+		INIT_DELAYED_WORK_DEFERRABLE(&dbs_info->work, do_dbs_timer);
		queue_delayed_work_on(dbs_info->cpu, kondemand_wq, &dbs_info->work,
				      delay);
	 }
	@@ -600,6 +601,8 @@ static int __init cpufreq_gov_dbs_init(v
	 static void __exit cpufreq_gov_dbs_exit(void)
	 {
		cpufreq_unregister_governor(&cpufreq_gov_ondemand);
	+	for_each_dbs_info(dbs_info)
	+		cancel_work_sync(&dbs_info->work);
		destroy_workqueue(kondemand_wq);
	 }
	 
Oleg.
Comment 3 Jean-Luc Coulon 2008-08-08 12:52:33 UTC
>> to the point I don't really know what cpufreq is

cpufreq is part of the system power managment. There is cpufreq daemon in the userspace who allows to adjust CPU frequency and voltage according to machine load and policy declared. On amd64, it relies on powernow_k8 module.
"ondemand" can be a policy for cpufreq management. But on my system the defaut policy is "userspace".

J-L
Comment 4 Oleg Nesterov 2008-08-09 05:31:05 UTC
On 08/08, bugme-daemon@bugzilla.kernel.org wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=11264
> 
> ------- Comment #3 from jean.luc.coulon@gmail.com  2008-08-08 12:52 -------
> >> to the point I don't really know what cpufreq is
> 
> cpufreq is part of the system power managment. There is cpufreq daemon in the
> userspace who allows to adjust CPU frequency and voltage according to machine
> load and policy declared. On amd64, it relies on powernow_k8 module.

OK, thanks.

> "ondemand" can be a policy for cpufreq management. But on my system the
> defaut
> policy is "userspace".

cpufreq_userspace.c doesn't use workqueues at all, and please note that
the trace has "Process kondemand/0". This means that "ondemand" must be
involved.

Perhaps we have another bug, but I believe cpufreq_ondemand.c is buggy too.
dbs_timer_init()->INIT_DELAYED_WORK() can rewrite the pending work. But yes,
it is very unlikely to hit this bug.

So. As I said, I don't understand this code, but I'd suggest something
like this completely untested patch for now,

--- DBS/drivers/cpufreq/cpufreq_ondemand.c~1_DBS_TIMER	2008-07-30 13:12:42.000000000 +0400
+++ DBS/drivers/cpufreq/cpufreq_ondemand.c	2008-08-09 16:17:07.000000000 +0400
@@ -482,7 +482,6 @@ static inline void dbs_timer_init(struct
 	dbs_info->enable = 1;
 	ondemand_powersave_bias_init();
 	dbs_info->sample_type = DBS_NORMAL_SAMPLE;
-	INIT_DELAYED_WORK_DEFERRABLE(&dbs_info->work, do_dbs_timer);
 	queue_delayed_work_on(dbs_info->cpu, kondemand_wq, &dbs_info->work,
 	                      delay);
 }
@@ -589,6 +588,13 @@ EXPORT_SYMBOL(cpufreq_gov_ondemand);
 
 static int __init cpufreq_gov_dbs_init(void)
 {
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct cpu_dbs_info_s *dbs_info = &per_cpu(cpu_dbs_info, cpu);
+		INIT_DELAYED_WORK_DEFERRABLE(&dbs_info->work, do_dbs_timer);
+	}
+
 	kondemand_wq = create_workqueue("kondemand");
 	if (!kondemand_wq) {
 		printk(KERN_ERR "Creation of kondemand failed\n");
-------------------------------------------------------------------------------

It is not clear why cpufreq_ondemand.c creates its own kondemand_wq, it was
added in 2f8a835c705794f71726eb12c06fb0f24fe07ed3 and the changelog says
nothing about this.

Perhaps we can kill it?

--- DBS/drivers/cpufreq/cpufreq_ondemand.c~2_KILL_WQ	2008-08-09 16:17:07.000000000 +0400
+++ DBS/drivers/cpufreq/cpufreq_ondemand.c	2008-08-09 16:21:47.000000000 +0400
@@ -81,8 +81,6 @@ static unsigned int dbs_enable;	/* numbe
  */
 static DEFINE_MUTEX(dbs_mutex);
 
-static struct workqueue_struct	*kondemand_wq;
-
 static struct dbs_tuners {
 	unsigned int sampling_rate;
 	unsigned int up_threshold;
@@ -469,7 +467,7 @@ static void do_dbs_timer(struct work_str
 	                        	dbs_info->freq_lo,
 	                        	CPUFREQ_RELATION_H);
 	}
-	queue_delayed_work_on(cpu, kondemand_wq, &dbs_info->work, delay);
+	schedule_delayed_work_on(cpu, &dbs_info->work, delay);
 	unlock_policy_rwsem_write(cpu);
 }
 
@@ -482,8 +480,7 @@ static inline void dbs_timer_init(struct
 	dbs_info->enable = 1;
 	ondemand_powersave_bias_init();
 	dbs_info->sample_type = DBS_NORMAL_SAMPLE;
-	queue_delayed_work_on(dbs_info->cpu, kondemand_wq, &dbs_info->work,
-	                      delay);
+	schedule_delayed_work_on(dbs_info->cpu, &dbs_info->work, delay);
 }
 
 static inline void dbs_timer_exit(struct cpu_dbs_info_s *dbs_info)
@@ -595,18 +592,18 @@ static int __init cpufreq_gov_dbs_init(v
 		INIT_DELAYED_WORK_DEFERRABLE(&dbs_info->work, do_dbs_timer);
 	}
 
-	kondemand_wq = create_workqueue("kondemand");
-	if (!kondemand_wq) {
-		printk(KERN_ERR "Creation of kondemand failed\n");
-		return -EFAULT;
-	}
 	return cpufreq_register_governor(&cpufreq_gov_ondemand);
 }
 
 static void __exit cpufreq_gov_dbs_exit(void)
 {
+	int cpu;
+
 	cpufreq_unregister_governor(&cpufreq_gov_ondemand);
-	destroy_workqueue(kondemand_wq);
+	for_each_possible_cpu(cpu) {
+		struct cpu_dbs_info_s *dbs_info = &per_cpu(cpu_dbs_info, cpu);
+		flush_work(&dbs_info->work.work);
+	}
 }
 
 
-------------------------------------------------------------------------------

Sadly, this BUG is not reproducable...

Oleg.
Comment 5 Jean-Luc Coulon 2008-08-09 07:31:21 UTC
If I start the system in single user the kondemand processes (there are two, probably one for each core) are NOT present.
THey appear when I start cpufreqd (and load the related modules, loaded automatically but powernow-k8) :

powernow_k8            14228  1 
cpufreq_powersave       2304  0 
cpufreq_performance     2368  0 
cpufreq_ondemand        8272  1 
cpufreq_conservative     8200  0 
freq_table              5376  2 powernow_k8,cpufreq_ondemand
processor              39160  2 powernow_k8,thermal

The processes kondemand are not killeable. I tried to stop cpufreq and unload the modules before, I failed to unload powernow-k8. 

4109 ?        S<     0:00 [kondemand/0]
4111 ?        S<     0:00 [kondemand/1]

I've the same behaviour with 2.6.26.

J-L
Comment 6 Venkatesh Pallipadi 2008-08-09 07:54:13 UTC
Were you stressing anything related to cpufreq when you reproduced the problem. Like changing governors continuously or taking CPU offline online etc. Or was it just the kernel compilation workload.

init and exit should not be called multiple times unless governor is getting changed or things like that. In normal case, ondemand just creates a per CPU workqueue and requeues work from currently running work.
Comment 7 Jean-Luc Coulon 2008-08-09 11:20:19 UTC
Well, the CPU is constantly adjusting its parameters according to the load and to the temperature.

As it is quite (very) hot for the moment, I've a cpufreq rule to set the clock to 2GHz (or less) instead of 2.2GHz when the dye temperature exceeds 68C (the amd64 had diodes on the chip for temperature measurements).

Under continous high load, this can conduct to some kind of regulation around this point, the CPU switching back and forth 2.2GHz <-->  2GHz.

I didnt change the governor. I didnt switch one CPU offline.
Comment 8 Alistair Strachan 2008-08-24 04:59:37 UTC
Could be that the high CPU temperature is causing computational errors; 68C is not terrible, but it's a bit on the high side.

Google for 'prime95' and run two copies of it in Torture Test mode; use 'taskset' from schedutils to set the affinity of the two processes to one CPU only (respectively). If you can run this test without problems for 30 minutes (assuming the CPU reaches temperature), we could eliminate this possibility.
Comment 9 Jean-Luc Coulon 2008-08-24 05:25:09 UTC
This is the temperature given by the diode on the core itself.
The motherboard sensor gives always a temperature less than 60C.
68C is the maximum allowed by my cpufreq parameters. When the hotest core reachs this temperature, I lower the frequency (the Vcore as well is lowered).

BTW, I've not found any linux version of prime95...

If I use cpuburn (2 iinstances of cpuburnK7: I've not found any binary specifically designed for k8), the Core0 temperature stabilize around 68/69C and Core1 temperature around 60C). Core0 is always 4 to 8° over Core0. Motherboard temperature sensor is about 59C.

I've pretty the same results compiling a kernel with make -j4.
Comment 10 Alistair Strachan 2008-08-26 07:42:45 UTC
ftp://mersenne.org/gimps/mprime2414.tar.gz
Comment 11 Rafael J. Wysocki 2008-11-09 13:20:57 UTC
Notify-Also : Alistair Strachan <alistair@devzero.co.uk>
Notify-Also : Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Comment 12 Rafael J. Wysocki 2008-11-16 10:05:32 UTC
This issue appears to have been fixed.  Closing.