Bug 51701

Summary: Workqueue: issue with singlethreaded self-freeing queues
Product: Process Management Reporter: Andrey Isakov (andy51)
Component: OtherAssignee: Tejun Heo (tj)
Status: RESOLVED CODE_FIX    
Severity: normal CC: florian, tj
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.39.4 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: A module to display the testcase
Some logs with the testmodule+workqueue_debug
Just adds a printk at the POI
0001-workqueue-consider-work-function-when-searching-for-.patch

Description Andrey Isakov 2012-12-13 13:23:51 UTC
Created attachment 89121 [details]
A module to display the testcase

It seems like the cmwq code does not handle well the case with self-freeing work items, especially combined with blocking.

My scenario follows:

1. Create two singlethreaded WQ
2. Enqueue an kmalloc-ed work item A to WQ1 that frees itself and [optionally] blocks on smth
3. Start queuing a bunch of works for WQ2 (B,C,D,...), allocating each one
4. One of the B,C or D works get allocated in re-used block from A, so it gets the same address as A. E.g. let B = A.
5. Now cmwq tries to serialize A and B (and the following C and D because of a singlethreaded WQ).

This causes unnecessary serialization of work items that were not supposed to. And if one of the BCD is supposed to unblock the A, we get a deadlock because of that.

The problem is that the cmwq code assumes that the works are the same if they have the same addresses, but this is not necessary true.

I have made a test module to illustrate and reproduce the problem, but the reproducibility rate is abt 60-70% on my test system. And also there is a debugging patch for workqueue to display this scenario. Both attached + testmodule logs with the problem captured.

Detected this bug on 2.6.39.4 Kernel (Android branch, wq code is identical to vanilla code though). I don't have an opportunity to check it with the newer one, but from the code it seems to affect all the versions with cmwq as of now.
AFAIK, this case was not possible with previous WQ code because of dedicated threads for each WQ.
Comment 1 Andrey Isakov 2012-12-13 13:26:20 UTC
Created attachment 89131 [details]
Some logs with the testmodule+workqueue_debug
Comment 2 Andrey Isakov 2012-12-13 13:27:05 UTC
Created attachment 89141 [details]
Just adds a printk at the POI
Comment 3 Tejun Heo 2012-12-18 19:09:23 UTC
Created attachment 89441 [details]
0001-workqueue-consider-work-function-when-searching-for-.patch

Patch posted to lkml and applied to wq/for-3.9. While this is possible in theory, I don't think we actually have in-kernel users which can trigger this and there hasn't been any indication that something like this is happening for the whole time cmwq has been around.

The patch avoids interaction among unrelated work items but isn't complete. It can't be complete - there's no way to tell which work item instance the caller is interested during e.g. flush_work() without hooking into allocator. While incomplete, it limits the possibility of such deadlocks to workqueue (ab)users which create such pseudo dependency onto themselves which is highly unlikely and even when it happens it should be relatively easy to pinpoint who's being silly.

Thanks.
Comment 4 Tejun Heo 2012-12-18 19:09:47 UTC
Resolving as CODE_FIX. Thank you.
Comment 5 Florian Mickler 2013-03-04 22:51:42 UTC
A patch referencing this bug report has been merged in Linux v3.9-rc1:

commit a2c1c57be8d9fd5b716113c8991d3d702eeacf77
Author: Tejun Heo <tj@kernel.org>
Date:   Tue Dec 18 10:35:02 2012 -0800

    workqueue: consider work function when searching for busy work items