Bug 13224
Summary: | iwl3945 -- INFO: possible irq lock inversion dependency detected | ||
---|---|---|---|
Product: | Networking | Reporter: | Miles Lane (miles.lane) |
Component: | Wireless | Assignee: | networking_wireless (networking_wireless) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | a.p.zijlstra, johannes, linville, reinette.chatre |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.30-rc4 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
New INFO text from kernel with different build options and patch applied.
Requested source code do not cancel delayed work in interrupt context dmesg output |
Description
Miles Lane
2009-05-02 20:22:30 UTC
This looks like a false positive. I'll handle this. Miles, if you can easily reproduce could you try this? (also on http://paste.pocoo.org/raw/115699/) --- kernel/timer.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) --- wireless-testing.orig/kernel/timer.c 2009-05-04 17:25:53.000000000 +0200 +++ wireless-testing/kernel/timer.c 2009-05-04 17:27:31.000000000 +0200 @@ -937,14 +937,18 @@ static inline void __run_timers(struct t */ struct lockdep_map lockdep_map = timer->lockdep_map; -#endif + /* * Couple the lock chain with the lock chain at * del_timer_sync() by acquiring the lock_map * around the fn() call here and in - * del_timer_sync(). + * del_timer_sync(). We need to disable IRQs + * to avoid false irq reordering warnings. */ + local_irq_disable(); lock_map_acquire(&lockdep_map); + local_irq_enable(); +#endif fn(data); I am currently running a build with your patch applied (2.6.30-rc4-git1). With the previous kernel, I noticed that the INFO message would reliably occur after ten or fifteen minutes of booting. Today, I thought I would try suspend/resume before the INFO message appeared. This triggered a "possible circular locking dependency" warning, also related to the iwl3945 code. Should I open a separate bug report for this, or would you like me to put the new INFO message here? Please open a separate bug report, I'll mark as duplicate or so if applicable, but I doubt it, and then I would like the driver folks to take a look at it. Also, please attach the log snippet as a text file instead of pasting it so that it doesn't get line-wrapped. I have opened a new report: http://bugzilla.kernel.org/show_bug.cgi?id=13245 Created attachment 21220 [details]
New INFO text from kernel with different build options and patch applied.
Thanks, I'll take a closer look later. Seems my initial thought about the bug was wrong. Alright, I did -- in the first version of the trace this happens: A) cancel_timer_sync: local_irq_save() lock_map_acquire(fake timer lock) B) run_softirq: lock_map_acquire(fake timer lock) And due to dependencies with priv->lock we get into the ordering issue. In the second version, we have this: A) cancel_timer_sync: local_irq_save() lock_map_acquire(fake timer lock) B) run_softirq: local_irq_disable() lock_map_acquire(fake timer lock) local_irq_enable() and we get bitten by trace_hardirqs_on() (called from local_irq_enable) which actually marks the currently held locks, which most definitely is correct but we thought we could trick it. I see one other option, but that needs a few more thoughts first. Scratch that other option. Miles, are you running exactly mainline? The report isn't making sense wrt. the code in drivers/net/wireless/iwlwifi/iwl3945-base.c, maybe you could attach that file here? Created attachment 21231 [details]
Requested source code
Hmm. I only have the patch you sent earlier applied onto 2.6.30-rc4-git1.
Thanks. I think I've got it. But some preliminary info first. the second lock's dependencies: -> (&priv->lock){-.-...} <key> -> (pci_lock){......} ops: 9888 { -> (pci_config_lock){......} ops: 9990 { -> (&rxq->lock){..-...} ops: 30981 { -> (&n->list_lock){..-...} ops: 439192 { -> (&zone->lock){..-...} ops: 50454 { -> (&base->lock){-.-...} ops: 526346 { -> (&q->lock){-.-.-.} ops: 4928318 { -> (&rq->lock){-.-.-.} ops: 2158037 { -> (&vec->lock){-.-...} ops: 4344 { -> (&rt_b->rt_runtime_lock){-.-...} ops: 482 { -> (&cpu_base->lock){-.-...} ops: 1793557 { -> (&rt_rq->rt_runtime_lock){-.-...} ops: 2710 { -> (&sig->cputimer.lock){-.-...} ops: 20901 { -> (&rq->lock/1){..-...} ops: 2084 { -> (&ep->lock){..-...} ops: 103 { -> (&cwq->lock){-.-...} ops: 65304 { -> (&priv->hcmd_lock){..-...} ops: 167 { -> (&list->lock#6){..-...} ops: 32394 { -> (&(&priv->scan_check)->timer){+.-...} ops: 19 { Now, the normal irq deadlock scenario with just two locks (I hope it won't wrap): phase CPU 1 CPU 2 --------------------------------------------- somewhere lock-irq(A) lock(B) unlock(B) unlock(A) state change lock(B) unlock(B) irq handler lock(A) unlock(A) deadlock lock-irq(A) lock(B) IRQ -----> lock(A) lock(B) Let's try that again phase CPU 1 CPU 2 ------------------------------------- somewhere lock-irq(A) lock(B) unlock(B) unlock(A) state change lock(B) unlock(B) irq handler lock(A) unlock(A) deadlock lock-irq(A) lock(B) IRQ -----> lock(A) lock(B) Now, with timer instead of lock B. Feel free to ignore "lock" and "unlock" bits, they are just how lockdep knows about this, not a real lock. A - lock T - timer phase CPU 1 CPU 2 --------------------------------------------- some place that calls cancel_timer_sync() (which is the | code) lock-irq(A) | "lock-irq"(T) | "unlock"(T) | wait(T) unlock(A) timer softirq "lock"(T) run(T) "unlock"(T) irq handler lock(A) unlock(A) Now all that again, interleaved, leading to deadlock: lock-irq(A) "lock"(T) run(T) IRQ during or maybe before run(T) --> lock(A) "lock-irq"(T) wait(T) What does it all mean? Bottom line is: you can never use cancel_timer_sync (or cancel_delayed_work[_sync] for that matter) within a spin_lock_irqsave() section! Why? Because an interrupt could come in there after spin_lock_irqsave(A) and while run(T) on CPU 1 and do spin_lock(A), but CPU 2 in this picture can never finish the critical section locked with A because it's waiting for T. As such, it is a bug in iwl3945 after all. Ah, I see the problem. iwl-scan.c: priv->rx_handlers[SCAN_COMPLETE_NOTIFICATION] = iwl_rx_scan_complete_notif; iwl_rx_scan_complete_notif: cancel_delayed_work(&priv->scan_check); iwl_irq_tasklet: spin_lock_irqsave(&priv->lock, flags); ... iwl_rx_handle ... iwl_rx_handle: ... priv->rx_handlers[pkt->hdr.cmd] (priv, rxb); ... The bug gets only reported by lockdep once the timer fires though. Created attachment 21238 [details]
do not cancel delayed work in interrupt context
Miles,
Could you please test this patch?
Thank you
Looks good here. Thanks! Hmm. Did this patch make it into the Linus tree? I just hit this again in 2.6.30-rc6-git1. I will attach the stack trace. Created attachment 21383 [details]
dmesg output
Miles, the patch was posted to linux-wireless on 15 May. I'll try to get it queued to Linus shortly. This patch can be found in Linus's 2.6.30-rc8. Can this bug be closed now? Thanks |