Bug 8736
Summary: | New TC deadlock scenario | ||
---|---|---|---|
Product: | Networking | Reporter: | Ranko Zivojnovic (ranko) |
Component: | Other | Assignee: | Arnaldo Carvalho de Melo (acme) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | ||
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.22 | Subsystem: | |
Regression: | --- | Bisected commit-id: |
Description
Ranko Zivojnovic
2007-07-11 08:49:58 UTC
Subject: Re: [Bugme-new] New: New TC deadlock scenario On Wed, 11 Jul 2007 08:45:12 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=8736 > > Summary: New TC deadlock scenario > Product: Networking > Version: 2.5 > KernelVersion: 2.6.22 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Other > AssignedTo: acme@ghostprotocols.net > ReportedBy: ranko@spidernet.net > > > Most recent kernel where this bug did not occur: > Distribution: > Hardware Environment: > Software Environment: > Problem Description: > > Here is another scenario I bumped onto - qdisc_watchdog_cancel() and > qdisc_restart() deadlock. > > CPU#0 > qdisc_watchdog() fires and gets dev->queue_lock > qdisc_run()...qdisc_restart()... > -> releases dev->queue_lock and enters dev_hard_start_xmit() > > CPU#1 > tc del qdisc dev ... > qdisc_graft()...dev_graft_qdisc()...dev_deactivate()... > -> grabs dev->queue_lock ... > qdisc_reset()...{cbq,hfsc,htb,netem,tbf}_reset()...qdisc_watchdog_cancel()... > -> hrtimer_cancel() - waiting for the qdisc_watchdog() to exit, while still > holding dev->queue_lock > > CPU#0 > dev_hard_start_xmit() returns ... > -> wants to get dev->queue_lock(!) > > DEADLOCK! > > I did not manage to get a backtrace on qdisc_watchdog stack to show them both > but nevertheless - the above looks like the only way qdisc_watchdog_cancel > could be sitting there. > > Regards, > > Ranko > > ---cut--- > SysRq : Show Regs > > Pid: 12790, comm: tc > EIP: 0060:[<c02f13f9>] CPU: 1 > EIP is at _spin_unlock_irqrestore+0x36/0x39 > EFLAGS: 00000282 Not tainted (2.6.22.SNET.Thors.htbpatch.6.lockdebug #1) > EAX: 00000000 EBX: c1d119c0 ECX: 00000000 EDX: 00000000 > ESI: 00000282 EDI: c1d11a18 EBP: 00000000 DS: 007b ES: 007b FS: 00d8 > CR0: 80050033 CR2: 008ba828 CR3: 20dc2000 CR4: 000006d0 > [<c0132ff4>] hrtimer_try_to_cancel+0x33/0x66 > [<c013007b>] kthread+0xf/0x57 > [<c0133035>] hrtimer_cancel+0xe/0x14 > [<c029daaf>] qdisc_watchdog_cancel+0x8/0x11 > [<f8b8541d>] htb_reset+0x9c/0x14b [sch_htb] > [<c029c2ad>] qdisc_reset+0x10/0x11 > [<c029c3e7>] dev_deactivate+0x27/0xa5 > [<c029d7a6>] dev_graft_qdisc+0x81/0xa5 > [<c029d7f2>] qdisc_graft+0x28/0x88 > [<c029df93>] tc_get_qdisc+0x15d/0x1e9 > [<c029de36>] tc_get_qdisc+0x0/0x1e9 > [<c0297038>] rtnetlink_rcv_msg+0x1c2/0x1f5 > [<c02a1834>] netlink_run_queue+0x96/0xfd > [<c0296e76>] rtnetlink_rcv_msg+0x0/0x1f5 > [<c0296e28>] rtnetlink_rcv+0x26/0x42 > [<c02a1d5c>] netlink_data_ready+0x12/0x54 > [<c02a0abe>] netlink_sendskb+0x1c/0x33 > [<c02a1c6b>] netlink_sendmsg+0x1f3/0x2d2 > [<c0284d06>] sock_sendmsg+0xe2/0xfd > [<c013030d>] autoremove_wake_function+0x0/0x37 > [<c013030d>] autoremove_wake_function+0x0/0x37 > [<c01d79c3>] copy_from_user+0x2d/0x59 > [<c0284e4e>] sys_sendmsg+0x12d/0x243 > [<c02f12fd>] _read_unlock_irq+0x20/0x23 > [<c013a49e>] trace_hardirqs_on+0xac/0x149 > [<c0148c78>] find_get_page+0x11/0x49 > [<c015585e>] __handle_mm_fault+0x19d/0x947 > [<c02f0f87>] _spin_unlock+0x14/0x1c > [<c01558e5>] __handle_mm_fault+0x224/0x947 > [<c028608a>] sys_socketcall+0x24f/0x271 > [<c0103370>] restore_nocheck+0x12/0x15 > [<c010329e>] sysenter_past_esp+0x5f/0x99 > ======================= > Subject: Re: [Bugme-new] New: New TC deadlock scenario
David Miller wrote:
> From: Ranko Zivojnovic <ranko@spidernet.net>
> Date: Sat, 14 Jul 2007 19:45:35 +0300
>
>
>>>>Please try reverting commit 1936502d00ae6c2aa3931c42f6cf54afaba094f2,
>>>>that should fix it.
>>>
>>>
>>>Ranko, did you get a chance to test this? I've attached the patch
>>>since it doesn't revert cleanly ..
>>
>>I have the commit reverted on my tree and have been testing it along
>>with the other changes to qdisc code I'm currently doing and it does
>>rectify the problem.
>>
>>Acked-by: Ranko Zivojnovic <ranko@spidernet.net>
>
>
> Applied, thanks everyone.
This one is a regression in 2.6.22, so the patch should also go
in -stable IMO.
Subject: Re: [Bugme-new] New: New TC deadlock scenario Patrick McHardy wrote: > Andrew Morton wrote: > >>>http://bugzilla.kernel.org/show_bug.cgi?id=8736 >>> >>>Here is another scenario I bumped onto - qdisc_watchdog_cancel() and >>>qdisc_restart() deadlock. >>> >>>[...] >>>DEADLOCK! > > > > Good catch. > > Please try reverting commit 1936502d00ae6c2aa3931c42f6cf54afaba094f2, > that should fix it. Ranko, did you get a chance to test this? I've attached the patch since it doesn't revert cleanly .. [NET_SCHED]: Revert "avoid transmit softirq on watchdog wakeup" optimization As noticed by Ranko Zivojnovic <ranko@spidernet.net>, calling qdisc_run from the timer handler can result in deadlock: > CPU#0 > > qdisc_watchdog() fires and gets dev->queue_lock > qdisc_run()...qdisc_restart()... > -> releases dev->queue_lock and enters dev_hard_start_xmit() > > CPU#1 > > tc del qdisc dev ... > qdisc_graft()...dev_graft_qdisc()...dev_deactivate()... > -> grabs dev->queue_lock ... > > qdisc_reset()...{cbq,hfsc,htb,netem,tbf}_reset()...qdisc_watchdog_cancel()... > -> hrtimer_cancel() - waiting for the qdisc_watchdog() to exit, while still > holding dev->queue_lock > > CPU#0 > > dev_hard_start_xmit() returns ... > -> wants to get dev->queue_lock(!) > > DEADLOCK! The entire optimization is a bit questionable IMO, it moves potentially large parts of NET_TX_SOFTIRQ work to TIMER_SOFTIRQ/HRTIMER_SOFTIRQ, which kind of defeats the separation of them. Signed-off-by: Patrick McHardy <kaber@trash.net> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index d92ea26..4fd0bec 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -278,11 +278,7 @@ static enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer) wd->qdisc->flags &= ~TCQ_F_THROTTLED; smp_wmb(); - if (spin_trylock(&dev->queue_lock)) { - qdisc_run(dev); - spin_unlock(&dev->queue_lock); - } else - netif_schedule(dev); + netif_schedule(dev); return HRTIMER_NORESTART; } Reply-To: davem@davemloft.net From: Patrick McHardy <kaber@trash.net> Date: Sun, 15 Jul 2007 16:21:13 +0200 > This one is a regression in 2.6.22, so the patch should also go > in -stable IMO. I have it queued up for submission to -stable. Reply-To: davem@davemloft.net From: Ranko Zivojnovic <ranko@spidernet.net> Date: Sat, 14 Jul 2007 19:45:35 +0300 > On Sat, 2007-07-14 at 17:43 +0200, Patrick McHardy wrote: > > Patrick McHardy wrote: > > > Andrew Morton wrote: > > > > > >>>http://bugzilla.kernel.org/show_bug.cgi?id=8736 > > >>> > > >>>Here is another scenario I bumped onto - qdisc_watchdog_cancel() and > > >>>qdisc_restart() deadlock. > > >>> > > >>>[...] > > >>>DEADLOCK! > > > > > > > > > > > > Good catch. > > > > > > Please try reverting commit 1936502d00ae6c2aa3931c42f6cf54afaba094f2, > > > that should fix it. > > > > > > Ranko, did you get a chance to test this? I've attached the patch > > since it doesn't revert cleanly .. > > I have the commit reverted on my tree and have been testing it along > with the other changes to qdisc code I'm currently doing and it does > rectify the problem. > > Acked-by: Ranko Zivojnovic <ranko@spidernet.net> Applied, thanks everyone. On Sat, 2007-07-14 at 17:43 +0200, Patrick McHardy wrote: > Patrick McHardy wrote: > > Andrew Morton wrote: > > > >>>http://bugzilla.kernel.org/show_bug.cgi?id=8736 > >>> > >>>Here is another scenario I bumped onto - qdisc_watchdog_cancel() and > >>>qdisc_restart() deadlock. > >>> > >>>[...] > >>>DEADLOCK! > > > > > > > > Good catch. > > > > Please try reverting commit 1936502d00ae6c2aa3931c42f6cf54afaba094f2, > > that should fix it. > > > Ranko, did you get a chance to test this? I've attached the patch > since it doesn't revert cleanly .. I have the commit reverted on my tree and have been testing it along with the other changes to qdisc code I'm currently doing and it does rectify the problem. Acked-by: Ranko Zivojnovic <ranko@spidernet.net> Fixed for a long time now. Please close. |