Bug 8934

Summary:	System freeze when restarting network connection with Broadcom driver
Product:	Drivers	Reporter:	Christian Casteyde (casteyde.christian)
Component:	network-wireless	Assignee:	drivers_network-wireless (drivers_network-wireless)
Status:	REJECTED WILL_NOT_FIX
Severity:	normal	CC:	akpm, Larry.Finger, mb, mchan
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	2.6.22.5	Subsystem:
Regression:	---	Bisected commit-id:
Attachments:	.config file for 2.6.23-rc4

Description Christian Casteyde 2007-08-24 15:35:21 UTC

Most recent kernel where this bug did not occur: 2.6.19.6 sure, maybe 2.6.22

Distribution: Slackware 12.0 (Bluewhite 64)

Hardware Environment: Broadcom wireless, Acer Laptop 1511 LMi, Athlon 64 3000 in 64 bit

Software Environment: iwconfig, ifconfig, kernel 2.6.23-rc3 with deadlock/mutex detection code, on both SLAB and SLUB (this problem was hidden by 8790, but it seems to be completly different so I entered this new one).

Problem Description:

When I restart the wireless connection with Slackware network scripts, the system freezes: no more keyboard, but the mouse still works in X. I can ask for a system shutdown with the power button, however init complains that 6 processes "seems to be hung", and the system blocks on unmount.
What I do is:
/etc/rc.d/rc.inet1 eth1_stop # OK
/etc/rc.d/rc.inet1 eth1_start # blocks

I dived into the scripts and have identified the faulty commands. eth1_stop is OK, it kills wpa_supplicant and dhcpcd then down the if. eth1_start blocks really soon in the if configuration processes. I added logs and have seen that:
First, the wireless configuration is started.
a1. iwconfig eth1 nick hostname is done -> OK
a2. iwconfig eth1 essid "any" is done -> OK, but wireless configuration deadlocks with a3
a3. ifconfig eth1 down -> not OK. This is down to enable wpa_supplicant to get its job from a down if.
Nothing happens after that, ifconfig never returns.

I added a sleep 3 seconds between a2 and a3 and the problem vanishes.
I concluded that wireless operation are deadlocking with if down command.

Moreover, I saw somewhere a log saying "cannot open /dev/console" from X or in syslog, I don't remember and lost the log, which could explain the keyboard freeze (except the hardware power button, which is handled by ACPI).

I never got any information in dmesg/syslog or on the console. Don't know exactly if I could get anything, maybe starting ( sleep 10 ; dmesg > /test.dmesg ; ) & before ifconfig down. Will post if I get anything.

Steps to reproduce:
Start the connection once first. Then stop it, and then restart it. It is racy, I don't get it everytime (it works 1 out of 4-5 in mean, so I can reproduce very easily).

I guess that having wpa authentication first, then killing wpa_supplicant, then the following triggers it (reduced script):
#!/bin/sh
iwconfig eth1 nick mynick
iwconfig eth1 essid "any"
# sleep 3 # to get it working
ifconfig eth1 down # Never terminates unless sleep 4 is uncommented.

Comment 1 Andrew Morton 2007-08-24 15:58:33 UTC

It would be useful if you could tell us whether 2.6.22 is working OK.

Because if it is, this gets escalated to a post-2.6.22 regression
and quite a lot more attention will be paid to it.

Thanks.

Comment 2 Christian Casteyde 2007-08-25 01:21:46 UTC

OK, I've rebuild 2.6.22.5 (without CFS and lock validation code) and it doesn't have the problem. So it is en 2.6.23-rc? regression, or very dependant to timing.

I'm trying to see at which rc the problem appeared (with SLOB only since pre rc3 crashes on my laptop). I'm posting the results in a few minutes.

Some more info:
Indeed, the system is not completly frozen: only the ifconfig command, the keyboard, and any process that tries to access the console.

Btw, I managed to get the console error, simply by doing the test under X11 and trying to reboot from the destktop menu: this switches to the console but gpm fails to reopen the console and logs the two lines in syslog:
Aug 25 00:42:55 athor /usr/sbin/gpm[3105]: O0o.oops(): [console.c(83)]:
Aug 25 00:42:55 athor /usr/sbin/gpm[3105]: Opening console failed.

I did the test of dumping dmesg asynchronously, it worked nice even while the bug triggered (all disk IO seems to work nicely therefore), but dmesg has nothing to tell. As syslog works too (gpm managed to log something), I'm sure there is no more info in it either.

Finally, I tried to reproduce the problem with the small script I gave in #1: I didn't managed to get the problem. However, this is really the commands use by the network init script, but there are some more commands in between (such as ps, grep, cut, etc., they all work), so the timing is really important I think and if the bug does not occur with a sleep 3 seconds, it also does not occur if ifconfig down is done immediatly after iwconfig. Maybe the bug triggers only when association with "any" has started. I'm compelled to keep using the network script to reproduce, and I certainly only have the timing on my laptop, with the given kernel configuration.

This is this last thought that make me suppose that timing in kernel is also important, as well as CPU scheduler, since it is obviously a race condition even in -rc3. So if I don't get the smallest rc that have the problem, I'll have to put CFS in 2.6.22 and triggers deadlock validation to be in the same environment.

Comment 3 Christian Casteyde 2007-08-25 03:26:24 UTC

Well, this is 2.6.23-rc3 regression indeed.
All kernels 2.6.22.5, 2.6.22.5-cfs20, 2.6.23-rc1 and 2.6.23-rc2 are OK, this bug is indeed quite recent.

I stop regtesting for now, but you can ask me to test smaller patch if you want.

Comment 4 Christian Casteyde 2007-08-25 07:23:32 UTC

Seems that there was few differences between rc2 and rc3 in bcm wireless driver. But there was huge modifications in core net.

Nevertheless, there seem to be a big pb in bcm driver anyway:
please see also http://bugzilla.kernel.org/show_bug.cgi?id=8937

I entered it because I thought it was independant from this bug, but now I'm not so sure. It seems I got a kernel panic in workqueue.c while trying to do the ifconfig eth1 down, even in 2.6.22.5-cfs20. However, I don't think this is the same bug, because:
a. The behaviour is different (here, only the console is killed, and in 8937 the system crashed).
b. I can reproduce this one systematically with -rc3 (9 out of 10 network script invocation), but quasi never in previous rc and in 2.6.22.* (I restarted at least 10 times the network script with each kernel without any problem).

Comment 5 Anonymous Emailer 2007-08-25 09:08:37 UTC

Reply-To: akpm@linux-foundation.org

On Sat, 25 Aug 2007 03:26:24 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=8934

Various nasty box-killing things happening here and
http://bugzilla.kernel.org/show_bug.cgi?id=8937 might be related. 

Michal, 8934 (at least) is a post-2.6.22 regression - can you please add it
to the list?

Comment 6 Larry Finger 2007-08-28 08:05:08 UTC

The only change in bcm43xx in 2.6.23-rcX reverts the result of a typo in the specifications. The only interface affected by this typo is an early model BCM4306 - it stops functioning. It is extremely unlikely that it has anything to do with this 

There are some changes in softmac, however, that might be involved. The specific examples are

Commit fdc8f43b5e49b64b251bb48da95193a13ac0132f, Fix deadlock of wx_set_essid with assoc work on July 31. Of all the changes, this is the most likely candidate as it messes with the work queues.

Commit 4cf92a3cd9117cb90ad9441172b9f42b67ee9d26, Fix ESSID problem on July 9. This one seems unlikely to cause the problem.

Commit 67c4f7aa9e64d37f32eb44d6d093b7028f1060bb, use list_for_each_entry on May 27. I think this is in 2.6.22.

Christian - Is it possible for you to test 2.6.23-rc3 or -rc4 with fdc8f43b5e4 reverted?

Comment 7 John W. Linville 2007-08-28 08:39:19 UTC

Larry beat me to it -- I agree that this looks like the most likely culprit.
 
commit fdc8f43b5e49b64b251bb48da95193a13ac0132f
Author: Michael Buesch <mb@bu3sch.de>
Date:   Tue Jul 31 20:41:04 2007 +0200

    [PATCH] softmac: Fix deadlock of wx_set_essid with assoc work

    The essid wireless extension does deadlock against the assoc mutex,
    as we don't unlock the assoc mutex when flushing the workqueue, which
    also holds the lock.

    Signed-off-by: Michael Buesch <mb@bu3sch.de>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>

Comment 8 Michael Buesch 2007-08-28 10:54:21 UTC

It is a fix. It fixes a "It will 100% always deadlock" to "It might unlikely deadlock".

Comment 9 Christian Casteyde 2007-08-28 12:08:51 UTC

I've a bad and a good news.
First, the bad: I reverted fdc8f43b5e49b64b251bb48da95193a13ac0132f
manually (was rejected for unknown reason), rebooted, and
it still blocks with 2.6.23-rc3. So this is not the cause of
the problem.

The good now. I've just tested 2.6.23-rc4 vanilla (without reverting
anything therefore), and guess it, now it works. This is really
fixed in -rc4, don't know by which patch, but it is fixed (it is
not bad luck in my tests, I crashed -rc3 three times in a row
and I passed the test 5-6 times with -rc4).

So good news, even if the problem is not identified. Seems to be
in the intersection of -rc3 diffs and -rc4 ;-). Quite strange, I
didn't see any patch which could have fixed it in the diff, but
the thing works now.

Well, I'm still paying attention to this problem the following days
(and -rc), just in case it would resurface. If anybody has an idea,
I can apply something from -rc4 to -rc3 to test just to be sure.

Thanks

Comment 10 John W. Linville 2007-08-28 12:19:45 UTC

Closing on the basis of comment 9...

Comment 11 Christian Casteyde 2007-08-28 14:38:46 UTC

Hmmm. Sorry (double sorry), it seems a little more complicated than I thought at first.
As requested for 8937, I tried to reproduce 8937 with -rc4. And indeed, this one reproduced, even for -rc4. The fact is this is far more difficult to get now, don't know why, but I got it if I insist really.

My test case is now the following:
1. On console 0 (X started on vt7, but not used): I issue the following command:
root# for i in ` seq 1 100 ` ; do echo $i ; /etc/rc.d/rc.inet1 eth1_stop ; /etc/rc.d/rc.inet1 eth1_start ; done

--> This run until the problem occurs, generally in 13-15 iteration (instead of 1 or 2 for -rc3, in my previous tests I went only up to 6-8 iterations because I thought this was sufficient).

2. If it still survives (say 10 iterations), I do the following:
- switch on vt2 and log in;
- issue:
christian:~$ cp dvd.iso dvd.bak &
christian:~$ cp dvd.iso /tmp
(/ and /tmp are on another partition than /home so I stress really the disk).

--> This seems to help to trigger the problem.

The behaviour is the same: console killed, disk I/O and mouse alive, power button -> init detects processes blocked (and I guess there are login on vt 1->6. However, I didn't get the workqueue problem still. If I wait long enough, the two iso copies are completed, the files were complete after reboot.

So I started to think the problem is **always** here and my not be a regression, apart the fact that rc3 triggered it far more easily. I therefore redid the test with 2.6.22.5, and I also managed to get it.

I reopen this bug, remove the regression flag, and adjust the kernel version in consequence. I still can reproduce "quite easily" (but being far more agressive as you can see). As for 8937, I'm still convinced this is quite the same bug, and there is indeed a deadlock/memory problem somewhere.

By the way, I have to tell that:
- My kernels are compiled with deadlock detection and so (.config attached), and I don't see any message on the console (if it is alive?) or in syslog telling there is a problem. So this is a deadlock that is not catched by the detection (soft lockup, sleep inside spinlock, etc.).
- This is not a spinlock that get stucked, because - you will laugh - the laptop is quiet. When there is an infinite loop, the fan gets crazy. This is not the case. I even doubt this is a deadlock now, the symptom is really the console is killed.
- In the log of gpm (complaining it cannot open the console), gpm does **not** get stucked. So the lock is not blocking open on /dev/console (at first glance). gpm is terminated correctly by init at reboot.

Sorry for not seeing the bug in other kernels sooner. This may be an old problem, I only noticed it due to rc3.
I'm appending my .config to show the debug options I activated, and I'll try to reproduce with the temporization before ifconfig down as it seemed to reduce the bug frequency with -rc3 (== I must revalidate this test too).

Comment 12 Christian Casteyde 2007-08-28 14:45:14 UTC

Created attachment 12583 [details]
.config file for 2.6.23-rc4

I'm adding SLUB debugging, and VM, linked list, write protect kernel structure and debug stacks in another test.

Comment 13 Christian Casteyde 2007-08-29 11:14:13 UTC

Well, I redid same tests with nearly all debugging options (except kobject) activated, and with -rc4. It reproduced after 72 and 65 ifconfig up/down, so it is effectively far more robust than -rc3. Good, I'm having to be patient to reproduce now.
Sadly, it simply gives no more information.
However, I managed to trigger a kernel BUG at boot time twice (one yesterday and one today), which seemed similar to 8937. However it was not an assert in workqueue: it was a spinlock blocked (still in workqueue, with effectively the fan getting crazy this time). But it was obviously caused by another bug before that scrolled upper and that I didn't managed to get (grrr.).

Today, I had a small time to see the first bug before the scrolling, and even if I didn't memorized the full stack dump, I remember some informations:
- First, there was a bcmxxx_close call in the stack, we are still really near the ifconfig down of the init script.
- Second, there was a "list_add" call, which seems to indicate that the list debugging option was usefull.
I didn't got more, but I'm with my camera next to the laptop and if I get it (with smaller font too), I take a snapshot before the spinlock watchdog triggers. I'm trying to reboot several times in the hope to get it one more time, it seems to be very informative stack this time.

I'm appending the call stack I got here, but I don't think it will be usefull (still in workqueue.c) (copied from a photo, I didn't copied all the offset since this is a pain - if it proved to be interesting I would complete it):

... (lost due to the scroll, the second time I got it I saw bcmxxx_close or something like that, and list_add)...
inet_ioctl+05d/0x80
sock_ioctl+0xbf/0x240
do_ioctl+0x31/0x90
vfs_ioctl+0x73/0x240
sys_ioctl+0x4a/0x80
system_call+0x7e/0x83

--> Now the second BUG()/spinlock:
BUG: spinlock lockup on CPU#0, events/0/5, ffff81004ff13000
Call Trace:
_raw_spin_lock+0x106/0x130
wq_barrier_func+0x0/0x10
_spin_lock_irq+0x3d/0x50
run_workqueue+0x80/0x190
worker_thread+0xca/0x130
autoremove_wake_function+0x0/0x40
worker_thread+0x0/0x130
kthread+0x4d/0x80
child_rip+0xa/0x12
restore_args+0x0/0x30
kthreadd+0xd4/0x160
kthread+0x0/0x80
child_rip+0x0/0x12
-- end of dump

Comment 14 Christian Casteyde 2007-08-29 13:26:19 UTC

I just got a more complete dump. It is reported in 8937, because it's more susceptible to get the system freeze (schedule while atomic).

The ifconfig eth1 down is effectively taking place when the bug occurs, as I had noticed experimentally. I don't know what preempt that call however and cause the scheduling, but the bug is indeed in this call.

I don't know if the console freeze is due to the same kind of problem.

Comment 15 Christian Casteyde 2007-08-30 12:10:09 UTC

I've just completed my network start/stop loop with the patches I got from 8937 and it still freezes (35 iterations, but helped with a "yes > /dev/null & " together to my file copies to change timing).

So 8934 and 8937 are really different, and 8937 seems multiple (up to now, 8937 schedule/lock pb is fixed by Larry's patch, but there's still a workqueue problem pending. The call stack given upper is therefore useless for this bug.
Moreover, I didn't get any more log when the problem occurs, even with the workqueue instrumentation from 8937.

And finally, I'm now sure that the processes init says they are "hung" are the login process on unused console (c4 to c6 if I use c1 to c3 for instance).

Comment 16 Christian Casteyde 2007-09-01 00:46:16 UTC

Maybe should I add printk in the path identified for ifconfig down in the bug 8927:
>  [<ffffffff802455e0>] flush_scheduled_work+0x10/0x20
>  [<ffffffff80402da1>] bcm43xx_net_stop+0x31/0x40
>  [<ffffffff8050a7da>] dev_close+0x6a/0x90
>  [<ffffffff80509789>] dev_change_flags+0x99/0x1c0
>  [<ffffffff80558352>] devinet_ioctl+0x582/0x740

to see where it is stucked (at least from this thread, since I do not know even if there is a deadlock with another thread)?

Comment 17 Christian Casteyde 2007-09-01 01:14:17 UTC

Well, I added printks at each method operation in net/core dev_close and in bcm43xx_net_stop, and I got this trace:

dev_close: enter
dev_close: deactivating device
dev__close: scheduled poll synchronization
dev__close: calling dev->stop
bcm43xx_net_stop: enter
bcm43xx_net_stop: stopping softmac

It is therefore stucked in softmac. For information, the complete traces (when ifconfig down works) is:
dev_close: enter
dev_close: deactivating device
dev__close: scheduled poll synchronization
dev__close: calling dev->stop
bcm43xx_net_stop: enter
bcm43xx_net_stop: stopping softmac
bcm43xx_net_stop: disabling interrupts sync
bcm43xx_net_stop: freeing board
bcm43xx_net_stop: exit
dev_close: notifying down
dev_close: exit

So you can see where I put the traces (before each operations).

I'm refining in softmac now...

Comment 18 Christian Casteyde 2007-09-01 01:56:26 UTC

It is in softmac: waiting_for_scan :

When all is OK:
Sep  1 10:35:45 athor kernel: dev_close: enter
Sep  1 10:35:45 athor kernel: dev_close: deactivating device
Sep  1 10:35:45 athor kernel: dev__close: scheduled poll synchronization
Sep  1 10:35:45 athor kernel: dev__close: calling dev->stop
Sep  1 10:35:45 athor kernel: bcm43xx_net_stop: enter
Sep  1 10:35:45 athor kernel: bcm43xx_net_stop: stopping softmac
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: enter
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: waiting_for_scan
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: sm->lock irqsave
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: canceling associnfo->work
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: canceling scaninfo->softmac_scan
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: canceling authptr->work list
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: canceling eventptr->work list
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: sm->lock irqrestore
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: flush_scheduled_work
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: sm->lock irqsave
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: list_del authptr->list
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: list_del eventptr->list
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: list_del netptr->challenge
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: sm->lock irqrestore
Sep  1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: exit
Sep  1 10:35:45 athor kernel: bcm43xx_net_stop: disabling interrupts sync
Sep  1 10:35:45 athor kernel: bcm43xx_net_stop: freeing board
Sep  1 10:35:45 athor kernel: bcm43xx_net_stop: exit
Sep  1 10:35:45 athor kernel: dev_close: notifying down
Sep  1 10:35:45 athor kernel: dev_close: exit

and:

Sep  1 10:35:48 athor kernel: dev_close: enter
Sep  1 10:35:48 athor kernel: dev_close: deactivating device
Sep  1 10:35:48 athor kernel: dev__close: scheduled poll synchronization
Sep  1 10:35:48 athor kernel: dev__close: calling dev->stop
Sep  1 10:35:48 athor kernel: bcm43xx_net_stop: enter
Sep  1 10:35:48 athor kernel: bcm43xx_net_stop: stopping softmac
Sep  1 10:35:48 athor kernel: ieee80211softmac_clear_pending_work: enter
Sep  1 10:35:48 athor kernel: ieee80211softmac_clear_pending_work: waiting_for_scan
** freeze **
when the deadlock occurs.

So far, it correlates what I suspected at first when reading the slackware scripts: it is indeed a deadlock between iwconfig eth1 essid "Any" and ifconfig eth1 down (the problem does not occur when I add a sleep between these two commands).
My guess is still there is an open AP in my neighbourhood, and the laptop manages to associate sometimes with it, and if I ifconfig down just at this moment, there is a deadlock in softmac. If there is no AP, the problem doesn't occur (so it depends of external conditions), and if the system:
- is very fast: association doesn't event start, so waiting_for_scan does nothing;
- is very low: ifconfig down occurs after association, so there is no deadlock;
- is mine: both occurs simultaneously and it deadlocks.

Btw, I found that the real test to trigger the problem now is:
yes > /dev/null &
for i in ... etc.
The disk I/O is useless, this is the CPU loads that matters here.
But I suppose that it depends of the machine speend, so it would be usefull to have a load simulator that takes CPU from 1% to 100%, just to get the exact load that makes the problem occurs on other machines. For me, yes > /dev/null is perfect.

I'm not sure what function is stored in sm->wait_for_scan, so I'm unable to go deeper. There is no wait_for_scan in the bcm driver, so it is in softmac still I guess, maybe ieee80211softmac_stop_scan_implementation, but it is clearly a mess as far as locking is regard (it's obvious, and it is already in sm->lock, won't be surprised there would be a cross with another lock from scanning ISR).

If you want, I can pursue investigation, but I'm not familiar enough with kernel locking primitives to fix annything there.

I still don't understand why it freezes the console and why lock instrumentation doesn't see anything there :-(

Comment 19 Christian Casteyde 2007-09-01 02:02:08 UTC

I was wrong in previous comment, sm->lock is not held when entering wait_for_scan_implementation.

Comment 20 Christian Casteyde 2007-09-01 02:41:14 UTC

I don't understand the code of ieee80211softmac_wait_for_scan_implementation.
It takes the spinlock to check if sm->scaninfo->started is true, to
know if it has to call wait_for_completion(&sm->scaninfo->finished);
However, it releases the spinlock just before calling this method.

So what happens if the job is completed just after the spinlock is released?
Does wait_for_completion waits indefinitively ?
If wait_for_completion looks at the flag started itself to be safe, then what's the use of the check before?

Comment 21 Christian Casteyde 2007-09-01 03:24:11 UTC

Indeed, it effectively blocks in wait_for_completion:
ieee80211softmac_clear_pending_work: enter
ieee80211softmac_clear_pending_work: waiting_for_scan
ieee80211softmac_wait_for_scan: calling ieee80211softmac_wait_for_scan_implementation+0x0/0xc0
ieee80211softmac_wait_for_scan_implementation: enter
ieee80211softmac_wait_for_scan_implementation: calling wait_for_completion

Comment 22 Christian Casteyde 2007-10-31 13:02:49 UTC

Just for info: mac802.11 library seems free from that bug. I was unable to reproduce with the b43 driver.

Comment 23 Michael Buesch 2007-10-31 15:57:11 UTC

Please mark this bug as
"Reject bug, WILL_NOT_FIX",
As softmac is not going to get completely fixed (partial fix for this has been submitted). The final fix will be to drop softmac and switch to mac80211.