Bug 8934
Summary: | System freeze when restarting network connection with Broadcom driver | ||
---|---|---|---|
Product: | Drivers | Reporter: | Christian Casteyde (casteyde.christian) |
Component: | network-wireless | Assignee: | drivers_network-wireless (drivers_network-wireless) |
Status: | REJECTED WILL_NOT_FIX | ||
Severity: | normal | CC: | akpm, Larry.Finger, mb, mchan |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.22.5 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: | .config file for 2.6.23-rc4 |
Description
Christian Casteyde
2007-08-24 15:35:21 UTC
It would be useful if you could tell us whether 2.6.22 is working OK. Because if it is, this gets escalated to a post-2.6.22 regression and quite a lot more attention will be paid to it. Thanks. OK, I've rebuild 2.6.22.5 (without CFS and lock validation code) and it doesn't have the problem. So it is en 2.6.23-rc? regression, or very dependant to timing. I'm trying to see at which rc the problem appeared (with SLOB only since pre rc3 crashes on my laptop). I'm posting the results in a few minutes. Some more info: Indeed, the system is not completly frozen: only the ifconfig command, the keyboard, and any process that tries to access the console. Btw, I managed to get the console error, simply by doing the test under X11 and trying to reboot from the destktop menu: this switches to the console but gpm fails to reopen the console and logs the two lines in syslog: Aug 25 00:42:55 athor /usr/sbin/gpm[3105]: O0o.oops(): [console.c(83)]: Aug 25 00:42:55 athor /usr/sbin/gpm[3105]: Opening console failed. I did the test of dumping dmesg asynchronously, it worked nice even while the bug triggered (all disk IO seems to work nicely therefore), but dmesg has nothing to tell. As syslog works too (gpm managed to log something), I'm sure there is no more info in it either. Finally, I tried to reproduce the problem with the small script I gave in #1: I didn't managed to get the problem. However, this is really the commands use by the network init script, but there are some more commands in between (such as ps, grep, cut, etc., they all work), so the timing is really important I think and if the bug does not occur with a sleep 3 seconds, it also does not occur if ifconfig down is done immediatly after iwconfig. Maybe the bug triggers only when association with "any" has started. I'm compelled to keep using the network script to reproduce, and I certainly only have the timing on my laptop, with the given kernel configuration. This is this last thought that make me suppose that timing in kernel is also important, as well as CPU scheduler, since it is obviously a race condition even in -rc3. So if I don't get the smallest rc that have the problem, I'll have to put CFS in 2.6.22 and triggers deadlock validation to be in the same environment. Well, this is 2.6.23-rc3 regression indeed. All kernels 2.6.22.5, 2.6.22.5-cfs20, 2.6.23-rc1 and 2.6.23-rc2 are OK, this bug is indeed quite recent. I stop regtesting for now, but you can ask me to test smaller patch if you want. Seems that there was few differences between rc2 and rc3 in bcm wireless driver. But there was huge modifications in core net. Nevertheless, there seem to be a big pb in bcm driver anyway: please see also http://bugzilla.kernel.org/show_bug.cgi?id=8937 I entered it because I thought it was independant from this bug, but now I'm not so sure. It seems I got a kernel panic in workqueue.c while trying to do the ifconfig eth1 down, even in 2.6.22.5-cfs20. However, I don't think this is the same bug, because: a. The behaviour is different (here, only the console is killed, and in 8937 the system crashed). b. I can reproduce this one systematically with -rc3 (9 out of 10 network script invocation), but quasi never in previous rc and in 2.6.22.* (I restarted at least 10 times the network script with each kernel without any problem). Reply-To: akpm@linux-foundation.org On Sat, 25 Aug 2007 03:26:24 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=8934 Various nasty box-killing things happening here and http://bugzilla.kernel.org/show_bug.cgi?id=8937 might be related. Michal, 8934 (at least) is a post-2.6.22 regression - can you please add it to the list? The only change in bcm43xx in 2.6.23-rcX reverts the result of a typo in the specifications. The only interface affected by this typo is an early model BCM4306 - it stops functioning. It is extremely unlikely that it has anything to do with this There are some changes in softmac, however, that might be involved. The specific examples are Commit fdc8f43b5e49b64b251bb48da95193a13ac0132f, Fix deadlock of wx_set_essid with assoc work on July 31. Of all the changes, this is the most likely candidate as it messes with the work queues. Commit 4cf92a3cd9117cb90ad9441172b9f42b67ee9d26, Fix ESSID problem on July 9. This one seems unlikely to cause the problem. Commit 67c4f7aa9e64d37f32eb44d6d093b7028f1060bb, use list_for_each_entry on May 27. I think this is in 2.6.22. Christian - Is it possible for you to test 2.6.23-rc3 or -rc4 with fdc8f43b5e4 reverted? Larry beat me to it -- I agree that this looks like the most likely culprit. commit fdc8f43b5e49b64b251bb48da95193a13ac0132f Author: Michael Buesch <mb@bu3sch.de> Date: Tue Jul 31 20:41:04 2007 +0200 [PATCH] softmac: Fix deadlock of wx_set_essid with assoc work The essid wireless extension does deadlock against the assoc mutex, as we don't unlock the assoc mutex when flushing the workqueue, which also holds the lock. Signed-off-by: Michael Buesch <mb@bu3sch.de> Signed-off-by: John W. Linville <linville@tuxdriver.com> It is a fix. It fixes a "It will 100% always deadlock" to "It might unlikely deadlock". I've a bad and a good news. First, the bad: I reverted fdc8f43b5e49b64b251bb48da95193a13ac0132f manually (was rejected for unknown reason), rebooted, and it still blocks with 2.6.23-rc3. So this is not the cause of the problem. The good now. I've just tested 2.6.23-rc4 vanilla (without reverting anything therefore), and guess it, now it works. This is really fixed in -rc4, don't know by which patch, but it is fixed (it is not bad luck in my tests, I crashed -rc3 three times in a row and I passed the test 5-6 times with -rc4). So good news, even if the problem is not identified. Seems to be in the intersection of -rc3 diffs and -rc4 ;-). Quite strange, I didn't see any patch which could have fixed it in the diff, but the thing works now. Well, I'm still paying attention to this problem the following days (and -rc), just in case it would resurface. If anybody has an idea, I can apply something from -rc4 to -rc3 to test just to be sure. Thanks Closing on the basis of comment 9... Hmmm. Sorry (double sorry), it seems a little more complicated than I thought at first. As requested for 8937, I tried to reproduce 8937 with -rc4. And indeed, this one reproduced, even for -rc4. The fact is this is far more difficult to get now, don't know why, but I got it if I insist really. My test case is now the following: 1. On console 0 (X started on vt7, but not used): I issue the following command: root# for i in ` seq 1 100 ` ; do echo $i ; /etc/rc.d/rc.inet1 eth1_stop ; /etc/rc.d/rc.inet1 eth1_start ; done --> This run until the problem occurs, generally in 13-15 iteration (instead of 1 or 2 for -rc3, in my previous tests I went only up to 6-8 iterations because I thought this was sufficient). 2. If it still survives (say 10 iterations), I do the following: - switch on vt2 and log in; - issue: christian:~$ cp dvd.iso dvd.bak & christian:~$ cp dvd.iso /tmp (/ and /tmp are on another partition than /home so I stress really the disk). --> This seems to help to trigger the problem. The behaviour is the same: console killed, disk I/O and mouse alive, power button -> init detects processes blocked (and I guess there are login on vt 1->6. However, I didn't get the workqueue problem still. If I wait long enough, the two iso copies are completed, the files were complete after reboot. So I started to think the problem is **always** here and my not be a regression, apart the fact that rc3 triggered it far more easily. I therefore redid the test with 2.6.22.5, and I also managed to get it. I reopen this bug, remove the regression flag, and adjust the kernel version in consequence. I still can reproduce "quite easily" (but being far more agressive as you can see). As for 8937, I'm still convinced this is quite the same bug, and there is indeed a deadlock/memory problem somewhere. By the way, I have to tell that: - My kernels are compiled with deadlock detection and so (.config attached), and I don't see any message on the console (if it is alive?) or in syslog telling there is a problem. So this is a deadlock that is not catched by the detection (soft lockup, sleep inside spinlock, etc.). - This is not a spinlock that get stucked, because - you will laugh - the laptop is quiet. When there is an infinite loop, the fan gets crazy. This is not the case. I even doubt this is a deadlock now, the symptom is really the console is killed. - In the log of gpm (complaining it cannot open the console), gpm does **not** get stucked. So the lock is not blocking open on /dev/console (at first glance). gpm is terminated correctly by init at reboot. Sorry for not seeing the bug in other kernels sooner. This may be an old problem, I only noticed it due to rc3. I'm appending my .config to show the debug options I activated, and I'll try to reproduce with the temporization before ifconfig down as it seemed to reduce the bug frequency with -rc3 (== I must revalidate this test too). Created attachment 12583 [details]
.config file for 2.6.23-rc4
I'm adding SLUB debugging, and VM, linked list, write protect kernel structure and debug stacks in another test.
Well, I redid same tests with nearly all debugging options (except kobject) activated, and with -rc4. It reproduced after 72 and 65 ifconfig up/down, so it is effectively far more robust than -rc3. Good, I'm having to be patient to reproduce now. Sadly, it simply gives no more information. However, I managed to trigger a kernel BUG at boot time twice (one yesterday and one today), which seemed similar to 8937. However it was not an assert in workqueue: it was a spinlock blocked (still in workqueue, with effectively the fan getting crazy this time). But it was obviously caused by another bug before that scrolled upper and that I didn't managed to get (grrr.). Today, I had a small time to see the first bug before the scrolling, and even if I didn't memorized the full stack dump, I remember some informations: - First, there was a bcmxxx_close call in the stack, we are still really near the ifconfig down of the init script. - Second, there was a "list_add" call, which seems to indicate that the list debugging option was usefull. I didn't got more, but I'm with my camera next to the laptop and if I get it (with smaller font too), I take a snapshot before the spinlock watchdog triggers. I'm trying to reboot several times in the hope to get it one more time, it seems to be very informative stack this time. I'm appending the call stack I got here, but I don't think it will be usefull (still in workqueue.c) (copied from a photo, I didn't copied all the offset since this is a pain - if it proved to be interesting I would complete it): ... (lost due to the scroll, the second time I got it I saw bcmxxx_close or something like that, and list_add)... inet_ioctl+05d/0x80 sock_ioctl+0xbf/0x240 do_ioctl+0x31/0x90 vfs_ioctl+0x73/0x240 sys_ioctl+0x4a/0x80 system_call+0x7e/0x83 --> Now the second BUG()/spinlock: BUG: spinlock lockup on CPU#0, events/0/5, ffff81004ff13000 Call Trace: _raw_spin_lock+0x106/0x130 wq_barrier_func+0x0/0x10 _spin_lock_irq+0x3d/0x50 run_workqueue+0x80/0x190 worker_thread+0xca/0x130 autoremove_wake_function+0x0/0x40 worker_thread+0x0/0x130 kthread+0x4d/0x80 child_rip+0xa/0x12 restore_args+0x0/0x30 kthreadd+0xd4/0x160 kthread+0x0/0x80 child_rip+0x0/0x12 -- end of dump I just got a more complete dump. It is reported in 8937, because it's more susceptible to get the system freeze (schedule while atomic). The ifconfig eth1 down is effectively taking place when the bug occurs, as I had noticed experimentally. I don't know what preempt that call however and cause the scheduling, but the bug is indeed in this call. I don't know if the console freeze is due to the same kind of problem. I've just completed my network start/stop loop with the patches I got from 8937 and it still freezes (35 iterations, but helped with a "yes > /dev/null & " together to my file copies to change timing). So 8934 and 8937 are really different, and 8937 seems multiple (up to now, 8937 schedule/lock pb is fixed by Larry's patch, but there's still a workqueue problem pending. The call stack given upper is therefore useless for this bug. Moreover, I didn't get any more log when the problem occurs, even with the workqueue instrumentation from 8937. And finally, I'm now sure that the processes init says they are "hung" are the login process on unused console (c4 to c6 if I use c1 to c3 for instance). Maybe should I add printk in the path identified for ifconfig down in the bug 8927: > [<ffffffff802455e0>] flush_scheduled_work+0x10/0x20 > [<ffffffff80402da1>] bcm43xx_net_stop+0x31/0x40 > [<ffffffff8050a7da>] dev_close+0x6a/0x90 > [<ffffffff80509789>] dev_change_flags+0x99/0x1c0 > [<ffffffff80558352>] devinet_ioctl+0x582/0x740 to see where it is stucked (at least from this thread, since I do not know even if there is a deadlock with another thread)? Well, I added printks at each method operation in net/core dev_close and in bcm43xx_net_stop, and I got this trace: dev_close: enter dev_close: deactivating device dev__close: scheduled poll synchronization dev__close: calling dev->stop bcm43xx_net_stop: enter bcm43xx_net_stop: stopping softmac It is therefore stucked in softmac. For information, the complete traces (when ifconfig down works) is: dev_close: enter dev_close: deactivating device dev__close: scheduled poll synchronization dev__close: calling dev->stop bcm43xx_net_stop: enter bcm43xx_net_stop: stopping softmac bcm43xx_net_stop: disabling interrupts sync bcm43xx_net_stop: freeing board bcm43xx_net_stop: exit dev_close: notifying down dev_close: exit So you can see where I put the traces (before each operations). I'm refining in softmac now... It is in softmac: waiting_for_scan : When all is OK: Sep 1 10:35:45 athor kernel: dev_close: enter Sep 1 10:35:45 athor kernel: dev_close: deactivating device Sep 1 10:35:45 athor kernel: dev__close: scheduled poll synchronization Sep 1 10:35:45 athor kernel: dev__close: calling dev->stop Sep 1 10:35:45 athor kernel: bcm43xx_net_stop: enter Sep 1 10:35:45 athor kernel: bcm43xx_net_stop: stopping softmac Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: enter Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: waiting_for_scan Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: sm->lock irqsave Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: canceling associnfo->work Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: canceling scaninfo->softmac_scan Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: canceling authptr->work list Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: canceling eventptr->work list Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: sm->lock irqrestore Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: flush_scheduled_work Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: sm->lock irqsave Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: list_del authptr->list Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: list_del eventptr->list Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: list_del netptr->challenge Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: sm->lock irqrestore Sep 1 10:35:45 athor kernel: ieee80211softmac_clear_pending_work: exit Sep 1 10:35:45 athor kernel: bcm43xx_net_stop: disabling interrupts sync Sep 1 10:35:45 athor kernel: bcm43xx_net_stop: freeing board Sep 1 10:35:45 athor kernel: bcm43xx_net_stop: exit Sep 1 10:35:45 athor kernel: dev_close: notifying down Sep 1 10:35:45 athor kernel: dev_close: exit and: Sep 1 10:35:48 athor kernel: dev_close: enter Sep 1 10:35:48 athor kernel: dev_close: deactivating device Sep 1 10:35:48 athor kernel: dev__close: scheduled poll synchronization Sep 1 10:35:48 athor kernel: dev__close: calling dev->stop Sep 1 10:35:48 athor kernel: bcm43xx_net_stop: enter Sep 1 10:35:48 athor kernel: bcm43xx_net_stop: stopping softmac Sep 1 10:35:48 athor kernel: ieee80211softmac_clear_pending_work: enter Sep 1 10:35:48 athor kernel: ieee80211softmac_clear_pending_work: waiting_for_scan ** freeze ** when the deadlock occurs. So far, it correlates what I suspected at first when reading the slackware scripts: it is indeed a deadlock between iwconfig eth1 essid "Any" and ifconfig eth1 down (the problem does not occur when I add a sleep between these two commands). My guess is still there is an open AP in my neighbourhood, and the laptop manages to associate sometimes with it, and if I ifconfig down just at this moment, there is a deadlock in softmac. If there is no AP, the problem doesn't occur (so it depends of external conditions), and if the system: - is very fast: association doesn't event start, so waiting_for_scan does nothing; - is very low: ifconfig down occurs after association, so there is no deadlock; - is mine: both occurs simultaneously and it deadlocks. Btw, I found that the real test to trigger the problem now is: yes > /dev/null & for i in ... etc. The disk I/O is useless, this is the CPU loads that matters here. But I suppose that it depends of the machine speend, so it would be usefull to have a load simulator that takes CPU from 1% to 100%, just to get the exact load that makes the problem occurs on other machines. For me, yes > /dev/null is perfect. I'm not sure what function is stored in sm->wait_for_scan, so I'm unable to go deeper. There is no wait_for_scan in the bcm driver, so it is in softmac still I guess, maybe ieee80211softmac_stop_scan_implementation, but it is clearly a mess as far as locking is regard (it's obvious, and it is already in sm->lock, won't be surprised there would be a cross with another lock from scanning ISR). If you want, I can pursue investigation, but I'm not familiar enough with kernel locking primitives to fix annything there. I still don't understand why it freezes the console and why lock instrumentation doesn't see anything there :-( I was wrong in previous comment, sm->lock is not held when entering wait_for_scan_implementation. I don't understand the code of ieee80211softmac_wait_for_scan_implementation. It takes the spinlock to check if sm->scaninfo->started is true, to know if it has to call wait_for_completion(&sm->scaninfo->finished); However, it releases the spinlock just before calling this method. So what happens if the job is completed just after the spinlock is released? Does wait_for_completion waits indefinitively ? If wait_for_completion looks at the flag started itself to be safe, then what's the use of the check before? Indeed, it effectively blocks in wait_for_completion: ieee80211softmac_clear_pending_work: enter ieee80211softmac_clear_pending_work: waiting_for_scan ieee80211softmac_wait_for_scan: calling ieee80211softmac_wait_for_scan_implementation+0x0/0xc0 ieee80211softmac_wait_for_scan_implementation: enter ieee80211softmac_wait_for_scan_implementation: calling wait_for_completion Just for info: mac802.11 library seems free from that bug. I was unable to reproduce with the b43 driver. Please mark this bug as "Reject bug, WILL_NOT_FIX", As softmac is not going to get completely fixed (partial fix for this has been submitted). The final fix will be to drop softmac and switch to mac80211. |