Bug 10821
Summary: | rt25xx: lock dependency warning, association failure, and kmalloc corruption | ||
---|---|---|---|
Product: | Networking | Reporter: | Christian Casteyde (casteyde.christian) |
Component: | Wireless | Assignee: | Ivo van Doorn (IvDoorn) |
Status: | RESOLVED PATCH_ALREADY_AVAILABLE | ||
Severity: | blocking | CC: | bunk, IvDoorn, linville |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.26-rc4 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 10492 | ||
Attachments: |
dmesg log just after inserting
next logs when starting wpa_supplicant QOS queue initialization Don't use mac80211 workqueue Make entry->flags and rt2x00->flags atomic free pointer corrupt with -rc6+bits.diff patch+first patch for lock Make rt2x00->flags atomic and fix race in remove_dev() Make rt2x00->flags atomic and fix race in remove_dev() - v2 dmesg output for 2.6.26-rc8+bits2.diff+work.diff Make rt2x00->flags atomic and fix race in remove_dev() - v3 dmesg output for 2.6.26-rc8+bits3.diff+work.diff dmesg output for 2.6.26 final |
Description
Christian Casteyde
2008-05-29 14:30:43 UTC
Please retest with 2.6.26-rc5 or 2.6.26-rc6 Both still fail with -rc6. At device plug in, I get: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.26-rc6 #10 ------------------------------------------------------- rt2500usb/3488 is trying to acquire lock: (rtnl_mutex){--..}, at: [<ffffffff805479d2>] rtnl_lock+0x12/0x20 but task is already holding lock: (&rt2x00dev->intf_work){--..}, at: [<ffffffff80243d52>] run_workqueue+0x122/0x220 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&rt2x00dev->intf_work){--..}: [<ffffffff80255ec2>] __lock_acquire+0xcc2/0x1120 [<ffffffff80256377>] lock_acquire+0x57/0x80 [<ffffffff80243d97>] run_workqueue+0x167/0x220 [<ffffffff802448f6>] worker_thread+0xa6/0x110 [<ffffffff802478ad>] kthread+0x4d/0x80 [<ffffffff8020c238>] child_rip+0xa/0x12 [<ffffffffffffffff>] 0xffffffffffffffff -> #1 ((name)){--..}: [<ffffffff80255ec2>] __lock_acquire+0xcc2/0x1120 [<ffffffff80256377>] lock_acquire+0x57/0x80 [<ffffffff80244555>] flush_workqueue+0x45/0x70 [<ffffffff80613936>] ieee80211_stop+0x1c6/0x3e0 [<ffffffff8053d433>] dev_close+0x63/0xb0 [<ffffffff8053d061>] dev_change_flags+0xa1/0x1d0 [<ffffffff8058fe8b>] devinet_ioctl+0x5ab/0x750 [<ffffffff805904ed>] inet_ioctl+0x6d/0x90 [<ffffffff8052f217>] sock_ioctl+0xc7/0x250 [<ffffffff8029d731>] vfs_ioctl+0x31/0x90 [<ffffffff8029d9f0>] do_vfs_ioctl+0x260/0x2e0 [<ffffffff8029daba>] sys_ioctl+0x4a/0x80 [<ffffffff8020b79b>] system_call_after_swapgs+0x7b/0x80 [<ffffffffffffffff>] 0xffffffffffffffff -> #0 (rtnl_mutex){--..}: [<ffffffff80255cf0>] __lock_acquire+0xaf0/0x1120 [<ffffffff80256377>] lock_acquire+0x57/0x80 [<ffffffff8063a7bf>] mutex_lock_nested+0x8f/0x2c0 [<ffffffff805479d2>] rtnl_lock+0x12/0x20 [<ffffffff80628f27>] ieee80211_iterate_active_interfaces+0x27/0xa0 [<ffffffff8042e7ae>] rt2x00lib_intf_scheduled+0x1e/0x20 [<ffffffff80243d9d>] run_workqueue+0x16d/0x220 [<ffffffff802448f6>] worker_thread+0xa6/0x110 [<ffffffff802478ad>] kthread+0x4d/0x80 [<ffffffff8020c238>] child_rip+0xa/0x12 [<ffffffffffffffff>] 0xffffffffffffffff other info that might help us debug this: 2 locks held by rt2500usb/3488: #0: ((name)){--..}, at: [<ffffffff80243d52>] run_workqueue+0x122/0x220 #1: (&rt2x00dev->intf_work){--..}, at: [<ffffffff80243d52>] run_workqueue+0x122/0x220 stack backtrace: Pid: 3488, comm: rt2500usb Not tainted 2.6.26-rc6 #10 Call Trace: [<ffffffff80253c13>] print_circular_bug_tail+0x83/0x90 [<ffffffff802532a9>] ? print_circular_bug_entry+0x49/0x60 [<ffffffff80255cf0>] __lock_acquire+0xaf0/0x1120 [<ffffffff805479d2>] ? rtnl_lock+0x12/0x20 [<ffffffff80256377>] lock_acquire+0x57/0x80 [<ffffffff805479d2>] ? rtnl_lock+0x12/0x20 [<ffffffff8063a7bf>] mutex_lock_nested+0x8f/0x2c0 [<ffffffff8042f240>] ? rt2x00lib_intf_scheduled_iter+0x0/0x100 [<ffffffff805479d2>] rtnl_lock+0x12/0x20 [<ffffffff80628f27>] ieee80211_iterate_active_interfaces+0x27/0xa0 [<ffffffff8042e790>] ? rt2x00lib_intf_scheduled+0x0/0x20 [<ffffffff8042e7ae>] rt2x00lib_intf_scheduled+0x1e/0x20 [<ffffffff80243d9d>] run_workqueue+0x16d/0x220 [<ffffffff802448f6>] worker_thread+0xa6/0x110 [<ffffffff80247cc0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff80244850>] ? worker_thread+0x0/0x110 [<ffffffff802478ad>] kthread+0x4d/0x80 [<ffffffff8020c238>] child_rip+0xa/0x12 [<ffffffff8020bdf3>] ? restore_args+0x0/0x30 [<ffffffff80247860>] ? kthread+0x0/0x80 [<ffffffff8020c22e>] ? child_rip+0x0/0x12 ADDRCONF(NETDEV_UP): wlan0: link is not ready After that, I can do iwlist scan now. However, association still fails with: /usr/sbin/wpa_supplicant -Bw -c/etc/wpa_supplicant.conf -Dwext -ieth1 -> wlan0: Initial auth_alg=0 wlan0: authenticate with AP 00:07:cb:55:12:b0 wlan0: authenticate with AP 00:07:cb:55:12:b0 wlan0: authenticate with AP 00:07:cb:55:12:b0 wlan0: authentication with AP 00:07:cb:55:12:b0 timed out wlan0: Initial auth_alg=0 wlan0: authenticate with AP 00:07:cb:55:12:b0 wlan0: RX authentication from 00:07:cb:55:12:b0 (alg=0 transaction=2 status=0) wlan0: authenticated wlan0: associate with AP 00:07:cb:55:12:b0 wlan0: Initial auth_alg=0 wlan0: authenticate with AP 00:07:cb:55:12:b0 wlan0: RX authentication from 00:07:cb:55:12:b0 (alg=0 transaction=2 status=0) wlan0: authenticated wlan0: associate with AP 00:07:cb:55:12:b0 wlan0: authentication frame received from 00:07:cb:55:12:b0, but not in authenticate state - ignored wlan0: authentication frame received from 00:07:cb:55:12:b0, but not in authenticate state - ignored wlan0: authentication frame received from 00:07:cb:55:12:b0, but not in authenticate state - ignored wlan0: authentication frame received from 00:07:cb:55:12:b0, but not in authenticate state - ignored wlan0: authentication frame received from 00:07:cb:55:12:b0, but not in authenticate state - ignored wlan0: authentication frame received from 00:07:cb:55:12:b0, but not in authenticate state - ignored wlan0: associate with AP 00:07:cb:55:12:b0 wlan0: associate with AP 00:07:cb:55:12:b0 wlan0: association with AP 00:07:cb:55:12:b0 timed out wlan0: association frame received from 00:07:cb:55:12:b0, but not in associate state - ignored wlan0: association frame received from 00:07:cb:55:12:b0, but not in associate state - ignored wlan0: Initial auth_alg=0 wlan0: authenticate with AP 00:07:cb:55:12:b0 wlan0: Initial auth_alg=0 wlan0: authenticate with AP 00:07:cb:55:12:b0 wlan0: association frame received from 00:07:cb:55:12:b0, but not in associate state - ignored wlan0: association frame received from 00:07:cb:55:12:b0, but not in associate state - ignored wlan0: association frame received from 00:07:cb:55:12:b0, but not in associate state - ignored wlan0: association frame received from 00:07:cb:55:12:b0, but not in associate state - ignored wlan0: association frame received from 00:07:cb:55:12:b0, but not in associate state - ignored wlan0: association frame received from 00:07:cb:55:12:b0, but not in associate state - ignored wlan0: authenticate with AP 00:07:cb:55:12:b0 wlan0: authenticate with AP 00:07:cb:55:12:b0 wlan0: authentication with AP 00:07:cb:55:12:b0 timed out wlan0: RX deauthentication from 00:07:cb:55:12:b0 (reason=15) wlan0: deauthenticated wlan0: RX deauthentication from 00:07:cb:55:12:b0 (reason=15) wlan0: RX deauthentication from 00:07:cb:55:12:b0 (reason=15) wlan0: RX deauthentication from 00:07:cb:55:12:b0 (reason=15) wlan0: RX deauthentication from 00:07:cb:55:12:b0 (reason=15) wlan0: RX deauthentication from 00:07:cb:55:12:b0 (reason=15) wlan0: RX deauthentication from 00:07:cb:55:12:b0 (reason=15) Then at device removal: usb 1-3: USB disconnect, address 4 ============================================================================= BUG kmalloc-96: Freepointer corrupt ----------------------------------------------------------------------------- INFO: Allocated in rt2x00queue_alloc_entries+0x49/0xc0 age=82828 cpu=0 pid=3519 INFO: Freed in sysfs_release+0x62/0xa0 age=82869 cpu=0 pid=3496 INFO: Slab 0xffffe20001118688 objects=24 used=13 fp=0xffff81004e2679d8 flags=0x40000000000000c3 INFO: Object 0xffff81004e267d20 @offset=3360 fp=0xffff81003dd37000 Bytes b4 0xffff81004e267d10: 95 4e 01 00 01 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a .N......ZZZZZZZZ Object 0xffff81004e267d20: 00 00 00 00 00 00 00 00 08 56 3b 4d 00 81 ff ff .........V;M..�� Object 0xffff81004e267d30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object 0xffff81004e267d40: 48 7d 26 4e 00 81 ff ff 70 ae d8 3d 00 81 ff ff H}&N..��p��=..�� Object 0xffff81004e267d50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object 0xffff81004e267d60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object 0xffff81004e267d70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Redzone 0xffff81004e267d80: cc cc cc cc cc cc cc cc �������� Padding 0xffff81004e267dc0: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ Pid: 191, comm: khubd Not tainted 2.6.26-rc6 #10 Call Trace: [<ffffffff80289cc9>] print_trailer+0xf9/0x160 [<ffffffff80289e4c>] object_err+0x3c/0x50 [<ffffffff8028a1ea>] check_object+0x15a/0x250 [<ffffffff8028b725>] __slab_free+0x265/0x370 [<ffffffff80430a0f>] ? rt2x00queue_uninitialize+0x1f/0x60 [<ffffffff80430a0f>] ? rt2x00queue_uninitialize+0x1f/0x60 [<ffffffff8028b9aa>] kfree+0xaa/0x120 [<ffffffff80430a0f>] rt2x00queue_uninitialize+0x1f/0x60 [<ffffffff8042e7dd>] rt2x00lib_uninitialize+0x2d/0x40 [<ffffffff8042ea91>] rt2x00lib_remove_dev+0x21/0x40 [<ffffffff80431c5a>] rt2x00usb_disconnect+0x2a/0x70 [<ffffffff80497660>] usb_unbind_interface+0x70/0xd0 [<ffffffff803ea19c>] __device_release_driver+0x6c/0xa0 [<ffffffff803ea2cb>] device_release_driver+0x2b/0x40 [<ffffffff803e941d>] bus_remove_device+0x8d/0xb0 [<ffffffff803e7cab>] device_del+0x10b/0x190 [<ffffffff80494361>] usb_disable_device+0x91/0x120 [<ffffffff8048f663>] usb_disconnect+0xa3/0x140 [<ffffffff80490ab6>] hub_thread+0x426/0xf50 [<ffffffff80247cc0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff80490690>] ? hub_thread+0x0/0xf50 [<ffffffff802478ad>] kthread+0x4d/0x80 [<ffffffff8020c238>] child_rip+0xa/0x12 [<ffffffff8020bdf3>] ? restore_args+0x0/0x30 [<ffffffff80247860>] ? kthread+0x0/0x80 [<ffffffff8020c22e>] ? child_rip+0x0/0x12 so the problems are still there. Could you test wireless-testing.git to see if the problem persists in there? Yes, I did the tests: -rc7 : unchanged, same results. wireless-testing from this morning (22/06 10h GMT+1): - the circular lock dependency at plug in is still there, but doesn't always trigger (therefore this is a race condition that triggers it). I appended the dmesg log for it. - the scan is possible, but it's still impossible to associate. Also appended. - at least, the free corruption at unplug is fixed (the most important I think for data integrity). Created attachment 16581 [details]
dmesg log just after inserting
not reproduceable at each insertion
Created attachment 16582 [details]
next logs when starting wpa_supplicant
note that the "link" led is not active, whereas it is in 2.6.25. I guess the driver is not functionnal, however scanning works.
Thanks for the test, I'll check which patches are between 2.6.26-rc6 and wireless-testing which might have solved this free corruption. The locking issue seems to be triggered by a 'ifdown' event, but you get this warning during start which is odd. You *should* be okay as long as no 'ifdown' command is given. But I have spotted a possible problem when 'ifdown' is called while rt2x00 is still busy configuring the hardware. I'll see what I can do about that. The leds are at this moment non-informative, if they don't blink it doesn't mean the driver doesn't work. I still have to investigate that issue some more, but leds aren't an high priority for me. ;) From experience, I'm not sure udev doesn't trigger some configuration scripts that could do ifdown before reconfiguring the interface. In fact it is very possible a ifdown is executed when I plug it, because to do test manually, I didn't activated the wlan0 if in my rc scripts. It is slackware, the scripts are easy to understand, I can look at it if necessary. OK for the leds :-) Created attachment 16587 [details]
QOS queue initialization
Please test attached patch to see if the segmentation fault still persists.
I don't have high hopes on this patch, but it seems to be the only patch in wireless-testing that fixed a similar crash, and we might be lucky and this works for your crash as well. :)
Handled-By : Ivo van Doorn <IvDoorn@gmail.com> Created attachment 16600 [details]
Don't use mac80211 workqueue
Above the previous patch, please try this patch as well.
This should fix the locking dependency problem
With the two patches on -rc7, I didn't managed to reproduce the lock error at plug in, but it doesn't occur very often anyway. So it may not be significant. The free pointer corruption at removal is still there however (and very reproduceable). It's still impossible to associate - and the link led is still off (whereas it is on on 2.6.25 and I associate, so I guess even if it is not used in the driver, it is a clue that the hardware indeed works for my key). Ok, I'll push the second patch as bugfix for the lockdep issue tomorrow (I'm quite confident that this patch is indeed the bugfix). But let me know if it reappears :) I'll see what I can discover on that free pointer Created attachment 16640 [details]
Make entry->flags and rt2x00->flags atomic
Could you try this patch to see if this helps in fixing the freepointer problem?
Currently my best guess is a race condition where the queues are freed twice, so far you are the only one reporting this, so it would seem rare enough for a race condition. ;)
I removed the queue patch and added our last patch (bits.diff) on -rc7, it still crashes. I append the log. Sorry ;-) Created attachment 16649 [details]
free pointer corrupt with -rc6+bits.diff patch+first patch for lock
may be this one will speak a little more?
Created attachment 16650 [details]
Make rt2x00->flags atomic and fix race in remove_dev()
Ok updated patch attached. I missed one if-statement which could perhaps lead to a race condition. Let's hope for the best. :)
with -rc8, the last patch does'nt build. I got: drivers/net/wireless/rt2x00/rt2x00dev.c: In function ‘rt2x00lib_intf_scheduled_iter’: drivers/net/wireless/rt2x00/rt2x00dev.c:434: erreur: ‘DEVICE_ENABLED_RADIO’ undeclared (first use in this function) drivers/net/wireless/rt2x00/rt2x00dev.c:434: erreur: (Each undeclared identifier is reported only once drivers/net/wireless/rt2x00/rt2x00dev.c:434: erreur: for each function it appears in.) make[4]: *** [drivers/net/wireless/rt2x00/rt2x00dev.o] Erreur 1 make[3]: *** [drivers/net/wireless/rt2x00] Erreur 2 make[2]: *** [drivers/net/wireless] Erreur 2 make[1]: *** [drivers/net] Erreur 2 make: *** [drivers] Erreur 2 CC net/ipv6/af_inet6.o make: *** Attente des tâches non terminées.... Created attachment 16691 [details]
Make rt2x00->flags atomic and fix race in remove_dev() - v2
Sorry about that, I hadn't created the patch on top of 16600 :S
This new patch should do the trick.
Created attachment 16703 [details]
dmesg output for 2.6.26-rc8+bits2.diff+work.diff
Well, sorry, it still does not work.
I appended another dmesg, it's quite different, maybe usefull.
Created attachment 16704 [details]
Make rt2x00->flags atomic and fix race in remove_dev() - v3
Argh, forgot a negation somewhere. :S
Please try updated patch.
Created attachment 16721 [details]
dmesg output for 2.6.26-rc8+bits3.diff+work.diff
sorry, still not there :-(
Created attachment 16816 [details]
dmesg output for 2.6.26 final
For the record, here is the dmesg output for final 2.6.26.
It shows the logs first when executing wpa_supplicant, which doesn't manage to associate to my AP, then the free pointer when I remove the device (after killing wpa_supplicant btw).
I managed to get the link led working fot the device, it was a change in kernel configuration (had to add support for led devices, then add led driver in rt driver). However, it effectively doesn't help.
Everything seems to work with 2.6.27-rc3 now. I managed to plug it in, associate, get an IP address, access to the Internet, and unplug the device event without ifconfig it down, without any problem or any warning from the kernel compiled with hacking options. Thanks |