Bug 50541
Summary: | Kernel OOPS in hidp when resuming from suspend | ||
---|---|---|---|
Product: | Drivers | Reporter: | karlrelton000+ubuntu |
Component: | Bluetooth | Assignee: | drivers_input-devices |
Status: | NEW --- | ||
Severity: | normal | CC: | alan, clancy.kieran+kernel, dave, szg00000 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.5 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
Testing patch
Proposed patch to fix the problem Poposed fix full dmesg |
Description
karlrelton000+ubuntu
2012-11-13 17:04:17 UTC
Looking at the Changelog for hidp/core.c there are only a few commits that are more obviously related to this between 3.2 and 3.5 releases (although it could be a non-obvious issue, perhaps even elsewhere?). I wonder if commit 794d175698f0e78be7f2e3f4bdbe0e7cd3f2d6ae has some bearing ('Bluetooth: hidp: Stop I/O on shutdown'). It adds wake_up_interruptible(&session->report_queue) at line 748. Is it possible that this wake_up call schedules a function elsewhere to execute, but then this hidp_session function goes ahead and frees the session data structure before the other function has had time to execute? This could me just clutching at things I don't really understand, so please excuse if its a false question. Just a "me too", with a bt mouse on kernel 3.5.0 on a MacBook4,1. I am still debugging, but I now have enough evidence to show that something is trashing/losing the reference ctrl_sk->sk_wq so that it points to a bogus value rather than the value 'socket_wq' structure. This happens while the thread executing hidp_session() is waiting in the wait_event_timeout() loop towards the end of the hidp_session() procedure. So it enters the loop okay, with ctrl_sk->sk_wq in tact, but then whilst asleep something else is removing/overwriting it so the next time a wakeup is attempted the 'lookup' of ctrl_sk->wq produces a bogus value (like 'null', or '000000032'). It seems clear to me that that the 'sock' structure pointed to by ctrl_sk is being freed elsewhere so that memory is then re-used and over-written (hence ctrl_sk is no longer valid). I've tried changing the wait_event_timeout in hidp_session to bt_sock_wait_state (with surrounding lock_sock and release_sock), but that didn't work because the lock is still released while the thread waits on another. I've tried using sock_hold/sock_put pairs that I would have thought would have stopped ctrl_sk from being freed. However that doesn't seem to have worked either (which I don't understand). A bit more testing, and a bit more accurate diagnosis to report. The ctrl_sk is being orphaned in the l2cap bluetooth driver code. The orphaning sets the sk_wq to null, leading to the OOPS in the wait_event_timeout() call in hidp_session. Is there some way of marking the sock as in use so that l2cap doesn't orphan it straight away? Created attachment 88991 [details]
Testing patch
Could you please whether this patch (totally untested) makes any difference? Thanks.
I've applied the patch and tried testing it - and I can see the theory that it should work. However my first test run showed up another related failure: one that I had seen before but is rarer: the ctrl_sk (& intr_sk) are in fact orphaned even before hidp_session() comes out of its main 'while' loop. This is shown in the following trace I captured last night (with my own annotation printks instrumented in): Dec 12 20:01:52 laptop2 kernel: [ 143.776184] l2cap release start socket dea22780, sock f5a48200 Dec 12 20:01:52 laptop2 kernel: [ 143.776322] l2cap release orpan socket dea22780, sock f5a48200 Dec 12 20:01:52 laptop2 kernel: [ 143.776340] l2cap release start socket dea22f00, sock e0a19000 Dec 12 20:01:52 laptop2 kernel: [ 143.776367] l2cap release orpan socket dea22f00, sock e0a19000 Dec 12 20:01:53 laptop2 kernel: [ 143.778283] hidp_session f580e100 exited while loop ctrl_sk/wq/h e0a19000 (null) 1, intr_sk/wq/h f5a48200 (null) 1 Dec 12 20:01:53 laptop2 kernel: [ 143.778307] BUG: unable to handle kernel NULL pointer dereference at (null) Dec 12 20:01:53 laptop2 kernel: [ 143.778443] IP: [<c103d518>] __ticket_spin_lock+0x8/0x30 Dec 12 20:01:53 laptop2 kernel: [ 143.778548] *pdpt = 0000000033bf7001 *pde = 0000000000000000 Dec 12 20:01:53 laptop2 kernel: [ 143.778653] Oops: 0002 [#1] SMP Dec 12 20:01:53 laptop2 kernel: [ 143.778718] Modules linked in: nfsv4 hid_generic hid_apple hidp(OF) hid bnep rfcomm parport_pc ppdev autofs4 nfsd nfs_acl auth_rpcgss nfs fscache binfmt_misc lockd sunrpc nls_iso8859_1 dm_crypt arc4 iwldvm mac80211 snd_hda_codec_idt btusb snd_hda_intel coretemp snd_hda_codec bluetooth(OF) pcmcia joydev yenta_socket iwlwifi pcmcia_rsrc kvm snd_hwdep pcmcia_core dell_laptop dell_wmi snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event gpio_ich dcdbas snd_seq snd_timer snd_seq_device mac_hid psmouse microcode snd soundcore cfg80211 serio_raw sparse_keymap snd_page_alloc lpc_ich firewire_sbp2 lp parport i915 drm_kms_helper firewire_ohci drm firewire_core i2c_algo_bit crc_itu_t wmi video tg3 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] Pid: 3253, comm: khidpd_05ac0256 Tainted: GF W O 3.7.0-4-generic #12 Dell Inc. Latitude D630 /0KU184 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] EIP: 0060:[<c103d518>] EFLAGS: 00010092 CPU: 1 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] EIP is at __ticket_spin_lock+0x8/0x30 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] EAX: 00000000 EBX: 00000292 ECX: 00000000 EDX: 00000100 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] ESI: f2537f34 EDI: f580e100 EBP: f2537e98 ESP: f2537e98 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] CR0: 8005003b CR2: 00000000 CR3: 3588c000 CR4: 000007f0 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] DR6: ffff0ff0 DR7: 00000400 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] Process khidpd_05ac0256 (pid: 3253, ti=f2536000 task=dd8432c0 task.ti=f2536000) Dec 12 20:01:53 laptop2 kernel: [ 143.780045] Stack: Dec 12 20:01:53 laptop2 kernel: [ 143.780045] f2537ea0 c103d5b8 f2537eb4 c160e02d 00000000 f2537f34 f580e100 f2537ec8 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] c106cb2d e0a19000 f5a48200 f580e100 f2537f6c f8e8a176 f8e8c4b0 f580e100 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] e0a19000 00000000 00000001 f5a48200 00000000 00000001 dd8432c0 f580e1c0 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] Call Trace: Dec 12 20:01:53 laptop2 kernel: [ 143.780045] [<c103d5b8>] default_spin_lock_flags+0x8/0x10 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] [<c160e02d>] _raw_spin_lock_irqsave+0x2d/0x40 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] [<c106cb2d>] remove_wait_queue+0x1d/0x50 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] [<f8e8a176>] hidp_session+0x426/0xb50 [hidp] Dec 12 20:01:53 laptop2 kernel: [ 143.780045] [<c107dba0>] ? try_to_wake_up+0x190/0x190 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] [<c107dba0>] ? try_to_wake_up+0x190/0x190 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] [<c10769ee>] ? complete+0x4e/0x60 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] [<f8e89d50>] ? hidp_input_report.isra.7+0x2a0/0x2a0 [hidp] Dec 12 20:01:53 laptop2 kernel: [ 143.780045] [<c106c2d4>] kthread+0x94/0xa0 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] [<c1070000>] ? hrtimer_force_reprogram+0xd0/0xe0 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] [<c1615477>] ret_from_kernel_thread+0x1b/0x28 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] [<c106c240>] ? flush_kthread_worker+0x90/0x90 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] Code: ff ff 90 b8 cd d3 03 c1 b9 ca d3 03 c1 e9 59 ff ff ff 90 b8 d3 d3 03 c1 b9 d0 d3 03 c1 e9 49 ff ff ff 90 55 ba 00 01 00 00 89 e5 <f0> 66 0f c1 10 0f b6 ce 38 d1 74 0d 8d 74 26 00 f3 90 0f b6 10 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] EIP: [<c103d518>] __ticket_spin_lock+0x8/0x30 SS:ESP 0068:f2537e98 Dec 12 20:01:53 laptop2 kernel: [ 143.780045] CR2: 0000000000000000 Dec 12 20:01:53 laptop2 kernel: [ 143.889761] ---[ end trace f27ab55d59fbdc1c ]--- My printks in this trace show that both intr_sk and ctrl_sk are released & orphaned in the l2cap code even as hidp_session comes out of its while loop. So now anything after the while() loop is problematic: - remove_wait_queue (x2) - hidp_schedule call (which does wake_up_interruptible ...) - and the wait_event_timeout originally reported - ? I assume the fput (x2) will still be okay? not sure ... - ?session->intr_sock->sk->sk_err = EUNATCH; (and ctrl_sock equiv) these traverse the 'socket' afresh rather than relying on ctrl_sk/intr_sk variables, so presumably okay? It would seem that, on my system at least, the sock_orphan can now happen very very early, way before hidp_session has had time to clean up. Created attachment 89511 [details]
Proposed patch to fix the problem
I've worked around the issue with the attached patch.
It stores the wait_queue_heads in local variables and uses the variables to reference them. Even if the sock is 'orphaned', the wait_queue_heads should still be available - so I don't understand the need for the use of the 'sk_sleep' for every look up of the wait_queue_head.
After the main while() loop, by now the sock may already be orphaned: so a test is put in and the wait_event_timeout is only done if the sock wait_queue_head still matches the original.
Created attachment 90761 [details]
Poposed fix
This fix is I think what we need.
Calls to rcu_dereference_raw are still used to protect the waitqueue(head) data structures, but the hidp_session stores the pointer to the waitqueue_head rather than relying on looking it up from the sock structure (which may have been orphaned, and so point to null).
Testing on my system is positive so far.
This bug is highly related to https://bugzilla.kernel.org/show_bug.cgi?id=52471 In bug 52471, a 5 second penalty for the execution of hidp_session is documented. Without that penalty the race between hidp_session and whatever is causing the sock structures to be orphaned is going to be won by hidp_session. In other words, fix 52471 and this bug will not show up. Created attachment 92441 [details]
full dmesg
I have similar crash when plugging in a BT mouse (without suspend or resume). Full dmesg attached.
[ 26.920280] Call Trace:
[ 26.920311] [<ffffffff8103fa39>] default_spin_lock_flags+0x9/0x10
[ 26.920388] [<ffffffff81683bee>] _raw_spin_lock_irqsave+0x2e/0x40
[ 26.920465] [<ffffffff81076871>] finish_wait+0x41/0x80
[ 26.920531] [<ffffffffa0644d05>] hidp_session+0x5c5/0x8f0 [hidp]
[ 26.920606] [<ffffffff81087cc0>] ? try_to_wake_up+0x2a0/0x2a0
[ 26.920678] [<ffffffff81087cc0>] ? try_to_wake_up+0x2a0/0x2a0
[ 26.920750] [<ffffffff810768b0>] ? finish_wait+0x80/0x80
[ 26.920817] [<ffffffffa0644740>] ? hidp_input_report.isra.4+0x290/0x290 [hidp]
[ 26.920906] [<ffffffff81075f13>] kthread+0x93/0xa0
[ 26.920967] [<ffffffff8168ce64>] kernel_thread_helper+0x4/0x10
[ 26.921039] [<ffffffff81075e80>] ? kthread_freezable_should_stop+0x70/0x70
[ 26.921125] [<ffffffff8168ce60>] ? gs_change+0x13/0x13
Chris
Please try this bug with latest kernel image. |