Bug 50541 - Kernel OOPS in hidp when resuming from suspend
Summary: Kernel OOPS in hidp when resuming from suspend
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Bluetooth (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_input-devices
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-13 17:04 UTC by karlrelton000+ubuntu
Modified: 2017-03-06 20:03 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.5
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Testing patch (710 bytes, patch)
2012-12-12 15:39 UTC, Jiri Kosina
Details | Diff
Proposed patch to fix the problem (1.75 KB, patch)
2012-12-20 15:54 UTC, karlrelton000+ubuntu
Details | Diff
Poposed fix (2.10 KB, patch)
2013-01-08 14:11 UTC, karlrelton000+ubuntu
Details | Diff
full dmesg (79.43 KB, application/octet-stream)
2013-02-02 18:45 UTC, chrmhoffmann
Details

Description karlrelton000+ubuntu 2012-11-13 17:04:17 UTC
Somewhere between 3.2 and 3.5 an issue with bluetooth keyboards on resuming from suspend crept in.

On a resume I now get the following:

Oct 23 15:42:34 laptop2 kernel: [10357.752131] BUG: unable to handle kernel NULL pointer dereference at   (null)
Oct 23 15:42:34 laptop2 kernel: [10357.752273] IP: [<c1037a58>] __ticket_spin_lock+0x8/0x30
Oct 23 15:42:34 laptop2 kernel: [10357.752379] *pdpt = 0000000000000000 *pde = f000ec62f000ec62 
Oct 23 15:42:34 laptop2 kernel: [10357.752485] Oops: 0002 [#1] SMP 
Oct 23 15:42:34 laptop2 kernel: [10357.752551] Modules linked in: hid_generic hid_apple hidp hid bnep rfcomm parport_pc ppdev autofs4 nfsd nfs binfmt_misc lockd fscache auth_rpcgss nfs_acl sunrpc nls_iso8859_1 dm_crypt arc4 snd_hda_codec_idt snd_hda_intel snd_hda_codec joydev coretemp pcmcia iwlwifi snd_hwdep snd_pcm yenta_socket snd_seq_midi btusb pcmcia_rsrc snd_rawmidi pcmcia_core mac80211 kvm bluetooth snd_seq_midi_event snd_seq dell_laptop dell_wmi cfg80211 snd_timer snd_seq_device microcode psmouse gpio_ich dcdbas serio_raw sparse_keymap lpc_ich mac_hid snd soundcore snd_page_alloc firewire_sbp2 lp parport firewire_ohci firewire_core crc_itu_t tg3 i915 wmi drm_kms_helper drm i2c_algo_bit video
Oct 23 15:42:34 laptop2 kernel: [10357.753834] 
Oct 23 15:42:34 laptop2 kernel: [10357.753868] Pid: 10165, comm: khidpd_05ac0256 Tainted: G        W    3.5.0-17-generic #28-Ubuntu Dell Inc. Latitude D630                   /0KU184
Oct 23 15:42:34 laptop2 kernel: [10357.754104] EIP: 0060:[<c1037a58>] EFLAGS: 00010096 CPU: 1
Oct 23 15:42:34 laptop2 kernel: [10357.754205] EIP is at __ticket_spin_lock+0x8/0x30
Oct 23 15:42:34 laptop2 kernel: [10357.754286] EAX: 00000000 EBX: 00000296 ECX: 00000000 EDX: 00000100
Oct 23 15:42:34 laptop2 kernel: [10357.754393] ESI: dfaedfa4 EDI: 00000000 EBP: dfaedefc ESP: dfaedefc
Oct 23 15:42:34 laptop2 kernel: [10357.754501]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Oct 23 15:42:34 laptop2 kernel: [10357.754594] CR0: 8005003b CR2: 00000000 CR3: 01974000 CR4: 000007f0
Oct 23 15:42:34 laptop2 kernel: [10357.754702] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Oct 23 15:42:34 laptop2 kernel: [10357.754808] DR6: ffff0ff0 DR7: 00000400
Oct 23 15:42:34 laptop2 kernel: [10357.754875] Process khidpd_05ac0256 (pid: 10165, ti=dfaec000 task=df2b0cb0 task.ti=dfaec000)
Oct 23 15:42:34 laptop2 kernel: [10357.755021] Stack:
Oct 23 15:42:34 laptop2 kernel: [10357.755058]  dfaedf04 c1037af8 dfaedf18 c15c924d dfaedf98 dfaedfa4 00000000 dfaedf2c
Oct 23 15:42:34 laptop2 kernel: [10357.755221]  c10654b3 00000000 cfb0b800 e56d0c00 dfaedfb8 f902ea9b 00000000 00000001
Oct 23 15:42:34 laptop2 kernel: [10357.755383]  df2b0cb0 cfb0b8c0 e56d1a00 e56d0c00 cfb0b890 cfb0b880 df2b0cb0 e56d1a48
Oct 23 15:42:34 laptop2 kernel: [10357.755544] Call Trace:
Oct 23 15:42:34 laptop2 kernel: [10357.755595]  [<c1037af8>] default_spin_lock_flags+0x8/0x10
Oct 23 15:42:34 laptop2 kernel: [10357.755698]  [<c15c924d>] _raw_spin_lock_irqsave+0x2d/0x40
Oct 23 15:42:34 laptop2 kernel: [10357.755800]  [<c10654b3>] finish_wait+0x33/0x70
Oct 23 15:42:34 laptop2 kernel: [10357.755885]  [<f902ea9b>] hidp_session+0x48b/0x730 [hidp]
Oct 23 15:42:34 laptop2 kernel: [10357.755985]  [<c1075e90>] ? try_to_wake_up+0x230/0x230
Oct 23 15:42:34 laptop2 kernel: [10357.756079]  [<c1075e90>] ? try_to_wake_up+0x230/0x230
Oct 23 15:42:34 laptop2 kernel: [10357.756090]  [<c10654f0>] ? finish_wait+0x70/0x70
Oct 23 15:42:34 laptop2 kernel: [10357.756090]  [<f902e610>] ? hidp_input_report.isra.4+0x2a0/0x2a0 [hidp]
Oct 23 15:42:34 laptop2 kernel: [10357.756090]  [<c1064dd2>] kthread+0x72/0x80
Oct 23 15:42:34 laptop2 kernel: [10357.756090]  [<c1064d60>] ? kthread_freezable_should_stop+0x60/0x60
Oct 23 15:42:34 laptop2 kernel: [10357.756090]  [<c15d04fe>] kernel_thread_helper+0x6/0x10
Oct 23 15:42:34 laptop2 kernel: [10357.756090] Code: b9 fa 78 03 c1 e9 59 ff ff ff 90 b8 03 79 03 c1 b9 00 79 03 c1 e9 49 ff ff ff 66 90 66 90 66 90 66 90 90 55 ba 00 01 00 00 89 e5 <f0> 66 0f c1 10 0f b6 ce 38 d1 74 0d 8d 74 26 00 f3 90 0f b6 10 
Oct 23 15:42:34 laptop2 kernel: [10357.756090] EIP: [<c1037a58>] __ticket_spin_lock+0x8/0x30 SS:ESP 0068:dfaedefc
Oct 23 15:42:34 laptop2 kernel: [10357.756090] CR2: 0000000000000000
Oct 23 15:42:34 laptop2 kernel: [10357.756090] ---[ end trace 9a445e0f7d0acd41 ]---

The issue seems to be in hidp, which makes sense.

Problem still exists in 3.7.
Comment 1 karlrelton000+ubuntu 2012-11-13 17:10:37 UTC
Looking at the Changelog for hidp/core.c there are only a few commits that are more obviously related to this between 3.2 and 3.5 releases (although it could be a non-obvious issue, perhaps even elsewhere?).

I wonder if commit 794d175698f0e78be7f2e3f4bdbe0e7cd3f2d6ae has some bearing ('Bluetooth: hidp: Stop I/O on shutdown').

It adds wake_up_interruptible(&session->report_queue) at line 748. Is it possible that this wake_up call schedules a function elsewhere to execute, but then this hidp_session function goes ahead and frees the session data structure before the other function has had time to execute?

This could me just clutching at things I don't really understand, so please excuse if its a false question.
Comment 2 Dave Vasilevsky 2012-11-23 21:39:12 UTC
Just a "me too", with a bt mouse on kernel 3.5.0 on a MacBook4,1.
Comment 3 karlrelton000+ubuntu 2012-11-29 19:35:00 UTC
I am still debugging, but I now have enough evidence to show that something is trashing/losing the reference ctrl_sk->sk_wq so that it points to a bogus value rather than the value 'socket_wq' structure.

This happens while the thread executing hidp_session() is waiting in the wait_event_timeout() loop towards the end of the hidp_session() procedure.

So it enters the loop okay, with ctrl_sk->sk_wq in tact, but then whilst asleep something else is removing/overwriting it so the next time a wakeup is attempted the 'lookup' of ctrl_sk->wq produces a bogus value (like 'null', or '000000032').
Comment 4 karlrelton000+ubuntu 2012-12-06 20:58:17 UTC
It seems clear to me that that the 'sock' structure pointed to by ctrl_sk is being freed elsewhere so that memory is then re-used and over-written (hence ctrl_sk is no longer valid).

I've tried changing the wait_event_timeout in hidp_session to bt_sock_wait_state (with surrounding lock_sock and release_sock), but that didn't work because the lock is still released while the thread waits on another.

I've tried using sock_hold/sock_put pairs that I would have thought would have stopped ctrl_sk from being freed. However that doesn't seem to have worked either (which I don't understand).
Comment 5 karlrelton000+ubuntu 2012-12-11 20:04:19 UTC
A bit more testing, and a bit more accurate diagnosis to report. The ctrl_sk is being orphaned in the l2cap bluetooth driver code. The orphaning sets the sk_wq to null, leading to the OOPS in the wait_event_timeout() call in hidp_session.

Is there some way of marking the sock as in use so that l2cap doesn't orphan it straight away?
Comment 6 Jiri Kosina 2012-12-12 15:39:02 UTC
Created attachment 88991 [details]
Testing patch

Could you please whether this patch (totally untested) makes any difference? Thanks.
Comment 7 karlrelton000+ubuntu 2012-12-13 08:34:12 UTC
I've applied the patch and tried testing it - and I can see the theory that it should work.

However my first test run showed up another related failure: one that I had seen before but is rarer: the ctrl_sk (& intr_sk) are in fact orphaned even before hidp_session() comes out of its main 'while' loop. This is shown in the following trace I captured last night (with my own annotation printks instrumented in):

Dec 12 20:01:52 laptop2 kernel: [  143.776184] l2cap  release start socket dea22780, sock f5a48200
Dec 12 20:01:52 laptop2 kernel: [  143.776322] l2cap  release orpan socket dea22780, sock f5a48200
Dec 12 20:01:52 laptop2 kernel: [  143.776340] l2cap  release start socket dea22f00, sock e0a19000
Dec 12 20:01:52 laptop2 kernel: [  143.776367] l2cap  release orpan socket dea22f00, sock e0a19000
Dec 12 20:01:53 laptop2 kernel: [  143.778283] hidp_session f580e100 exited while loop ctrl_sk/wq/h e0a19000   (null) 1, intr_sk/wq/h f5a48200   (null) 1
Dec 12 20:01:53 laptop2 kernel: [  143.778307] BUG: unable to handle kernel NULL pointer dereference at   (null)
Dec 12 20:01:53 laptop2 kernel: [  143.778443] IP: [<c103d518>] __ticket_spin_lock+0x8/0x30
Dec 12 20:01:53 laptop2 kernel: [  143.778548] *pdpt = 0000000033bf7001 *pde = 0000000000000000 
Dec 12 20:01:53 laptop2 kernel: [  143.778653] Oops: 0002 [#1] SMP 
Dec 12 20:01:53 laptop2 kernel: [  143.778718] Modules linked in: nfsv4 hid_generic hid_apple hidp(OF) hid bnep rfcomm parport_pc ppdev autofs4 nfsd nfs_acl auth_rpcgss nfs fscache binfmt_misc lockd sunrpc nls_iso8859_1 dm_crypt arc4 iwldvm mac80211 snd_hda_codec_idt btusb snd_hda_intel coretemp snd_hda_codec bluetooth(OF) pcmcia joydev yenta_socket iwlwifi pcmcia_rsrc kvm snd_hwdep pcmcia_core dell_laptop dell_wmi snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event gpio_ich dcdbas snd_seq snd_timer snd_seq_device mac_hid psmouse microcode snd soundcore cfg80211 serio_raw sparse_keymap snd_page_alloc lpc_ich firewire_sbp2 lp parport i915 drm_kms_helper firewire_ohci drm firewire_core i2c_algo_bit crc_itu_t wmi video tg3
Dec 12 20:01:53 laptop2 kernel: [  143.780045] Pid: 3253, comm: khidpd_05ac0256 Tainted: GF       W  O 3.7.0-4-generic #12 Dell Inc. Latitude D630                   /0KU184
Dec 12 20:01:53 laptop2 kernel: [  143.780045] EIP: 0060:[<c103d518>] EFLAGS: 00010092 CPU: 1
Dec 12 20:01:53 laptop2 kernel: [  143.780045] EIP is at __ticket_spin_lock+0x8/0x30
Dec 12 20:01:53 laptop2 kernel: [  143.780045] EAX: 00000000 EBX: 00000292 ECX: 00000000 EDX: 00000100
Dec 12 20:01:53 laptop2 kernel: [  143.780045] ESI: f2537f34 EDI: f580e100 EBP: f2537e98 ESP: f2537e98
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Dec 12 20:01:53 laptop2 kernel: [  143.780045] CR0: 8005003b CR2: 00000000 CR3: 3588c000 CR4: 000007f0
Dec 12 20:01:53 laptop2 kernel: [  143.780045] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Dec 12 20:01:53 laptop2 kernel: [  143.780045] DR6: ffff0ff0 DR7: 00000400
Dec 12 20:01:53 laptop2 kernel: [  143.780045] Process khidpd_05ac0256 (pid: 3253, ti=f2536000 task=dd8432c0 task.ti=f2536000)
Dec 12 20:01:53 laptop2 kernel: [  143.780045] Stack:
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  f2537ea0 c103d5b8 f2537eb4 c160e02d 00000000 f2537f34 f580e100 f2537ec8
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  c106cb2d e0a19000 f5a48200 f580e100 f2537f6c f8e8a176 f8e8c4b0 f580e100
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  e0a19000 00000000 00000001 f5a48200 00000000 00000001 dd8432c0 f580e1c0
Dec 12 20:01:53 laptop2 kernel: [  143.780045] Call Trace:
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  [<c103d5b8>] default_spin_lock_flags+0x8/0x10
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  [<c160e02d>] _raw_spin_lock_irqsave+0x2d/0x40
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  [<c106cb2d>] remove_wait_queue+0x1d/0x50
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  [<f8e8a176>] hidp_session+0x426/0xb50 [hidp]
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  [<c107dba0>] ? try_to_wake_up+0x190/0x190
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  [<c107dba0>] ? try_to_wake_up+0x190/0x190
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  [<c10769ee>] ? complete+0x4e/0x60
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  [<f8e89d50>] ? hidp_input_report.isra.7+0x2a0/0x2a0 [hidp]
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  [<c106c2d4>] kthread+0x94/0xa0
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  [<c1070000>] ? hrtimer_force_reprogram+0xd0/0xe0
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  [<c1615477>] ret_from_kernel_thread+0x1b/0x28
Dec 12 20:01:53 laptop2 kernel: [  143.780045]  [<c106c240>] ? flush_kthread_worker+0x90/0x90
Dec 12 20:01:53 laptop2 kernel: [  143.780045] Code: ff ff 90 b8 cd d3 03 c1 b9 ca d3 03 c1 e9 59 ff ff ff 90 b8 d3 d3 03 c1 b9 d0 d3 03 c1 e9 49 ff ff ff 90 55 ba 00 01 00 00 89 e5 <f0> 66 0f c1 10 0f b6 ce 38 d1 74 0d 8d 74 26 00 f3 90 0f b6 10
Dec 12 20:01:53 laptop2 kernel: [  143.780045] EIP: [<c103d518>] __ticket_spin_lock+0x8/0x30 SS:ESP 0068:f2537e98
Dec 12 20:01:53 laptop2 kernel: [  143.780045] CR2: 0000000000000000
Dec 12 20:01:53 laptop2 kernel: [  143.889761] ---[ end trace f27ab55d59fbdc1c ]---

My printks in this trace show that both intr_sk and ctrl_sk are released & orphaned in the l2cap code even as hidp_session comes out of its while loop. So now anything after the while() loop is problematic:

 - remove_wait_queue  (x2)
 - hidp_schedule call (which does wake_up_interruptible ...)
 - and the wait_event_timeout originally reported

 - ? I assume the fput (x2) will still be okay? not sure ...

 - ?session->intr_sock->sk->sk_err = EUNATCH;  (and ctrl_sock equiv)
   these traverse the 'socket' afresh rather than relying on ctrl_sk/intr_sk variables, so presumably okay?


It would seem that, on my system at least, the sock_orphan can now happen very very early, way before hidp_session has had time to clean up.
Comment 8 karlrelton000+ubuntu 2012-12-20 15:54:23 UTC
Created attachment 89511 [details]
Proposed patch to fix the problem

I've worked around the issue with the attached patch.

It stores the wait_queue_heads in local variables and uses the variables to reference them. Even if the sock is 'orphaned', the wait_queue_heads should still be available - so I don't understand the need for the use of the 'sk_sleep' for every look up of the wait_queue_head.

After the main while() loop, by now the sock may already be orphaned: so a test is put in and the wait_event_timeout is only done if the sock wait_queue_head still matches the original.
Comment 9 karlrelton000+ubuntu 2013-01-08 14:11:56 UTC
Created attachment 90761 [details]
Poposed fix

This fix is I think what we need.

Calls to rcu_dereference_raw are still used to protect the waitqueue(head) data structures, but the hidp_session stores the pointer to the waitqueue_head rather than relying on looking it up from the sock structure (which may have been orphaned, and so point to null).

Testing on my system is positive so far.
Comment 10 karlrelton000+ubuntu 2013-01-29 20:18:57 UTC
This bug is highly related to https://bugzilla.kernel.org/show_bug.cgi?id=52471

In bug 52471, a 5 second penalty for the execution of hidp_session is documented. Without that penalty the race between hidp_session and whatever is causing the sock structures to be orphaned is going to be won by hidp_session.

In other words, fix 52471 and this bug will not show up.
Comment 11 chrmhoffmann 2013-02-02 18:45:44 UTC
Created attachment 92441 [details]
full dmesg

I have similar crash when plugging in a BT mouse (without suspend or resume). Full dmesg attached.


[   26.920280] Call Trace:
[   26.920311]  [<ffffffff8103fa39>] default_spin_lock_flags+0x9/0x10
[   26.920388]  [<ffffffff81683bee>] _raw_spin_lock_irqsave+0x2e/0x40
[   26.920465]  [<ffffffff81076871>] finish_wait+0x41/0x80
[   26.920531]  [<ffffffffa0644d05>] hidp_session+0x5c5/0x8f0 [hidp]
[   26.920606]  [<ffffffff81087cc0>] ? try_to_wake_up+0x2a0/0x2a0
[   26.920678]  [<ffffffff81087cc0>] ? try_to_wake_up+0x2a0/0x2a0
[   26.920750]  [<ffffffff810768b0>] ? finish_wait+0x80/0x80
[   26.920817]  [<ffffffffa0644740>] ? hidp_input_report.isra.4+0x290/0x290 [hidp]
[   26.920906]  [<ffffffff81075f13>] kthread+0x93/0xa0
[   26.920967]  [<ffffffff8168ce64>] kernel_thread_helper+0x4/0x10
[   26.921039]  [<ffffffff81075e80>] ? kthread_freezable_should_stop+0x70/0x70
[   26.921125]  [<ffffffff8168ce60>] ? gs_change+0x13/0x13

Chris
Comment 12 Szőgyényi Gábor 2017-03-06 20:03:19 UTC
Please try this bug with latest kernel image.

Note You need to log in before you can comment on or make changes to this bug.