Created attachment 256957 [details]
alsa-info --stdout output
This bug is filed here as a follow up of the GFX's team bug filed on Freedesktop.org: https://bugs.freedesktop.org/show_bug.cgi?id=101246
We managed to hit this issue twice in 39 runs of Intel's GFX CI, by running the IGT test igt@drv_module_reload@basic-reload-inject (https://cgit.freedesktop.org/xorg/app/intel-gpu-tools/tree/tests/drv_module_reload.c).
Here is the backtrace we got:
[ 292.165356] BUG: unable to handle kernel NULL pointer dereference at 0000000000000a20
[ 292.165393] IP: call_hp_automute+0xb/0x30 [snd_hda_codec_generic]
[ 292.165407] PGD 0
[ 292.165409] P4D 0
[ 292.165426] Oops: 0000 [#1] PREEMPT SMP
[ 292.165437] Modules linked in: snd_hda_intel i915 vgem coretemp tg3 ptp pps_core lpc_ich snd_hda_codec_analog snd_hda_codec_generic snd_hda_codec snd_hwdep snd_hda_core snd_pcm prime_numbers [last unloaded: i915]
[ 292.165501] CPU: 0 PID: 148 Comm: kworker/0:3 Tainted: G U 4.12.0-rc3-CI-CI_DRM_2673+ #1
[ 292.165520] Hardware name: Dell Inc. OptiPlex 745 /0GW726, BIOS 2.3.1 05/21/2007
[ 292.165550] Workqueue: events process_unsol_events [snd_hda_core]
[ 292.165566] task: ffff88003b29a600 task.stack: ffffc900001e0000
[ 292.165582] RIP: 0010:call_hp_automute+0xb/0x30 [snd_hda_codec_generic]
[ 292.165598] RSP: 0018:ffffc900001e3dc8 EFLAGS: 00010286
[ 292.165613] RAX: 0000000000000000 RBX: ffff88002e3f19e8 RCX: 0000000000000005
[ 292.165630] RDX: ffff8800314f2550 RSI: ffff88002e3f19e8 RDI: ffff8800316c33f8
[ 292.165647] RBP: ffffc900001e3dc8 R08: ffff88003b29ae38 R09: 0000000000000000
[ 292.165664] R10: ffffc900001e3dc8 R11: 0000000000000001 R12: ffff8800316c33f8
[ 292.165684] R13: ffff8800314f2550 R14: ffff88002e2aa5a0 R15: ffff88002e2aa5a8
[ 292.165704] FS: 0000000000000000(0000) GS:ffff88003f200000(0000) knlGS:0000000000000000
[ 292.165727] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 292.165744] CR2: 0000000000000a20 CR3: 000000003a091000 CR4: 00000000000006f0
[ 292.165763] Call Trace:
[ 292.165786] call_jack_callback.isra.2+0x20/0xa0 [snd_hda_codec]
[ 292.165809] snd_hda_jack_unsol_event+0x62/0x70 [snd_hda_codec]
[ 292.165829] hda_codec_unsol_event+0x12/0x20 [snd_hda_codec]
[ 292.165848] process_unsol_events+0x5e/0x70 [snd_hda_core]
[ 292.165867] process_one_work+0x1fe/0x670
[ 292.165883] worker_thread+0x49/0x3b0
[ 292.165898] kthread+0x10f/0x150
[ 292.165910] ? process_one_work+0x670/0x670
[ 292.165924] ? kthread_create_on_node+0x40/0x40
[ 292.165940] ret_from_fork+0x27/0x40
[ 292.165955] Code: 48 8d 93 0a 05 00 00 be 04 00 00 00 eb c8 4c 89 e7 e8 3a ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 48 8b 87 68 05 00 00 55 48 89 e5 <48> 8b 80 20 0a 00 00 48 85 c0 74 04 ff d0 5d c3 e8 70 ff ff ff
[ 292.166091] RIP: call_hp_automute+0xb/0x30 [snd_hda_codec_generic] RSP: ffffc900001e3dc8
[ 292.166112] CR2: 0000000000000a20
[ 292.166125] ---[ end trace 65ddd10cfd482602 ]---
[ 292.166141] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:33
[ 292.166165] in_atomic(): 0, irqs_disabled(): 1, pid: 148, name: kworker/0:3
[ 292.166182] INFO: lockdep is turned off.
[ 292.166194] irq event stamp: 3264388
[ 292.166207] hardirqs last enabled at (3264387): [<ffffffff8187bd73>] restore_regs_and_iret+0x0/0x1d
[ 292.166232] hardirqs last disabled at (3264388): [<ffffffff8187d2af>] error_entry+0x6f/0xc0
[ 292.166256] softirqs last enabled at (3263952): [<ffffffff810839f9>] __do_softirq+0x1d9/0x4a0
[ 292.166279] softirqs last disabled at (3263931): [<ffffffff81083e39>] irq_exit+0xa9/0xc0
[ 292.166301] CPU: 0 PID: 148 Comm: kworker/0:3 Tainted: G UD 4.12.0-rc3-CI-CI_DRM_2673+ #1
[ 292.166325] Hardware name: Dell Inc. OptiPlex 745 /0GW726, BIOS 2.3.1 05/21/2007
[ 292.166356] Workqueue: events process_unsol_events [snd_hda_core]
[ 292.166374] Call Trace:
[ 292.166388] dump_stack+0x67/0x97
[ 292.166401] ___might_sleep+0x162/0x250
[ 292.166415] __might_sleep+0x45/0x80
[ 292.166429] exit_signals+0x1f/0x2a0
[ 292.166443] do_exit+0x90/0xcc0
[ 292.166456] ? kthread+0x10f/0x150
[ 292.166469] ? process_one_work+0x670/0x670
[ 292.166483] rewind_stack_do_exit+0x17/0x20
[ 292.166545] do_IRQ: 0.35 No irq handler for vector
Any ideas what could be causing this? From the look of it, it is possible that the unbinding of the i915 driver set codec->spec to NULL before the workqueue got executed, resulting in a NULL pointer dereference. If this is the problem, we might need to reference-count codec->spec.
I added the following commit to our branch that is used for our GFX CI in order to give us more information about the issue when it arises: https://cgit.freedesktop.org/drm-tip/commit/?id=a856f7ee9acc3586408f103ed56e8d2914c1e466
Sorry, forgot to say that it was for i965G, a 10-year-old Intel platform.
Hmm, what exactly is performed here? Do you unload HD-audio module before i915?
If so, the following ad hoc fix might work.
Created attachment 256963 [details]
(In reply to Takashi Iwai from comment #2)
> Hmm, what exactly is performed here? Do you unload HD-audio module before
> If so, the following ad hoc fix might work.
No, from what I understand, it is only the i915 driver which gets reloaded.
Given that i915 provides sound sinks (HDMI, DP), I would assume that the problem lies in the code handling the unregistering of the sinks.
But when i915 component is bound, it increases the refcount, so you cannot reload the i915 module unless unbound from snd-hda-intel. And the code path isn't about HDMI jack. The HDMI codec driver doesn't use the generic parser at all.
So, what do we do to get more information about this bug? I can add a commit to our CI kernel in order to dump more logs when this happens.
The way I see it, it is possible that the i915 component is unbound between the queuing of a workitem and its execution. Since the workitem does not check much, it just results in a crash.
What could be responsible for sending this work item then?
Well, could you tell the exact procedure what this CLI does? I have to understand it at first. All other information doesn't help without the understanding of the circumstance.
For example, it's crucial to know whether you do rmmod i915 or do forcibly unbind via sysfs. The latter isn't supposed to work in the current implementation.
Well, this is the exact code that is being run:
If I understand correctly, it does unload snd-hda-intel before i915, right?
At which moment is the Oops triggered?
(In reply to Takashi Iwai from comment #9)
> If I understand correctly, it does unload snd-hda-intel before i915, right?
Well, that's also what I understand from the code.
> At which moment is the Oops triggered?
I do not know. I guess I should take the machine out of CI and try running this test in a loop with an added write to the kernel logs before removing and then adding the different modules.
I could have blamed a cosmic ray, but it happened twice in 15 runs, which prompted writing this bug. I could however not reproduce it again in the past 35 runs.
OK, if you can manage to reproduce it, could you try the fix in comment 3?
I guess it's some race when an event gets triggered while unregistering the codec, and the patch should address it.
Yes, it should help. However, why don't you use an atomic value here? You can't be sure that the current CPU has the latest view on this value.
The patch is merely to check whether it's really the race of this kind :)
The in_freeing flag is an indicator for avoiding the recursive free call, thus it's not expected to be referred concurrently from other threads.
A more proper fix would be something like below, unregistering the codec reference in hdac_bus before actually freeing.
If the patch in comment 3 was confirmed, try this change instead, too.
Created attachment 257013 [details]
A better fix
I must say that this bug is quite tricky to hit... I ran more than 3.5k times the offending test in a loop on the commit that showed the problem, and one thousand time on a newer commit, and I still can't hit it...
So, that rules out testing bug fixes and leaves us with only the possibility to harden the code to be more robust against this sort of issues.
I agree that something like your better fix is needed, however, it does not account for the case where a work_item is already running. How about not scheduling this work_item if codec->in_freeing, then making sure that the current queue is empty and if not, wait for it to be done before continuing with the actual freeing of resources?
Sorry that I can't be a more useful tester, but this issue seems quite complex to hit, so we need a change in tactics...
(In reply to Martin Peres from comment #15)
> I agree that something like your better fix is needed, however, it does not
> account for the case where a work_item is already running.
I thought flush_work() cares that?
The unregister is performed before the free, thus after the unregister, there can be no unsolicited event handling for the given codec.
(In reply to Takashi Iwai from comment #16)
> (In reply to Martin Peres from comment #15)
> > I agree that something like your better fix is needed, however, it does not
> > account for the case where a work_item is already running.
> I thought flush_work() cares that?
Of course, sorry! I have not been using the work queues enough, apparently :p
> The unregister is performed before the free, thus after the unregister,
> there can be no unsolicited event handling for the given codec.
That sounds good then!
Should your patch land directly then? It seems like it is always safe, although it might delay the freeing, but that's the right thing to do.
If you would like more testing of this patch, I could drop it on our CI systems and see if it breaks anything.
Yes, I'm going to brush up and submit the patch for merging to 4.13.
(In reply to Takashi Iwai from comment #18)
> Yes, I'm going to brush up and submit the patch for merging to 4.13.
Wonderful :) Could you provide us with the patch as soon as you have it, so we can test it as much as possible in advance?
Thanks for your work, it has been pleasant to work with you and I'm sure we'll be in touch more often ;)
Created attachment 257201 [details]
The official fix merged to sound git tree
Here it is.
I close the bug now. If you still encounter the problem even with the patch, feel free to reopen. Thanks!