Bug 196045
Summary: | NULL pointer dereference at call_hp_automute+0xb/0x30 [snd_hda_codec_generic] | ||
---|---|---|---|
Product: | Drivers | Reporter: | Martin Peres (martin.peres) |
Component: | Sound(ALSA) | Assignee: | Jaroslav Kysela (perex) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | tiwai |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.12.0-rc3 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
alsa-info --stdout output
Possible fix A better fix The official fix merged to sound git tree |
Description
Martin Peres
2017-06-12 14:56:32 UTC
Sorry, forgot to say that it was for i965G, a 10-year-old Intel platform. Hmm, what exactly is performed here? Do you unload HD-audio module before i915? If so, the following ad hoc fix might work. Created attachment 256963 [details]
Possible fix
(In reply to Takashi Iwai from comment #2) > Hmm, what exactly is performed here? Do you unload HD-audio module before > i915? > If so, the following ad hoc fix might work. No, from what I understand, it is only the i915 driver which gets reloaded. Given that i915 provides sound sinks (HDMI, DP), I would assume that the problem lies in the code handling the unregistering of the sinks. But when i915 component is bound, it increases the refcount, so you cannot reload the i915 module unless unbound from snd-hda-intel. And the code path isn't about HDMI jack. The HDMI codec driver doesn't use the generic parser at all. Puzzling. So, what do we do to get more information about this bug? I can add a commit to our CI kernel in order to dump more logs when this happens. The way I see it, it is possible that the i915 component is unbound between the queuing of a workitem and its execution. Since the workitem does not check much, it just results in a crash. What could be responsible for sending this work item then? Well, could you tell the exact procedure what this CLI does? I have to understand it at first. All other information doesn't help without the understanding of the circumstance. For example, it's crucial to know whether you do rmmod i915 or do forcibly unbind via sysfs. The latter isn't supposed to work in the current implementation. Well, this is the exact code that is being run: https://cgit.freedesktop.org/xorg/app/intel-gpu-tools/tree/lib/igt_kmod.c#n274 If I understand correctly, it does unload snd-hda-intel before i915, right? At which moment is the Oops triggered? (In reply to Takashi Iwai from comment #9) > If I understand correctly, it does unload snd-hda-intel before i915, right? Well, that's also what I understand from the code. > > At which moment is the Oops triggered? I do not know. I guess I should take the machine out of CI and try running this test in a loop with an added write to the kernel logs before removing and then adding the different modules. I could have blamed a cosmic ray, but it happened twice in 15 runs, which prompted writing this bug. I could however not reproduce it again in the past 35 runs. OK, if you can manage to reproduce it, could you try the fix in comment 3? I guess it's some race when an event gets triggered while unregistering the codec, and the patch should address it. Yes, it should help. However, why don't you use an atomic value here? You can't be sure that the current CPU has the latest view on this value. The patch is merely to check whether it's really the race of this kind :) The in_freeing flag is an indicator for avoiding the recursive free call, thus it's not expected to be referred concurrently from other threads. A more proper fix would be something like below, unregistering the codec reference in hdac_bus before actually freeing. If the patch in comment 3 was confirmed, try this change instead, too. Created attachment 257013 [details]
A better fix
I must say that this bug is quite tricky to hit... I ran more than 3.5k times the offending test in a loop on the commit that showed the problem, and one thousand time on a newer commit, and I still can't hit it... So, that rules out testing bug fixes and leaves us with only the possibility to harden the code to be more robust against this sort of issues. I agree that something like your better fix is needed, however, it does not account for the case where a work_item is already running. How about not scheduling this work_item if codec->in_freeing, then making sure that the current queue is empty and if not, wait for it to be done before continuing with the actual freeing of resources? Sorry that I can't be a more useful tester, but this issue seems quite complex to hit, so we need a change in tactics... (In reply to Martin Peres from comment #15) > I agree that something like your better fix is needed, however, it does not > account for the case where a work_item is already running. I thought flush_work() cares that? The unregister is performed before the free, thus after the unregister, there can be no unsolicited event handling for the given codec. (In reply to Takashi Iwai from comment #16) > (In reply to Martin Peres from comment #15) > > I agree that something like your better fix is needed, however, it does not > > account for the case where a work_item is already running. > > I thought flush_work() cares that? Of course, sorry! I have not been using the work queues enough, apparently :p > > The unregister is performed before the free, thus after the unregister, > there can be no unsolicited event handling for the given codec. That sounds good then! Should your patch land directly then? It seems like it is always safe, although it might delay the freeing, but that's the right thing to do. If you would like more testing of this patch, I could drop it on our CI systems and see if it breaks anything. Yes, I'm going to brush up and submit the patch for merging to 4.13. (In reply to Takashi Iwai from comment #18) > Yes, I'm going to brush up and submit the patch for merging to 4.13. Wonderful :) Could you provide us with the patch as soon as you have it, so we can test it as much as possible in advance? Thanks for your work, it has been pleasant to work with you and I'm sure we'll be in touch more often ;) Created attachment 257201 [details]
The official fix merged to sound git tree
Here it is. I close the bug now. If you still encounter the problem even with the patch, feel free to reopen. Thanks! |