Bug 64891 - [snd_hda_codec] BUG: soft lockup - CPU#0 stuck for 22s (when using vgaswitcheroo)
Summary: [snd_hda_codec] BUG: soft lockup - CPU#0 stuck for 22s (when using vgaswitche...
Status: NEEDINFO
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-12 23:16 UTC by Kamil Bar
Modified: 2015-06-30 22:15 UTC (History)
6 users (show)

See Also:
Kernel Version: 3.12.0-1-ARCH
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Full dmesg, actual tracebacks are on the last lines and it's repeating every 22s (218.19 KB, text/plain)
2013-11-12 23:16 UTC, Kamil Bar
Details
cpuinfo (3.27 KB, text/x-log)
2013-11-12 23:17 UTC, Kamil Bar
Details
verbose lspci (12.51 KB, text/x-log)
2013-11-12 23:18 UTC, Kamil Bar
Details
dmesg, lockdep, lockdep_stats, before and after switch (69.15 KB, application/gzip)
2013-11-15 19:04 UTC, Kamil Bar
Details
dmesg of 3.12 with 3.11 hda_intel.c (83.21 KB, text/plain)
2013-11-24 01:47 UTC, Joaquín Aramendía
Details
dmesg with hotplug drivers disabled (84.10 KB, text/plain)
2013-11-25 15:07 UTC, Joaquín Aramendía
Details
bisection log (2.57 KB, text/plain)
2013-12-10 02:09 UTC, Joaquín Aramendía
Details
Dmesg blacklisting snd_hda_intel module only (72.61 KB, text/plain)
2013-12-26 20:01 UTC, Joaquín Aramendía
Details
nouveau: Possible VGA switcheroo problem fix (1.79 KB, patch)
2013-12-28 22:03 UTC, Rafael J. Wysocki
Details | Diff
3.13rc5 patched dmesg (86.76 KB, text/plain)
2013-12-29 15:10 UTC, Joaquín Aramendía
Details
dmesg after ACPIPHP patch (87.61 KB, text/plain)
2013-12-30 22:50 UTC, Joaquín Aramendía
Details
Debug patch for ACPIPHP (1.67 KB, patch)
2013-12-31 00:09 UTC, Rafael J. Wysocki
Details | Diff
acpidump for Dell Vostro 3500 (225.64 KB, text/plain)
2013-12-31 00:20 UTC, Joaquín Aramendía
Details
dmesg Debug patch (91.27 KB, text/plain)
2013-12-31 00:49 UTC, Joaquín Aramendía
Details
ACPIPHP / radeon / nouveau: Fix VGA switcheroo problem related to hotplug (8.00 KB, patch)
2013-12-31 01:06 UTC, Rafael J. Wysocki
Details | Diff
dmesg after patch from comment #38 (76.76 KB, text/plain)
2013-12-31 01:50 UTC, Joaquín Aramendía
Details
dmesg for test on comment #40 (75.96 KB, text/plain)
2013-12-31 12:49 UTC, Joaquín Aramendía
Details
Nouveau locked dmesg 3.13rc6 patched (89.16 KB, text/plain)
2014-01-02 22:19 UTC, Joaquín Aramendía
Details
dmesg showing lockpuos (radeon off) (139.14 KB, text/x-log)
2015-06-30 22:15 UTC, Tommaso Falchi Delitala
Details

Description Kamil Bar 2013-11-12 23:16:41 UTC
Created attachment 114431 [details]
Full dmesg, actual tracebacks are on the last lines and it's repeating every 22s

I'm getting soft lockup every time I use vgaswitcheroo, it started from update to 3.12 kernel. From traceback it looks like Intel HDA cannot release device, probably AMD's audio.

Attaching dmesg, and will attach lspci and cpuinfo.
Comment 1 Kamil Bar 2013-11-12 23:17:50 UTC
Created attachment 114441 [details]
cpuinfo
Comment 2 Kamil Bar 2013-11-12 23:18:08 UTC
Created attachment 114451 [details]
verbose lspci
Comment 3 Takashi Iwai 2013-11-13 08:26:48 UTC
Hm, judging from the log, the hotplug driver tries to remove the HD-audio sound device.  Did you do anything relevant?  If not, this is the biggest behavior difference.  Does the problem still happen if you unload acpiphp beforehand?

In anyway, the problem looks like a deadlock of card->files_lock spinlock.  Could you try with LOCKDEP and give the kernel log.
Comment 4 Kamil Bar 2013-11-15 19:04:23 UTC
Created attachment 114871 [details]
dmesg, lockdep, lockdep_stats, before and after switch

Attached dmesg /proc/lockdep /proc/lockdep_stats before and after switching from integrated to dedicated graphics as a zip to prevent mess.
Also I'm not sure from when because I wasn't used vgaswitcheroo from 2 weeks from now... but after switching to dedicated AMD card I cannot run xserver, maybe because of this lockup.
But actually after enabling LOCKDEP or maybe after compiling kernel from ABS with modified config, it doesn't lock my pc on boot with "vgaswitcheroo switch to integrated gpu" systemd script enabled. Also I can't say what's the effect of disabling acpiphp
Comment 5 Takashi Iwai 2013-11-17 18:02:03 UTC
Was the hotplug behavior seen in 3.11 kernel?  If not, it might be a new behavior of the hotplug stuff that really disables the inactive sound controller PCI on the fly.

In anyway, you can copy hda_intel.c from 3.11 to 3.12 and see whether it still works.  Or even 3.10 hda_intel.c should work.  The relevant changes in 3.11 was i915 power well stuff, and in 3.12, nouveau pm domain stuff.

If the old hda_intel.c works, we can concentrate on the regression there.  If it doesn't work, it implies that the breakage happened in other places.
Comment 6 Joaquín Aramendía 2013-11-23 18:55:27 UTC
I can confirm the same behaviour in kernel 3.12.1-ARCH with nouveau driver. If I blacklist the nouveau driver then the issue disappears (probably because vgaswitcheroo isn't kicking in). I remember using vgaswitcheroo in kernel 3.11 without any issues.
I'll try Takashi's suggestion and compile 3.12 with the 3.11 hda_intel.c file.
Comment 7 Joaquín Aramendía 2013-11-24 01:38:38 UTC
Compiling 3.12 with 3.11 hda_intel.c file doesn't work. The bug is still present.
Comment 8 Joaquín Aramendía 2013-11-24 01:47:55 UTC
Created attachment 115771 [details]
dmesg of 3.12 with 3.11 hda_intel.c
Comment 9 Takashi Iwai 2013-11-25 08:38:29 UTC
OK, then try to disable/blacklist the hotplug driver (acpiphp and pciephp), so that we can see whether the problem comes from the hotplug disable, or VGA switcheroo breakage itself.
Comment 10 Joaquín Aramendía 2013-11-25 15:07:05 UTC
Created attachment 115931 [details]
dmesg with hotplug drivers disabled

Making the modules fail to load by placing these lines in a modprobe.conf file:

install acpiphp /bin/false
install pciephp /bin/false

Modules did not load, but nouveau did. Have the same behaviour, the bug persists.
Comment 11 Joaquín Aramendía 2013-11-25 18:46:01 UTC
Should I bisect the kernel between 3.11 and 3.12? Is it any helpful? I can do that if asked.
Comment 12 Takashi Iwai 2013-11-26 06:38:09 UTC
Yeah, that'd be helpful, of course :)  Thanks.
Comment 13 Joaquín Aramendía 2013-11-29 03:19:18 UTC
Ok. I'll try to do it this weekend. As I'm learning here, could you give me some pointers on how to identify tis bug and no others? Which are good starting points for good/bad commits for bisection? I guess that revision containing modifications to nouveau/hda_intel/vgaswitcheroo are candidates.
Comment 14 Takashi Iwai 2013-11-29 06:54:56 UTC
It'd be safer to do a full bisection between 3.11 and 3.12 in this case, as it's not 100% clear which subsystem (or combination) caused the issue.  (The difference to partial bisection would be maybe just a couple of tests.)

There is no common solution for identifying a specific bug, unfortunately.  You must check the behavior and judge whether it's the same.

Of course, if you fall into a wrong bisection, try the partial ones including drivers/gpu, include/drm and sound/pci/hda.
Comment 15 Joaquín Aramendía 2013-12-10 02:09:42 UTC
Created attachment 117951 [details]
bisection log

Ok. It took me more than expected to get the time to bisect the issue. I hope i did it right. Here is what I did:
 - bisect -> good:v3.11; bad:v3.12

Then, as the issue resulted in a complete freeze (or almost) on desktop start with nouveau enabled, I tested each step with and without nouveau blacklisted. And try to get a similar post-mortem dmesg (with journalctl).

I'll attach the log of bisect. Seems reasonable as first bad commit is a huge merge regarding drm.
Comment 16 Takashi Iwai 2013-12-13 11:54:22 UTC
OK, changing the component to DRI...
Comment 17 Alex Deucher 2013-12-13 14:19:02 UTC
This might be related to bug 61891 (and related fdo bugs).  In that case, bbd34fcdd1b201e996235731a7c98fd5197d9e51 caused the regression.  Unfortunately, I'm not familiar enough with the acpi or hotplug code to really understand what's going on.
Comment 18 Takashi Iwai 2013-12-13 14:29:14 UTC
Rafael, any clue?
Comment 19 Joaquín Aramendía 2013-12-26 20:01:28 UTC
Created attachment 119641 [details]
Dmesg blacklisting snd_hda_intel module only

Just a little more research: blacklisting snd_hda_intel I got this dmesg. A call trace is called on gpu turn off but I'm able to continue using my laptop (no lockup). I'll try to make a debug version so that trace is something useful.

$ uname -r
3.12.6-1-ARCH
Comment 20 Joaquín Aramendía 2013-12-27 22:22:47 UTC
I'm not able to compile a debug version, at least not easily. Hope the trace is helpful as is, if not, please give me some pointers on how to improove it. Thanks.
Comment 21 Alex Deucher 2013-12-27 22:32:28 UTC
See bug 61891.  Does booting with acpiphp.disable=1 on the kernel command line in grub help?
Comment 22 Joaquín Aramendía 2013-12-28 15:12:26 UTC
Thanks for the reply, Alex.
Adding 'acpiphp.disable=1' to bootloader indeed helps.

Tested on
$ uname -r
3.12.6-1-ARCH

Last lines of dmesg shows the Nvidia card is shut down
[   25.473454] hda-intel 0000:01:00.1: Disabling via VGA-switcheroo
[   28.062014] ACPI Warning: \_SB_.PCI0.P0P1.PEGP._DSM: Argument #4 type mismatch - Found [Integer], ACPI requires [Package] (20130725/nsarguments-95)

And is confirmed by vgaswitcheroo
# cat /sys/kernel/debug/vgaswitcheroo/switch
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynOff:0000:01:00.0
2:DIS-Audio: :Off:0000:01:00.1

Then tested powering on/off with prime and it didn't work. It froze when calling:
$ DRI_PRIME=1 glxinfo | grep renderer

Couldn't catch the dmesg after that.

Indeed it seems like a duplicate of bug 61891. I can't test the patch proposed there as it's radeon specific.
Comment 23 Alex Deucher 2013-12-28 15:22:35 UTC
(In reply to Joaquín Aramendía from comment #22)
> Thanks for the reply, Alex.
> Adding 'acpiphp.disable=1' to bootloader indeed helps.
> 
> Indeed it seems like a duplicate of bug 61891. I can't test the patch
> proposed there as it's radeon specific.

We probably need a similar patch for nouveau and the DSM acpi method.
Comment 24 Joaquín Aramendía 2013-12-28 15:36:05 UTC
(In reply to Alex Deucher from comment #23)
> We probably need a similar patch for nouveau and the DSM acpi method.

Should I open a new bug report? Or change the title on this one? Let me know. I can test whatever is necessary
Comment 25 Rafael J. Wysocki 2013-12-28 21:13:13 UTC
(In reply to Joaquín Aramendía from comment #24)
> (In reply to Alex Deucher from comment #23)
> > We probably need a similar patch for nouveau and the DSM acpi method.
> 
> Should I open a new bug report? Or change the title on this one? Let me
> know. I can test whatever is necessary

Can you please attach dmesg from the failing kernel without any changes?
Comment 26 Rafael J. Wysocki 2013-12-28 22:03:58 UTC
Created attachment 120021 [details]
nouveau: Possible VGA switcheroo problem fix

If this is exactly the same issue as in bug #61891, this (untested) patch on top of

https://patchwork.kernel.org/patch/3414401/

should work for nouveau.
Comment 27 Joaquín Aramendía 2013-12-28 23:16:19 UTC
I'll test it tonight. I tried to make a similar code for nouveau but didn't work.
Comment 28 Joaquín Aramendía 2013-12-29 15:10:47 UTC
Created attachment 120071 [details]
3.13rc5 patched dmesg

(In reply to Rafael J. Wysocki from comment #26)
> nouveau: Possible VGA switcheroo problem fix
> 
> If this is exactly the same issue as in bug #61891, this (untested) patch on
> top of
> 
> https://patchwork.kernel.org/patch/3414401/
> 
> should work for nouveau.

I could test it after a struggle. This is what I did:

Patched the clean 3.13rc5 tarball with both patches and compiled it. After that adding the "acpiphp.disable=1" workarround made no difference, so that issue might be cleared.

To get to at least an interactive system I had to blacklist 'snd_hda_module' and get into 'multi-user.target' in systemd. X will freeze before I could enter my password.

I finally could get into tty and run a dmesg. I noticed that nouveau gets insmod-ed and rmmod-ed once every, roughly, a minute. After two minutes the system will lock. Unfortunatelly I couldn't catch up the dmesg before that, 'journalctl' got a crippled dmesg from previous boot (this one I'm attaching). There is a call trace regarding hotplug in there.

In a related issue: I also noticed that the 'acpiphp.disable=1' workarround won't always work on 3.12.6-ARCH, it will lock X after login. Specially when the system is recovering from a hard button shutdown (when I have to kill my laptop with the power button). Sometimes it freezes earlier than X login. (Possibly a race condition?)

Hope all this helps.
Comment 29 Rafael J. Wysocki 2013-12-29 21:12:13 UTC
(In reply to Joaquín Aramendía from comment #28)
> Created attachment 120071 [details]
> 3.13rc5 patched dmesg
> 
> (In reply to Rafael J. Wysocki from comment #26)
> > nouveau: Possible VGA switcheroo problem fix
> > 
> > If this is exactly the same issue as in bug #61891, this (untested) patch
> on
> > top of
> > 
> > https://patchwork.kernel.org/patch/3414401/
> > 
> > should work for nouveau.
> 
> I could test it after a struggle. This is what I did:
> 
> Patched the clean 3.13rc5 tarball with both patches and compiled it. After
> that adding the "acpiphp.disable=1" workarround made no difference, so that
> issue might be cleared.

That's what the patch is for, so it looks like it has worked.

I'll the nouveau workaround to https://patchwork.kernel.org/patch/3414401/ then.
Comment 30 Joaquín Aramendía 2013-12-30 22:50:58 UTC
Created attachment 120321 [details]
dmesg after ACPIPHP patch

A follow-up on this issue:

After applying Rafael's patch, the issue with the lockup is still present. After booting on 3.13rc5 kernel in 'multi-user.target' I managed to get a full dmesg.

The nouveau driver seems not to handle vgaswitcheroo well. After being disabled as DSM call mismatch error suggest, it will lock, after a minute or so it starts printing this messages:

...
[  123.322051] smeagol kernel: nouveau W[   PFIFO][0000:01:00.0] unknown intr 0x04200000, ch 2
...

Followed by timeout errors:

...
[  129.302814] smeagol kernel: nouveau E[   PFIFO][0000:01:00.0] channel 2 [DRM] unload timeout
...
[  149.888549] smeagol kernel: nouveau W[  BARCTL][0000:01:00.0] flush timeout
...

When the laptop finally responded to Ctrl+Alt+Del then a backtrace appeared regarding acpi hotplug.

And then nouveau is reloaded again! Just before I had to drown my laptop with the power button.

...
[  207.138410] smeagol kernel: nouveau  [  DEVICE][0000:01:00.0] BOOT0  : 0x0a8a80b1
[  207.138415] smeagol kernel: nouveau  [  DEVICE][0000:01:00.0] Chipset: GT218 (NVA8)
[  207.138418] smeagol kernel: nouveau  [  DEVICE][0000:01:00.0] Family : NV50
[  207.139892] smeagol kernel: nouveau  [   VBIOS][0000:01:00.0] checking PRAMIN for image...
[  207.188025] smeagol kernel: nouveau  [   VBIOS][0000:01:00.0] ... signature not found
[  207.188030] smeagol kernel: nouveau  [   VBIOS][0000:01:00.0] checking PROM for image...
[  208.466903] smeagol kernel: nouveau  [   VBIOS][0000:01:00.0] ... signature not found
[  208.466907] smeagol kernel: nouveau  [   VBIOS][0000:01:00.0] checking ACPI for image...
Comment 31 Rafael J. Wysocki 2013-12-30 23:59:12 UTC
It looks like the ACPIPHP is not sufficient in your case, because it still tries to remove some devices in response to the event signaled after _DSM execution.

We'll need some more debug info to figure out what is going on.

Please remove the patches you have applied so far and apply this one instead:

https://patchwork.kernel.org/patch/3416861/

This contains the nouveau part already.

I'll let you know what to apply in addition to that.
Comment 32 Rafael J. Wysocki 2013-12-31 00:07:25 UTC
First off, please attach the output of acpidump from that machine.
Comment 33 Rafael J. Wysocki 2013-12-31 00:09:50 UTC
Created attachment 120331 [details]
Debug patch for ACPIPHP

Second. please apply this patch (on top of the one I pointed you to previously) and try to get a dmesg output analogous to the previous one.
Comment 34 Joaquín Aramendía 2013-12-31 00:20:23 UTC
Created attachment 120361 [details]
acpidump for Dell Vostro 3500
Comment 35 Joaquín Aramendía 2013-12-31 00:49:42 UTC
Created attachment 120371 [details]
dmesg Debug patch

Hope this is what you want
Comment 36 Rafael J. Wysocki 2013-12-31 01:05:12 UTC
Yes, thanks!

It looks like on your system the device having the "switching _DSM" method has to be marked with no_hotplug.
Comment 37 Rafael J. Wysocki 2013-12-31 01:06:49 UTC
Created attachment 120381 [details]
ACPIPHP / radeon / nouveau: Fix VGA switcheroo problem related to hotplug

Please apply this patch (instead of the ones from comment #31 and #33) and check if it makes any difference.
Comment 38 Joaquín Aramendía 2013-12-31 01:18:13 UTC
(In reply to Rafael J. Wysocki from comment #36)
> It looks like on your system the device having the "switching _DSM" method
> has to be marked with no_hotplug.

Indeed, the Nvidia card has a sound card in it that is claimed by the snd_hda_intel module. That module controls both sound cards.

I'll apply patch in comment #37 and report back.
Comment 39 Joaquín Aramendía 2013-12-31 01:50:03 UTC
Created attachment 120391 [details]
dmesg after patch from comment #38

Now when I get to the tty the system is at leas responsive. The card is turned off and doing

# cat /sys/kernel/debug/vgaswitcheroo/switch
0:IGD:+:Pwr:0000:00:02.0
1:DIS-Audio: :Off:0000:01:00.1
2:DIS: :DynOff:0000:01:00.0

I noticed an order change (before 1 was the DIS card, 2 the DIS-Audio).

Then I started to toy with it, tryed

# systemctl isolate graphical.target

And it hanged. So I rebooted and tried again. This time trying a reboot after checking the vgaswitcheroo status, and it hanged with the attached dmesg. I'm not able to get into a graphical.target yet.
Comment 40 Rafael J. Wysocki 2013-12-31 01:58:43 UTC
Well, that may be a different error.  It looks like there are no hotplug events triggered by switcheroo any more which is the purpose of the patch.

Please apply the patch from comment #33 on top of the one in #37 and retest to verify that the hotplug events are really gone.
Comment 41 Joaquín Aramendía 2013-12-31 12:49:17 UTC
Created attachment 120411 [details]
dmesg for test on comment #40

Sorry for the delay.
Seems that there are some hotplug events left as per dmesg suggests:

...
[   22.292356] smeagol kernel: ACPI Warning: \_SB_.PCI0.P0P1.PEGP._DSM: Argument #4 type mismatch - Found [Integer], ACPI requires [Package] (20131115/nsarguments-95)
[   22.299084] smeagol kernel: acpiphp_glue: hotplug_event: Bus check notify on \_SB_.PCI0.P0P1
[   22.299087] smeagol kernel: acpiphp_glue: hotplug_event: re-enumerating slots under \_SB_.PCI0.P0P1
[   22.299093] smeagol kernel: pcieport 0000:00:01.0: acpiphp_check_bridge
[   22.299097] smeagol kernel: video LNXVIDEO:00: no_hotplug set
...
Comment 42 Rafael J. Wysocki 2013-12-31 20:55:22 UTC
Yes, the event is triggered, but because of the no_hotplug set for LNXVIDEO:00, it is actually ignored and that's what I meant.  This means that the patch works as intended and it should take hotplug out of the way for you.

The other issues you're seeing need to be investigated by colleagues who are more familiar with graphics and audio than I am.

Thanks a lot and happy New Year!
Comment 43 Joaquín Aramendía 2013-12-31 21:02:44 UTC
Thanks, Rafael! Hope I get it fixed (or help someone to do it, most likely).
Happy new year.
Comment 44 Joaquín Aramendía 2014-01-02 22:19:15 UTC
Created attachment 120671 [details]
Nouveau locked dmesg 3.13rc6 patched

Ok. Then I got 3.13rc6 with the patch from comment #37. Entering multi-user.target went fine. Then I tried to enter X. Here is the dmesg from that run. It took me almost 15 minutes of lockup. This time it got to print a trace. It seems that is only related to nouveau.

Hope this helps
Comment 45 Joaquín Aramendía 2014-02-15 13:04:37 UTC
Today I tested mainline kernel with no arguments and I found out that it works out-of-the-box like a charm. So no more issues related with power on/off nvidia card. I'll mark this as fixed.

Thanks to all.
Comment 46 Tommaso Falchi Delitala 2015-06-30 22:15:36 UTC
Created attachment 181501 [details]
dmesg showing lockpuos (radeon off)

I still experience lockups related to audio with discrete card disabled via vgaswitcheroo. Moreover the systems hangs at shutdown.

Kernel 4.0.6-1-ARCH
AMD Mobility Radeon HD5470 + intel integrated.


# cat /sys/kernel/debug/vgaswitcheroo/switch 
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynOff:0000:01:00.0
2:DIS-Audio: :Off:0000:01:00.1

Interesting dmesg print:

snd_hda_intel 0000:01:00.1: Disabling via VGA-switcheroo
snd_hda_intel 0000:01:00.1: Cannot lock devices!
INFO: task kworker/u16:0:6 blocked for more than 120 seconds.
      Tainted: G        W       4.0.6-1-ARCH #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:0   D ffff880132fd7ca8     0     6      2 0x00000000
Workqueue: hd-audio1 hdmi_repoll_eld [snd_hda_codec_hdmi]


See complete dmesg in attachment.

Note You need to log in before you can comment on or make changes to this bug.