Created attachment 114431 [details]
Full dmesg, actual tracebacks are on the last lines and it's repeating every 22s
I'm getting soft lockup every time I use vgaswitcheroo, it started from update to 3.12 kernel. From traceback it looks like Intel HDA cannot release device, probably AMD's audio.
Attaching dmesg, and will attach lspci and cpuinfo.
Created attachment 114441 [details]
Created attachment 114451 [details]
Hm, judging from the log, the hotplug driver tries to remove the HD-audio sound device. Did you do anything relevant? If not, this is the biggest behavior difference. Does the problem still happen if you unload acpiphp beforehand?
In anyway, the problem looks like a deadlock of card->files_lock spinlock. Could you try with LOCKDEP and give the kernel log.
Created attachment 114871 [details]
dmesg, lockdep, lockdep_stats, before and after switch
Attached dmesg /proc/lockdep /proc/lockdep_stats before and after switching from integrated to dedicated graphics as a zip to prevent mess.
Also I'm not sure from when because I wasn't used vgaswitcheroo from 2 weeks from now... but after switching to dedicated AMD card I cannot run xserver, maybe because of this lockup.
But actually after enabling LOCKDEP or maybe after compiling kernel from ABS with modified config, it doesn't lock my pc on boot with "vgaswitcheroo switch to integrated gpu" systemd script enabled. Also I can't say what's the effect of disabling acpiphp
Was the hotplug behavior seen in 3.11 kernel? If not, it might be a new behavior of the hotplug stuff that really disables the inactive sound controller PCI on the fly.
In anyway, you can copy hda_intel.c from 3.11 to 3.12 and see whether it still works. Or even 3.10 hda_intel.c should work. The relevant changes in 3.11 was i915 power well stuff, and in 3.12, nouveau pm domain stuff.
If the old hda_intel.c works, we can concentrate on the regression there. If it doesn't work, it implies that the breakage happened in other places.
I can confirm the same behaviour in kernel 3.12.1-ARCH with nouveau driver. If I blacklist the nouveau driver then the issue disappears (probably because vgaswitcheroo isn't kicking in). I remember using vgaswitcheroo in kernel 3.11 without any issues.
I'll try Takashi's suggestion and compile 3.12 with the 3.11 hda_intel.c file.
Compiling 3.12 with 3.11 hda_intel.c file doesn't work. The bug is still present.
Created attachment 115771 [details]
dmesg of 3.12 with 3.11 hda_intel.c
OK, then try to disable/blacklist the hotplug driver (acpiphp and pciephp), so that we can see whether the problem comes from the hotplug disable, or VGA switcheroo breakage itself.
Created attachment 115931 [details]
dmesg with hotplug drivers disabled
Making the modules fail to load by placing these lines in a modprobe.conf file:
install acpiphp /bin/false
install pciephp /bin/false
Modules did not load, but nouveau did. Have the same behaviour, the bug persists.
Should I bisect the kernel between 3.11 and 3.12? Is it any helpful? I can do that if asked.
Yeah, that'd be helpful, of course :) Thanks.
Ok. I'll try to do it this weekend. As I'm learning here, could you give me some pointers on how to identify tis bug and no others? Which are good starting points for good/bad commits for bisection? I guess that revision containing modifications to nouveau/hda_intel/vgaswitcheroo are candidates.
It'd be safer to do a full bisection between 3.11 and 3.12 in this case, as it's not 100% clear which subsystem (or combination) caused the issue. (The difference to partial bisection would be maybe just a couple of tests.)
There is no common solution for identifying a specific bug, unfortunately. You must check the behavior and judge whether it's the same.
Of course, if you fall into a wrong bisection, try the partial ones including drivers/gpu, include/drm and sound/pci/hda.
Created attachment 117951 [details]
Ok. It took me more than expected to get the time to bisect the issue. I hope i did it right. Here is what I did:
- bisect -> good:v3.11; bad:v3.12
Then, as the issue resulted in a complete freeze (or almost) on desktop start with nouveau enabled, I tested each step with and without nouveau blacklisted. And try to get a similar post-mortem dmesg (with journalctl).
I'll attach the log of bisect. Seems reasonable as first bad commit is a huge merge regarding drm.
OK, changing the component to DRI...
This might be related to bug 61891 (and related fdo bugs). In that case, bbd34fcdd1b201e996235731a7c98fd5197d9e51 caused the regression. Unfortunately, I'm not familiar enough with the acpi or hotplug code to really understand what's going on.
Rafael, any clue?
Created attachment 119641 [details]
Dmesg blacklisting snd_hda_intel module only
Just a little more research: blacklisting snd_hda_intel I got this dmesg. A call trace is called on gpu turn off but I'm able to continue using my laptop (no lockup). I'll try to make a debug version so that trace is something useful.
$ uname -r
I'm not able to compile a debug version, at least not easily. Hope the trace is helpful as is, if not, please give me some pointers on how to improove it. Thanks.
See bug 61891. Does booting with acpiphp.disable=1 on the kernel command line in grub help?
Thanks for the reply, Alex.
Adding 'acpiphp.disable=1' to bootloader indeed helps.
$ uname -r
Last lines of dmesg shows the Nvidia card is shut down
[ 25.473454] hda-intel 0000:01:00.1: Disabling via VGA-switcheroo
[ 28.062014] ACPI Warning: \_SB_.PCI0.P0P1.PEGP._DSM: Argument #4 type mismatch - Found [Integer], ACPI requires [Package] (20130725/nsarguments-95)
And is confirmed by vgaswitcheroo
# cat /sys/kernel/debug/vgaswitcheroo/switch
Then tested powering on/off with prime and it didn't work. It froze when calling:
$ DRI_PRIME=1 glxinfo | grep renderer
Couldn't catch the dmesg after that.
Indeed it seems like a duplicate of bug 61891. I can't test the patch proposed there as it's radeon specific.
(In reply to Joaquín Aramendía from comment #22)
> Thanks for the reply, Alex.
> Adding 'acpiphp.disable=1' to bootloader indeed helps.
> Indeed it seems like a duplicate of bug 61891. I can't test the patch
> proposed there as it's radeon specific.
We probably need a similar patch for nouveau and the DSM acpi method.
(In reply to Alex Deucher from comment #23)
> We probably need a similar patch for nouveau and the DSM acpi method.
Should I open a new bug report? Or change the title on this one? Let me know. I can test whatever is necessary
(In reply to Joaquín Aramendía from comment #24)
> (In reply to Alex Deucher from comment #23)
> > We probably need a similar patch for nouveau and the DSM acpi method.
> Should I open a new bug report? Or change the title on this one? Let me
> know. I can test whatever is necessary
Can you please attach dmesg from the failing kernel without any changes?
Created attachment 120021 [details]
nouveau: Possible VGA switcheroo problem fix
If this is exactly the same issue as in bug #61891, this (untested) patch on top of
should work for nouveau.
I'll test it tonight. I tried to make a similar code for nouveau but didn't work.
Created attachment 120071 [details]
3.13rc5 patched dmesg
(In reply to Rafael J. Wysocki from comment #26)
> nouveau: Possible VGA switcheroo problem fix
> If this is exactly the same issue as in bug #61891, this (untested) patch on
> top of
> should work for nouveau.
I could test it after a struggle. This is what I did:
Patched the clean 3.13rc5 tarball with both patches and compiled it. After that adding the "acpiphp.disable=1" workarround made no difference, so that issue might be cleared.
To get to at least an interactive system I had to blacklist 'snd_hda_module' and get into 'multi-user.target' in systemd. X will freeze before I could enter my password.
I finally could get into tty and run a dmesg. I noticed that nouveau gets insmod-ed and rmmod-ed once every, roughly, a minute. After two minutes the system will lock. Unfortunatelly I couldn't catch up the dmesg before that, 'journalctl' got a crippled dmesg from previous boot (this one I'm attaching). There is a call trace regarding hotplug in there.
In a related issue: I also noticed that the 'acpiphp.disable=1' workarround won't always work on 3.12.6-ARCH, it will lock X after login. Specially when the system is recovering from a hard button shutdown (when I have to kill my laptop with the power button). Sometimes it freezes earlier than X login. (Possibly a race condition?)
Hope all this helps.
(In reply to Joaquín Aramendía from comment #28)
> Created attachment 120071 [details]
> 3.13rc5 patched dmesg
> (In reply to Rafael J. Wysocki from comment #26)
> > nouveau: Possible VGA switcheroo problem fix
> > If this is exactly the same issue as in bug #61891, this (untested) patch
> > top of
> > https://patchwork.kernel.org/patch/3414401/
> > should work for nouveau.
> I could test it after a struggle. This is what I did:
> Patched the clean 3.13rc5 tarball with both patches and compiled it. After
> that adding the "acpiphp.disable=1" workarround made no difference, so that
> issue might be cleared.
That's what the patch is for, so it looks like it has worked.
I'll the nouveau workaround to https://patchwork.kernel.org/patch/3414401/ then.
Created attachment 120321 [details]
dmesg after ACPIPHP patch
A follow-up on this issue:
After applying Rafael's patch, the issue with the lockup is still present. After booting on 3.13rc5 kernel in 'multi-user.target' I managed to get a full dmesg.
The nouveau driver seems not to handle vgaswitcheroo well. After being disabled as DSM call mismatch error suggest, it will lock, after a minute or so it starts printing this messages:
[ 123.322051] smeagol kernel: nouveau W[ PFIFO][0000:01:00.0] unknown intr 0x04200000, ch 2
Followed by timeout errors:
[ 129.302814] smeagol kernel: nouveau E[ PFIFO][0000:01:00.0] channel 2 [DRM] unload timeout
[ 149.888549] smeagol kernel: nouveau W[ BARCTL][0000:01:00.0] flush timeout
When the laptop finally responded to Ctrl+Alt+Del then a backtrace appeared regarding acpi hotplug.
And then nouveau is reloaded again! Just before I had to drown my laptop with the power button.
[ 207.138410] smeagol kernel: nouveau [ DEVICE][0000:01:00.0] BOOT0 : 0x0a8a80b1
[ 207.138415] smeagol kernel: nouveau [ DEVICE][0000:01:00.0] Chipset: GT218 (NVA8)
[ 207.138418] smeagol kernel: nouveau [ DEVICE][0000:01:00.0] Family : NV50
[ 207.139892] smeagol kernel: nouveau [ VBIOS][0000:01:00.0] checking PRAMIN for image...
[ 207.188025] smeagol kernel: nouveau [ VBIOS][0000:01:00.0] ... signature not found
[ 207.188030] smeagol kernel: nouveau [ VBIOS][0000:01:00.0] checking PROM for image...
[ 208.466903] smeagol kernel: nouveau [ VBIOS][0000:01:00.0] ... signature not found
[ 208.466907] smeagol kernel: nouveau [ VBIOS][0000:01:00.0] checking ACPI for image...
It looks like the ACPIPHP is not sufficient in your case, because it still tries to remove some devices in response to the event signaled after _DSM execution.
We'll need some more debug info to figure out what is going on.
Please remove the patches you have applied so far and apply this one instead:
This contains the nouveau part already.
I'll let you know what to apply in addition to that.
First off, please attach the output of acpidump from that machine.
Created attachment 120331 [details]
Debug patch for ACPIPHP
Second. please apply this patch (on top of the one I pointed you to previously) and try to get a dmesg output analogous to the previous one.
Created attachment 120361 [details]
acpidump for Dell Vostro 3500
Created attachment 120371 [details]
dmesg Debug patch
Hope this is what you want
It looks like on your system the device having the "switching _DSM" method has to be marked with no_hotplug.
Created attachment 120381 [details]
ACPIPHP / radeon / nouveau: Fix VGA switcheroo problem related to hotplug
Please apply this patch (instead of the ones from comment #31 and #33) and check if it makes any difference.
(In reply to Rafael J. Wysocki from comment #36)
> It looks like on your system the device having the "switching _DSM" method
> has to be marked with no_hotplug.
Indeed, the Nvidia card has a sound card in it that is claimed by the snd_hda_intel module. That module controls both sound cards.
I'll apply patch in comment #37 and report back.
Created attachment 120391 [details]
dmesg after patch from comment #38
Now when I get to the tty the system is at leas responsive. The card is turned off and doing
# cat /sys/kernel/debug/vgaswitcheroo/switch
I noticed an order change (before 1 was the DIS card, 2 the DIS-Audio).
Then I started to toy with it, tryed
# systemctl isolate graphical.target
And it hanged. So I rebooted and tried again. This time trying a reboot after checking the vgaswitcheroo status, and it hanged with the attached dmesg. I'm not able to get into a graphical.target yet.
Well, that may be a different error. It looks like there are no hotplug events triggered by switcheroo any more which is the purpose of the patch.
Please apply the patch from comment #33 on top of the one in #37 and retest to verify that the hotplug events are really gone.
Created attachment 120411 [details]
dmesg for test on comment #40
Sorry for the delay.
Seems that there are some hotplug events left as per dmesg suggests:
[ 22.292356] smeagol kernel: ACPI Warning: \_SB_.PCI0.P0P1.PEGP._DSM: Argument #4 type mismatch - Found [Integer], ACPI requires [Package] (20131115/nsarguments-95)
[ 22.299084] smeagol kernel: acpiphp_glue: hotplug_event: Bus check notify on \_SB_.PCI0.P0P1
[ 22.299087] smeagol kernel: acpiphp_glue: hotplug_event: re-enumerating slots under \_SB_.PCI0.P0P1
[ 22.299093] smeagol kernel: pcieport 0000:00:01.0: acpiphp_check_bridge
[ 22.299097] smeagol kernel: video LNXVIDEO:00: no_hotplug set
Yes, the event is triggered, but because of the no_hotplug set for LNXVIDEO:00, it is actually ignored and that's what I meant. This means that the patch works as intended and it should take hotplug out of the way for you.
The other issues you're seeing need to be investigated by colleagues who are more familiar with graphics and audio than I am.
Thanks a lot and happy New Year!
Thanks, Rafael! Hope I get it fixed (or help someone to do it, most likely).
Happy new year.
Created attachment 120671 [details]
Nouveau locked dmesg 3.13rc6 patched
Ok. Then I got 3.13rc6 with the patch from comment #37. Entering multi-user.target went fine. Then I tried to enter X. Here is the dmesg from that run. It took me almost 15 minutes of lockup. This time it got to print a trace. It seems that is only related to nouveau.
Hope this helps
Today I tested mainline kernel with no arguments and I found out that it works out-of-the-box like a charm. So no more issues related with power on/off nvidia card. I'll mark this as fixed.
Thanks to all.
Created attachment 181501 [details]
dmesg showing lockpuos (radeon off)
I still experience lockups related to audio with discrete card disabled via vgaswitcheroo. Moreover the systems hangs at shutdown.
AMD Mobility Radeon HD5470 + intel integrated.
# cat /sys/kernel/debug/vgaswitcheroo/switch
Interesting dmesg print:
snd_hda_intel 0000:01:00.1: Disabling via VGA-switcheroo
snd_hda_intel 0000:01:00.1: Cannot lock devices!
INFO: task kworker/u16:0:6 blocked for more than 120 seconds.
Tainted: G W 4.0.6-1-ARCH #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:0 D ffff880132fd7ca8 0 6 2 0x00000000
Workqueue: hd-audio1 hdmi_repoll_eld [snd_hda_codec_hdmi]
See complete dmesg in attachment.