Bug 84431 - Kernel crash when unloading radeon module for switcheroo card
Summary: Kernel crash when unloading radeon module for switcheroo card
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 high
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-09-12 20:39 UTC by Pali Rohár
Modified: 2016-08-05 19:18 UTC (History)
4 users (show)

See Also:
Kernel Version: all
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Fix crash after rmmod radeon on PX systems. (1.23 KB, patch)
2014-09-12 20:39 UTC, Pali Rohár
Details | Diff
patch 1/3 (2.30 KB, patch)
2014-09-12 22:08 UTC, Alex Deucher
Details | Diff
patch 2/3 (1.96 KB, patch)
2014-09-12 22:09 UTC, Alex Deucher
Details | Diff
patch 3/3 (1.37 KB, patch)
2014-09-12 22:09 UTC, Alex Deucher
Details | Diff

Description Pali Rohár 2014-09-12 20:39:24 UTC
Created attachment 149991 [details]
Fix crash after rmmod radeon on PX systems.

Calling rmmod radeon on PX system cause kernel crash. Reason is function vga_switcheroo_init_domain_pm_ops() which setting dev->pm_domain function of PCI device. When radeon module is unloaded pointer dev->pm_domain is set to vga_switcheroo function which try to call radeon function (which does not exists in memory after rmmod radeon). I bet that nouveau has same problem.

I'm attaching simple patch which set dev->pm_domain of PCI device back to NULL when removing radeon device so vga_switcheroo will not be called.

But I think that proper way for fixing this bug - which is in vga_switcheroo - should be to add function like "vga_switcheroo_exit_domain_pm_ops()" which will set pm_domain back to origin value (which is in my case NULL).

With my patch on PX system I can call rmmod radeon, modprobe radeon, rmmod radeon, ... many times without no crash.
Comment 1 Alex Deucher 2014-09-12 21:38:00 UTC
Care to generate a git patch and sign-off on it?
Comment 2 Pali Rohár 2014-09-12 21:44:35 UTC
I can, but I do not know if this is proper way how to fix it. I still think that root of bug is in function vga_switcheroo_init_domain_pm_ops() which overwrite dev->pm_domain, but does not restore it when driver/device unregister.
Comment 3 Alex Deucher 2014-09-12 22:08:47 UTC
Created attachment 150001 [details]
patch 1/3

How about this patch set?
Comment 4 Alex Deucher 2014-09-12 22:09:14 UTC
Created attachment 150011 [details]
patch 2/3
Comment 5 Alex Deucher 2014-09-12 22:09:33 UTC
Created attachment 150021 [details]
patch 3/3
Comment 6 Pali Rohár 2014-09-12 23:01:04 UTC
I tested 1/3 and 2/3 on 3.13 kernel. And as expected (because patches doing same thing) same result as with my patch - no kernel crash anymore. You can add my Signed-off.

I do not have nvidia optimus card, so I cannot test last patch.

Anyway in vga_switcheroo.c is exported function vga_switcheroo_init_domain_pm_optimus_hdmi_audio() which changing dev->pm_domain too. But I do not see any driver which using it.
Comment 7 Pali Rohár 2014-09-21 10:15:50 UTC
Function vga_switcheroo_init_domain_pm_optimus_hdmi_audio() is used in sound/pci/hda/hda_intel.c. So that driver has same problem and cause kernel panic on driver unload.
Comment 8 Joaquín Aramendía 2014-11-26 22:56:58 UTC
Alex, That patchset indeed got rid of that bug, but for some reason it introduced another one:
https://bugzilla.kernel.org/show_bug.cgi?id=86011

97d30fa3524ff60b43d450012abe8f961d280478 from stable kernel tree breaks nouveau power management through vga-switcheroo.
Comment 9 Peter Wu 2016-07-15 13:27:21 UTC
(In reply to Pali Rohár from comment #7)
> Function vga_switcheroo_init_domain_pm_optimus_hdmi_audio() is used in
> sound/pci/hda/hda_intel.c. So that driver has same problem and cause kernel
> panic on driver unload.

A patch for this issue is queued at
http://mailman.alsa-project.org/pipermail/alsa-devel/2016-July/110125.html

Joaquín, how does 97d30fa35 break nouveau vga-switcheroo? If you load nouveau with runpm=0, then you can write OFF to debugfs' vga_switcheroo. However runpm=1 (or -1 for Optimus systems) is recommended.

I think that the original bug is fixed, so this can be marked as resolved?
Comment 10 Joaquín Aramendía 2016-08-05 19:18:18 UTC
> Joaquín, how does 97d30fa35 break nouveau vga-switcheroo? If you load
> nouveau with runpm=0, then you can write OFF to debugfs' vga_switcheroo.
> However runpm=1 (or -1 for Optimus systems) is recommended.

Just tested removing nouveau module with Ubuntu 16.04 on mainline kernel v4.6.5 and it worked correctly. Also modprobed it after that and worked correctly. This bug should be marked as resolved.

Note You need to log in before you can comment on or make changes to this bug.