Bug 216527 - intel/iommu: regression causing nvidia driver crash in an optimus system
Summary: intel/iommu: regression causing nvidia driver crash in an optimus system
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: IOMMU (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: drivers_iommu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-09-24 15:24 UTC by michelesr
Modified: 2022-09-25 12:37 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.19.9
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Script to disable the nvidia card (379 bytes, application/x-shellscript)
2022-09-24 15:25 UTC, michelesr
Details
Script to reenable the nvidia card (489 bytes, application/x-shellscript)
2022-09-24 15:25 UTC, michelesr
Details
Kernel log (nvidia) (158.70 KB, text/plain)
2022-09-24 15:25 UTC, michelesr
Details
Kernel log (nouveau) (168.91 KB, text/plain)
2022-09-24 18:42 UTC, michelesr
Details

Description michelesr 2022-09-24 15:24:34 UTC
On some laptops (e.g. Dell XPS 9570, which is what I used to test) some (unoffical) power management solutions (including nvidia-xrun) remove the card from the system so that the "nvidia" module is not loaded automatically (e.g. by Xorg, a Wayland compositor, or a web browser, or anything that want to use the discrete GPU), and then a rescan is trigger automatically to re-add the card to the system so that it can be used.

Since kernel 5.19.9, using the graphic card (e.g. launching vkcube or simply nvidia-smi) causes an error in the nvidia proprietary driver, and locks the process. Rebooting the system is prevented because the process can't be killed, the only possible way to reboot is via REISUB (if allowed) or holding the power button.

I've identified that the commit that introduced the issue is https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9cd4f1434479f1ac25c440c421fbf52069079914, and reverting the commit fixes the issue.

I know nvidia is not officially supported but since it's a kernel regression (and possibly could impact something else than nvidia), I've decided to report anyway.

I've attached:

- a kernel log with the stacktrace for the nvidia driver crashing
- example scripts used to turn the nvidia card on and off, showing how the remove and rescan methods are used together with the related PM operations. 

Related Arch Linux bug tracker issue: https://bugs.archlinux.org/task/75986?project=1

Another post describing the issue: https://forum.manjaro.org/t/testing-update-2022-09-16-kernels-kde-gear-kde-frameworks-libreoffice-qemu-qt-wine/121967/12
Comment 1 michelesr 2022-09-24 15:25:12 UTC
Created attachment 301864 [details]
Script to disable the nvidia card
Comment 2 michelesr 2022-09-24 15:25:30 UTC
Created attachment 301865 [details]
Script to reenable the nvidia card
Comment 3 michelesr 2022-09-24 15:25:49 UTC
Created attachment 301866 [details]
Kernel log (nvidia)
Comment 4 michelesr 2022-09-24 18:41:30 UTC
This issue can partly reproduced also with nouveau. Running "DRI_PRIME=1 glxinfo" takes longer than usual, and a similar stack trace pops up in the kernel log (I'll attach that too), however, the process doesn't hang and terminates correctly.
Comment 5 michelesr 2022-09-24 18:42:39 UTC
Created attachment 301867 [details]
Kernel log (nouveau)
Comment 6 Lu Baolu 2022-09-25 01:30:10 UTC
Can you please try the latest mainline kernel? Recently we just reverted a buggy patch with below commit:

7ebb5f8e0010 Revert "iommu/vt-d: Fix possible recursive locking in intel_iommu_init()"

This has been picked in mainline and backported to stable v5.19.x kernel by Greg.
Comment 7 michelesr 2022-09-25 12:35:59 UTC
I've tried the latest mainline (HEAD at 105a36f3694edc680f3e9318cdd3c03722e42554) and it works fine, thanks for fixing this.

Feel free to reach me via e-mail should you need to test further changes to the code.

Note You need to log in before you can comment on or make changes to this bug.