Bug 206571 - [Bisected] On kernel 5.5 system hangs on shutdown or reboot
Summary: [Bisected] On kernel 5.5 system hangs on shutdown or reboot
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: IOMMU (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: drivers_iommu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-02-17 13:19 UTC by Edoardo Signorini
Modified: 2021-12-19 17:56 UTC (History)
15 users (show)

See Also:
Kernel Version: 5.5
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
A potential fix patch ask for test (4.55 KB, patch)
2020-07-16 01:34 UTC, Lu Baolu
Details | Diff

Description Edoardo Signorini 2020-02-17 13:19:12 UTC
Description:
System hangs after reaching "Reboot: Power Down" and I have to force shutdown by pressing the power button for few seconds.
I'm on a Dell XPS 7390 2-in-1 with Arch, kernel version 5.5.4.
I have this bug since version 5.5.

Steps to reproduce:
It always hangs if I suspend and wakeup the system before shutdown. It is not the only condition but I've not identified other ways to always reproduce.

Fix:
Reverting commit 6c3a44ed3c553c324845744f30bcd1d3b07d61fd "iommu/vt-d: Turn off translations at shutdown" resolves the issue.
Comment 1 Marc C 2020-05-14 20:39:28 UTC
Edoardo (and anyone else reading this).

Just chiming in with a +1 and 'me too' report for a XPS 13" 2020.  Same weird reboot/poweroff behaviour where the laptop just sits with the LED & Keyboard backlight and requires the 8second power button press to get it to turn off.  

The history of my diagnostics can be found over at @ https://bugzilla.redhat.com/show_bug.cgi?id=1834277

--

The reason I'm commenting is because the kernel bisect and related git commit (to do with "vt-d and iommu" presented a functional workaround!


Go into the BIOS of the laptop, under Virtualization, and DISABLE the option "VT for Direct I/O".

Once that is disabled, the laptop will reboot or poweroff properly from 5.5+ kernels.  I'm testing in FC31 on the aforementioned Dell XPS 13 2020 and its working great. At present, I'm running kernel 5.6.11!
Comment 2 random 2020-05-20 19:32:57 UTC
I had the same problem with Kernel 5.3.0.53. Reverting to 5.3.0.51 fixed it.
System is an ASUS A320-M-K with a AMD Ryzen 3 2200g.
Comment 3 Leho Kraav 2020-07-09 08:40:50 UTC
7390 2-in-1, same problem (obviously).

So which is worse or better: reverting the commit, or turning off BIOS option?

What is "VT for Direct I/O" useful for?
Comment 4 Mario Limonciello 2020-07-09 14:02:49 UTC
If you can; you're better off reverting the commit for now.  However while you're rebuilding your kernel I think a good data point to provide is if this is fixed in current mainline tree.  The particular code in that commit is still present (IOMMU is turned off at shutdown) but to make sure that other changes to IOMMU haven't changed it.

Regarding what that BIOS option does: It uses the IOMMU to prevent devices from being able to access the memory space of one another.  It prevents attacks such as Thunderspy.
Comment 5 Lu Baolu 2020-07-10 01:20:25 UTC
Can you please let me know the pci vendor/device ids of the integrated graphic device?
Comment 6 Leho Kraav 2020-07-10 08:14:38 UTC
leho@papaya ~ $ [-] sudo lspci -nn 
...
00:02.0 VGA compatible controller [0300]: Intel Corporation Iris Plus Graphics G7 [8086:8a52] (rev 07)
...
Comment 7 deepa 2020-07-12 21:11:22 UTC
I think this is happening because some devices don't get shutdown properly especially if in suspend power states and when the device comes back up, the IOMMU is gone and the device is possibly accessing physical addresses directly.

Can you try if the following patch works for you?

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index a3a29647272c..879194797dee 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5040,6 +5040,8 @@ void intel_iommu_shutdown(void)
 {
        struct dmar_drhd_unit *drhd;
        struct intel_iommu *iommu = NULL;
+       struct device *dev;
+       int i;

        if (no_iommu || dmar_disabled)
                return;
@@ -5047,11 +5049,21 @@ void intel_iommu_shutdown(void)
        down_write(&dmar_global_lock);

        /* Disable PMRs explicitly here. */
-       for_each_iommu(iommu, drhd)
-               iommu_disable_protect_mem_regions(iommu);
+       for_each_iommu(iommu, drhd) {
+        /*
+          * All PCI devices managed by this unit should have been destroyed.
+          */
+         if (!drhd->include_all && drhd->devices && drhd->devices_cnt) {
+                 for_each_active_dev_scope(drhd->devices,
+                                           drhd->devices_cnt, i, dev)
+                         continue;
+         }

-       /* Make sure the IOMMUs are switched off */
-       intel_disable_iommus();
+               disable_dmar_iommu(iommu);
+               free_dmar_iommu(iommu);
+
+               iommu_disable_protect_mem_regions(iommu);
+       }

        up_write(&dmar_global_lock);


If it does, the AMD IOMMU code also has a similar issue, I can figure out a clean way of doing this and post to the mailing list.

Sorry for the delay in looking into this, I did not have an account.
Mario contacted me on email directly.
Comment 8 Chih 2020-07-15 05:48:02 UTC
(In reply to deepa from comment #7)
> I think this is happening because some devices don't get shutdown properly
> especially if in suspend power states and when the device comes back up, the
> IOMMU is gone and the device is possibly accessing physical addresses
> directly.
> 
> Can you try if the following patch works for you?
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index a3a29647272c..879194797dee 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5040,6 +5040,8 @@ void intel_iommu_shutdown(void)
>  {
>         struct dmar_drhd_unit *drhd;
>         struct intel_iommu *iommu = NULL;
> +       struct device *dev;
> +       int i;
> 
>         if (no_iommu || dmar_disabled)
>                 return;
> @@ -5047,11 +5049,21 @@ void intel_iommu_shutdown(void)
>         down_write(&dmar_global_lock);
> 
>         /* Disable PMRs explicitly here. */
> -       for_each_iommu(iommu, drhd)
> -               iommu_disable_protect_mem_regions(iommu);
> +       for_each_iommu(iommu, drhd) {
> +        /*
> +          * All PCI devices managed by this unit should have been destroyed.
> +          */
> +         if (!drhd->include_all && drhd->devices && drhd->devices_cnt) {
> +                 for_each_active_dev_scope(drhd->devices,
> +                                           drhd->devices_cnt, i, dev)
> +                         continue;
> +         }
> 
> -       /* Make sure the IOMMUs are switched off */
> -       intel_disable_iommus();
> +               disable_dmar_iommu(iommu);
> +               free_dmar_iommu(iommu);
> +
> +               iommu_disable_protect_mem_regions(iommu);
> +       }
> 
>         up_write(&dmar_global_lock);
> 
> 
> If it does, the AMD IOMMU code also has a similar issue, I can figure out a
> clean way of doing this and post to the mailing list.
> 
> Sorry for the delay in looking into this, I did not have an account.
> Mario contacted me on email directly.

Please refer to the comment in https://bugzilla.kernel.org/show_bug.cgi?id=208363#c16. Unfortunately, this suggested does seem to work.
Comment 9 Lu Baolu 2020-07-15 06:36:49 UTC
Can you please check whether the test patch posted at

https://bugzilla.kernel.org/show_bug.cgi?id=208363

can help here?
Comment 10 Mario Limonciello 2020-07-15 14:57:17 UTC
Just to make it clearer, Lu means the patch in this comment on that bug: https://bugzilla.kernel.org/attachment.cgi?id=289955&action=diff
Comment 11 Lu Baolu 2020-07-16 01:34:33 UTC
Created attachment 290315 [details]
A potential fix patch ask for test

Hi, can anybody help to test the patch attached (A potential fix patch ask for test)?

Best regards,
baolu
Comment 12 François Guerraz 2020-08-19 14:43:05 UTC
I confirm that this is working with the patch that landed in 5.8.2
Comment 13 Leho Kraav 2021-12-19 17:56:49 UTC
I'm fairly certain this bug can be closed as RESOLVED, as I haven't seen this happen on my 7390 for a year or longer.

Note You need to log in before you can comment on or make changes to this bug.