Bug 208363

Summary: Restart failure with IOMMU errors
Product: ACPI Reporter: Hsiao-Ting Wang, Tiffany (hsiaoting.wang)
Component: BIOSAssignee: Lu Baolu (baolu.lu)
Status: CLOSED CODE_FIX    
Severity: high CC: ashok.raj, baolu.lu, gicmo, koba.ko, leho, rui.zhang, superm1, tiffany.wang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.6 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel log with patches as described.
A test change
A potential fix patch ask for test

Description Hsiao-Ting Wang, Tiffany 2020-06-29 03:35:21 UTC
Created attachment 289919 [details]
kernel log with patches as described.

Restart failure with iommu error needs Intel i915 driver support.

We found a restart failure issue on Tigerlake platform and after debugging, we need Intel to support with i915 driver.

Reproduce Steps:
1.Suspend the system
2.press power button to wake up system
3.Restart the system
4.System will hang and can’t boot to user login screen.


Fail Rate: 100%
TGL platform
OS: Ubuntu 20.04

Related kernel log:
Jun 3 22:00:30 kernel: [ 117.686352] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
Jun 3 22:00:30  kernel: [ 117.690538] done.
Jun 3 22:00:30  kernel: [ 117.712034] nvme nvme0: 8/0/0 default/read/poll queues
Jun 3 22:00:30 kernel: [ 117.741473] PM: suspend exit
Jun 3 22:00:39  kernel: [ 126.457779] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
Jun 3 22:00:44 kernel: [ 132.024069] rfkill: input handler enabled
Jun 3 22:00:45  kernel: [ 132.293626] sof-audio-pci 0000:00:1f.3: firmware boot complete
The fan keeps running, and the only way to power off system is long pressing the power button.
We also tried the kernel parameters below (3 tests) and the reboot failure cannot be reproduce:

nomodeset
i915.modeset=0
intel_iommu=igfx_off


Based on Intel suggestion, we patched:
Patch 1:
https://patchwork.freedesktop.org/patch/355270/
Patch 2:
https://patchwork.freedesktop.org/patch/365227/

Same as previous situation.
System restart successfully but wait a long time to shutdown.
The waiting point is still at iommu_disable_translation()@intel-iommu.c.
Comment 1 Hsiao-Ting Wang, Tiffany 2020-06-29 06:54:02 UTC
More observation:
When we mark iommu_disable_translation()@intel-iommu.c, 
reboot is good and go into OS soon.
Comment 2 Lu Baolu 2020-06-30 02:52:14 UTC
Created attachment 289955 [details]
A test change
Comment 3 Lu Baolu 2020-06-30 02:53:19 UTC
Can anybody help to test whether attached change help here?
Comment 4 Hsiao-Ting Wang, Tiffany 2020-06-30 06:17:53 UTC
We are checking and will update you.
Comment 5 Hsiao-Ting Wang, Tiffany 2020-06-30 07:25:41 UTC
@Lu Baolu,
We test your test change on comment 2, the system would not stuck and can restart okay.
May I know is there any side effect for this change?
We found it also skip iommu as a workaround.

Thank you for your effort.
Comment 6 Lu Baolu 2020-06-30 07:43:35 UTC
@ Hsiao-Ting Wang I'm not sure about the side effect hence I need to do more tests before posting it to upstream. Thanks a lot for your test.
Comment 7 Ashok Raj 2020-07-01 16:00:22 UTC
Can you also test kexec with the latest patch from Baolu?
Comment 8 KobaKo 2020-07-02 05:32:16 UTC
@Ashok,
What's different between install a kernel and kexec!?
The waiting point is in kernel(iommu_disable_translation()@intel-iommu.c).
Comment 9 Mario Limonciello 2020-07-09 13:59:35 UTC
Is this possibly the same root cause as https://bugzilla.kernel.org/show_bug.cgi?id=206571 ?
Comment 10 Lu Baolu 2020-07-10 01:19:10 UTC
(In reply to Hsiao-Ting Wang from comment #5)
> @Lu Baolu,
> We test your test change on comment 2, the system would not stuck and can
> restart okay.
> May I know is there any side effect for this change?
> We found it also skip iommu as a workaround.
> 
> Thank you for your effort.

Can you please let me know the pci vendor/device ids of the integrated graphic device?

Best regards,
baolu
Comment 11 KobaKo 2020-07-10 01:57:28 UTC
@Baolu,
It's a Intel's gpu(i915).
Comment 12 KobaKo 2020-07-10 02:16:14 UTC
vendor id/device id = 0086:9a49
Comment 13 Lu Baolu 2020-07-10 02:22:11 UTC
0086 isn't the vendor id for Intel. Can you please double check?
Comment 14 KobaKo 2020-07-10 02:35:54 UTC
sorry my fault, correct the information
vendor id/device id = 8086:9a49
Comment 15 Mario Limonciello 2020-07-13 19:14:03 UTC
@KobaKo since you're not in the CC list on https://bugzilla.kernel.org/show_bug.cgi?id=206571 there was a proposed patch.  Can you see if it helps for the TGL case?
Comment 16 KobaKo 2020-07-14 05:01:35 UTC
@Mario, Tried the patch and it doesn't work.
The machine still hang on iommu_disable_translation()@intel-iommu.c.
Comment 17 Lu Baolu 2020-07-16 01:36:04 UTC
Created attachment 290317 [details]
A potential fix patch ask for test

Hi, can anybody help to test the patch attached (A potential fix patch ask for test)?

Best regards,
baolu
Comment 18 KobaKo 2020-07-16 02:19:58 UTC
@Baolu,
Could you explain more why gfx would be ignore to disable TE!?
Comment 19 Lu Baolu 2020-07-16 02:36:39 UTC
@Kobako, I have explained in the commit message. Is that sufficient for you? I am not sure whether it's the root cause of the issue reported here, hence ask for some tests.
Comment 20 KobaKo 2020-07-16 02:41:36 UTC
@Baolu, 
It's nice and the information is very detail in the patch.
Thanks
Comment 21 KobaKo 2020-07-16 03:40:28 UTC
@Baolu, 
With the test patch I tried the multiple times(around 6) and the machine(TGL) wouldn't hang during the shutdown and wouldn't wait a long time(*).

*With the previous patch, the machine wouldn't hang but wait a long time to restart.
Comment 22 Lu Baolu 2020-07-16 05:26:10 UTC
@Kobako, thanks for the testing. Can I add your test-by when I submit this patch to upstream linux kernel?
Comment 23 KobaKo 2020-07-16 06:04:16 UTC
@Baolu, 
Can we reset the iommu during resume!?
After the power state transition is triggered, is the dma translation in a corrupted status? if it is and you don't recover it, does dma translation work well in the following time!?
Comment 24 KobaKo 2020-07-16 06:06:47 UTC
@Baolu, 
yes, please take it
tested-by: koba.ko@canonical.com
Comment 25 KobaKo 2020-07-16 06:17:15 UTC
@Baolu, 
Will you refine something in the patch!? Is the last patch you provided a final version!?
Comment 26 Lu Baolu 2020-07-16 06:30:37 UTC
(In reply to KobaKo from comment #23)
> @Baolu, 
> Can we reset the iommu during resume!?
> After the power state transition is triggered, is the dma translation in a
> corrupted status? if it is and you don't recover it, does dma translation
> work well in the following time!?

The same thing happens during iommu suspend/resume.

https://bugzilla.kernel.org/show_bug.cgi?id=206571

The patch is aking for test for this issue.
Comment 27 Lu Baolu 2020-07-16 06:31:55 UTC
(In reply to KobaKo from comment #25)
> @Baolu, 
> Will you refine something in the patch!? Is the last patch you provided a
> final version!?

It should be if it passes the test for 206571. Do you have any review comments?
Comment 28 KobaKo 2020-07-16 07:48:22 UTC
(In reply to Lu Baolu from comment #26)
> (In reply to KobaKo from comment #23)
> > @Baolu, 
> > Can we reset the iommu during resume!?
> > After the power state transition is triggered, is the dma translation in a
> > corrupted status? if it is and you don't recover it, does dma translation
> > work well in the following time!?
> 
> The same thing happens during iommu suspend/resume.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=206571
> 
> The patch is aking for test for this issue.

On my side, the issue must trigger the suspend before reboot, that means the iommu is in the corrupted status after suspend? Should we recover/reset the iommu that the machine comes from the suspend!?
Comment 29 KobaKo 2020-07-17 02:12:28 UTC
@Baolu,
Would you please share the official patch once you push it to upstream!?
Thanks
Comment 30 Lu Baolu 2020-07-17 02:15:58 UTC
sure! I will update here.
Comment 31 Christian Kellner 2020-07-17 15:23:30 UTC
I put the patch into the current rawhide kernel (kernel-5.8.0-0.rc5.20200715gite9919e11e219) for a user, who currently has shutdown issues on his Dell XPS 9300 (Ice Lake IIRC), to test and they report that "Both shutdown and reboot are working."
Comment 32 Zhang Rui 2020-11-26 07:34:40 UTC
@Baolu,
what is the status of this issue?
Comment 33 Lu Baolu 2020-11-26 08:09:58 UTC
It has been upstreamed. Is this issue still there?
Comment 34 Zhang Rui 2021-01-03 14:50:26 UTC
Bug closed as the fix is already in upstream.