Bug 219363 - [BISECTED] iommu/tv-d patch causes hang on boot on 3 machines (2 Dells and a Samsung)
Summary: [BISECTED] iommu/tv-d patch causes hang on boot on 3 machines (2 Dells and a ...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: IOMMU (show other bugs)
Hardware: All Linux
: P3 high
Assignee: drivers_iommu
URL:
Keywords:
Depends on:
Blocks: 178231
  Show dependency tree
 
Reported: 2024-10-08 20:11 UTC by Todd Brandt
Modified: 2024-10-14 22:36 UTC (History)
2 users (show)

See Also:
Kernel Version: 6.12-rc1
Subsystem:
Regression: Yes
Bisected commit-id: 2031c469f8161abe74189cb74f50da224f340b71


Attachments
dmesg-boot-otcpl-dell-p3520.txt (83.40 KB, text/plain)
2024-10-08 20:13 UTC, Todd Brandt
Details
dmesg-boot-otcpl-galaxy-book-10.txt (96.91 KB, text/plain)
2024-10-08 20:14 UTC, Todd Brandt
Details
dmesg-boot-otcpl-dell-p5510-xeon-1.txt (87.03 KB, text/plain)
2024-10-08 20:15 UTC, Todd Brandt
Details
dell-p3520-6.12-rc1-boot-fail.jpg (773.69 KB, image/jpeg)
2024-10-08 20:18 UTC, Todd Brandt
Details
samsung-galaxy-6.12-rc1-boot-fail.jpg (700.60 KB, image/jpeg)
2024-10-08 20:38 UTC, Todd Brandt
Details
config file for kernel under test (294.03 KB, text/plain)
2024-10-10 02:13 UTC, Lu Baolu
Details

Description Todd Brandt 2024-10-08 20:11:34 UTC
Three of our systems have failed to boot since The following commit:

commit 2031c469f8161abe74189cb74f50da224f340b71 (HEAD, refs/bisect/bad)
Author: Lu Baolu <baolu.lu@linux.intel.com>
Date:   Mon Sep 2 10:27:16 2024 +0800

    iommu/vt-d: Add support for static identity domain
    
    Software determines VT-d hardware support for passthrough translation by
    inspecting the capability register. If passthrough translation is not
    supported, the device is instructed to use DMA domain for its default
    domain.
    
    Add a global static identity domain with guaranteed attach semantics for
    IOMMUs that support passthrough translation mode.
    
    The global static identity domain is a dummy domain without corresponding
    dmar_domain structure. Consequently, the device's info->domain will be
    NULL with the identity domain is attached. Refactor the code accordingly.

The machines are:

otcpl-samsung-galaxy-book-10:
os-version              : Ubuntu 22.04.5 LTS
baseboard-manufacturer  : SAMSUNG ELECTRONICS CO., LTD.
baseboard-product-name  : SM-W620NZKAXAR
baseboard-serial-number : 123490EN400015
baseboard-version       : SGL8879A3F-C01-G001-S0001+10.0.16299
bios-release-date       : 11/22/2017
bios-vendor             : American Megatrends Inc.
bios-version            : P03HAD.003.171122.WY.2239
chassis-manufacturer    : SAMSUNG ELECTRONICS CO., LTD.
chassis-serial-number   : None
chassis-version         : N/A
processor-manufacturer  : Intel(R) Corporation
processor-version       : Intel(R) Core(TM) m3-7Y30 CPU @ 1.00GHz
system-manufacturer     : SAMSUNG ELECTRONICS CO., LTD.
system-product-name     : Galaxy Book 10.6
system-serial-number    : 134RR52K500022
system-version          : P03HAD
cpucount                : 4
memtotal                : 3767776 kB
memfree                 : 1831016 kB

otcpl-dell-p5510-xeon-1:
os-version              : Ubuntu 24.04.1 LTS
baseboard-manufacturer  : Dell Inc.
baseboard-product-name  : 0W7V82
baseboard-serial-number : /2QQQN72/CN1296363I0006/
baseboard-version       : A00
bios-release-date       : 07/06/2022
bios-vendor             : Dell Inc.
bios-version            : 1.23.0
chassis-manufacturer    : Dell Inc.
chassis-serial-number   : 2QQQN72
processor-manufacturer  : Intel(R) Corporation
processor-version       : Intel(R) Core(TM) i5-6300HQ CPU @ 2.30GHz
system-manufacturer     : Dell Inc.
system-product-name     : Precision 5510
system-serial-number    : 2QQQN72
cpucount                : 4
memtotal                : 16197492 kB
memfree                 : 13685292 kB

otcpl-dell-p3520:
os-version              : Ubuntu 24.04.1 LTS
baseboard-manufacturer  : Dell Inc.
baseboard-product-name  : 0167B9
baseboard-serial-number : /CGKMTC2/CN129636C5001F/
baseboard-version       : A00
bios-release-date       : 07/03/2023
bios-vendor             : Dell Inc.
bios-version            : 1.31.0
chassis-manufacturer    : Dell Inc.
chassis-serial-number   : CGKMTC2
processor-manufacturer  : Intel(R) Corporation
processor-version       : Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
system-manufacturer     : Dell Inc.
system-product-name     : Precision 3520
system-serial-number    : CGKMTC2
cpucount                : 8
memtotal                : 16264232 kB
memfree                 : 14623628 kB
Comment 1 Todd Brandt 2024-10-08 20:13:11 UTC
Created attachment 306984 [details]
dmesg-boot-otcpl-dell-p3520.txt

boot from 6.11
Comment 2 Todd Brandt 2024-10-08 20:14:26 UTC
Created attachment 306985 [details]
dmesg-boot-otcpl-galaxy-book-10.txt

boot from 6.11, error not shown
Comment 3 Todd Brandt 2024-10-08 20:15:29 UTC
Created attachment 306986 [details]
dmesg-boot-otcpl-dell-p5510-xeon-1.txt

boot from 6.11, error now shown
Comment 4 Todd Brandt 2024-10-08 20:18:58 UTC
Created attachment 306987 [details]
dell-p3520-6.12-rc1-boot-fail.jpg

Screenshot of the dmesg out at Dell p3520 boot fail.
Comment 5 Todd Brandt 2024-10-08 20:38:16 UTC
Created attachment 306988 [details]
samsung-galaxy-6.12-rc1-boot-fail.jpg

snapshot of the failure output on screen
Comment 6 Lu Baolu 2024-10-10 02:13:10 UTC
Created attachment 306996 [details]
config file for kernel under test
Comment 7 Lu Baolu 2024-10-10 03:02:36 UTC
I built a kernel image with the same config file. I tested it on a Skylake desktop that I have. It boot smoothly without any warning or errors.

I also tried the same kernel binary on other clients and servers that I can access. I can't reproduce the problem. It appears that this boot failure is related to specific platforms?
Comment 8 Lu Baolu 2024-10-10 07:11:08 UTC
If commit <2031c469f816> ("iommu/vt-d: Add support for static identity domain") really matters, can you please try below change? It's not a final solution, but it can help narrow down the root cause.

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 9f6b0780f2ef..a9b986cf9d85 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3628,9 +3628,11 @@ int prepare_domain_attach_device(struct iommu_domain *domain,
 static int intel_iommu_attach_device(struct iommu_domain *domain,
                                     struct device *dev)
 {
+       struct device_domain_info *info = dev_iommu_priv_get(dev);
        int ret;
 
-       device_block_translation(dev);
+       if (info->domain)
+               device_block_translation(dev);
 
        ret = prepare_domain_attach_device(domain, dev);
        if (ret)
Comment 9 Todd Brandt 2024-10-11 01:48:27 UTC
Heh, I literally just found the sasme thing by experimenting: I just tested tweaking the regression commit and I found that if I build the regression commit but with these 2 lines added back in, the kernel boots again on all three machines.:

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 14f1fcf17152..e48f45ab2604 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3691,9 +3691,11 @@ int prepare_domain_attach_device(struct iommu_domain *domain,
 static int intel_iommu_attach_device(struct iommu_domain *domain,
 				     struct device *dev)
 {
+	struct device_domain_info *info = dev_iommu_priv_get(dev);
 	int ret;
 
-	device_block_translation(dev);
+	if (info->domain)
+		device_block_translation(dev);
 
 	ret = prepare_domain_attach_device(domain, dev);
 	if (ret)
Comment 10 Todd Brandt 2024-10-11 03:22:10 UTC
I just applied this fix patch to the latest upstream and it works! So at least we're partway there.
Comment 11 Lu Baolu 2024-10-12 03:16:08 UTC
A fix patch has been posted:

https://lore.kernel.org/linux-iommu/20241012030720.90218-1-baolu.lu@linux.intel.com/
Comment 12 Todd Brandt 2024-10-12 05:57:22 UTC
It's looking good so far, it fixes all three machines in 6.12-rc1 and 6.12-rc2, I'm going to do one final full lab test this Sunday on 6.12-rc2 and 6.12-rc3 and if there are no other issues we're in the clear. I'll ping you Monday.
Comment 13 Marcin M 2024-10-12 15:42:35 UTC
Patch also fixes my bug.
Comment 14 Todd Brandt 2024-10-14 22:36:22 UTC
ok I just tested this patch on 6.12.0-rc3 and it fixes all three broken machines and appears to have had no impact on any of the other systems. Looks great! You can add my Tested-by. Thanks for the fix!

Note You need to log in before you can comment on or make changes to this bug.