Three of our systems have failed to boot since The following commit: commit 2031c469f8161abe74189cb74f50da224f340b71 (HEAD, refs/bisect/bad) Author: Lu Baolu <baolu.lu@linux.intel.com> Date: Mon Sep 2 10:27:16 2024 +0800 iommu/vt-d: Add support for static identity domain Software determines VT-d hardware support for passthrough translation by inspecting the capability register. If passthrough translation is not supported, the device is instructed to use DMA domain for its default domain. Add a global static identity domain with guaranteed attach semantics for IOMMUs that support passthrough translation mode. The global static identity domain is a dummy domain without corresponding dmar_domain structure. Consequently, the device's info->domain will be NULL with the identity domain is attached. Refactor the code accordingly. The machines are: otcpl-samsung-galaxy-book-10: os-version : Ubuntu 22.04.5 LTS baseboard-manufacturer : SAMSUNG ELECTRONICS CO., LTD. baseboard-product-name : SM-W620NZKAXAR baseboard-serial-number : 123490EN400015 baseboard-version : SGL8879A3F-C01-G001-S0001+10.0.16299 bios-release-date : 11/22/2017 bios-vendor : American Megatrends Inc. bios-version : P03HAD.003.171122.WY.2239 chassis-manufacturer : SAMSUNG ELECTRONICS CO., LTD. chassis-serial-number : None chassis-version : N/A processor-manufacturer : Intel(R) Corporation processor-version : Intel(R) Core(TM) m3-7Y30 CPU @ 1.00GHz system-manufacturer : SAMSUNG ELECTRONICS CO., LTD. system-product-name : Galaxy Book 10.6 system-serial-number : 134RR52K500022 system-version : P03HAD cpucount : 4 memtotal : 3767776 kB memfree : 1831016 kB otcpl-dell-p5510-xeon-1: os-version : Ubuntu 24.04.1 LTS baseboard-manufacturer : Dell Inc. baseboard-product-name : 0W7V82 baseboard-serial-number : /2QQQN72/CN1296363I0006/ baseboard-version : A00 bios-release-date : 07/06/2022 bios-vendor : Dell Inc. bios-version : 1.23.0 chassis-manufacturer : Dell Inc. chassis-serial-number : 2QQQN72 processor-manufacturer : Intel(R) Corporation processor-version : Intel(R) Core(TM) i5-6300HQ CPU @ 2.30GHz system-manufacturer : Dell Inc. system-product-name : Precision 5510 system-serial-number : 2QQQN72 cpucount : 4 memtotal : 16197492 kB memfree : 13685292 kB otcpl-dell-p3520: os-version : Ubuntu 24.04.1 LTS baseboard-manufacturer : Dell Inc. baseboard-product-name : 0167B9 baseboard-serial-number : /CGKMTC2/CN129636C5001F/ baseboard-version : A00 bios-release-date : 07/03/2023 bios-vendor : Dell Inc. bios-version : 1.31.0 chassis-manufacturer : Dell Inc. chassis-serial-number : CGKMTC2 processor-manufacturer : Intel(R) Corporation processor-version : Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz system-manufacturer : Dell Inc. system-product-name : Precision 3520 system-serial-number : CGKMTC2 cpucount : 8 memtotal : 16264232 kB memfree : 14623628 kB
Created attachment 306984 [details] dmesg-boot-otcpl-dell-p3520.txt boot from 6.11
Created attachment 306985 [details] dmesg-boot-otcpl-galaxy-book-10.txt boot from 6.11, error not shown
Created attachment 306986 [details] dmesg-boot-otcpl-dell-p5510-xeon-1.txt boot from 6.11, error now shown
Created attachment 306987 [details] dell-p3520-6.12-rc1-boot-fail.jpg Screenshot of the dmesg out at Dell p3520 boot fail.
Created attachment 306988 [details] samsung-galaxy-6.12-rc1-boot-fail.jpg snapshot of the failure output on screen
Created attachment 306996 [details] config file for kernel under test
I built a kernel image with the same config file. I tested it on a Skylake desktop that I have. It boot smoothly without any warning or errors. I also tried the same kernel binary on other clients and servers that I can access. I can't reproduce the problem. It appears that this boot failure is related to specific platforms?
If commit <2031c469f816> ("iommu/vt-d: Add support for static identity domain") really matters, can you please try below change? It's not a final solution, but it can help narrow down the root cause. diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 9f6b0780f2ef..a9b986cf9d85 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -3628,9 +3628,11 @@ int prepare_domain_attach_device(struct iommu_domain *domain, static int intel_iommu_attach_device(struct iommu_domain *domain, struct device *dev) { + struct device_domain_info *info = dev_iommu_priv_get(dev); int ret; - device_block_translation(dev); + if (info->domain) + device_block_translation(dev); ret = prepare_domain_attach_device(domain, dev); if (ret)
Heh, I literally just found the sasme thing by experimenting: I just tested tweaking the regression commit and I found that if I build the regression commit but with these 2 lines added back in, the kernel boots again on all three machines.: diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 14f1fcf17152..e48f45ab2604 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -3691,9 +3691,11 @@ int prepare_domain_attach_device(struct iommu_domain *domain, static int intel_iommu_attach_device(struct iommu_domain *domain, struct device *dev) { + struct device_domain_info *info = dev_iommu_priv_get(dev); int ret; - device_block_translation(dev); + if (info->domain) + device_block_translation(dev); ret = prepare_domain_attach_device(domain, dev); if (ret)
I just applied this fix patch to the latest upstream and it works! So at least we're partway there.
A fix patch has been posted: https://lore.kernel.org/linux-iommu/20241012030720.90218-1-baolu.lu@linux.intel.com/
It's looking good so far, it fixes all three machines in 6.12-rc1 and 6.12-rc2, I'm going to do one final full lab test this Sunday on 6.12-rc2 and 6.12-rc3 and if there are no other issues we're in the clear. I'll ping you Monday.
Patch also fixes my bug.
ok I just tested this patch on 6.12.0-rc3 and it fixes all three broken machines and appears to have had no impact on any of the other systems. Looks great! You can add my Tested-by. Thanks for the fix!