Bug 219363
Summary: | [BISECTED] iommu/tv-d patch causes hang on boot on 3 machines (2 Dells and a Samsung) | ||
---|---|---|---|
Product: | Drivers | Reporter: | Todd Brandt (todd.e.brandt) |
Component: | IOMMU | Assignee: | drivers_iommu |
Status: | NEW --- | ||
Severity: | high | CC: | baolu.lu, bugzilla.kernel.org |
Priority: | P3 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 6.12-rc1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | 2031c469f8161abe74189cb74f50da224f340b71 |
Bug Depends on: | |||
Bug Blocks: | 178231 | ||
Attachments: |
dmesg-boot-otcpl-dell-p3520.txt
dmesg-boot-otcpl-galaxy-book-10.txt dmesg-boot-otcpl-dell-p5510-xeon-1.txt dell-p3520-6.12-rc1-boot-fail.jpg samsung-galaxy-6.12-rc1-boot-fail.jpg config file for kernel under test |
Description
Todd Brandt
2024-10-08 20:11:34 UTC
Created attachment 306984 [details]
dmesg-boot-otcpl-dell-p3520.txt
boot from 6.11
Created attachment 306985 [details]
dmesg-boot-otcpl-galaxy-book-10.txt
boot from 6.11, error not shown
Created attachment 306986 [details]
dmesg-boot-otcpl-dell-p5510-xeon-1.txt
boot from 6.11, error now shown
Created attachment 306987 [details]
dell-p3520-6.12-rc1-boot-fail.jpg
Screenshot of the dmesg out at Dell p3520 boot fail.
Created attachment 306988 [details]
samsung-galaxy-6.12-rc1-boot-fail.jpg
snapshot of the failure output on screen
Created attachment 306996 [details]
config file for kernel under test
I built a kernel image with the same config file. I tested it on a Skylake desktop that I have. It boot smoothly without any warning or errors. I also tried the same kernel binary on other clients and servers that I can access. I can't reproduce the problem. It appears that this boot failure is related to specific platforms? If commit <2031c469f816> ("iommu/vt-d: Add support for static identity domain") really matters, can you please try below change? It's not a final solution, but it can help narrow down the root cause. diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 9f6b0780f2ef..a9b986cf9d85 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -3628,9 +3628,11 @@ int prepare_domain_attach_device(struct iommu_domain *domain, static int intel_iommu_attach_device(struct iommu_domain *domain, struct device *dev) { + struct device_domain_info *info = dev_iommu_priv_get(dev); int ret; - device_block_translation(dev); + if (info->domain) + device_block_translation(dev); ret = prepare_domain_attach_device(domain, dev); if (ret) Heh, I literally just found the sasme thing by experimenting: I just tested tweaking the regression commit and I found that if I build the regression commit but with these 2 lines added back in, the kernel boots again on all three machines.: diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 14f1fcf17152..e48f45ab2604 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -3691,9 +3691,11 @@ int prepare_domain_attach_device(struct iommu_domain *domain, static int intel_iommu_attach_device(struct iommu_domain *domain, struct device *dev) { + struct device_domain_info *info = dev_iommu_priv_get(dev); int ret; - device_block_translation(dev); + if (info->domain) + device_block_translation(dev); ret = prepare_domain_attach_device(domain, dev); if (ret) I just applied this fix patch to the latest upstream and it works! So at least we're partway there. A fix patch has been posted: https://lore.kernel.org/linux-iommu/20241012030720.90218-1-baolu.lu@linux.intel.com/ It's looking good so far, it fixes all three machines in 6.12-rc1 and 6.12-rc2, I'm going to do one final full lab test this Sunday on 6.12-rc2 and 6.12-rc3 and if there are no other issues we're in the clear. I'll ping you Monday. Patch also fixes my bug. ok I just tested this patch on 6.12.0-rc3 and it fixes all three broken machines and appears to have had no impact on any of the other systems. Looks great! You can add my Tested-by. Thanks for the fix! |