Bug 212017

Summary: System doesn't boot with amd_iommu=off
Product: IO/Storage Reporter: Mthw (matejm98mthw)
Component: OtherAssignee: io_other
Status: RESOLVED CODE_FIX    
Severity: high CC: basic89, dwmw2, jan.steffens, jordicoma22, mario.limonciello, matejm98mthw, mingo, steven, tglx
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.11 Subsystem:
Regression: Yes Bisected commit-id:

Description Mthw 2021-03-02 08:49:45 UTC
Description (from Arch Linux bugtracker link : https://bugs.archlinux.org/task/69810)

"I have a laptop with an AMD 3550H CPU and since Kernel 5.11.x it doesn't boot at all with 'amd_iommu=off' kernel parameter.
To give you more info, when I select the boot entry in systemd-boot nothing happens.
There are no error messages or anything just a blank screen (and external screen is not detected/it doesn't detect any signal from the laptop) and I need to shut down with power button.
Linux 5.10 kernels and older work correctly."

We did a bisect and got the following results:

"
git bisect log
git bisect start
# bad: [f40ddce88593482919761f74910f42f4b84c004b] Linux 5.11
git bisect bad f40ddce88593482919761f74910f42f4b84c004b
# good: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
git bisect good 2c85ebc57b3e1817b6ce1a6b703928e113a90442
# bad: [538fcf57aaee6ad78a05f52b69a99baa22b33418] Merge branches 'acpi-scan', 'acpi-pnp' and 'acpi-sleep'
git bisect bad 538fcf57aaee6ad78a05f52b69a99baa22b33418
# bad: [d635a69dd4981cc51f90293f5f64268620ed1565] Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
git bisect bad d635a69dd4981cc51f90293f5f64268620ed1565
# good: [a1dd1d86973182458da7798a95f26cfcbea599b4] Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
git bisect good a1dd1d86973182458da7798a95f26cfcbea599b4
# good: [e5795aacd71b697c739f2d193b0e275993d93187] Merge tag 'wireless-drivers-next-2020-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
git bisect good e5795aacd71b697c739f2d193b0e275993d93187
# good: [dfefd226b0bf7c435a58d75a0ce2f9273b9825f6] mm: cleanup kstrto*() usage
git bisect good dfefd226b0bf7c435a58d75a0ce2f9273b9825f6
# good: [eb0ea74120e0f14a6d6454109153d1b4ccf210fc] Merge tag 'x86-fpu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good eb0ea74120e0f14a6d6454109153d1b4ccf210fc
# good: [22f07b86d4e580424cbeb0ce232ed30d4b5ecb95] Merge branch 'bnxt_en-improve-firmware-flashing'
git bisect good 22f07b86d4e580424cbeb0ce232ed30d4b5ecb95
# bad: [26ab12bb9d96133b7880141d68b5e01a8783de9d] iommu/hyper-v: Remove I/O-APIC ID check from hyperv_irq_remapping_select()
git bisect bad 26ab12bb9d96133b7880141d68b5e01a8783de9d
# good: [341b4a7211b6ba3a7089e1dc09ac4bd576dfb05f] x86/ioapic: Cleanup IO/APIC route entry structs
git bisect good 341b4a7211b6ba3a7089e1dc09ac4bd576dfb05f
# good: [79eb3581bcaae9b5677629d945e14da212aa76e2] iommu/vt-d: Simplify intel_irq_remapping_select()
git bisect good 79eb3581bcaae9b5677629d945e14da212aa76e2
# good: [d981059e13ffa9ed03a73472e932d070323bd057] x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it
git bisect good d981059e13ffa9ed03a73472e932d070323bd057
# bad: [2fb6acf3edfeb904505f9ba3fd01166866062591] iommu/amd: Fix union of bitfields in intcapxt support
git bisect bad 2fb6acf3edfeb904505f9ba3fd01166866062591
# bad: [aec8da04e4d71afdd4ab3025ea34a6517435f363] x86/ioapic: Correct the PCI/ISA trigger type selection
git bisect bad aec8da04e4d71afdd4ab3025ea34a6517435f363
# bad: [f36a74b9345aebaf5d325380df87a54720229d18] x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index
git bisect bad f36a74b9345aebaf5d325380df87a54720229d18
# first bad commit: [f36a74b9345aebaf5d325380df87a54720229d18] x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index
"

This points to the following commit:

"
git bisect bad
f36a74b9345aebaf5d325380df87a54720229d18 is the first bad commit
commit f36a74b9345aebaf5d325380df87a54720229d18
Author: David Woodhouse <dwmw@amazon.co.uk>
Date: Tue Nov 3 16:36:22 2020 +0000

x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index

In commit b643128b917 ("x86/ioapic: Use irq_find_matching_fwspec() to
find remapping irqdomain") the I/O-APIC code was changed to find its
parent irqdomain using irq_find_matching_fwspec(), but the key used
for the lookup was wrong. It shouldn't use 'ioapic' which is the index
into its own ioapics[] array. It should use the actual arbitration
ID of the I/O-APIC in question, which is mpc_ioapic_id(ioapic).

Fixes: b643128b917 ("x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain")
Reported-by: lkp <oliver.sang@intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link:57adf2c305cd0c5e9d860b2f3007a7e676fd0f9f.camel@infradead.org"> https://lore.kernel.org/r/57adf2c305cd0c5e9d860b2f3007a7e676fd0f9f.camel@infradead.org

arch/x86/kernel/apic/io_apic.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
"

So, it would seem there is something wrong with the mentioned commit. Could someone look at it? If this report belongs somewhere else (Product/Component), I apologize, it's the first time I am reporting a bug here.
Comment 1 Mthw 2021-03-04 13:38:46 UTC
Reverting this commit fixes the issue and there haven't been any side-effects yet.
Comment 2 Steven Barrett 2021-03-08 00:12:20 UTC
I tried reverting f36a74b9 in Zen Kernel and it made my own system not bootable.

System configuration affected:
Machine:   Type: Desktop System: ASUS product: All Series v: N/A serial: <superuser required> 
           Mobo: ASUSTeK model: X99-DELUXE II v: Rev 1.xx serial: <superuser required> 
           UEFI [Legacy]: American Megatrends v: 2101 date: 07/10/2019 
CPU:       Info: 8-Core Intel Core i7-5960X [MT MCP] speed: 3787 MHz min/max: 1200/3001 MHz

Maybe this commit's change is mutually exclusive between AMD and Intel systems?
Comment 3 Mthw 2021-03-08 16:45:09 UTC
(In reply to Steven Barrett from comment #2)
> 
> Maybe this commit's change is mutually exclusive between AMD and Intel
> systems?

Might be, I don't have any other hardware to test it with though.
Comment 4 Mav 2021-03-09 16:17:49 UTC
(In reply to matejm98mthw from comment #3)
> Might be, I don't have any other hardware to test it with though.

I can confirm a Thinkpad T495s (AMD Zen+) will not boot with "amd_iommu=off" in 5.11.1 abd 5.11.2 as delivered with manjaro linux. (and will sometimes fail to recover from suspend if I do not pass this parameter).
Comment 5 Mthw 2021-03-09 16:31:26 UTC
(In reply to Mav from comment #4)
> (In reply to matejm98mthw from comment #3)
> > Might be, I don't have any other hardware to test it with though.
> 
> I can confirm a Thinkpad T495s (AMD Zen+) will not boot with "amd_iommu=off"
> in 5.11.1 abd 5.11.2 as delivered with manjaro linux. (and will sometimes
> fail to recover from suspend if I do not pass this parameter).

Have you tried 5.11.3 or .4? The issue with waking up from suspend seems to be fixed now.
Comment 6 jordicoma 2021-03-14 11:26:06 UTC
I have tried the last kernel from arch 5.11.6 and it still doesn't boot.
I'm one of the reporters from the arch thread https://bugs.archlinux.org/task/69810 and the kernel compiled with this commit rollback worked.

Are the changes pushed to the main branch?

I have ryzen 1600x (zen1) and it's confirmed that removing "amd_iommu=off" boots.
Comment 7 Mthw 2021-03-14 11:54:56 UTC
(In reply to jordicoma from comment #6)
> I have tried the last kernel from arch 5.11.6 and it still doesn't boot.
> I'm one of the reporters from the arch thread
> https://bugs.archlinux.org/task/69810 and the kernel compiled with this
> commit rollback worked.
> 
> Are the changes pushed to the main branch?
> 
> I have ryzen 1600x (zen1) and it's confirmed that removing "amd_iommu=off"
> boots.

Yes, it's still broken, someone has to make the change, which would be either reverting the commit or figure out a better way to fix the original issue.
Comment 8 David Woodhouse 2021-03-15 11:46:43 UTC
Please test https://lore.kernel.org/r/20210315111502.440451-1-dwmw2@infradead.org
Comment 9 Mthw 2021-03-15 19:42:11 UTC
I just built it on 5.12-rc3 and it fixes the issue.
Comment 10 Jan Steffens 2021-04-10 20:44:11 UTC
(In reply to David Woodhouse from comment #8)
> Please test
> https://lore.kernel.org/r/20210315111502.440451-1-dwmw2@infradead.org

Is this patch still needed after 36013e9ffc0a17eee8d3e4d92aea0dc37687760d (9f81ca8d1fd68f5697c201f26632ed622e9e462f upstream, bug 212133)?
Comment 11 Mthw 2021-04-11 07:28:09 UTC
I am currently on 5.11.11 and I haven't had this issue for a while now.