Bug 212069

Summary: Upgrade 5.10 to 5.11 causes hang on boot of Dell Inspiron 3195
Product: Drivers Reporter: Albert Ferrero (euix)
Component: IOMMUAssignee: drivers_iommu
Status: NEW ---    
Severity: high CC: aricooperman, dwmw2, euix, hi-angel
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.11 Subsystem:
Regression: Yes Bisected commit-id:

Description Albert Ferrero 2021-03-05 07:54:20 UTC
Laptop is Dell Inspiron 3195 running AMD A9-9420e RADEON R5, this system has the integrated AMD graphics. System is using systemd-boot and is booting into an F2FS file system.

Upgraded linux package (5.10.16.arch1-1 -> 5.11.1.arch1-1). On reboot of system, immediately after systemd-boot, system appears to hang and do nothing. No log entries are recorded in journalctrl. Attempted to modify the systemd-boot kernel parameters by adding "debug ignore_loglevel earlyprintk=efi,keep log_buf_len=16M" but nothing is displayed to the screen. The systemd-boot loader goes blank and nothing is displayed on the screen. This happens every time. The power button is responsive, so I don't have to long press the power button to get it to power off, a quick press will turn the computer off, suggesting that something in the system is responsive.

Resolution so far is to downgrade to 5.10 kernel.
Comment 1 Albert Ferrero 2021-03-05 07:55:27 UTC
Working with Arch Linux distribution to help troubleshoot issue. Can see link at https://bugs.archlinux.org/task/69757?project=1&order=lastedit&sort=desc
Comment 2 Albert Ferrero 2021-03-05 07:55:46 UTC
Observed hang condition with newer 5.11.2 kernel that was released.
Comment 3 Albert Ferrero 2021-03-05 07:56:46 UTC
git bisect good
a27dca645d2c0f31abb7858aa0e10b2fa0f2f659 is the first bad commit
commit a27dca645d2c0f31abb7858aa0e10b2fa0f2f659
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Sat Oct 24 22:35:19 2020 +0100

x86/io_apic: Cleanup trigger/polarity helpers

'trigger' and 'polarity' are used throughout the I/O-APIC code for handling
the trigger type (edge/level) and the active low/high configuration. While
there are defines for initializing these variables and struct members, they
are not used consequently and the meaning of 'trigger' and 'polarity' is
opaque and confusing at best.

Rename them to 'is_level' and 'active_low' and make them boolean in various
structs so it's entirely clear what the meaning is.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link:20201024213535.443185-20-dwmw2@infradead.org"> https://lore.kernel.org/r/20201024213535.443185-20-dwmw2@infradead.org

arch/x86/include/asm/hw_irq.h | 6 +-
arch/x86/kernel/apic/io_apic.c | 244 +++++++++++++++++-------------------
arch/x86/pci/intel_mid_pci.c | 8 +-
drivers/iommu/amd/iommu.c | 10 +-
drivers/iommu/intel/irq_remapping.c | 9 +-
5 files changed, 130 insertions(+), 147 deletions(-)

git bisect log
git bisect start
# bad: [f40ddce88593482919761f74910f42f4b84c004b] Linux 5.11
git bisect bad f40ddce88593482919761f74910f42f4b84c004b
# good: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
git bisect good 2c85ebc57b3e1817b6ce1a6b703928e113a90442
# bad: [538fcf57aaee6ad78a05f52b69a99baa22b33418] Merge branches 'acpi-scan', 'acpi-pnp' and 'acpi-sleep'
git bisect bad 538fcf57aaee6ad78a05f52b69a99baa22b33418
# bad: [15b447361794271f4d03c04d82276a841fe06328] mm/lru: revise the comments of lru_lock
git bisect bad 15b447361794271f4d03c04d82276a841fe06328
# good: [b10733527bfd864605c33ab2e9a886eec317ec39] Merge tag 'amd-drm-next-5.11-2020-12-09' of git://people.freedesktop.org/~agd5f/linux into drm-next
git bisect good b10733527bfd864605c33ab2e9a886eec317ec39
# good: [2c075f38a708c578a752b738a45e8c26923eac2e] Merge branch 'radeon-fixes' (Radeon and amdgpu fixes)
git bisect good 2c075f38a708c578a752b738a45e8c26923eac2e
# good: [76d4acf22b4847f6c7b2f9042366fbdc3d20f578] Merge tag 'perf-kprobes-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 76d4acf22b4847f6c7b2f9042366fbdc3d20f578
# good: [dfefd226b0bf7c435a58d75a0ce2f9273b9825f6] mm: cleanup kstrto*() usage
git bisect good dfefd226b0bf7c435a58d75a0ce2f9273b9825f6
# good: [eb0ea74120e0f14a6d6454109153d1b4ccf210fc] Merge tag 'x86-fpu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good eb0ea74120e0f14a6d6454109153d1b4ccf210fc
# bad: [51130d21881d435fad5fa7f25bea77aa0ffc9a4e] x86/ioapic: Handle Extended Destination ID field in RTE
git bisect bad 51130d21881d435fad5fa7f25bea77aa0ffc9a4e
# good: [e16c8058a10ba8e38d0d1ad0b64e444b245ffdbd] PCI: vmd: Use msi_msg shadow structs
git bisect good e16c8058a10ba8e38d0d1ad0b64e444b245ffdbd
# bad: [6452ea2a323b80868ce5e6d3030e4ccbeab9dc30] x86/apic: Add select() method on vector irqdomain
git bisect bad 6452ea2a323b80868ce5e6d3030e4ccbeab9dc30
# bad: [a27dca645d2c0f31abb7858aa0e10b2fa0f2f659] x86/io_apic: Cleanup trigger/polarity helpers
git bisect bad a27dca645d2c0f31abb7858aa0e10b2fa0f2f659
# good: [41bb2115beec5e318095a89f5ad4a9c343cb21ad] x86/pci/xen: Use msi_msg shadow structs
git bisect good 41bb2115beec5e318095a89f5ad4a9c343cb21ad
# good: [0c1883c1eb9dfa3c72af6e00425eeb1eb171a03e] x86/msi: Remove msidef.h
git bisect good 0c1883c1eb9dfa3c72af6e00425eeb1eb171a03e
# first bad commit: [a27dca645d2c0f31abb7858aa0e10b2fa0f2f659] x86/io_apic: Cleanup trigger/polarity helpers
Comment 4 Albert Ferrero 2021-03-09 03:19:42 UTC
I tested with 5.11.4 kernel and observe the same behavior, the system hangs on boot.

I also observed that booting with either acpi=off or iommu=off will allow the system to boot and get to a log in prompt, but neither parameter is ideal since, if I specify iommu=off the system's performance is degraded and kernel panics on shutdown and if I specify acpi=off several components will not function (like wifi).
Comment 5 David Woodhouse 2021-03-16 09:49:50 UTC
That commit is known broken, fixed by:

commit aec8da04e4d71afdd4ab3025ea34a6517435f363
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Tue Nov 10 15:34:32 2020 +0100

    x86/ioapic: Correct the PCI/ISA trigger type selection
    
    PCI's default trigger type is level and ISA's is edge. The recent
    refactoring made it the other way round, which went unnoticed as it seems
    only to cause havoc on some AMD systems.
    
    Make the comment and code do the right thing again.
    
    Fixes: a27dca645d2c ("x86/io_apic: Cleanup trigger/polarity helpers")
    Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
    Reported-by: Borislav Petkov <bp@alien8.de>
    Reported-by: Qian Cai <cai@redhat.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
    Cc: David Woodhouse <dwmw@amazon.co.uk>
    Link: https://lore.kernel.org/r/87d00lgu13.fsf@nanos.tec.linutronix.de

If you're bisecting and have a27dca645d2c in your tree, apply the fix manually. Otherwise you're probably bisecting the *wrong* breakage.
Comment 6 Ari 2021-03-25 12:56:53 UTC
I have the same issue running Fedora 33 which just recently upgraded from 5.10.23 to 5.11.[7|8]. Both those new 5.11 kernels fail to boot on my Dell Precision 7510
Comment 7 Albert Ferrero 2021-03-27 01:47:29 UTC
Good news, I just tried the 5.11.10 kernel and I'm able to boot the Dell Inspiron 3195 without issues.