Bug 219247

Summary: Segfault in Qt apps running on Linux kernel 6.10.8 ARM with LPAE
Product: Memory Management Reporter: Andrew (quark)
Component: OtherAssignee: drivers_video-other
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: michal.pecio, triad
Priority: P3    
Hardware: ARM   
OS: Linux   
Kernel Version: 6.10 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
gdb Kate on 6.10.1 and 6.9.12
More detailed dmesg.
dmesg
config
Script used for build

Description Andrew 2024-09-07 11:11:58 UTC
Created attachment 306828 [details]
dmesg

Trying to run LxQt on a Chromebook XE303C12 with Devuan 4 and Linux kernel 6.10.8 results in a segmentation fault (for LxQt). There are no such problems with Linux kernel 6.9.12 or earlier. With Linux kernel 6.10.8 it is possible to run Xfce4, but trying to run for example Kate ends in a segmentation fault. Mesa 20.3.5, patched for partial hardware acceleration, preserves this acceleration in Xfce4. The mpv works using acceleration regardless of the Linux kernel version. dmesg does not give anything significantly new compared to previous kernel version.
Comment 1 Artem S. Tashkinov 2024-09-07 14:09:59 UTC
One way to track it down is to bisect: https://docs.kernel.org/admin-guide/bug-bisect.html

Sadly, I've not seen anything similar recently, so you could be the only one with the issue and the only one who can unwind it.
Comment 2 Andrew 2024-09-08 06:01:21 UTC
Created attachment 306829 [details]
gdb Kate on 6.10.1 and 6.9.12
Comment 3 Andrew 2024-09-08 06:03:12 UTC
Well, I compiled Linux kernel 6.10.1 and got the same problem. So the error appeared between 6.9.12 and 6.10.1. Since nothing similar happened recently, I assumed it was related to a hw of my system. According to ChangeLog-6.10, there were some changes in the panfrost video driver that is used on my system.

I tried running Kate with gdb on 6.10.1 and 6.9.12, report is attached.
Comment 4 Andrew 2024-09-09 06:23:41 UTC
Created attachment 306833 [details]
More detailed dmesg.

Errors in mwifiex occurred in other versions of the Linux kernel too, it was run several times and then worked. 
It seems my assumption about panfrost was wrong, at least it was not mentioned in the errors.
Comment 5 Artem S. Tashkinov 2024-09-09 11:03:33 UTC
You really could bisect because there are literally a dozen of people in the world having the same hw configuration.
Comment 6 Andrew 2024-09-13 13:54:31 UTC
git bisect start 0129910096573d08ecb139b20e2940682f248186 bb67b270b37e8bd9c96829d58ffe758635651e90
Bisecting: a merge base must be tested
[a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6] Linux 6.9

git bisect good a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
Bisecting: 7202 revisions left to test after this (roughly 13 steps)
[33e02dc69afbd8f1b85a51d74d72f139ba4ca623] Merge tag 'sound-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

git bisect good 33e02dc69afbd8f1b85a51d74d72f139ba4ca623
Bisecting: 3570 revisions left to test after this (roughly 12 steps)
[29c73fc794c83505066ee6db893b2a83ac5fac63] Merge tag 'perf-tools-for-v6.10-1-2024-05-21' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools

git bisect bad 29c73fc794c83505066ee6db893b2a83ac5fac63
Bisecting: 2006 revisions left to test after this (roughly 11 steps)
[0450d2083be6bdcd18c9535ac50c55266499b2df] Merge tag '6.10-rc-smb-fix' of git://git.samba.org/sfrench/cifs-2.6

git bisect bad 0450d2083be6bdcd18c9535ac50c55266499b2df
Bisecting: 839 revisions left to test after this (roughly 10 steps)
[b426433c03a6eb547515edbe74ebb3a90b9979dd] Merge tag 'mtd/for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux

git bisect good b426433c03a6eb547515edbe74ebb3a90b9979dd
Bisecting: 425 revisions left to test after this (roughly 9 steps)
[f0cd69b8cca6a5096463644d6dacc9f991bfa521] Merge tag 'random-6.10-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random

git bisect bad f0cd69b8cca6a5096463644d6dacc9f991bfa521
Bisecting: 246 revisions left to test after this (roughly 8 steps)
[4853f1f6ace32c68a04287353e428c4cfc3fa8ed] Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux

git bisect bad 4853f1f6ace32c68a04287353e428c4cfc3fa8ed
Bisecting: 83 revisions left to test after this (roughly 6 steps)
[31456ffa7b73f2b46b07b9b863eed332d05f5c23] platform/x86: thinkpad_acpi: Use correct keycodes for volume and brightness keys

git bisect good 31456ffa7b73f2b46b07b9b863eed332d05f5c23
Bisecting: 41 revisions left to test after this (roughly 5 steps)
[484bae9e4d6acb5eec39e1ea47f9aa43f11b154d] platform/x86: Add new Dell UART backlight driver

git bisect good 484bae9e4d6acb5eec39e1ea47f9aa43f11b154d
Bisecting: 22 revisions left to test after this (roughly 4 steps)
[aff00427579d4c915ee92553f712e4c632185e6e] ARM: 9379/1: coresight: tpda: drop owner assignment

git bisect good aff00427579d4c915ee92553f712e4c632185e6e
Bisecting: 12 revisions left to test after this (roughly 4 steps)
[7b749aad1faa5bcb23b45b7126f677ab17324c40] ARM: 9393/1: mm: Use conditionals for CFI branches

7b749aad1faa5bcb23b45b7126f677ab17324c40 build error:
...
drivers/uio/uio.c: In function ‘uio_mmap_dma_coherent’:
drivers/uio/uio.c:795:16: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  795 |         addr = (void *)mem->addr;
      |
...
arm-linux-gnueabihf-ld: kernel/trace/fgraph.o: in function `unregister_ftrace_graph':
fgraph.c:(.text+0xa60): undefined reference to `ftrace_stub_graph'
arm-linux-gnueabihf-ld: fgraph.c:(.text+0xa64): undefined reference to `ftrace_stub_graph'
arm-linux-gnueabihf-ld: kernel/trace/fgraph.o: in function `.LANCHOR0':
fgraph.c:(.data+0x1c): undefined reference to `ftrace_stub_graph'
...
make[2]: *** [scripts/Makefile.vmlinux:37: vmlinux] Error 1
make[1]: *** [/aaa/ttt/kernel/Makefile:1160: vmlinux] Error 2
make[1]: *** Waiting for unfinished jobs....
...
make: *** [Makefile:240: __sub-make] Error 2

git bisect skip 7b749aad1faa5bcb23b45b7126f677ab17324c40
Bisecting: 12 revisions left to test after this (roughly 4 steps)
[393999fa96273bab8d6efb2f4724030916afd61b] ARM: 9389/2: mm: Define prototypes for all per-processor calls

git bisect good 393999fa96273bab8d6efb2f4724030916afd61b
Bisecting: 10 revisions left to test after this (roughly 3 steps)
[eebadafc3b14d9426fa9cc3ab0da0e48367c7114] ARM: 9398/1: Fix userspace enter on LPAE with CC_OPTIMIZE_FOR_SIZE=y

git bisect bad eebadafc3b14d9426fa9cc3ab0da0e48367c7114
Bisecting: 2 revisions left to test after this (roughly 2 steps)
[de7f60f0b03175ff056f18996d7e2577bc4baa65] ARM: 9357/2: Reduce the number of #ifdef CONFIG_CPU_SW_DOMAIN_PAN

git bisect good de7f60f0b03175ff056f18996d7e2577bc4baa65
Bisecting: 1 revision left to test after this (roughly 1 step)
[7af5b901e84743c608aae90cb0e429702812c324] ARM: 9358/2: Implement PAN for LPAE by TTBR0 page table walks disablement

git bisect bad 7af5b901e84743c608aae90cb0e429702812c324
7af5b901e84743c608aae90cb0e429702812c324 is the first bad commit
commit 7af5b901e84743c608aae90cb0e429702812c324
Author: Linus Walleij <linus.walleij@linaro.org>
Date:   Mon Mar 25 08:31:13 2024 +0100

    ARM: 9358/2: Implement PAN for LPAE by TTBR0 page table walks disablement

    With LPAE enabled, privileged no-access cannot be enforced using CPU
    domains as such feature is not available. This patch implements PAN
    by disabling TTBR0 page table walks while in kernel mode.

    The ARM architecture allows page table walks to be split between TTBR0
    and TTBR1. With LPAE enabled, the split is defined by a combination of
    TTBCR T0SZ and T1SZ bits. Currently, an LPAE-enabled kernel uses TTBR0
    for user addresses and TTBR1 for kernel addresses with the VMSPLIT_2G
    and VMSPLIT_3G configurations. The main advantage for the 3:1 split is
    that TTBR1 is reduced to 2 levels, so potentially faster TLB refill
    (though usually the first level entries are already cached in the TLB).

    The PAN support on LPAE-enabled kernels uses TTBR0 when running in user
    space or in kernel space during user access routines (TTBCR T0SZ and
    T1SZ are both 0). When running user accesses are disabled in kernel
    mode, TTBR0 page table walks are disabled by setting TTBCR.EPD0. TTBR1
    is used for kernel accesses (including loadable modules; anything
    covered by swapper_pg_dir) by reducing the TTBCR.T0SZ to the minimum
    (2^(32-7) = 32MB). To avoid user accesses potentially hitting stale TLB
    entries, the ASID is switched to 0 (reserved) by setting TTBCR.A1 and
    using the ASID value in TTBR1. The difference from a non-PAN kernel is
    that with the 3:1 memory split, TTBR1 always uses 3 levels of page
    tables.

    As part of the change we are using preprocessor elif definied() clauses
    so balance these clauses by converting relevant precedingt ifdef
    clauses to if defined() clauses.

    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
    Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
    Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

 arch/arm/Kconfig                            | 22 +++++++++++++--
 arch/arm/include/asm/assembler.h            |  1 +
 arch/arm/include/asm/pgtable-3level-hwdef.h |  9 ++++++
 arch/arm/include/asm/ptrace.h               |  1 +
 arch/arm/include/asm/uaccess-asm.h          | 44 ++++++++++++++++++++++++++++-
 arch/arm/include/asm/uaccess.h              | 26 ++++++++++++++++-
 arch/arm/kernel/asm-offsets.c               |  1 +
 arch/arm/kernel/suspend.c                   |  8 ++++++
 arch/arm/lib/csumpartialcopyuser.S          | 20 ++++++++++++-
 arch/arm/mm/fault.c                         | 29 +++++++++++++++++++
 10 files changed, 155 insertions(+), 6 deletions(-)

----
This does not affect mwifiex/mwifiex_sdio errors in dmesg, only segfault errors.
LPAE is enabled in my kernel config.
Comment 7 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-09-24 12:02:00 UTC
(In reply to Andrew from comment #6)
> git bisect [...]

Thx; can I CC you when forwarding this bug my mail? note, this will expose your email address to the public!
Comment 8 Andrew 2024-09-25 07:40:30 UTC
Yes, you can.
Comment 9 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-09-25 12:26:57 UTC
Forward: https://lore.kernel.org/regressions/bf8288c9-13a1-47f0-9842-3b8eff37ef65@leemhuis.info/
Comment 10 Andrew 2024-09-28 05:12:09 UTC
Created attachment 306924 [details]
dmesg

Tried version 6.10.1 commit 0129910096573d08ecb139b20e2940682f248186ed

Changing in config to
# CONFIG_CPU_TTBR0_PAN is not set
reverted automatically during build.

Segfault does not occur in qt-apps in case of
# CONFIG_ARM_PAN is not set

The patch did not help.

Maybe I should have checked on another commit?
Comment 11 Andrew 2024-09-28 05:12:57 UTC
Created attachment 306925 [details]
config
Comment 12 Linus Walleij 2024-11-01 14:13:29 UTC
I have obtained a Chromebook XE303C12 to see if I can replicate this problem.

What I do not understand is how you installed Devuan on the machine since it has only AMD64 and x86_64 images available on the download pages. Any hints on how to replicate your set-up?

In the meantime I have discovered that Debian (upstream to Devuan I guess) is actually disabling PAN. I don't know why they do this but I have noted that doing so increases syscall performance. Maybe there are other reasons why they do this as well?
Comment 13 Andrew 2024-11-02 09:35:40 UTC
Created attachment 307119 [details]
Script used for build

Devuan is a fork of Debian with the same architectures in the repository. The process of creating the system image same for any debian-based system and was borrowed from the Kali Linux arm systems build script with minor changes in the repository links. The kernel was cross-compiled from the chrooted AMD64 Kali Linux minimal environment (this was the very first option, and still used; to install build dependencies lines 289-307 should be uncommented), as a result we get a deb-package of the kernel. I will attach the script that was used for the build.

Chromebook was switched to developer mode. 

You can also test with a Debian system, in case of a custom kernel, there will probably not be difference except of disk partition scheme and boot config. The only reason why Devuan was named is that it was already installed.

In case of ready-made system images, it could be taken from https://gitlab.com/quarkscript/linarm#recoveryinstalltest-disk-images 

Last compiled kernel packages with patch and without PAN:
https://drive.usercontent.google.com/download?id=1ad3jASrJmvncOsWUGn52Dpo0Hr9vY0UB&export=download
https://drive.usercontent.google.com/download?id=1k5zkXGKBq7J2zoND1WZM6I-wxqLj8zqS&export=download
Comment 14 Andrew 2024-11-03 14:56:39 UTC
Just checked kernel on Debian 12 and got same issue.

[   95.037173] lxqt-session: unhandled page fault (11) at 0x0048a000, code 0x207
[   95.037182] [0048a000] *pgd=48499003, *pmd=bdaf4003
[   95.037198] CPU: 0 PID: 414 Comm: lxqt-session Not tainted 6.10.1-dirty #1
[   95.037207] Hardware name: Samsung Exynos (Flattened Device Tree)
[   95.037214] PC is at 0xb51c0cf4
[   95.037224] LR is at 0xb5394215
[   95.037231] pc : [<b51c0cf4>]    lr : [<b5394215>]    psr: 200f0030
[   95.037237] sp : be8eb850  ip : 004885bc  fp : 01422c18
[   95.037244] r10: 004885bc  r9 : 00000000  r8 : be8f38c0
[   95.037250] r7 : 00000002  r6 : be8eb8b4  r5 : ffb3b4bf  r4 : 00489ffa
[   95.037256] r3 : be8eb8a0  r2 : be8ed2b0  r1 : 002380ae  r0 : 00989681
[   95.037263] Flags: nzCv  IRQs on  FIQs on  Mode USER_32  ISA Thumb  Segment user
[   95.037271] Control: 30c5387d  Table: 4784d480  DAC: fffffffd

Used system https://drive.usercontent.google.com/download?id=1q9i1FITox8Q-cmYsz-UCqsgtMyB_DYp-&export=download&authuser=0
Comment 15 Michał Pecio 2024-11-12 18:30:08 UTC
Hi Andrew,

I was pointed to this bug because I found a similar one (or the same):
https://lore.kernel.org/linux-arm-kernel/20241111233817.2f824c19@foxbook/T/#u

What helped me was running the crashing program under strace (which thankfully didn't crash) and noticing strange behavior of syscalls right before the crash. Specifically, it was cacheflush failing with EFAULT for no reason, and this was actually the direct and relatively obvious cause of gdb segfaulting a moment later.

If you can see similar cacheflush failures in your crashing applications, it's likely the same bug.

If not, knowing what you applications are doing before they crash might still offer some clue, hopefully.
Comment 16 Linus Walleij 2024-11-12 21:23:05 UTC
Russell wrote this patch:
https://lore.kernel.org/linux-arm-kernel/ZzMsMFNSHLOKEeEW@shell.armlinux.org.uk/

Andrew could you see if it solves your crash too?
Comment 17 Andrew 2024-11-13 18:00:50 UTC
Yes, I can confirm. It is solved my crash too.