Bug 216688

Summary: Bad page map in process with fault:filemap_fault mmap:ext4_file_mmap
Product: Memory Management Reporter: Marcus Seyfarth (m.seyfarth)
Component: Page AllocatorAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 6.0.8 Subsystem:
Regression: No Bisected commit-id:

Description Marcus Seyfarth 2022-11-13 22:00:06 UTC
For as long as I own the current system (over two years), I see the following kernel warning in dmesg when the system is under very high CPU / memory load for a long period of time (e.g. compiling LLVM with ThinLTO for an LTO/PGO/BOLT build that takes hours to complete):

[  +2,258700] BUG: Bad page map in process clang++  pte:34999a025 pmd:5abfdb067
[  +0,000011] page:000000009c141376 refcount:68 mapcount:-191 mapping:000000007d9e887b index:0x47ce pfn:0x34999a
[  +0,000005] memcg:ffff90a9145c8000
[  +0,000002] aops:ext4_da_aops [ext4] ino:961599 dentry name:"libLLVM-16git.so"
[  +0,000023] flags: 0xa600000000002056(referenced|uptodate|lru|workingset|private|zone=2)
[  +0,000005] raw: a600000000002056 ffffda8b1eda1908 ffffda8b1c5bb1c8 ffff90a6ba07b0d8
[  +0,000003] raw: 00000000000047ce ffff90a96c06a000 00000044ffffff40 ffff90a9145c8000
[  +0,000001] page dumped because: bad pte
[  +0,000001] addr:00007f71291cf000 vm_flags:80000075 anon_vma:0000000000000000 mapping:ffff90a6ba07b0d8 index:47ce
[  +0,000004] file:libLLVM-16git.so fault:filemap_fault mmap:ext4_file_mmap [ext4] read_folio:ext4_read_folio [ext4]
[  +0,000032] CPU: 25 PID: 67341 Comm: clang++ Tainted: G    B      O       6.0.8-3.1-cachyos-bore-lto #1 ae22c0a1d3f4e64c18fc975cbc4082d37abd50c5
[  +0,000005] Hardware name: LENOVO GAMING TF/X99-TF Gaming, BIOS CX99DE26 10/10/2020
[  +0,000001] Call Trace:
[  +0,000003]  <TASK>
[  +0,000001]  ? print_bad_pte+0x1e6/0x280
[  +0,000005]  ? unmap_page_range+0xaaf/0x12e0
[  +0,000004]  ? balance_dirty_pages_ratelimited_flags+0x233/0x10e0
[  +0,000006]  ? unmap_vmas+0xd4/0x1a0
[  +0,000003]  ? exit_mmap+0x11b/0x440
[  +0,000004]  ? __mmput+0x3b/0x180
[  +0,000005]  ? do_exit+0x4d6/0x1140
[  +0,000003]  ? do_group_exit+0x4b/0xe0
[  +0,000003]  ? __x64_sys_exit_group+0xe/0x20
[  +0,000002]  ? do_syscall_64+0x2b/0x60
[  +0,000005]  ? entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  +0,000006]  </TASK>

I normally don't see that when gaming or lighter workloads, such as a LLVM-build with FullLTO or a normal Kernel build. My Kernel config, and patches that I use, can be found in my Github repo under: https://github.com/ms178/archpkgbuilds/tree/main/packages/linux-cachyos-bore

inxi -f output:

System:
  Host: klx99 Kernel: 6.0.8-3.1-cachyos-bore-lto arch: x86_64 bits: 64
    Desktop: KDE Plasma v: 5.26.80 Distro: EndeavourOS
Machine:
  Type: Desktop System: LENOVO product: GAMING TF v: N/A
    serial: <superuser required>
  Mobo: Lenovo model: X99-TF Gaming v: G368J V1.1
    serial: <superuser required> UEFI: American Megatrends v: CX99DE26
    date: 10/10/2020
CPU:
  Info: 18-core model: Intel Xeon E5-2696 v3 bits: 64 type: MT MCP cache:
    L2: 4.5 MiB
  Speed (MHz): avg: 3000 min/max: 1200/2301 cores: 1: 3000 2: 3000 3: 3000
    4: 3000 5: 3000 6: 3000 7: 3000 8: 3000 9: 3000 10: 3000 11: 3000 12: 3000
    13: 3000 14: 3000 15: 3000 16: 3000 17: 3000 18: 3000 19: 3000 20: 3000
    21: 3000 22: 3000 23: 3000 24: 3000 25: 3000 26: 3000 27: 3000 28: 3000
    29: 3000 30: 3000 31: 3000 32: 3000 33: 3000 34: 3000 35: 3000 36: 3000
Graphics:
  Device-1: AMD Vega 10 XL/XT [Radeon RX 56/64] driver: amdgpu v: kernel
  Display: x11 server: X.Org v: 21.1.99 with: Xwayland v: 22.1.5 driver: X:
    loaded: amdgpu unloaded: modesetting dri: radeonsi gpu: amdgpu
    resolution: 2560x1440
  API: OpenGL v: 4.6 Mesa 23.0.0-devel (git-ae76bba34a) renderer: AMD
    Radeon RX Vega (vega10 LLVM 16.0.0 DRM 3.48 6.0.8-3.1-cachyos-bore-lto)
Audio:
  Device-1: Intel C610/X99 series HD Audio driver: snd_hda_intel
  Device-2: AMD Vega 10 HDMI Audio [Radeon 56/64] driver: snd_hda_intel
  Sound API: ALSA v: k6.0.8-3.1-cachyos-bore-lto running: yes
  Sound Server-1: PipeWire v: 0.3.60 running: yes
Network:
  Device-1: Intel I350 Gigabit Network driver: igb
  IF: ens1f0 state: down mac: a0:36:9f:a3:72:44
  Device-2: Intel I350 Gigabit Network driver: igb
  IF: ens1f1 state: up speed: 1000 Mbps duplex: full mac: a0:36:9f:09:3f:67
  Device-3: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet
    driver: r8169
  IF: enp7s0 state: down mac: 00:e0:4c:68:02:1c
Drives:
  Local Storage: total: 1.39 TiB used: 355.29 GiB (25.0%)
  ID-1: /dev/nvme0n1 vendor: Silicon Power model: SPCC M.2 PCIe SSD
    size: 953.87 GiB
  ID-2: /dev/sda vendor: Samsung model: SSD 860 EVO 500GB size: 465.76 GiB
Partition:
  ID-1: / size: 937.53 GiB used: 355.29 GiB (37.9%) fs: ext4
    dev: /dev/nvme0n1p2
  ID-2: /boot/efi size: 299.4 MiB used: 316 KiB (0.1%) fs: vfat
    dev: /dev/nvme0n1p1
Swap:
  ID-1: swap-1 type: zram size: 4 GiB used: 17.5 MiB (0.4%) dev: /dev/zram0
Sensors:
  System Temperatures: cpu: 50.0 C mobo: N/A gpu: amdgpu temp: 43.0 C
  Fan Speeds (RPM): N/A gpu: amdgpu fan: 66
Info:
  Processes: 572 Uptime: 2h 26m Memory: 62.66 GiB used: 6.27 GiB (10.0%)
  Shell: Zsh inxi: 3.3.23

I have seen this issue with several different CPUs in the same system, but I cannot rule out a hardware quirk.
Comment 1 Artem S. Tashkinov 2022-11-17 13:44:45 UTC
Have you ever run memtest86? If not, please do and give it at least 24 hours: https://www.memtest86.com/download.htm
Comment 2 Marcus Seyfarth 2022-11-17 14:37:26 UTC
Not yet, but it is ECC memory and I haven't seen any issues when putting the system under stress in Windows (as the event logger would show any memory issues).
Comment 3 Marcus Seyfarth 2022-11-24 16:36:54 UTC
FWIW, the same system survived a 3h 13 Min long LLVM-BOLT build session with sustained full CPU and high memory load without showing this error this week.