Bug 216688 - Bad page map in process with fault:filemap_fault mmap:ext4_file_mmap
Summary: Bad page map in process with fault:filemap_fault mmap:ext4_file_mmap
Status: NEW
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Page Allocator (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-11-13 22:00 UTC by Marcus Seyfarth
Modified: 2022-11-24 16:36 UTC (History)
0 users

See Also:
Kernel Version: 6.0.8
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Marcus Seyfarth 2022-11-13 22:00:06 UTC
For as long as I own the current system (over two years), I see the following kernel warning in dmesg when the system is under very high CPU / memory load for a long period of time (e.g. compiling LLVM with ThinLTO for an LTO/PGO/BOLT build that takes hours to complete):

[  +2,258700] BUG: Bad page map in process clang++  pte:34999a025 pmd:5abfdb067
[  +0,000011] page:000000009c141376 refcount:68 mapcount:-191 mapping:000000007d9e887b index:0x47ce pfn:0x34999a
[  +0,000005] memcg:ffff90a9145c8000
[  +0,000002] aops:ext4_da_aops [ext4] ino:961599 dentry name:"libLLVM-16git.so"
[  +0,000023] flags: 0xa600000000002056(referenced|uptodate|lru|workingset|private|zone=2)
[  +0,000005] raw: a600000000002056 ffffda8b1eda1908 ffffda8b1c5bb1c8 ffff90a6ba07b0d8
[  +0,000003] raw: 00000000000047ce ffff90a96c06a000 00000044ffffff40 ffff90a9145c8000
[  +0,000001] page dumped because: bad pte
[  +0,000001] addr:00007f71291cf000 vm_flags:80000075 anon_vma:0000000000000000 mapping:ffff90a6ba07b0d8 index:47ce
[  +0,000004] file:libLLVM-16git.so fault:filemap_fault mmap:ext4_file_mmap [ext4] read_folio:ext4_read_folio [ext4]
[  +0,000032] CPU: 25 PID: 67341 Comm: clang++ Tainted: G    B      O       6.0.8-3.1-cachyos-bore-lto #1 ae22c0a1d3f4e64c18fc975cbc4082d37abd50c5
[  +0,000005] Hardware name: LENOVO GAMING TF/X99-TF Gaming, BIOS CX99DE26 10/10/2020
[  +0,000001] Call Trace:
[  +0,000003]  <TASK>
[  +0,000001]  ? print_bad_pte+0x1e6/0x280
[  +0,000005]  ? unmap_page_range+0xaaf/0x12e0
[  +0,000004]  ? balance_dirty_pages_ratelimited_flags+0x233/0x10e0
[  +0,000006]  ? unmap_vmas+0xd4/0x1a0
[  +0,000003]  ? exit_mmap+0x11b/0x440
[  +0,000004]  ? __mmput+0x3b/0x180
[  +0,000005]  ? do_exit+0x4d6/0x1140
[  +0,000003]  ? do_group_exit+0x4b/0xe0
[  +0,000003]  ? __x64_sys_exit_group+0xe/0x20
[  +0,000002]  ? do_syscall_64+0x2b/0x60
[  +0,000005]  ? entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  +0,000006]  </TASK>

I normally don't see that when gaming or lighter workloads, such as a LLVM-build with FullLTO or a normal Kernel build. My Kernel config, and patches that I use, can be found in my Github repo under: https://github.com/ms178/archpkgbuilds/tree/main/packages/linux-cachyos-bore

inxi -f output:

System:
  Host: klx99 Kernel: 6.0.8-3.1-cachyos-bore-lto arch: x86_64 bits: 64
    Desktop: KDE Plasma v: 5.26.80 Distro: EndeavourOS
Machine:
  Type: Desktop System: LENOVO product: GAMING TF v: N/A
    serial: <superuser required>
  Mobo: Lenovo model: X99-TF Gaming v: G368J V1.1
    serial: <superuser required> UEFI: American Megatrends v: CX99DE26
    date: 10/10/2020
CPU:
  Info: 18-core model: Intel Xeon E5-2696 v3 bits: 64 type: MT MCP cache:
    L2: 4.5 MiB
  Speed (MHz): avg: 3000 min/max: 1200/2301 cores: 1: 3000 2: 3000 3: 3000
    4: 3000 5: 3000 6: 3000 7: 3000 8: 3000 9: 3000 10: 3000 11: 3000 12: 3000
    13: 3000 14: 3000 15: 3000 16: 3000 17: 3000 18: 3000 19: 3000 20: 3000
    21: 3000 22: 3000 23: 3000 24: 3000 25: 3000 26: 3000 27: 3000 28: 3000
    29: 3000 30: 3000 31: 3000 32: 3000 33: 3000 34: 3000 35: 3000 36: 3000
Graphics:
  Device-1: AMD Vega 10 XL/XT [Radeon RX 56/64] driver: amdgpu v: kernel
  Display: x11 server: X.Org v: 21.1.99 with: Xwayland v: 22.1.5 driver: X:
    loaded: amdgpu unloaded: modesetting dri: radeonsi gpu: amdgpu
    resolution: 2560x1440
  API: OpenGL v: 4.6 Mesa 23.0.0-devel (git-ae76bba34a) renderer: AMD
    Radeon RX Vega (vega10 LLVM 16.0.0 DRM 3.48 6.0.8-3.1-cachyos-bore-lto)
Audio:
  Device-1: Intel C610/X99 series HD Audio driver: snd_hda_intel
  Device-2: AMD Vega 10 HDMI Audio [Radeon 56/64] driver: snd_hda_intel
  Sound API: ALSA v: k6.0.8-3.1-cachyos-bore-lto running: yes
  Sound Server-1: PipeWire v: 0.3.60 running: yes
Network:
  Device-1: Intel I350 Gigabit Network driver: igb
  IF: ens1f0 state: down mac: a0:36:9f:a3:72:44
  Device-2: Intel I350 Gigabit Network driver: igb
  IF: ens1f1 state: up speed: 1000 Mbps duplex: full mac: a0:36:9f:09:3f:67
  Device-3: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet
    driver: r8169
  IF: enp7s0 state: down mac: 00:e0:4c:68:02:1c
Drives:
  Local Storage: total: 1.39 TiB used: 355.29 GiB (25.0%)
  ID-1: /dev/nvme0n1 vendor: Silicon Power model: SPCC M.2 PCIe SSD
    size: 953.87 GiB
  ID-2: /dev/sda vendor: Samsung model: SSD 860 EVO 500GB size: 465.76 GiB
Partition:
  ID-1: / size: 937.53 GiB used: 355.29 GiB (37.9%) fs: ext4
    dev: /dev/nvme0n1p2
  ID-2: /boot/efi size: 299.4 MiB used: 316 KiB (0.1%) fs: vfat
    dev: /dev/nvme0n1p1
Swap:
  ID-1: swap-1 type: zram size: 4 GiB used: 17.5 MiB (0.4%) dev: /dev/zram0
Sensors:
  System Temperatures: cpu: 50.0 C mobo: N/A gpu: amdgpu temp: 43.0 C
  Fan Speeds (RPM): N/A gpu: amdgpu fan: 66
Info:
  Processes: 572 Uptime: 2h 26m Memory: 62.66 GiB used: 6.27 GiB (10.0%)
  Shell: Zsh inxi: 3.3.23

I have seen this issue with several different CPUs in the same system, but I cannot rule out a hardware quirk.
Comment 1 Artem S. Tashkinov 2022-11-17 13:44:45 UTC
Have you ever run memtest86? If not, please do and give it at least 24 hours: https://www.memtest86.com/download.htm
Comment 2 Marcus Seyfarth 2022-11-17 14:37:26 UTC
Not yet, but it is ECC memory and I haven't seen any issues when putting the system under stress in Windows (as the event logger would show any memory issues).
Comment 3 Marcus Seyfarth 2022-11-24 16:36:54 UTC
FWIW, the same system survived a 3h 13 Min long LLVM-BOLT build session with sustained full CPU and high memory load without showing this error this week.

Note You need to log in before you can comment on or make changes to this bug.