Bug 212123 - kernel BUG at mm/filemap.c:1499, invalid opcode: 0000
Summary: kernel BUG at mm/filemap.c:1499, invalid opcode: 0000
Status: NEW
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: FileSystem/XFS Default Virtual Assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-03-08 07:31 UTC by Tiziano Müller
Modified: 2021-03-09 08:52 UTC (History)
0 users

See Also:
Kernel Version: 5.11.1-2.g39e04be
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Tiziano Müller 2021-03-08 07:31:39 UTC
Unfortunately I didn't have the chance yet to test this with something else than XFS, but we see this across several hosts with different Intel SSDs (SATA and NVMe, gets triggered faster on NVMe) and AMD CPUs, different mount options (discard or periodic trim) and at least since Kernel 5.7.

This makes the corresponding user space process hang and finally a hard reset of the machine is needed. What seems to trigger the issue more rapidly (besides a faster drive) is the fill state (usually >90%) of the filesystem and/or the load on the machine (those are HPC-esque nodes).

The following stacktrace also contains an error "Bad page state in process", this is not always the case.

2021-02-26T11:49:29.803049+01:00 tcopt11 kernel: [156009.300579] ------------[ cut here ]------------
2021-02-26T11:49:29.803063+01:00 tcopt11 kernel: [156009.300584] kernel BUG at mm/filemap.c:1499!
2021-02-26T11:49:29.803065+01:00 tcopt11 kernel: [156009.300591] invalid opcode: 0000 [#1] SMP NOPTI
2021-02-26T11:49:29.803066+01:00 tcopt11 kernel: [156009.305213] CPU: 51 PID: 832 Comm: kworker/51:1 Not tainted 5.11.1-2.g39e04be-default #1 openSUSE Tumbleweed (unreleased)
2021-02-26T11:49:29.803067+01:00 tcopt11 kernel: [156009.316246] Hardware name: GIGABYTE R181-Z91-00/MZ91-FS0-00, BIOS F23a 08/14/2020
2021-02-26T11:49:29.803069+01:00 tcopt11 kernel: [156009.323812] Workqueue: xfs-conv/nvme0n1p1 xfs_end_io [xfs]
2021-02-26T11:49:29.803069+01:00 tcopt11 kernel: [156009.329486] RIP: 0010:end_page_writeback+0xbb/0xc0
2021-02-26T11:49:29.803070+01:00 tcopt11 kernel: [156009.334365] Code: 34 83 e0 07 83 f8 04 75 e8 48 8b 45 08 8b 40 68 83 e8 01 83 f8 01 77 d9 48 89 ef 5d e9 fe e4 00 00 48 89 ef 5d e9 45 e4 00 00 <0f> 0b 0f 1f 00 0f 1f 44 00 00 41 54 55 48 89 fd 53 89 d3 40 84 f6
2021-02-26T11:49:29.803071+01:00 tcopt11 kernel: [156009.353203] RSP: 0018:ffffbf67cf38fd80 EFLAGS: 00010246
2021-02-26T11:49:29.803073+01:00 tcopt11 kernel: [156009.358519] RAX: 0000000000000000 RBX: 000000000000001d RCX: 0000000000000005
2021-02-26T11:49:29.803074+01:00 tcopt11 kernel: [156009.365739] RDX: 0000000000000000 RSI: ffff9b3893433130 RDI: 0000000000000000
2021-02-26T11:49:29.803074+01:00 tcopt11 kernel: [156009.372959] RBP: fffff9a5643ea1c0 R08: 0000000000006000 R09: 0000000000000001
2021-02-26T11:49:29.803075+01:00 tcopt11 kernel: [156009.380177] R10: ffff9b3e2ff6a000 R11: 0000000000000020 R12: 0000000000000000
2021-02-26T11:49:29.803095+01:00 tcopt11 kernel: [156009.387398] R13: 0000000000008000 R14: ffff9b0f0294b240 R15: fffff9a5643ea1c0
2021-02-26T11:49:29.803096+01:00 tcopt11 kernel: [156009.394617] FS:  0000000000000000(0000) GS:ffff9b160f040000(0000) knlGS:0000000000000000
2021-02-26T11:49:29.803098+01:00 tcopt11 kernel: [156009.402789] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2021-02-26T11:49:29.803099+01:00 tcopt11 kernel: [156009.408622] CR2: 0000558ea6c00870 CR3: 000000183b5b6000 CR4: 00000000003506e0
2021-02-26T11:49:29.803099+01:00 tcopt11 kernel: [156009.415841] Call Trace:
2021-02-26T11:49:29.803100+01:00 tcopt11 kernel: [156009.418383]  iomap_finish_ioend+0x127/0x250
2021-02-26T11:49:29.803101+01:00 tcopt11 kernel: [156009.422654]  iomap_finish_ioends+0x41/0x90
2021-02-26T11:49:29.803101+01:00 tcopt11 kernel: [156009.426842]  xfs_end_ioend+0x6d/0x110 [xfs]
2021-02-26T11:49:29.803102+01:00 tcopt11 kernel: [156009.431193]  xfs_end_io+0xac/0xe0 [xfs]
2021-02-26T11:49:29.803103+01:00 tcopt11 kernel: [156009.435195]  process_one_work+0x1df/0x370
2021-02-26T11:49:29.803104+01:00 tcopt11 kernel: [156009.439293]  worker_thread+0x50/0x400
2021-02-26T11:49:29.803105+01:00 tcopt11 kernel: [156009.443047]  ? process_one_work+0x370/0x370
2021-02-26T11:49:29.803106+01:00 tcopt11 kernel: [156009.447320]  kthread+0x11b/0x140
2021-02-26T11:49:29.803107+01:00 tcopt11 kernel: [156009.450638]  ? __kthread_bind_mask+0x60/0x60
2021-02-26T11:49:29.803107+01:00 tcopt11 kernel: [156009.454999]  ret_from_fork+0x22/0x30
2021-02-26T11:49:29.803108+01:00 tcopt11 kernel: [156009.458665] Modules linked in: joydev st sr_mod cdrom rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc nfs_ssc fscache fuse af_packet xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ebtable_filter ebtables iscsi_ibft iscsi_boot_sysfs ip6table_filter ip6_tables rfkill iptable_filter ip_tables x_tables bpfilter dmi_sysfs msr ipmi_ssif i40iw ib_uverbs intel_rapl_msr ib_core intel_rapl_common amd64_edac_mod edac_mce_amd xfs nls_iso8859_1 nls_cp437 kvm_amd vfat fat kvm nvme irqbypass acpi_ipmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel igb ipmi_si crypto_simd sp5100_tco cryptd glue_helper i40e pcspkr efi_pstore nvme_core ipmi_devintf k10temp i2c_piix4 dca ccp ipmi_msghandler tiny_power_button button
2021-02-26T11:49:29.803109+01:00 tcopt11 kernel: [156009.458745]  acpi_cpufreq btrfs blake2b_generic libcrc32c xor ast i2c_algo_bit drm_vram_helper raid6_pq drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm_ttm_helper xhci_pci ttm xhci_pci_renesas crc32c_intel xhci_hcd drm usbcore sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
2021-02-26T11:49:29.803116+01:00 tcopt11 kernel: [156009.574325] ---[ end trace bd7bd5d623d48cd8 ]---
2021-02-26T11:49:29.944226+01:00 tcopt11 kernel: [156009.710597] RIP: 0010:end_page_writeback+0xbb/0xc0
2021-02-26T11:49:29.944229+01:00 tcopt11 kernel: [156009.710604] Code: 34 83 e0 07 83 f8 04 75 e8 48 8b 45 08 8b 40 68 83 e8 01 83 f8 01 77 d9 48 89 ef 5d e9 fe e4 00 00 48 89 ef 5d e9 45 e4 00 00 <0f> 0b 0f 1f 00 0f 1f 44 00 00 41 54 55 48 89 fd 53 89 d3 40 84 f6
2021-02-26T11:49:29.968386+01:00 tcopt11 kernel: [156009.734336] RSP: 0018:ffffbf67cf38fd80 EFLAGS: 00010246
2021-02-26T11:49:29.968388+01:00 tcopt11 kernel: [156009.734340] RAX: 0000000000000000 RBX: 000000000000001d RCX: 0000000000000005
2021-02-26T11:49:29.982842+01:00 tcopt11 kernel: [156009.746884] RDX: 0000000000000000 RSI: ffff9b3893433130 RDI: 0000000000000000
2021-02-26T11:49:29.982845+01:00 tcopt11 kernel: [156009.746888] RBP: fffff9a5643ea1c0 R08: 0000000000006000 R09: 0000000000000001
2021-02-26T11:49:29.982846+01:00 tcopt11 kernel: [156009.746890] R10: ffff9b3e2ff6a000 R11: 0000000000000020 R12: 0000000000000000
2021-02-26T11:49:30.004527+01:00 tcopt11 kernel: [156009.768566] R13: 0000000000008000 R14: ffff9b0f0294b240 R15: fffff9a5643ea1c0
2021-02-26T11:49:30.004528+01:00 tcopt11 kernel: [156009.768569] FS:  0000000000000000(0000) GS:ffff9b160f040000(0000) knlGS:0000000000000000
2021-02-26T11:49:30.013210+01:00 tcopt11 kernel: [156009.783977] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2021-02-26T11:49:30.013212+01:00 tcopt11 kernel: [156009.783980] CR2: 0000558ea6c00870 CR3: 000000183b5b6000 CR4: 00000000003506e0
2021-02-26T11:49:32.035969+01:00 tcopt11 kernel: [156011.521366] BUG: Bad page state in process cp2k.ssmp  pfn:290fa87
2021-02-26T11:49:32.035989+01:00 tcopt11 kernel: [156011.527556] page:000000000121bb86 refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x290fa87
2021-02-26T11:49:32.035991+01:00 tcopt11 kernel: [156011.537118] flags: 0x2affff800000000()
2021-02-26T11:49:32.035992+01:00 tcopt11 kernel: [156011.540959] raw: 02affff800000000 dead000000000100 dead000000000122 0000000000000000
2021-02-26T11:49:32.035993+01:00 tcopt11 kernel: [156011.548785] raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000
2021-02-26T11:49:32.035995+01:00 tcopt11 kernel: [156011.556610] page dumped because: nonzero _refcount
2021-02-26T11:49:32.035996+01:00 tcopt11 kernel: [156011.561488] Modules linked in: joydev st sr_mod cdrom rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc nfs_ssc fscache fuse af_packet xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ebtable_filter ebtables iscsi_ibft iscsi_boot_sysfs ip6table_filter ip6_tables rfkill iptable_filter ip_tables x_tables bpfilter dmi_sysfs msr ipmi_ssif i40iw ib_uverbs intel_rapl_msr ib_core intel_rapl_common amd64_edac_mod edac_mce_amd xfs nls_iso8859_1 nls_cp437 kvm_amd vfat fat kvm nvme irqbypass acpi_ipmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel igb ipmi_si crypto_simd sp5100_tco cryptd glue_helper i40e pcspkr efi_pstore nvme_core ipmi_devintf k10temp i2c_piix4 dca ccp ipmi_msghandler tiny_power_button button
2021-02-26T11:49:32.036009+01:00 tcopt11 kernel: [156011.561566]  acpi_cpufreq btrfs blake2b_generic libcrc32c xor ast i2c_algo_bit drm_vram_helper raid6_pq drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm_ttm_helper xhci_pci ttm xhci_pci_renesas crc32c_intel xhci_hcd drm usbcore sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
2021-02-26T11:49:32.036013+01:00 tcopt11 kernel: [156011.677099] CPU: 56 PID: 89417 Comm: cp2k.ssmp Tainted: G      D           5.11.1-2.g39e04be-default #1 openSUSE Tumbleweed (unreleased)
2021-02-26T11:49:32.036015+01:00 tcopt11 kernel: [156011.689437] Hardware name: GIGABYTE R181-Z91-00/MZ91-FS0-00, BIOS F23a 08/14/2020
2021-02-26T11:49:32.036016+01:00 tcopt11 kernel: [156011.697002] Call Trace:
2021-02-26T11:49:32.036018+01:00 tcopt11 kernel: [156011.699545]  dump_stack+0x6b/0x83
2021-02-26T11:49:32.036019+01:00 tcopt11 kernel: [156011.702956]  bad_page.cold+0x9b/0xa0
2021-02-26T11:49:32.036020+01:00 tcopt11 kernel: [156011.706623]  get_page_from_freelist+0x74b/0x1380
2021-02-26T11:49:32.036021+01:00 tcopt11 kernel: [156011.711331]  __alloc_pages_nodemask+0x161/0x310
2021-02-26T11:49:32.036023+01:00 tcopt11 kernel: [156011.715947]  alloc_pages_vma+0x80/0x260
2021-02-26T11:49:32.036024+01:00 tcopt11 kernel: [156011.719873]  handle_mm_fault+0xef9/0x1960
2021-02-26T11:49:32.036025+01:00 tcopt11 kernel: [156011.723974]  ? schedule+0x46/0xb0
2021-02-26T11:49:32.036028+01:00 tcopt11 kernel: [156011.727380]  do_user_addr_fault+0x19f/0x480
2021-02-26T11:49:32.036029+01:00 tcopt11 kernel: [156011.731653]  exc_page_fault+0x5d/0x120
2021-02-26T11:49:32.036031+01:00 tcopt11 kernel: [156011.735490]  ? asm_exc_page_fault+0x8/0x30
2021-02-26T11:49:32.036034+01:00 tcopt11 kernel: [156011.739677]  asm_exc_page_fault+0x1e/0x30
2021-02-26T11:49:32.036036+01:00 tcopt11 kernel: [156011.743777] RIP: 0033:0xc9b9d8
2021-02-26T11:49:32.036039+01:00 tcopt11 kernel: [156011.746923] Code: b0 08 20 00 00 49 8d 90 08 22 08 00 4d 8d 24 08 4d 89 f7 48 89 55 a0 49 89 c6 4d 01 ec 0f 1f 44 00 00 4f 8d 8c 3d f8 dd f7 ff <43> c7 84 3c f8 fd f7 ff 01 00 00 00 48 89 d8 49 8d 79 08 49 c7 01
2021-02-26T11:49:32.036041+01:00 tcopt11 kernel: [156011.765756] RSP: 002b:00007ffe41c5df80 EFLAGS: 00010287
2021-02-26T11:49:32.036043+01:00 tcopt11 kernel: [156011.771069] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
2021-02-26T11:49:32.036045+01:00 tcopt11 kernel: [156011.778289] RDX: 000000000a5386b0 RSI: 0000000000000000 RDI: 000000000a53a6c0
2021-02-26T11:49:32.036047+01:00 tcopt11 kernel: [156011.785507] RBP: 00007ffe41c5e000 R08: 00007f9291c55b00 R09: 00007f92785bd140
2021-02-26T11:49:32.036050+01:00 tcopt11 kernel: [156011.792727] R10: ffffffffffff0c08 R11: 0000000000000070 R12: 00007f9278571010
2021-02-26T11:49:32.036053+01:00 tcopt11 kernel: [156011.799946] R13: 00007f9278571010 R14: 00007f9291c55f30 R15: 00000000000ce338
2021-02-26T11:50:18.309353+01:00 tcopt11 kernel: [156058.072317] BUG: workqueue lockup - pool cpus=51 node=0 flags=0x0 nice=0 stuck for 48s!
2021-02-26T11:50:18.309380+01:00 tcopt11 kernel: [156058.080446] Showing busy workqueues and worker pools:
2021-02-26T11:50:18.319118+01:00 tcopt11 kernel: [156058.085750] workqueue mm_percpu_wq: flags=0x8
2021-02-26T11:50:18.330121+01:00 tcopt11 kernel: [156058.090215]   pwq 102: cpus=51 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
2021-02-26T11:50:18.330126+01:00 tcopt11 kernel: [156058.097355]     pending: vmstat_update
2021-02-26T11:50:18.347571+01:00 tcopt11 kernel: [156058.101828] workqueue xfs-conv/nvme0n1p1: flags=0xc
2021-02-26T11:50:18.347576+01:00 tcopt11 kernel: [156058.106805]   pwq 102: cpus=51 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
2021-02-26T11:50:18.347577+01:00 tcopt11 kernel: [156058.113940]     in-flight: 832:xfs_end_io [xfs]
2021-02-26T11:50:18.355454+01:00 tcopt11 kernel: [156058.118877] pool 102: cpus=51 node=0 flags=0x0 nice=0 hung=48s workers=2 idle: 330
[...]
Comment 1 Tiziano Müller 2021-03-09 08:52:52 UTC
As per the discussion in #xfs, this is most likely a bug in the page cache/writeback state tracking. Changing component to Memory Management.

Note You need to log in before you can comment on or make changes to this bug.