Bug 206191
Summary: | 5.5.x, 5.4.x: PAGE FAULT crashes the system multiple times / 24h | ||
---|---|---|---|
Product: | Memory Management | Reporter: | udo (udovdh) |
Component: | Page Allocator | Assignee: | other_other |
Status: | RESOLVED INSUFFICIENT_DATA | ||
Severity: | blocking | CC: | akpm, alexdeucher, Ian.kumlien, kernel, postix, schweinefilet |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.5.x, 5.4.x, 5.3.x | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
crashdump
crashdump2-1 crashdump2-2 crashdump2-3 crashdump2-4 kernel config adapted patch to 5.4.15 from https://lkml.org/lkml/diff/2019/12/10/333/1 dmesg from one rare instance where the system did not fully crash. dmesg: Null pointer dmesg: another bug latest kernel config |
Description
udo
2020-01-13 15:57:17 UTC
Firefox appears to be a trigger for this bug. Created attachment 286787 [details]
crashdump2-1
new crashdump, more complete
Created attachment 286789 [details]
crashdump2-2
Created attachment 286791 [details]
crashdump2-3
Created attachment 286793 [details]
crashdump2-4
more complete crashdump, final photo
I managed to capture a second, more complete crashdump. Photo's are attached to this bug. What bios version are you running? I see a pagefault, so is this a memory management bug? (In reply to txrx from comment #7) > What bios version are you running? https://bugzilla.kernel.org/attachment.cgi?id=286787 This is the Call Trace for a non-crash happening: https://bugzilla.kernel.org/show_bug.cgi?id=206245 (In reply to udo from comment #0) > Created attachment 286783 [details] > crashdump > > Previosly I emailed the linux-kernel list with: > > [ 246.911769] Call Trace: See bug https://bugzilla.kernel.org/show_bug.cgi?id=206245 5.4.13 also has this issue. Created attachment 286881 [details]
kernel config
5.3.18 shows similar type of behaviour. Can you bisect? Sure I can when I find the howto to help me remember. But I forgot what was the last known good version. Also this will be a lengthy process. But now 5.3.18 has been up for four days, so the issue in Comment 13 was perhaps a different one? For bisect: Can we call 5.3.18 a good version and 5.4 a bad version? Can you agree with those? I found this patch https://lkml.org/lkml/diff/2019/12/10/333/1 from this thread which looks similar (initial error messages) to what I found for this issue. I adapted the patch to 5.4.15 to test. Bisect has to wait. Bisect between v5.3 and v5.4 has done two steps, 10 or so to go. Created attachment 286995 [details] adapted patch to 5.4.15 from https://lkml.org/lkml/diff/2019/12/10/333/1 Patch did not fix the core issue. Back to bisecting. Created attachment 287017 [details]
dmesg from one rare instance where the system did not fully crash.
What can a developer tell from the info in the attachment to comment 20? Created attachment 287097 [details]
dmesg: Null pointer
Created attachment 287225 [details]
dmesg: another bug
these bugs make it hard to bisect
kernel 5.5.2 also has this issue. While bisecting I found this in dmesg: [22803.406500] evince[29402]: segfault at 7f6af8021000 ip 00007f6b3a9c7d6b sp 00007ffd81eeb648 error 6 in libc-2.30.so[7f6b3a888000+14f000] [22803.480177] Code: 47 20 c5 fe 7f 44 17 c0 c5 fe 7f 47 40 c5 fe 7f 44 17 a0 c5 fe 7f 47 60 c5 fe 7f 44 17 80 48 01 fa 48 83 e2 80 48 39 d1 74 ba <c5> fd 7f 01 c5 fd 7f 41 20 c5 fd 7f 41 40 c5 fd 7f 41 60 48 81 c1 [23067.538622] WARNING: CPU: 4 PID: 30 at kernel/rcu/tree.c:2211 rcu_core+0x3f6/0x450 [23067.583982] Modules linked in: fuse mq_deadline xt_MASQUERADE iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_u32 xt_multiport iptable_filter nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT it87 nf_reject_ipv6 hwmon_vid xt_tcpudp xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables msr uvcvideo videobuf2_vmalloc snd_usb_audio videobuf2_memops videobuf2_v4l2 snd_hwdep snd_usbmidi_lib videodev snd_rawmidi videobuf2_common cdc_acm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_intel_nhlt snd_hda_codec snd_hda_core snd_seq snd_seq_device snd_pcm k10temp snd_timer i2c_piix4 snd bfq evdev acpi_cpufreq binfmt_misc ip_tables x_tables amdgpu hid_generic gpu_sched aesni_intel ttm sr_mod cdrom usbhid i2c_dev autofs4 [23067.992351] CPU: 4 PID: 30 Comm: ksoftirqd/4 Tainted: G W 5.3.0git-11783-g0a3775e4f883 #7 [23068.049171] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F11 12/06/2019 [23068.107562] RIP: 0010:rcu_core+0x3f6/0x450 [23068.132087] Code: fd ff ff 48 2b 15 ea 07 f6 00 48 39 c2 7e 07 48 89 83 b0 00 00 00 48 8b 43 50 48 85 c0 75 c7 0f 0b eb c3 0f 0b e9 5a fc ff ff <0f> 0b eb b8 48 8b 15 c7 07 f6 00 48 89 93 c0 00 00 00 e9 72 ff ff [23068.244630] RSP: 0018:ffffa6aa00203e18 EFLAGS: 00010002 [23068.275932] RAX: 000014faf2aab325 RBX: ffffa443df11ee00 RCX: ffffa4438e2cd710 [23068.318692] RDX: 0000000000000000 RSI: ffffa6aa00203e18 RDI: ffffa443df11ee50 [23068.361451] RBP: ffffa443df11ee50 R08: ffffa4438e2cd710 R09: 0000000000000100 [23068.404206] R10: 0000000000000080 R11: ffffa443df11e1f8 R12: 0000000000000246 [23068.446968] R13: ffffa443ddf884c0 R14: 0000000000000000 R15: ffffffffab005110 [23068.489727] FS: 0000000000000000(0000) GS:ffffa443df100000(0000) knlGS:0000000000000000 [23068.538215] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [23068.572640] CR2: 00003080cd8c26a0 CR3: 0000000027aca000 CR4: 00000000003406e0 [23068.615396] Call Trace: [23068.630035] __do_softirq+0xfc/0x24c [23068.651439] run_ksoftirqd+0x21/0x30 [23068.672842] smpboot_thread_fn+0x195/0x230 [23068.697374] ? sort_range+0x20/0x20 [23068.718259] kthread+0x10d/0x130 [23068.737578] ? kthread_create_worker_on_cpu+0x60/0x60 [23068.767843] ret_from_fork+0x22/0x40 [23068.789244] ---[ end trace 8d1fe1b346997d45 ]--- [25100.691238] Web Content[30651]: segfault at 7faba69fffe8 ip 00007fabbcd97c43 sp 00007ffff8aaf518 error 4 in libxul.so[7fabba60f000+3a97000] 5.3.0git-11783-g0a3775e4f883 Bisecting landed me to a kernel that cannot find the root fs despite the kernel commandline not changing. Are the many segfaults in the software a sign of this kernel issue? Why do I see these and not so many others, judging from reactions? New bisct started between 5.2 and 5.4. This new bisect immediately results in a kernel that does not find the rootfs and instead lists a few block devices. So how can I find a commit causing the page fault issue this way? This is with 5.3.0-rc3git-01632-gbe91233b1053; how to circumvent? 5.5.8, similar trace as at athe start of this bug. I cannot proceed with the bisect, I need help with this issue. So what does the 'blocking' status mean at this time? [21573.900298] traps: chrome[24334] trap int3 ip:55eec824b037 sp:7fffd37bcb50 error:0 in chrome[55eec4040000+7287000] [21573.978083] traps: chrome[24325] trap int3 ip:55ec1b96b6ae sp:7ffef1029eb0 error:0 in chrome[55ec176e3000+7287000] [21677.749442] ------------[ cut here ]------------ [21677.777125] WARNING: CPU: 0 PID: 9 at kernel/rcu/tree.c:2238 rcu_core+0x401/0x450 [21677.821957] Modules linked in: fuse mq_deadline ip6t_REJECT nf_reject_ipv6 xt_state ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast xt_MASQUERADE iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_u32 xt_multiport xt_tcpudp xt_conntrack it87 nf_conntrack hwmon_vid nf_defrag_ipv6 nf_defrag_ipv4 msr iptable_filter snd_usb_audio uvcvideo snd_hwdep videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_usbmidi_lib videodev videobuf2_common snd_rawmidi cdc_acm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_seq snd_seq_device i2c_piix4 k10temp snd_pcm snd_timer snd bfq evdev acpi_cpufreq binfmt_misc ip_tables x_tables sr_mod cdrom amdgpu gpu_sched aesni_intel ttm hid_generic usbhid i2c_dev autofs4 [21678.231368] CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 5.5.8 #3 [21678.267353] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F11 12/06/2019 [21678.325743] RIP: 0010:rcu_core+0x401/0x450 [21678.350271] Code: fd ff ff 48 2b 15 df db f5 00 48 39 c2 7e 07 48 89 83 b0 00 00 00 48 8b 43 50 48 85 c0 75 c7 0f 0b eb c3 0f 0b e9 4f fc ff ff <0f> 0b eb b8 48 8b 15 bc db f5 00 48 89 93 c0 00 00 00 e9 72 ff ff [21678.462816] RSP: 0018:ffffae44c0117e18 EFLAGS: 00010006 [21678.494117] RAX: 000013b75c8178a6 RBX: ffffa06cdf020f00 RCX: ffffa06cb5a1fe90 [21678.536874] RDX: 0000000000000000 RSI: ffffae44c0117e18 RDI: ffffa06cdf020f50 [21678.579634] RBP: ffffa06cdf020f50 R08: ffffa06cb5a1fe90 R09: 0000000000000100 [21678.622393] R10: 0000000000000002 R11: 000000000001831c R12: 0000000000000246 [21678.665150] R13: ffffa06cd0e9ee00 R14: 0000000000000000 R15: 0000000000000000 [21678.707911] FS: 0000000000000000(0000) GS:ffffa06cdf000000(0000) knlGS:0000000000000000 [21678.756395] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [21678.790822] CR2: 00007f3d27d3f000 CR3: 000000021ce3c000 CR4: 00000000003406f0 [21678.833577] Call Trace: [21678.848219] __do_softirq+0xfc/0x247 [21678.869622] run_ksoftirqd+0x21/0x30 [21678.891027] smpboot_thread_fn+0x195/0x230 [21678.915559] ? sort_range+0x20/0x20 [21678.936446] kthread+0x118/0x130 [21678.955767] ? kthread_create_worker_on_cpu+0x60/0x60 [21678.986022] ret_from_fork+0x22/0x40 [21679.007427] ---[ end trace 76d0aaccffc55ecc ]--- Dist? Kernel commandline? How much memory? I have two ryzen boxes and I'm not seeing this... We run Fedora 31 with kernel.org, git mesa. kernel commandline: ro root=/dev/myvg/rootlv noexec=on noexec32=on vga=0xF06 SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=us acpi_enforce_resources=lax fbcon=font:VGA8x16 cgroup_disable=memory threadirqs plymouth.enable=0 rd.plymouth=0 mce=dont_log_ce panic=0 rd.lvm.vg=myvg zswap.enabled=0 rd.auto=1 audit=0 systemd.log_level=warning net.ifnames=0 clocksource=hpet rd.lvm.vg=ssdvg rd.luks.options=discard elevator=mq-deadline amdgpu.gttsize=8192 amdgpu.lockup_timeout=1000 amdgpu.gpu_recovery=1 amdgpu.noretry=0 amdgpu.ppfeaturemask=0xfffd3fff console=ttyS1,19200n8 console=tty0 fsck.repair=yes fsck.mode=force (from extlinux.conf) 16GB of DDR 3200 RAM, latest BIOS from gigabyte Created attachment 287819 [details]
latest kernel config
On my old AMD system i got odd bugs and crashes with "threadirqs" try without it Threadirqs has been there for quite a while, it helps using the multicore capabilities. I'll nevertheless try without it soon. it forces drivers not explicitly marked as "not working with threaded irqs" to use threaded irqs... Some drivers might have problems with it without being marked as such. If the omission of threadirqs helps the stability, then how can we find the offending driver? I don't know, it could be side effects - either way I'm not the right person to find it... :) On 5.5.8 without threadirq I still get this one. [ 110.812821] pps pps0: source "/dev/ttyS0" added [ 110.997022] it87: Found IT8733E chip at 0xa60, revision 3 [ 110.997071] it87: Beeping is supported [ 115.536825] io scheduler mq-deadline registered [ 118.741818] igb 0000:04:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX [ 118.853958] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 132.986522] fuse: init (API version 7.31) [ 2481.471509] ------------[ cut here ]------------ [ 2481.526833] WARNING: CPU: 6 PID: 40 at kernel/rcu/tree.c:2238 rcu_core+0x401/0x450 [ 2481.617555] Modules linked in: fuse mq_deadline xt_MASQUERADE iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_u32 xt_multiport iptable_filter nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6 it87 hwmon_vid xt_tcpudp xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables msr uvcvideo snd_hda_codec_realtek videobuf2_vmalloc videobuf2_memops snd_usb_audio videobuf2_v4l2 snd_hwdep snd_hda_codec_generic snd_usbmidi_lib videodev snd_rawmidi snd_hda_intel videobuf2_common cdc_acm snd_intel_dspcfg snd_hda_codec snd_hda_core snd_seq snd_seq_device snd_pcm snd_timer bfq snd k10temp i2c_piix4 evdev acpi_cpufreq binfmt_misc ip_tables x_tables hid_generic amdgpu aesni_intel gpu_sched sr_mod ttm cdrom usbhid i2c_dev autofs4 [ 2482.436373] CPU: 6 PID: 40 Comm: ksoftirqd/6 Not tainted 5.5.8 #3 [ 2482.509388] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F11 12/06/2019 [ 2482.626157] RIP: 0010:rcu_core+0x401/0x450 [ 2482.675216] Code: fd ff ff 48 2b 15 df db f5 00 48 39 c2 7e 07 48 89 83 b0 00 00 00 48 8b 43 50 48 85 c0 75 c7 0f 0b eb c3 0f 0b e9 4f fc ff ff <0f> 0b eb b8 48 8b 15 bc db f5 00 48 89 93 c0 00 00 00 e9 72 ff ff [ 2482.900310] RSP: 0018:ffff9e45c025be18 EFLAGS: 00010002 [ 2482.962912] RAX: 00000241c6680a98 RBX: ffff96ffdf1a0f00 RCX: ffff96ffc27d5410 [ 2483.049472] RDX: 0000000000000000 RSI: ffff9e45c025be18 RDI: ffff96ffdf1a0f50 [ 2483.134988] RBP: ffff96ffdf1a0f50 R08: 00000241c33a2d38 R09: 0000000000000100 [ 2483.220508] R10: 0000000000000200 R11: ffff96ffca694900 R12: 0000000000000246 [ 2483.306022] R13: ffff96ffd0f9b2c0 R14: 0000000000000000 R15: ffffffffa5005100 [ 2483.391539] FS: 0000000000000000(0000) GS:ffff96ffdf180000(0000) knlGS:0000000000000000 [ 2483.488512] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2483.558408] CR2: 00007f6c977201e0 CR3: 00000003e7152000 CR4: 00000000003406e0 [ 2483.643924] Call Trace: [ 2483.673200] __do_softirq+0xfc/0x247 [ 2483.716007] run_ksoftirqd+0x21/0x30 [ 2483.758815] smpboot_thread_fn+0x195/0x230 [ 2483.807881] ? sort_range+0x20/0x20 [ 2483.849651] kthread+0x118/0x130 [ 2483.888293] ? kthread_create_worker_on_cpu+0x60/0x60 [ 2483.948813] ret_from_fork+0x22/0x40 [ 2483.991619] ---[ end trace e3c11c0962f755af ]--- Previously the box hung hard with a different error, trying to reproduce. So: echo "Code: fd ff ff 48 2b 15 df db f5 00 48 39 c2 7e 07 48 89 83 b0 00 00 00 48 8b 43 50 48 85 c0 75 c7 0f 0b eb c3 0f 0b e9 4f fc ff ff <0f> 0b eb b8 48 8b 15 bc db f5 00 48 89 93 c0 00 00 00 e9 72 ff ff" | ./scripts/decodecode Code: fd ff ff 48 2b 15 df db f5 00 48 39 c2 7e 07 48 89 83 b0 00 00 00 48 8b 43 50 48 85 c0 75 c7 0f 0b eb c3 0f 0b e9 4f fc ff ff <0f> 0b eb b8 48 8b 15 bc db f5 00 48 89 93 c0 00 00 00 e9 72 ff ff All code ======== 0: fd std 1: ff (bad) 2: ff 48 2b decl 0x2b(%rax) 5: 15 df db f5 00 adc $0xf5dbdf,%eax a: 48 39 c2 cmp %rax,%rdx d: 7e 07 jle 0x16 f: 48 89 83 b0 00 00 00 mov %rax,0xb0(%rbx) 16: 48 8b 43 50 mov 0x50(%rbx),%rax 1a: 48 85 c0 test %rax,%rax 1d: 75 c7 jne 0xffffffffffffffe6 1f: 0f 0b ud2 21: eb c3 jmp 0xffffffffffffffe6 23: 0f 0b ud2 25: e9 4f fc ff ff jmpq 0xfffffffffffffc79 2a:* 0f 0b ud2 <-- trapping instruction 2c: eb b8 jmp 0xffffffffffffffe6 2e: 48 8b 15 bc db f5 00 mov 0xf5dbbc(%rip),%rdx # 0xf5dbf1 35: 48 89 93 c0 00 00 00 mov %rdx,0xc0(%rbx) 3c: e9 .byte 0xe9 3d: 72 ff jb 0x3e 3f: ff .byte 0xff Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: eb b8 jmp 0xffffffffffffffbc 4: 48 8b 15 bc db f5 00 mov 0xf5dbbc(%rip),%rdx # 0xf5dbc7 b: 48 89 93 c0 00 00 00 mov %rdx,0xc0(%rbx) 12: e9 .byte 0xe9 13: 72 ff jb 0x14 15: ff .byte 0xff And according to this: https://mudongliang.github.io/x86/html/file_module_x86_id_318.html It's working as intended - don't know why though and it's basically beyond me Still happens, without threadirqs. Back to bisecting? Do you have this for a reason? mce=dont_log_ce (In reply to Ian Kumlien from comment #43) > Do you have this for a reason? > mce=dont_log_ce I do not want to see those. They do however indicate faults, and all faults aren't corrected correctly... We have had several issues with ECC not being able to correct things We have no ECC. Actually, you do, in the cpu - all caches are ecc I'd also check the memory if you have no ecc. 5.3.18 crashes way less. No ECC RAM becaue no DDR4 ECC at 3200. The ECC on the PCI links etc I cannot do anything about. And a few bisect skips and still the unbootable kernel... What does 'blocking' status mean? 5.5.10 also suffers from this issue. Another `git bisect` try gave me another unbootable kernel, despite skipping several commits. git bisect bad v5.4 git bisect good v5.3 git bisect skip (several times) So what can I do? 5.6.2 managed to stay up for ~24h, this might be a sign that things have improved. Need more testing. Still looking good after another 24h; what commit did fix this? I did not yet see this specific bug in 5.6.3. I did not yet see this specific bug in 5.6.4. I do not know what commit fixed this but it appears fixed in 5.6.x. |