Bug 206191

Summary: 5.5.x, 5.4.x: PAGE FAULT crashes the system multiple times / 24h
Product: Memory Management Reporter: udo (udovdh)
Component: Page AllocatorAssignee: other_other
Status: RESOLVED INSUFFICIENT_DATA    
Severity: blocking CC: akpm, alexdeucher, Ian.kumlien, kernel, postix, schweinefilet
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.5.x, 5.4.x, 5.3.x Subsystem:
Regression: No Bisected commit-id:
Attachments: crashdump
crashdump2-1
crashdump2-2
crashdump2-3
crashdump2-4
kernel config
adapted patch to 5.4.15 from https://lkml.org/lkml/diff/2019/12/10/333/1
dmesg from one rare instance where the system did not fully crash.
dmesg: Null pointer
dmesg: another bug
latest kernel config

Description udo 2020-01-13 15:57:17 UTC
Created attachment 286783 [details]
crashdump

Previosly I emailed the linux-kernel list with:

[  246.911769] Call Trace:
[  246.911777]  __do_softirq+0xfc/0x247
[  246.911783]  run_ksoftirqd+0x21/0x30
[  246.911787]  smpboot_thread_fn+0x195/0x230
[  246.911789]  ? sort_range+0x20/0x20
[  246.911792]  kthread+0x118/0x130
[  246.911795]  ? kthread_create_worker_on_cpu+0x60/0x60
[  246.911797]  ret_from_fork+0x22/0x40
[  246.911800] ---[ end trace 6f223c45fc8e7e99 ]---

See https://marc.info/?l=linux-kernel&m=157815016419071&w=2
The system cintinued running after this.

But now the situation is more serious: I get multiple crashes per 24h.
No info in /var/log/messages, so I attached a vt420 terminal as console, which yielded me the attached photo.
Same trace but then an actual crash.

This is on AMD Ryzen 5 3400G with Radeon Vega Graphics on Gigabye  X570 AORUS PRO, Fedora 31 with git nesa, kernel.org (5.4.9 currently)

Please fix.
Comment 1 udo 2020-01-13 16:47:44 UTC
Firefox appears to be a trigger for this bug.
Comment 2 udo 2020-01-14 03:59:30 UTC
Created attachment 286787 [details]
crashdump2-1

new crashdump, more complete
Comment 3 udo 2020-01-14 04:00:00 UTC
Created attachment 286789 [details]
crashdump2-2
Comment 4 udo 2020-01-14 04:00:28 UTC
Created attachment 286791 [details]
crashdump2-3
Comment 5 udo 2020-01-14 04:01:10 UTC
Created attachment 286793 [details]
crashdump2-4

more complete crashdump, final photo
Comment 6 udo 2020-01-14 04:03:22 UTC
I managed to capture a second, more complete crashdump. Photo's are attached to this bug.
Comment 7 txrx 2020-01-15 11:36:10 UTC
What bios version are you running?
Comment 8 udo 2020-01-18 08:55:50 UTC
I see a pagefault, so is this a memory management bug?

(In reply to txrx from comment #7)
> What bios version are you running?

https://bugzilla.kernel.org/attachment.cgi?id=286787
Comment 9 udo 2020-01-18 10:59:53 UTC
This is the Call Trace for a non-crash happening: https://bugzilla.kernel.org/show_bug.cgi?id=206245
Comment 10 udo 2020-01-18 11:06:05 UTC
(In reply to udo from comment #0)
> Created attachment 286783 [details]
> crashdump
> 
> Previosly I emailed the linux-kernel list with:
> 
> [  246.911769] Call Trace:

See bug https://bugzilla.kernel.org/show_bug.cgi?id=206245
Comment 11 udo 2020-01-18 13:27:59 UTC
5.4.13 also has this issue.
Comment 12 udo 2020-01-19 09:45:10 UTC
Created attachment 286881 [details]
kernel config
Comment 13 udo 2020-01-20 16:16:55 UTC
5.3.18 shows similar type of behaviour.
Comment 14 Alex Deucher 2020-01-20 16:54:00 UTC
Can you bisect?
Comment 15 udo 2020-01-21 02:53:50 UTC
Sure I can when I find the howto to help me remember.
But I forgot what was the last known good version.
Also this will be a lengthy process.
Comment 16 udo 2020-01-24 17:01:23 UTC
But now 5.3.18 has been up for four days, so the issue in Comment 13 was perhaps a different one?

For bisect:
Can we call 5.3.18 a good version and 5.4 a bad version?
Can you agree with those?
Comment 17 udo 2020-01-27 13:52:23 UTC
I found this patch https://lkml.org/lkml/diff/2019/12/10/333/1 from this thread which looks similar (initial error messages) to what I found for this issue.
I adapted the patch to 5.4.15 to test.

Bisect has to  wait.
Bisect between v5.3 and v5.4 has done two steps, 10 or so to go.
Comment 18 udo 2020-01-27 14:21:44 UTC
Created attachment 286995 [details]
adapted patch to 5.4.15 from https://lkml.org/lkml/diff/2019/12/10/333/1
Comment 19 udo 2020-01-27 15:30:40 UTC
Patch did not fix the core issue. Back to bisecting.
Comment 20 udo 2020-01-29 13:31:49 UTC
Created attachment 287017 [details]
dmesg from one rare instance where the system did not fully crash.
Comment 21 udo 2020-01-29 13:33:50 UTC
What can a developer tell from the info in the attachment to comment 20?
Comment 22 udo 2020-02-03 16:34:00 UTC
Created attachment 287097 [details]
dmesg: Null pointer
Comment 23 udo 2020-02-07 14:15:08 UTC
Created attachment 287225 [details]
dmesg: another bug

these bugs make it hard to bisect
Comment 24 udo 2020-02-07 15:27:11 UTC
kernel 5.5.2 also has this issue.
Comment 25 udo 2020-02-08 12:34:48 UTC
While bisecting I found this in dmesg:

[22803.406500] evince[29402]: segfault at 7f6af8021000 ip 00007f6b3a9c7d6b sp 00007ffd81eeb648 error 6 in libc-2.30.so[7f6b3a888000+14f000]
[22803.480177] Code: 47 20 c5 fe 7f 44 17 c0 c5 fe 7f 47 40 c5 fe 7f 44 17 a0 c5 fe 7f 47 60 c5 fe 7f 44 17 80 48 01 fa 48 83 e2 80 48 39 d1 74 ba <c5> fd 7f 01 c5 fd 7f 41 20 c5 fd 7f 41 40 c5 fd 7f 41 60 48 81 c1
[23067.538622] WARNING: CPU: 4 PID: 30 at kernel/rcu/tree.c:2211 rcu_core+0x3f6/0x450
[23067.583982] Modules linked in: fuse mq_deadline xt_MASQUERADE iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_u32 xt_multiport iptable_filter nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT it87 nf_reject_ipv6 hwmon_vid xt_tcpudp xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables msr uvcvideo videobuf2_vmalloc snd_usb_audio videobuf2_memops videobuf2_v4l2 snd_hwdep snd_usbmidi_lib videodev snd_rawmidi videobuf2_common cdc_acm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_intel_nhlt snd_hda_codec snd_hda_core snd_seq snd_seq_device snd_pcm k10temp snd_timer i2c_piix4 snd bfq evdev acpi_cpufreq binfmt_misc ip_tables x_tables amdgpu hid_generic gpu_sched aesni_intel ttm sr_mod cdrom usbhid i2c_dev autofs4
[23067.992351] CPU: 4 PID: 30 Comm: ksoftirqd/4 Tainted: G        W         5.3.0git-11783-g0a3775e4f883 #7
[23068.049171] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F11 12/06/2019
[23068.107562] RIP: 0010:rcu_core+0x3f6/0x450
[23068.132087] Code: fd ff ff 48 2b 15 ea 07 f6 00 48 39 c2 7e 07 48 89 83 b0 00 00 00 48 8b 43 50 48 85 c0 75 c7 0f 0b eb c3 0f 0b e9 5a fc ff ff <0f> 0b eb b8 48 8b 15 c7 07 f6 00 48 89 93 c0 00 00 00 e9 72 ff ff
[23068.244630] RSP: 0018:ffffa6aa00203e18 EFLAGS: 00010002
[23068.275932] RAX: 000014faf2aab325 RBX: ffffa443df11ee00 RCX: ffffa4438e2cd710
[23068.318692] RDX: 0000000000000000 RSI: ffffa6aa00203e18 RDI: ffffa443df11ee50
[23068.361451] RBP: ffffa443df11ee50 R08: ffffa4438e2cd710 R09: 0000000000000100
[23068.404206] R10: 0000000000000080 R11: ffffa443df11e1f8 R12: 0000000000000246
[23068.446968] R13: ffffa443ddf884c0 R14: 0000000000000000 R15: ffffffffab005110
[23068.489727] FS:  0000000000000000(0000) GS:ffffa443df100000(0000) knlGS:0000000000000000
[23068.538215] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[23068.572640] CR2: 00003080cd8c26a0 CR3: 0000000027aca000 CR4: 00000000003406e0
[23068.615396] Call Trace:
[23068.630035]  __do_softirq+0xfc/0x24c
[23068.651439]  run_ksoftirqd+0x21/0x30
[23068.672842]  smpboot_thread_fn+0x195/0x230
[23068.697374]  ? sort_range+0x20/0x20
[23068.718259]  kthread+0x10d/0x130
[23068.737578]  ? kthread_create_worker_on_cpu+0x60/0x60
[23068.767843]  ret_from_fork+0x22/0x40
[23068.789244] ---[ end trace 8d1fe1b346997d45 ]---
[25100.691238] Web Content[30651]: segfault at 7faba69fffe8 ip 00007fabbcd97c43 sp 00007ffff8aaf518 error 4 in libxul.so[7fabba60f000+3a97000]

5.3.0git-11783-g0a3775e4f883
Comment 26 udo 2020-02-09 07:48:39 UTC
Bisecting landed me to a kernel that cannot find the root fs despite the kernel commandline not changing.
Comment 27 udo 2020-02-09 14:00:19 UTC
Are the many segfaults in the software a sign of this kernel issue?
Why do I see these and not so many others, judging from reactions?
Comment 28 udo 2020-02-15 16:03:41 UTC
New bisct started between 5.2 and 5.4.
Comment 29 udo 2020-02-16 15:22:51 UTC
This new bisect immediately results in a kernel that does not find the rootfs and instead lists a few block devices.
So how can I find a commit causing the page fault issue this way?
Comment 30 udo 2020-02-17 11:45:27 UTC
This is with 5.3.0-rc3git-01632-gbe91233b1053; how to circumvent?
Comment 31 udo 2020-03-07 10:39:23 UTC
5.5.8, similar trace as at athe start of this bug.
I cannot proceed with the bisect, I need help with this issue.
So what does the 'blocking' status mean at this time?

[21573.900298] traps: chrome[24334] trap int3 ip:55eec824b037 sp:7fffd37bcb50 error:0 in chrome[55eec4040000+7287000]
[21573.978083] traps: chrome[24325] trap int3 ip:55ec1b96b6ae sp:7ffef1029eb0 error:0 in chrome[55ec176e3000+7287000]
[21677.749442] ------------[ cut here ]------------
[21677.777125] WARNING: CPU: 0 PID: 9 at kernel/rcu/tree.c:2238 rcu_core+0x401/0x450
[21677.821957] Modules linked in: fuse mq_deadline ip6t_REJECT nf_reject_ipv6 xt_state ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast xt_MASQUERADE iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_u32 xt_multiport xt_tcpudp xt_conntrack it87 nf_conntrack hwmon_vid nf_defrag_ipv6 nf_defrag_ipv4 msr iptable_filter snd_usb_audio uvcvideo snd_hwdep videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_usbmidi_lib videodev videobuf2_common snd_rawmidi cdc_acm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_seq snd_seq_device i2c_piix4 k10temp snd_pcm snd_timer snd bfq evdev acpi_cpufreq binfmt_misc ip_tables x_tables sr_mod cdrom amdgpu gpu_sched aesni_intel ttm hid_generic usbhid i2c_dev autofs4
[21678.231368] CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 5.5.8 #3
[21678.267353] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F11 12/06/2019
[21678.325743] RIP: 0010:rcu_core+0x401/0x450
[21678.350271] Code: fd ff ff 48 2b 15 df db f5 00 48 39 c2 7e 07 48 89 83 b0 00 00 00 48 8b 43 50 48 85 c0 75 c7 0f 0b eb c3 0f 0b e9 4f fc ff ff <0f> 0b eb b8 48 8b 15 bc db f5 00 48 89 93 c0 00 00 00 e9 72 ff ff
[21678.462816] RSP: 0018:ffffae44c0117e18 EFLAGS: 00010006
[21678.494117] RAX: 000013b75c8178a6 RBX: ffffa06cdf020f00 RCX: ffffa06cb5a1fe90
[21678.536874] RDX: 0000000000000000 RSI: ffffae44c0117e18 RDI: ffffa06cdf020f50
[21678.579634] RBP: ffffa06cdf020f50 R08: ffffa06cb5a1fe90 R09: 0000000000000100
[21678.622393] R10: 0000000000000002 R11: 000000000001831c R12: 0000000000000246
[21678.665150] R13: ffffa06cd0e9ee00 R14: 0000000000000000 R15: 0000000000000000
[21678.707911] FS:  0000000000000000(0000) GS:ffffa06cdf000000(0000) knlGS:0000000000000000
[21678.756395] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[21678.790822] CR2: 00007f3d27d3f000 CR3: 000000021ce3c000 CR4: 00000000003406f0
[21678.833577] Call Trace:
[21678.848219]  __do_softirq+0xfc/0x247
[21678.869622]  run_ksoftirqd+0x21/0x30
[21678.891027]  smpboot_thread_fn+0x195/0x230
[21678.915559]  ? sort_range+0x20/0x20
[21678.936446]  kthread+0x118/0x130
[21678.955767]  ? kthread_create_worker_on_cpu+0x60/0x60
[21678.986022]  ret_from_fork+0x22/0x40
[21679.007427] ---[ end trace 76d0aaccffc55ecc ]---
Comment 32 Ian Kumlien 2020-03-07 12:14:42 UTC
Dist?
Kernel commandline?
How much memory?

I have two ryzen boxes and I'm not seeing this...
Comment 33 udo 2020-03-07 12:20:40 UTC
We run Fedora 31 with kernel.org, git mesa.

kernel commandline:
ro root=/dev/myvg/rootlv noexec=on noexec32=on vga=0xF06 SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=us acpi_enforce_resources=lax fbcon=font:VGA8x16 cgroup_disable=memory threadirqs plymouth.enable=0 rd.plymouth=0 mce=dont_log_ce panic=0 rd.lvm.vg=myvg zswap.enabled=0 rd.auto=1 audit=0 systemd.log_level=warning net.ifnames=0 clocksource=hpet rd.lvm.vg=ssdvg rd.luks.options=discard elevator=mq-deadline amdgpu.gttsize=8192 amdgpu.lockup_timeout=1000 amdgpu.gpu_recovery=1 amdgpu.noretry=0 amdgpu.ppfeaturemask=0xfffd3fff console=ttyS1,19200n8 console=tty0 fsck.repair=yes fsck.mode=force

(from extlinux.conf)

16GB of DDR 3200 RAM, latest BIOS from gigabyte
Comment 34 udo 2020-03-07 12:34:30 UTC
Created attachment 287819 [details]
latest kernel config
Comment 35 Ian Kumlien 2020-03-07 14:35:37 UTC
On my old AMD system i got odd bugs and crashes with "threadirqs" try without it
Comment 36 udo 2020-03-07 14:45:23 UTC
Threadirqs has been there for quite a while, it helps using the multicore capabilities.
I'll nevertheless try without it soon.
Comment 37 Ian Kumlien 2020-03-07 14:47:42 UTC
it forces drivers not explicitly marked as "not working with threaded irqs" to use threaded irqs... 

Some drivers might have problems with it without being marked as such.
Comment 38 udo 2020-03-07 15:31:08 UTC
If the omission of threadirqs helps the stability, then how can we find the offending driver?
Comment 39 Ian Kumlien 2020-03-07 18:07:01 UTC
I don't know, it could be side effects - either way I'm not the right person to find it... :)
Comment 40 udo 2020-03-08 08:49:32 UTC
On 5.5.8 without threadirq I still get this one.

[  110.812821] pps pps0: source "/dev/ttyS0" added
[  110.997022] it87: Found IT8733E chip at 0xa60, revision 3
[  110.997071] it87: Beeping is supported
[  115.536825] io scheduler mq-deadline registered
[  118.741818] igb 0000:04:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[  118.853958] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  132.986522] fuse: init (API version 7.31)
[ 2481.471509] ------------[ cut here ]------------
[ 2481.526833] WARNING: CPU: 6 PID: 40 at kernel/rcu/tree.c:2238 rcu_core+0x401/0x450
[ 2481.617555] Modules linked in: fuse mq_deadline xt_MASQUERADE iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_u32 xt_multiport iptable_filter nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6 it87 hwmon_vid xt_tcpudp xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables msr uvcvideo snd_hda_codec_realtek videobuf2_vmalloc videobuf2_memops snd_usb_audio videobuf2_v4l2 snd_hwdep snd_hda_codec_generic snd_usbmidi_lib videodev snd_rawmidi snd_hda_intel videobuf2_common cdc_acm snd_intel_dspcfg snd_hda_codec snd_hda_core snd_seq snd_seq_device snd_pcm snd_timer bfq snd k10temp i2c_piix4 evdev acpi_cpufreq binfmt_misc ip_tables x_tables hid_generic amdgpu aesni_intel gpu_sched sr_mod ttm cdrom usbhid i2c_dev autofs4
[ 2482.436373] CPU: 6 PID: 40 Comm: ksoftirqd/6 Not tainted 5.5.8 #3
[ 2482.509388] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F11 12/06/2019
[ 2482.626157] RIP: 0010:rcu_core+0x401/0x450
[ 2482.675216] Code: fd ff ff 48 2b 15 df db f5 00 48 39 c2 7e 07 48 89 83 b0 00 00 00 48 8b 43 50 48 85 c0 75 c7 0f 0b eb c3 0f 0b e9 4f fc ff ff <0f> 0b eb b8 48 8b 15 bc db f5 00 48 89 93 c0 00 00 00 e9 72 ff ff
[ 2482.900310] RSP: 0018:ffff9e45c025be18 EFLAGS: 00010002
[ 2482.962912] RAX: 00000241c6680a98 RBX: ffff96ffdf1a0f00 RCX: ffff96ffc27d5410
[ 2483.049472] RDX: 0000000000000000 RSI: ffff9e45c025be18 RDI: ffff96ffdf1a0f50
[ 2483.134988] RBP: ffff96ffdf1a0f50 R08: 00000241c33a2d38 R09: 0000000000000100
[ 2483.220508] R10: 0000000000000200 R11: ffff96ffca694900 R12: 0000000000000246
[ 2483.306022] R13: ffff96ffd0f9b2c0 R14: 0000000000000000 R15: ffffffffa5005100
[ 2483.391539] FS:  0000000000000000(0000) GS:ffff96ffdf180000(0000) knlGS:0000000000000000
[ 2483.488512] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2483.558408] CR2: 00007f6c977201e0 CR3: 00000003e7152000 CR4: 00000000003406e0
[ 2483.643924] Call Trace:
[ 2483.673200]  __do_softirq+0xfc/0x247
[ 2483.716007]  run_ksoftirqd+0x21/0x30
[ 2483.758815]  smpboot_thread_fn+0x195/0x230
[ 2483.807881]  ? sort_range+0x20/0x20
[ 2483.849651]  kthread+0x118/0x130
[ 2483.888293]  ? kthread_create_worker_on_cpu+0x60/0x60
[ 2483.948813]  ret_from_fork+0x22/0x40
[ 2483.991619] ---[ end trace e3c11c0962f755af ]---


Previously the box hung hard with a different error, trying to reproduce.
Comment 41 Ian Kumlien 2020-03-08 09:37:27 UTC
So:
echo "Code: fd ff ff 48 2b 15 df db f5 00 48 39 c2 7e 07 48 89 83 b0 00 00 00 48 8b 43 50 48 85 c0 75 c7 0f 0b eb c3 0f 0b e9 4f fc ff ff <0f> 0b eb b8 48 8b 15 bc db f5 00 48 89 93 c0 00 00 00 e9 72 ff ff" | ./scripts/decodecode

Code: fd ff ff 48 2b 15 df db f5 00 48 39 c2 7e 07 48 89 83 b0 00 00 00 48 8b 43 50 48 85 c0 75 c7 0f 0b eb c3 0f 0b e9 4f fc ff ff <0f> 0b eb b8 48 8b 15 bc db f5 00 48 89 93 c0 00 
00 00 e9 72 ff ff
All code
========
   0:   fd                      std    
   1:   ff                      (bad)  
   2:   ff 48 2b                decl   0x2b(%rax)
   5:   15 df db f5 00          adc    $0xf5dbdf,%eax
   a:   48 39 c2                cmp    %rax,%rdx
   d:   7e 07                   jle    0x16
   f:   48 89 83 b0 00 00 00    mov    %rax,0xb0(%rbx)
  16:   48 8b 43 50             mov    0x50(%rbx),%rax
  1a:   48 85 c0                test   %rax,%rax
  1d:   75 c7                   jne    0xffffffffffffffe6
  1f:   0f 0b                   ud2    
  21:   eb c3                   jmp    0xffffffffffffffe6
  23:   0f 0b                   ud2    
  25:   e9 4f fc ff ff          jmpq   0xfffffffffffffc79
  2a:*  0f 0b                   ud2             <-- trapping instruction
  2c:   eb b8                   jmp    0xffffffffffffffe6
  2e:   48 8b 15 bc db f5 00    mov    0xf5dbbc(%rip),%rdx        # 0xf5dbf1
  35:   48 89 93 c0 00 00 00    mov    %rdx,0xc0(%rbx)
  3c:   e9                      .byte 0xe9
  3d:   72 ff                   jb     0x3e
  3f:   ff                      .byte 0xff

Code starting with the faulting instruction
===========================================
   0:   0f 0b                   ud2    
   2:   eb b8                   jmp    0xffffffffffffffbc
   4:   48 8b 15 bc db f5 00    mov    0xf5dbbc(%rip),%rdx        # 0xf5dbc7
   b:   48 89 93 c0 00 00 00    mov    %rdx,0xc0(%rbx)
  12:   e9                      .byte 0xe9
  13:   72 ff                   jb     0x14
  15:   ff                      .byte 0xff

And according to this:
https://mudongliang.github.io/x86/html/file_module_x86_id_318.html

It's working as intended - don't know why though and it's basically beyond me
Comment 42 udo 2020-03-08 10:28:37 UTC
Still happens, without threadirqs.
Back to bisecting?
Comment 43 Ian Kumlien 2020-03-08 10:33:57 UTC
Do you have this for a reason?
mce=dont_log_ce
Comment 44 udo 2020-03-08 10:37:10 UTC
(In reply to Ian Kumlien from comment #43)
> Do you have this for a reason?
> mce=dont_log_ce

I do not want to see those.
Comment 45 Ian Kumlien 2020-03-08 10:39:22 UTC
They do however indicate faults, and all faults aren't corrected correctly...

We have had several issues with ECC not being able to correct things
Comment 46 udo 2020-03-08 10:57:50 UTC
We have no ECC.
Comment 47 Ian Kumlien 2020-03-08 11:04:05 UTC
Actually, you do, in the cpu - all caches are ecc

I'd also check the memory if you have no ecc.
Comment 48 udo 2020-03-08 11:06:24 UTC
5.3.18 crashes way less.

No ECC RAM becaue no DDR4 ECC at 3200.
The ECC on the PCI links etc I cannot do anything about.
Comment 49 udo 2020-03-14 12:59:34 UTC
And a few bisect skips and still the unbootable kernel...
What does 'blocking' status mean?
Comment 50 udo 2020-03-20 15:49:49 UTC
5.5.10 also suffers from this issue.
Comment 51 udo 2020-03-29 12:01:35 UTC
Another `git bisect` try gave me another unbootable kernel, despite skipping several commits.

git bisect bad v5.4
git bisect good v5.3
git bisect skip (several times)

So what can I do?
Comment 52 udo 2020-04-04 11:36:19 UTC
5.6.2 managed to stay up for ~24h, this might be a sign that things have improved.
Need more testing.
Comment 53 udo 2020-04-05 14:43:50 UTC
Still looking good after another 24h; what commit did fix this?
Comment 54 udo 2020-04-15 11:17:21 UTC
I did not yet see this specific bug in 5.6.3.
Comment 55 udo 2020-04-17 13:37:33 UTC
I did not yet see this specific bug in 5.6.4.
Comment 56 udo 2020-04-18 11:47:24 UTC
I do not know what commit fixed this but it appears fixed in 5.6.x.