Bug 205017
Summary: | Ryzen 3600 CPU Lockup after BIOS update | ||
---|---|---|---|
Product: | Memory Management | Reporter: | JerryD (jvdelisle2) |
Component: | Page Allocator | Assignee: | Andrew Morton (akpm) |
Status: | RESOLVED INVALID | ||
Severity: | normal | CC: | belliash, bp, jamietre, jan.public, postix, rmalinverni, tom |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.3.6, 5.2.15 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg from x570 motherboard
dmesg showing some traces Captured a photo of the console at lockup minicom.cap as requested |
Description
JerryD
2019-09-27 02:06:22 UTC
Created attachment 285205 [details]
dmesg from x570 motherboard
Attached copy of dmesg from initial bug report on 204811.
Built and installed stable 5.3.1 kernel and it appears to be OK on F3 bios. I will reflash bios to F5 and see if problem continues or not. The 5.3.1 Kernel with the F5b bios (latest) gave two call traces at about 160 seconds. The first was an invalid op code: 0000 [#1] SMP NOOPT......could not read anymore. I was in the process of copy and paste to this bug report when it locked so the view is blocked on the screen. Can not ssh into it. I wanted to add that the 5.1.17 kernel just released on Fedora is also giving a pcie related error. I will post that one elsewhere. Finally got something: [ 247.947689] fbcon: Taking over console [ 248.049630] Console: switching to colour frame buffer device 320x90 [ 271.013621] systemd-journald[770]: File /var/log/journal/6fc51f73be354417bc87f2b5ba180e0e/user-1000.journal corrupted or uncleanly shut down, renaming and replacing. [ 271.213585] usb 1-3.2: reset high-speed USB device number 5 using xhci_hcd [ 655.858401] BUG: unable to handle page fault for address: 0000000000002088 [ 655.858412] #PF: supervisor read access in kernel mode [ 655.858415] #PF: error_code(0x0000) - not-present page [ 655.858418] PGD 0 P4D 0 [ 655.858423] Oops: 0000 [#1] SMP NOPTI [ 655.858427] CPU: 10 PID: 2024 Comm: dnf Not tainted 5.3.6-200.fc30.x86_64 #1 [ 655.858430] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F5b 09/17/2019 [ 655.858438] RIP: 0010:__alloc_pages_nodemask+0x132/0x340 [ 655.858442] Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 77 01 00 00 <3b> 77 08 0f 82 6e 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02 [ 655.858448] RSP: 0000:ffff9d01c093bd48 EFLAGS: 00010246 [ 655.858452] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8 [ 655.858456] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000002080 [ 655.858459] RBP: 0000000000100dca R08: ffffffff8ac617e0 R09: 0000000000000000 [ 655.858463] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 655.858466] R13: 0000000000100dca R14: 0000000000000081 R15: 0000000000000000 [ 655.858470] FS: 00007fe88deb8680(0000) GS:ffff8a74cea80000(0000) knlGS:0000000000000000 [ 655.858474] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 655.858478] CR2: 0000000000002088 CR3: 00000003d2ce0000 CR4: 0000000000340ee0 [ 655.858481] Call Trace: [ 655.858487] alloc_pages_vma+0xcc/0x170 [ 655.858491] ? mem_cgroup_commit_charge+0xcb/0x1a0 [ 655.858495] __handle_mm_fault+0xa2e/0x1ab0 [ 655.858499] ? __fsnotify_parent+0x9f/0x140 [ 655.858503] handle_mm_fault+0xc4/0x1e0 [ 655.858507] do_user_addr_fault+0x1c4/0x440 [ 655.858511] do_page_fault+0x31/0x110 [ 655.858515] page_fault+0x3e/0x50 [ 655.858519] RIP: 0033:0x7fe88e110515 [ 655.858902] Code: 83 e8 01 4c 89 f1 48 2b 0d b8 f3 25 00 49 8b 56 08 48 be ab aa aa aa aa aa aa aa 48 c1 f9 04 48 0f af ce 48 8d ba 00 10 00 00 <c7> 42 24 ff ff 00 00 89 4a 20 49 89 7e 08 41 89 46 10 85 c0 0f 84 [ 655.859603] RSP: 002b:00007ffe015ce870 EFLAGS: 00010a17 [ 655.860311] RAX: 0000000000000034 RBX: 0000000000000005 RCX: 000000000000001a [ 655.861009] RDX: 00007fe87d979000 RSI: aaaaaaaaaaaaaaab RDI: 00007fe87d97a000 [ 655.861716] RBP: 0000000000000058 R08: 00007fe88e336020 R09: 00007fe88e336020 [ 655.862419] R10: 0000000000000000 R11: 00007ffe015ce910 R12: 000000000000000a [ 655.863467] R13: 00007fe88e335fe0 R14: 00005601be8c8470 R15: 00007fe87d978090 [ 655.864182] Modules linked in: xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables sunrpc vfat fat edac_mce_amd kvm_amd snd_hda_codec_hdmi snd_hda_codec_generic ledtrig_audio uvcvideo snd_hda_intel videobuf2_vmalloc kvm snd_usb_audio videobuf2_memops snd_hda_codec snd_hda_core snd_usbmidi_lib irqbypass snd_rawmidi videobuf2_v4l2 snd_hwdep videobuf2_common snd_seq crct10dif_pclmul snd_seq_device crc32_pclmul snd_pcm videodev snd_timer snd sp5100_tco joydev mc ghash_clmulni_intel wmi_bmof i2c_piix4 k10temp ccp soundcore acpi_cpufreq hid_logitech_hidpp hid_logitech_dj nouveau [ 655.864203] video mxm_wmi drm_kms_helper ttm drm crc32c_intel igb dca i2c_algo_bit wmi pinctrl_amd [ 655.868479] CR2: 0000000000002088 [ 655.869370] ---[ end trace 82a2b57327546202 ]--- [ 655.870251] RIP: 0010:__alloc_pages_nodemask+0x132/0x340 [ 655.871135] Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 77 01 00 00 <3b> 77 08 0f 82 6e 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02 [ 655.872060] RSP: 0000:ffff9d01c093bd48 EFLAGS: 00010246 [ 655.872985] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8 [ 655.873978] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000002080 [ 655.874900] RBP: 0000000000100dca R08: ffffffff8ac617e0 R09: 0000000000000000 [ 655.875799] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 655.876693] R13: 0000000000100dca R14: 0000000000000081 R15: 0000000000000000 [ 655.877688] FS: 00007fe88deb8680(0000) GS:ffff8a74cea80000(0000) knlGS:0000000000000000 [ 655.878571] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 655.879449] CR2: 0000000000002088 CR3: 00000003d2ce0000 CR4: 0000000000340ee0 $ uname -a Linux quasar 5.3.6-200.fc30.x86_64 #1 SMP Mon Oct 14 13:11:01 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Created attachment 285555 [details]
dmesg showing some traces
I managed to switch to console mode and then ssh into this machine with latest ABBA bios and latest fedora kernel to capture this. I am suspicious of the graphics card and nouveau driver. Under the older bios I was getting a continuous AER message so used pci=noaer to quite that one. My googling says that the aer issue should be fixed by latest AGESA so I have not tried without pci=noaer yet.
I did notice that this machine locked up instantly when I tried to go from console mode (CTRL-ALT-F3) back to typical gnome desktop (CTRL-ALT-F1)
While in this current running state I am going to do a gcc build to stress this a little, but without any gui running.
hmm What does this mean? [ 7.011246] ACPI Warning: SystemIO range 0x0000000000000B00-0x0000000000000B08 conflicts with OpRegion 0x0000000000000B00-0x0000000000000B0F (\GSA1.SMBI) (20190703/utaddress-204) [ 7.011255] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver Created attachment 285557 [details]
Captured a photo of the console at lockup
Photo right after lockup.
Hmm, that __alloc_pages_nodemask() null ptr happens pretty reliably. Can you build the latest Linus kernel from here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ using your .config and see if it triggers with it too. Looking at the asm, correlating it, it looks like it comes from: # ./include/linux/mmzone.h:1035: if (likely(!nodes && zonelist_zone_idx(z) <= highest_zoneidx)) jne .L1471 #, cmpl 8(%rdi), %esi # MEM[(int *)_11 + 8B], _10 jb .L1471 #, # mm/page_alloc.c:4431: ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, and it looks like that zoneref is going up into the weeds but that's for an mm person to verify. I am going to have to bump the bios back to get stable enough to build the kernel. I did so with 5.3 so will do so again with 5.4 latest rc. I am on business travel next week so will see what I can do before I leave. I had got same/similar issue. Fixed it by motherboard replacement. (In reply to Rafal Kupiec from comment #10) > I had got same/similar issue. Fixed it by motherboard replacement. Your new motherboard is same one I am reporting here now. It is stable under bios ver F3. With newer bios it is stable under Windows regardless of bios version. Linux kernel is not stable under newer bios. It is a kernel bug. We are trying to help identify it via this bug report (PS I am able to build kernel="/boot/vmlinuz-5.4.0-rc3". I Have to redo it because I muffed the .config.) Soft lockup which repeats every few seconds using 5.4.0-rc3 https://drive.google.com/drive/folders/1AaucH1NYsmvYLJxQJ4xlIIODWC2GJ6MC?usp=sharing This should be a folder with multiple pictures of console. I can apply patches for testing. I am building kernel master at the moment, but would be happy to clone in anything to try or apply patches manually. I can confirm problem remains with 5.4.0-rc4. Unfortunately, the pics in comment #12 don't show the first oops as it has scrolled by. And the first oops is the important one. Can you catch full dmesg over serial console or netconsole on another box? Can you plese advise how I could get a "continous" dmesg output that can be written to a file or something. Otherwise I will try to repeat the expermiment and catch it as it flys by. The best thing to do is if you have a second machine and both machines have serial out and you connect them with a null modem cable: https://en.wikipedia.org/wiki/Null_modem This is explaining what you need to do on the target kernel: https://www.kernel.org/doc/html/latest/admin-guide/serial-console.html On the logging machine you can use minicom or so to log stuff. Here's another short text explaining what to do: https://elinux.org/Serial_console If you can't do serial, you can try netconsole but that also needs another machine on the same network: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/netconsole.txt Ask if something's unclear. HTH. (In reply to Borislav Petkov from comment #17) > The best thing to do is if you have a second machine and both machines have > serial out and you connect them with a null modem cable: > OK, I have the serial ports and NULL modem. I have installed minicom on the logging machine and it is looking for a /dev/modem Cannot create lockfile for /dev/modem: No such file or directory I think I have to create a dev entry, but I am not sure. I am now running FC31 and have the kernel 5.4.0-rc5 running fine with older BIOS on the test machine. As soon as I can get the logging machine working I will bump the BIOS and see what we get. OK got pass the minicom issue. $ ln -s /dev/ttyUSB0 /dev/modem Here is is from serial console: [ 86.732505] kernel BUG at mm/mmap.c:680! [ 86.732518] invalid opcode: 0000 [#1] SMP NOPTI [ 86.732520] CPU: 10 PID: 2466 Comm: gnome-terminal Not tainted 5.4.0-rc5 #1 [ 86.732521] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X5709 [ 86.732525] RIP: 0010:__vma_adjust+0x7e8/0xaf0 [ 86.732526] Code: 28 fa ff ff 4d 85 f6 0f 84 74 01 00 00 4c 89 f6 4c 89 e7 e8 b [ 86.732527] RSP: 0018:ffffad8fc1347c80 EFLAGS: 00010206 [ 86.732529] RAX: ffff936b783b3920 RBX: 0000000000000000 RCX: 00007fda11fda000 [ 86.732530] RDX: ffff936b7ca39c40 RSI: 00007fda11fbe000 RDI: ffff936b783b3958 [ 86.732531] RBP: ffff936b7ca391f8 R08: ffff936b7ca38b20 R09: 0000000000000058 [ 86.732531] R10: 0000000000000000 R11: ffff936b783b3958 R12: ffff936b783b3900 [ 86.732532] R13: 0000000000000000 R14: ffff936b5de97cd0 R15: ffff936b783b3920 [ 86.732533] FS: 0000000000000000(0000) GS:ffff936b8ee80000(0000) knlGS:0000000 [ 86.732534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 86.732535] CR2: 00007fda12001840 CR3: 00000003f82d0000 CR4: 0000000000340ee0 [ 86.732536] Call Trace: [ 86.732540] __split_vma+0x17e/0x190 [ 86.732541] __do_munmap+0x46f/0x4b0 [ 86.732543] mmap_region+0x232/0x660 [ 86.732545] do_mmap+0x391/0x530 [ 86.732547] ? security_mmap_file+0x7a/0xe0 [ 86.732549] vm_mmap_pgoff+0xd4/0x120 [ 86.732551] ksys_mmap_pgoff+0x1b1/0x270 [ 86.732553] do_syscall_64+0x5b/0x180 [ 86.732556] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 86.732558] RIP: 0033:0x7fda12a93056 [ 86.732559] Code: 1f 44 00 00 f3 0f 1e fa 41 f7 c1 ff 0f 00 00 75 2b 55 48 89 0 [ 86.732560] RSP: 002b:00007ffcbc189248 EFLAGS: 00000206 ORIG_RAX: 0000000000009 [ 86.732561] RAX: ffffffffffffffda RBX: 0000000000000812 RCX: 00007fda12a93056 [ 86.732561] RDX: 0000000000000001 RSI: 0000000000018000 RDI: 00007fda11fbe000 [ 86.732562] RBP: 00007fda11fbe000 R08: 0000000000000003 R09: 0000000000040000 [ 86.732563] R10: 0000000000000812 R11: 0000000000000206 R12: 00007fda12a588b0 [ 86.732564] R13: 0000000000000001 R14: 00007ffcbc189698 R15: 0000000000000003 [ 86.732565] Modules linked in: uinput xt_CHECKSUM xt_MASQUERADE nf_nat_tftp nfp [ 86.732586] acpi_cpufreq ip_tables hid_logitech_hidpp uas usb_storage hid_loge [ 86.732595] ---[ end trace 2f0a8b1548c04a0f ]--- [ 86.732597] RIP: 0010:__vma_adjust+0x7e8/0xaf0 [ 86.732598] Code: 28 fa ff ff 4d 85 f6 0f 84 74 01 00 00 4c 89 f6 4c 89 e7 e8 b [ 86.732599] RSP: 0018:ffffad8fc1347c80 EFLAGS: 00010206 [ 86.732599] RAX: ffff936b783b3920 RBX: 0000000000000000 RCX: 00007fda11fda000 [ 86.732600] RDX: ffff936b7ca39c After a restart and playing with Firfox this one just died: Starting Hold until boot process finishes up... [ OK ] Started Network Manager Wait Online. [ 840.057082] BUG: kernel NULL pointer dereference, address: 0000000000000001 [ 840.057091] #PF: supervisor read access in kernel mode [ 840.057092] #PF: error_code(0x0000) - not-present page [ 840.057093] PGD 0 P4D 0 [ 840.057095] Oops: 0000 [#1] SMP NOPTI [ 840.057097] CPU: 4 PID: 1804 Comm: gnome-shell Not tainted 5.4.0-rc5 #1 [ 840.057098] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X5709 [ 840.057102] RIP: 0010:irq_work_run_list+0x2b/0x70 [ 840.057104] Code: 55 41 54 55 53 9c 58 0f 1f 44 00 00 25 00 02 00 00 75 57 Can you upload the vmlinux from that 5.4.0-rc5 kernel with which you caught the oops along with full dmesg somewhere? Thx. (In reply to Borislav Petkov from comment #22) > Can you upload the vmlinux from that 5.4.0-rc5 kernel with which you caught > the oops along with full dmesg somewhere? > > Thx. See: https://drive.google.com/drive/folders/1dK1CpT0s61Meg3mYImbMLgL00-Ss4Zib Three files from today. consoleatboot: is several reboots from cold power start captured on minicom. On these the locksups are occuring before the normal bootup finishes and at different places. 5.4-rc5-dmesg-F3bios.txt: is a complete output of dmesg after changing bios back to the F3 version with all looking good in case you need all info from the kernel. vmlinuz-5.4.0-rc5: is the kernel executable being booted. Compiled using gcc 9.2.1 (Red Hat 9.2.1-1) Hope this helps. The kernel is failing on the AGESA 1.0.0.3ABBA bios but not on AGESA 1.0.0.3. I see a new bios F10a as just arrived for the bios with AGESA 1.0.0.4B so I will be trying that soon and reporting back. (I also see a kernel RC6 has arrived.) I will try it a bit later. (In reply to JerryD from comment #23) > I see a new bios F10a as just arrived for the bios with AGESA 1.0.0.4B so I > will be trying that soon and reporting back. (I also see a kernel RC6 has > arrived.) I will try it a bit later. No improvement with this latest BIOS. (In reply to JerryD from comment #23) > consoleatboot: is several reboots from cold power start captured on minicom. > On these the locksups are occuring before the normal bootup finishes and at > different places. Ok, so far so good. Please do exactly the same but try catching the output from the very beginning until it oopses. The *important* thing you need to do before booting the target machine is to do in minicom: Ctrl-A L so that it says +-----------------------------------------+ |Capture to which file? | |> minicom.cap | +-----------------------------------------+ then you press enter and *then* you boot the target machine, wait until it oopses and then you send me that minicom.cap file. Because as it is, consoleatboot is cut off to the terminal width and missing important information like the whole Code: line which has the instruction bytes around rIP. Other than that, do not change anything on the setup so that I can use the vmlinuz you uploaded earlier. Thx. Created attachment 285851 [details]
minicom.cap as requested
Note: I am using USB serial ports so the capture begins right after line 10. Before that was a previous attempt that just locked in the middle of the capture. Some of these fails are pretty abrubt, others will stay alive for a while. The one captured here, at line 52 you will see a second RIP, That one just stopped mid way on dumping the code. After that I noticed the gnome desktop was still responding and thats when I attempted a shutdown. It did a few steps and hung on the stop job, so ignore that one.
(In reply to JerryD from comment #26) > Created attachment 285851 [details] > minicom.cap as requested > > Note: I am using USB serial ports so the capture begins right after line 10. > Before that was a previous attempt that just locked in the middle of the > capture. Some of these fails are pretty abrubt, others will stay alive for a > while. The one captured here, at line 52 you will see a second RIP, That one > just stopped mid way on dumping the code. After that I noticed the gnome > desktop was still responding and thats when I attempted a shutdown. It did a > few steps and hung on the stop job, so ignore that one. To clarify, logging starts after the USB interfaces are brought up, so I watch the VGA console before this and the boot is clean. The bug manifests after I start exercising the system such as using Firefox or attempting a multithreaded gcc build. Ok, so this looks like it fails here: [ 123.219007] Code: 08 e8 b5 48 ea ff eb be 0f 0b 90 0f 1f 44 00 00 41 57 41 56 41 55 4c 8d 6f 78 41 54 55 53 48 83 ec 08 4c 8b 67 78 48 89 3c 24 <49> 8b 0c 24 4d 39 e5 0f 84 7b 01 00 00 4d 8d 7c 24 f0 48 8d 59 f0 All code ======== 0: 08 e8 or %ch,%al 2: b5 48 mov $0x48,%ch 4: ea (bad) 5: ff (bad) 6: eb be jmp 0xffffffffffffffc6 8: 0f 0b ud2 a: 90 nop b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 10: 41 57 push %r15 12: 41 56 push %r14 14: 41 55 push %r13 16: 4c 8d 6f 78 lea 0x78(%rdi),%r13 1a: 41 54 push %r12 1c: 55 push %rbp 1d: 53 push %rbx 1e: 48 83 ec 08 sub $0x8,%rsp 22: 4c 8b 67 78 mov 0x78(%rdi),%r12 26: 48 89 3c 24 mov %rdi,(%rsp) 2a:* 49 8b 0c 24 mov (%r12),%rcx <-- trapping instruction 2e: 4d 39 e5 cmp %r12,%r13 31: 0f 84 7b 01 00 00 je 0x1b2 37: 4d 8d 7c 24 f0 lea -0x10(%r12),%r15 3c: 48 8d 59 f0 lea -0x10(%rcx),%rbx unlink_anon_vmas: 1: call __fentry__ .section __mcount_loc, "a",@progbits .quad 1b .previous pushq %r15 # pushq %r14 # pushq %r13 # # mm/rmap.c:386: list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) { leaq 120(%rdi), %r13 #, _92 # mm/rmap.c:378: { pushq %r12 # pushq %rbp # pushq %rbx # subq $8, %rsp #, # mm/rmap.c:386: list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) { movq 120(%rdi), %r12 # vma_23(D)->anon_vma_chain.next, __mptr # mm/rmap.c:378: { movq %rdi, (%rsp) # vma, %sfp # mm/rmap.c:386: list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) { cmpq %r13, %r12 # _92, _6 movq (%r12), %rcx # MEM[(struct anon_vma_chain *)__mptr_24 + -16B].same_vma.next, tmp153 which is this last instruction and which says that one of the next pointers on this &vma->anon_vma_chain list is NULL. Leading to the NULL ptr deref. Now, this is a different splat than the one I've seen before and you don't seem to have ECC memory so have you checked whether your DIMMs are all ok? IOW, can you try swapping them out one by one and each time booting the machine to see whether the error happens. And log it each time. Also, before you do, build your kernel with CONFIG_DEBUG_INFO=y because it is easier to search in vmlinux with it. Thx. (In reply to JerryD from comment #23) > The kernel is failing on the AGESA 1.0.0.3ABBA bios but > not on AGESA 1.0.0.3. Are you sure the failure doesn't happen at all with the old BIOS? If so, you might consider not upgrading it. Or you can try reporting the issue but good luck reporting BIOS bugs... (In reply to JerryD from comment #24) > (In reply to JerryD from comment #23) > > I see a new bios F10a as just arrived for the bios with AGESA 1.0.0.4B so I > > will be trying that soon and reporting back. (I also see a kernel RC6 has > > arrived.) I will try it a bit later. > > No improvement with this latest BIOS. Check F10c (In reply to Rafal Kupiec from comment #30) > (In reply to JerryD from comment #24) > > (In reply to JerryD from comment #23) > > > I see a new bios F10a as just arrived for the bios with AGESA 1.0.0.4B so > I > > > will be trying that soon and reporting back. (I also see a kernel RC6 > has > > > arrived.) I will try it a bit later. > > > > No improvement with this latest BIOS. > > Check F10c Just installed F10c BIOS. The system will boot, but any operations that require paging memory get into trouble. For example, when I ssh into the machine and ran 'git svn --rebase' (which fails to run) I get the following as displayed by dmesg. Note, I am still shelled in and the machine is not completely locking up. The rest of dmesg before this is clean as a whistle. [ 50.843414] fbcon: Taking over console [ 50.851866] Console: switching to colour frame buffer device 320x90 [ 2158.328293] BUG: Bad page map in process git-svn pte:ffffffff8427dd5d pmd:3f1d26067 [ 2158.328297] addr:000055ddee37e000 vm_flags:08100073 anon_vma:ffff9eee457fdaa8 mapping:0000000000000000 index:55ddee37e [ 2158.328301] file:(null) fault:0x0 mmap:0x0 readpage:0x0 [ 2158.328304] CPU: 8 PID: 35359 Comm: git-svn Not tainted 5.4.0-rc5 #1 [ 2158.328305] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F10c 11/08/2019 [ 2158.328306] Call Trace: [ 2158.328313] dump_stack+0x5c/0x80 [ 2158.328317] print_bad_pte.cold+0x6a/0xd2 [ 2158.328319] vm_normal_page+0xbe/0xd0 [ 2158.328321] unmap_page_range+0x50b/0xd10 [ 2158.328323] unmap_vmas+0x7a/0xf0 [ 2158.328326] exit_mmap+0xad/0x1a0 [ 2158.328330] mmput+0x61/0x130 [ 2158.328333] flush_old_exec+0x261/0x680 [ 2158.328335] load_elf_binary+0x32e/0x1640 [ 2158.328337] ? get_user_pages_remote+0x14f/0x240 [ 2158.328339] ? _cond_resched+0x15/0x30 [ 2158.328342] ? selinux_inode_permission+0x107/0x1c0 [ 2158.328345] search_binary_handler+0x84/0x1b0 [ 2158.328347] __do_execve_file.isra.0+0x4f0/0x850 [ 2158.328350] __x64_sys_execve+0x35/0x40 [ 2158.328353] do_syscall_64+0x5b/0x180 [ 2158.328355] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 2158.328357] RIP: 0033:0x7efd6acb744b [ 2158.328360] Code: Bad RIP value. [ 2158.328361] RSP: 002b:00007ffcf1c16958 EFLAGS: 00000202 ORIG_RAX: 000000000000003b [ 2158.328363] RAX: ffffffffffffffda RBX: 000055ddede4c680 RCX: 00007efd6acb744b [ 2158.328364] RDX: 000055ddeef031b0 RSI: 000055ddeee473b0 RDI: 00007ffcf1c16960 [ 2158.328365] RBP: 00007ffcf1c16a50 R08: 0000000000000fff R09: 000055ddede36a3a [ 2158.328366] R10: 0000000000000008 R11: 0000000000000202 R12: 000055ddeee473b0 [ 2158.328366] R13: 000055ddeef031b0 R14: 0000000000000004 R15: 000055ddede36a25 [ 2158.328368] Disabling lock debugging due to kernel taint [ 2158.328798] BUG: Bad rss-counter state mm:000000006486e0cf type:MM_ANONPAGES val:1 So far I have been able to boot into windows and it runs fine. However, today I noticed the windows boot got clobbered, either from installing Fedora 31 or the numerous crashes I have encountered while doing a lot of disk I/O. So I will restore it and confirm Windows does indeed run fine on this F10c bios. (In reply to Borislav Petkov from comment #29) > (In reply to JerryD from comment #23) > > The kernel is failing on the AGESA 1.0.0.3ABBA bios but > > not on AGESA 1.0.0.3. > > Are you sure the failure doesn't happen at all with the old BIOS? > > If so, you might consider not upgrading it. Or you can try reporting the > issue but good luck reporting BIOS bugs... Yes, I can always just go back to the original BIOS however, it is obvious to me there is a bug lurking here. I think the inconsistancy we see is because the bug is related to memory management and paging which implies the IOMMU I plan to do some more experiments regarding this and will report back. (In reply to JerryD from comment #32) > Yes, I can always just go back to the original BIOS however, it is obvious > to me there is a bug lurking here. How is it obvious to you? > I think the inconsistancy we see is > because the bug is related to memory management and paging which implies > the IOMMU So that "implication" certainly doesn't make any sense to me. > I plan to do some more experiments regarding this and will report > back. It looks to me more like a BIOS issue on that particular platform. Especially since you're the only one reporting such random corruptions. And the fact that the corruptions are random could mean a hw issue, like one of your DIMMs is faulty and the old BIOS is hiding it properly, or ... IOW, I don't think we can yet clearly pinpoint what the issue is. At least I can't. (In reply to Borislav Petkov from comment #33) > (In reply to JerryD from comment #32) > > Yes, I can always just go back to the original BIOS however, it is obvious > > to me there is a bug lurking here. > > How is it obvious to you? The lockup and kernel oops indicate a bug. It does not indicate that the bug is software or hardware, kernel or otherwise. > > > I think the inconsistancy we see is > > because the bug is related to memory management and paging which implies > > the IOMMU > > So that "implication" certainly doesn't make any sense to me. Agree, I was being too specific, it appears to be memory related. > > > I plan to do some more experiments regarding this and will report > > back. I finally got around to your suggested pull and swap memory modules. Pulled the second bank and rebooted and it locked as seen before. Pulled first bank and swapped back in the second bank....booted and I could build gcc without a hitch. I have to confess I have never seen a bios "mask" a hardware problem before, but you said it and sure enough it seems the case. The only issue I have seen now is when I request a system retart from the gnome menu it appears to start to do so, but stops and does not actually restart. Power off works fine. Thankyou for your helpful advice, I have learned a lot. I will monitor this for a few days befor closing the bug. Oops wrote too soon. After about 20 minutes it ceased function. I am back to the original bios and will query the bios provider. But did you run with the original BIOS for a sufficiently long time to confirm that it really is ok and only updating to the newer BIOS is starting to cause those random corruptions? (In reply to Borislav Petkov from comment #36) > But did you run with the original BIOS for a sufficiently long time to > confirm that it really is ok and only updating to the newer BIOS is starting > to cause those random corruptions? Yes I have used the original BIOS for several months before the new ones were released. Also I have been been using Windows all day today on the most recent F10c bios without a glitch. Two different browsers, IDE, shell, Ubuntu inside Windows virtual machine and email. I also have been running CPU-Z benchmarks and stress tests without an issue. Ran the stress test for about 15 minutes, while doing all the other things. (sidenote: Ryzen 3600 keeps up quite well with all these things at once). Yah, OEMs test only with windoze so no wonder it works under windoze. It could be some unfortunate interaction between the kernel and the newer BIOSes. What you could do is try older kernels and try to pinpoint, if possible, whether there's a kernel version which is ok. If you establish that, then bisecting should help us narrow it down... (In reply to Borislav Petkov from comment #38) > Yah, OEMs test only with windoze so no wonder it works under windoze. It > could be some unfortunate interaction between the kernel and the newer > BIOSes. > > What you could do is try older kernels and try to pinpoint, if possible, > whether there's a kernel version which is ok. If you establish that, then > bisecting should help us narrow it down... My exact same thoughts, to back up the kernel and see if I can find something. For what its worth. I loaded the 5.0 kernel from Fedora koji. Running a -j10 build of gcc to exercise the system. $ cat 5dump [ 10.437509] virbr0: port 1(virbr0-nic) entered listening state [ 10.456870] virbr0: port 1(virbr0-nic) entered disabled state [ 12.508858] usb 1-3.2: reset high-speed USB device number 5 using xhci_hcd [ 625.463440] BUG: Bad page state in process make pfn:27eddd [ 625.463443] page:ffffe09cc9fb7740 count:0 mapcount:0 mapping:0000000000000000 index:0x127 [ 625.463445] flags: 0x17fffe00080004(uptodate|swapbacked) [ 625.463447] raw: 0017fffe00080004 ffffe09cca22b508 ffffb0e557c27d40 0000000000000000 [ 625.463448] raw: 0000000000000127 0000000000000000 00000000ffffffff ffff97e70c41b000 [ 625.463448] page dumped because: page still charged to cgroup [ 625.463448] page->mem_cgroup:ffff97e70c41b000 [ 625.463449] Modules linked in: uinput devlink xt_CHECKSUM ipt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat snd_hda_codec_hdmi edac_mce_amd kvm_amd kvm uvcvideo videobuf2_vmalloc snd_hda_codec_generic irqbypass videobuf2_memops ledtrig_audio videobuf2_v4l2 videobuf2_common videodev crct10dif_pclmul snd_hda_intel crc32_pclmul snd_usb_audio pl2303 snd_hda_codec joydev media snd_hda_core snd_usbmidi_lib wmi_bmof ghash_clmulni_intel snd_rawmidi snd_hwdep snd_seq snd_seq_device bfq snd_pcm snd_timer sp5100_tco snd i2c_piix4 soundcore ccp pcc_cpufreq acpi_cpufreq nouveau [ 625.463472] video mxm_wmi drm_kms_helper ttm drm crc32c_intel igb dca i2c_algo_bit wmi pinctrl_amd fuse [ 625.463477] CPU: 4 PID: 183178 Comm: make Not tainted 5.0.0-1.fc31.x86_64 #1 [ 625.463478] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F10c 11/08/2019 [ 625.463478] Call Trace: [ 625.463484] dump_stack+0x5c/0x80 [ 625.463486] bad_page.cold+0x7f/0xb3 [ 625.463487] free_unref_page_list+0xf1/0x270 [ 625.463489] release_pages+0x1b7/0x480 [ 625.463491] tlb_flush_mmu_free+0x28/0x50 [ 625.463493] arch_tlb_finish_mmu+0x16/0xa0 [ 625.463494] tlb_finish_mmu+0x1f/0x30 [ 625.463495] exit_mmap+0xcd/0x1a0 [ 625.463497] mmput+0x61/0x130 [ 625.463498] do_exit+0x286/0xbd0 [ 625.463499] ? handle_mm_fault+0xdc/0x210 [ 625.463501] ? do_user_addr_fault+0x218/0x450 [ 625.463502] do_group_exit+0x3a/0xa0 [ 625.463503] __x64_sys_exit_group+0x14/0x20 [ 625.463504] do_syscall_64+0x5b/0x150 [ 625.463506] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 625.463508] RIP: 0033:0x7f0899974416 [ 625.463510] Code: Bad RIP value. [ 625.463511] RSP: 002b:00007ffc92578978 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 [ 625.463512] RAX: ffffffffffffffda RBX: 00007f0899a6a470 RCX: 00007f0899974416 [ 625.463512] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000 [ 625.463513] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78 [ 625.463513] R10: 00007f089938c84e R11: 0000000000000246 R12: 00007f0899a6a470 [ 625.463513] R13: 0000000000000002 R14: 00007f0899a6de68 R15: 0000000000000000 [ 625.463515] Disabling lock debugging due to kernel taint For what it’s worth, I experienced issues strikingly similar to JerryD. I have a Ryzen 3600 on MSI x570 motherboard. Initially installed Ubuntu 18.10 in a dual boot setup with Kernel 4.18 which was stable but after upgrade of the bios to address AMD issues the system became unstable and would freeze as soon as I opened any application on the desktop that put a load on the system, no matter how slight. I abandoned it for a few months due to having the dual boot option and the bios upgrade solved an issue I was having in windows. Recently I tried different distros and combinations with Kernel 4.19 and 5.3 and 5.4 RC 4 but I experienced the same issue. My mother board is MSI MPG X570 and they have a beta bios at the moment (7C37vA61) which I decided to try before rolling back to a pre issue version to test. I tried to install Manjaro with 4.19 and the same issues existed but I decided to give Ubuntu 19.10 with 5.3 a run and for the moment its very stable but I have yet to give it a full stress test. Compared to any other install, I was able to open and use Firefox and load a video which I never could since this issue started. Once I find time, I will give it a full run. There does not appear to be any release notes with the bios update to detail any changes apart from the following limited info. - Improved system boot up time. - Improved PCI-E device compatibility. Is there anything I can run on my system that would help troubleshoot this issue for you? Apologies for the limited information, I am not to the technical level of yourselves but hoped it may help to see someone else with the same hardware and similar issue. (In reply to Tom from comment #41) > Is there anything I can run on my system that would help troubleshoot this > issue for you? First of all, please open a separate bugzilla entry and add me to CC. If it turns out that your issue really is the same as Jerry's, we can merge them. Then, boot the latest upstream kernel - 5.4-rc8 - and see if it freezes. If it does, try catching the freeze over serial, it is explained how in this bug, and upload that log to your fresh bugzilla entry. Don't hesitate to ask if there are questions. Thx. Ok, I will wait until I get the serial output done before I open a new bugzilla, this is my main machine so it will be the weekend before I could get to it, its needed for work purposes. Thank You After taking a break from fighting this, I tried again with the latest vendor BIOS which had additional options in it's set up. I disabled c-states and set to typical idle currrent and the system is stable. I would consider this a BIOS bug and not a kernel bug, though it could have been a combination of these earlier. Closing this bug report. |