Bug 205017 - Ryzen 3600 CPU Lockup after BIOS update
Summary: Ryzen 3600 CPU Lockup after BIOS update
Status: RESOLVED INVALID
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Page Allocator (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-27 02:06 UTC by JerryD
Modified: 2020-03-01 18:44 UTC (History)
7 users (show)

See Also:
Kernel Version: 5.3.6, 5.2.15
Tree: Fedora
Regression: No


Attachments
dmesg from x570 motherboard (88.92 KB, text/plain)
2019-09-27 02:11 UTC, JerryD
Details
dmesg showing some traces (92.36 KB, text/plain)
2019-10-19 16:59 UTC, JerryD
Details
Captured a photo of the console at lockup (2.21 MB, image/jpeg)
2019-10-19 17:54 UTC, JerryD
Details
minicom.cap as requested (8.16 KB, text/plain)
2019-11-11 04:18 UTC, JerryD
Details

Description JerryD 2019-09-27 02:06:22 UTC
Branching this Bug from 204811 comments 10 through 17:

https://bugzilla.kernel.org/show_bug.cgi?id=204811#c17

Attached dmesg from starting.

Updated my BIOS to the 1003ABB bios on a Gigabyte Aurus Elite x570 motherboard and a Ryzen 3600X. Before the BIOS update this machine was rock solid. Now it is locking up fairly quickly.

This was on kernel 5.2.15-200.fc30.x86_64 (Fedora 30 distribution).

I will attach dmesg shortly.  I have reverted back to the original starting BIOS and latest kernel 5.2.16 works fine. This is BIOS F5 with latest AMD AGESA 1.0.0.3ABB.

This is a dual boot machine with Windows and on windows the @BIOS utility from Gigabyte will not intstall so I am only using Q-Flash. I have no overclocking, just using the default memory.  After reverting back to F3 BIOS, no problems. I eneabled XMP to set memory and no problems.

I will try with XMP on F5 loaded again and 5.2.16 kernel and see what happens.

After thtis I will have to revert back to F3 if there is still a problem and try to get setup to build kernel 5.3 from source and try again.
Comment 1 JerryD 2019-09-27 02:11:38 UTC
Created attachment 285205 [details]
dmesg from x570 motherboard

Attached copy of dmesg from initial bug report on 204811.
Comment 2 JerryD 2019-09-27 15:52:15 UTC
Built and installed stable 5.3.1 kernel and it appears to be OK on F3 bios.  I will reflash bios to F5 and see if problem continues or not.
Comment 3 JerryD 2019-09-28 02:10:43 UTC
The 5.3.1 Kernel with the F5b bios (latest) gave two call traces at about 160 seconds. The first was an invalid op code: 0000 [#1] SMP NOOPT......could not read anymore. I was in the process of copy and paste to this bug report when it locked so the view is blocked on the screen. Can not ssh into it.

I wanted to add that the 5.1.17 kernel just released on Fedora is also giving a pcie related error. I will post that one elsewhere.
Comment 4 JerryD 2019-10-19 16:46:14 UTC
Finally got something:

[  247.947689] fbcon: Taking over console
[  248.049630] Console: switching to colour frame buffer device 320x90
[  271.013621] systemd-journald[770]: File /var/log/journal/6fc51f73be354417bc87f2b5ba180e0e/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
[  271.213585] usb 1-3.2: reset high-speed USB device number 5 using xhci_hcd
[  655.858401] BUG: unable to handle page fault for address: 0000000000002088
[  655.858412] #PF: supervisor read access in kernel mode
[  655.858415] #PF: error_code(0x0000) - not-present page
[  655.858418] PGD 0 P4D 0 
[  655.858423] Oops: 0000 [#1] SMP NOPTI
[  655.858427] CPU: 10 PID: 2024 Comm: dnf Not tainted 5.3.6-200.fc30.x86_64 #1
[  655.858430] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F5b 09/17/2019
[  655.858438] RIP: 0010:__alloc_pages_nodemask+0x132/0x340
[  655.858442] Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 77 01 00 00 <3b> 77 08 0f 82 6e 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02
[  655.858448] RSP: 0000:ffff9d01c093bd48 EFLAGS: 00010246
[  655.858452] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8
[  655.858456] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000002080
[  655.858459] RBP: 0000000000100dca R08: ffffffff8ac617e0 R09: 0000000000000000
[  655.858463] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  655.858466] R13: 0000000000100dca R14: 0000000000000081 R15: 0000000000000000
[  655.858470] FS:  00007fe88deb8680(0000) GS:ffff8a74cea80000(0000) knlGS:0000000000000000
[  655.858474] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  655.858478] CR2: 0000000000002088 CR3: 00000003d2ce0000 CR4: 0000000000340ee0
[  655.858481] Call Trace:
[  655.858487]  alloc_pages_vma+0xcc/0x170
[  655.858491]  ? mem_cgroup_commit_charge+0xcb/0x1a0
[  655.858495]  __handle_mm_fault+0xa2e/0x1ab0
[  655.858499]  ? __fsnotify_parent+0x9f/0x140
[  655.858503]  handle_mm_fault+0xc4/0x1e0
[  655.858507]  do_user_addr_fault+0x1c4/0x440
[  655.858511]  do_page_fault+0x31/0x110
[  655.858515]  page_fault+0x3e/0x50
[  655.858519] RIP: 0033:0x7fe88e110515
[  655.858902] Code: 83 e8 01 4c 89 f1 48 2b 0d b8 f3 25 00 49 8b 56 08 48 be ab aa aa aa aa aa aa aa 48 c1 f9 04 48 0f af ce 48 8d ba 00 10 00 00 <c7> 42 24 ff ff 00 00 89 4a 20 49 89 7e 08 41 89 46 10 85 c0 0f 84
[  655.859603] RSP: 002b:00007ffe015ce870 EFLAGS: 00010a17
[  655.860311] RAX: 0000000000000034 RBX: 0000000000000005 RCX: 000000000000001a
[  655.861009] RDX: 00007fe87d979000 RSI: aaaaaaaaaaaaaaab RDI: 00007fe87d97a000
[  655.861716] RBP: 0000000000000058 R08: 00007fe88e336020 R09: 00007fe88e336020
[  655.862419] R10: 0000000000000000 R11: 00007ffe015ce910 R12: 000000000000000a
[  655.863467] R13: 00007fe88e335fe0 R14: 00005601be8c8470 R15: 00007fe87d978090
[  655.864182] Modules linked in: xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables sunrpc vfat fat edac_mce_amd kvm_amd snd_hda_codec_hdmi snd_hda_codec_generic ledtrig_audio uvcvideo snd_hda_intel videobuf2_vmalloc kvm snd_usb_audio videobuf2_memops snd_hda_codec snd_hda_core snd_usbmidi_lib irqbypass snd_rawmidi videobuf2_v4l2 snd_hwdep videobuf2_common snd_seq crct10dif_pclmul snd_seq_device crc32_pclmul snd_pcm videodev snd_timer snd sp5100_tco joydev mc ghash_clmulni_intel wmi_bmof i2c_piix4 k10temp ccp soundcore acpi_cpufreq hid_logitech_hidpp hid_logitech_dj nouveau
[  655.864203]  video mxm_wmi drm_kms_helper ttm drm crc32c_intel igb dca i2c_algo_bit wmi pinctrl_amd
[  655.868479] CR2: 0000000000002088
[  655.869370] ---[ end trace 82a2b57327546202 ]---
[  655.870251] RIP: 0010:__alloc_pages_nodemask+0x132/0x340
[  655.871135] Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 77 01 00 00 <3b> 77 08 0f 82 6e 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02
[  655.872060] RSP: 0000:ffff9d01c093bd48 EFLAGS: 00010246
[  655.872985] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8
[  655.873978] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000002080
[  655.874900] RBP: 0000000000100dca R08: ffffffff8ac617e0 R09: 0000000000000000
[  655.875799] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  655.876693] R13: 0000000000100dca R14: 0000000000000081 R15: 0000000000000000
[  655.877688] FS:  00007fe88deb8680(0000) GS:ffff8a74cea80000(0000) knlGS:0000000000000000
[  655.878571] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  655.879449] CR2: 0000000000002088 CR3: 00000003d2ce0000 CR4: 0000000000340ee0


$ uname -a
Linux quasar 5.3.6-200.fc30.x86_64 #1 SMP Mon Oct 14 13:11:01 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Comment 5 JerryD 2019-10-19 16:59:27 UTC
Created attachment 285555 [details]
dmesg showing some traces

I managed to switch to console mode and then ssh into this machine with latest ABBA bios and latest fedora kernel to capture this.  I am suspicious of the graphics card and nouveau driver. Under the older bios I was getting a continuous AER message so used pci=noaer to quite that one. My googling says that the aer issue should be fixed by latest AGESA so I have not tried without pci=noaer yet.

I did notice that this machine locked up instantly when I tried to go from console mode (CTRL-ALT-F3) back to typical gnome desktop (CTRL-ALT-F1)

While in this current running state I am going to do a gcc build to stress this a little, but without any gui running.
Comment 6 JerryD 2019-10-19 17:00:55 UTC
hmm What does this mean?

[    7.011246] ACPI Warning: SystemIO range 0x0000000000000B00-0x0000000000000B08 conflicts with OpRegion 0x0000000000000B00-0x0000000000000B0F (\GSA1.SMBI) (20190703/utaddress-204)
[    7.011255] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
Comment 7 JerryD 2019-10-19 17:54:22 UTC
Created attachment 285557 [details]
Captured a photo of the console at lockup

Photo right after lockup.
Comment 8 Borislav Petkov 2019-10-19 18:30:02 UTC
Hmm, that __alloc_pages_nodemask() null ptr happens pretty reliably. Can you build the latest Linus kernel from here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/

using your .config and see if it triggers with it too.

Looking at the asm, correlating it, it looks like it comes from:

# ./include/linux/mmzone.h:1035:        if (likely(!nodes && zonelist_zone_idx(z) <= highest_zoneidx))
        jne     .L1471  #,
        cmpl    8(%rdi), %esi   # MEM[(int *)_11 + 8B], _10
        jb      .L1471  #,
# mm/page_alloc.c:4431:         ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,

and it looks like that zoneref is going up into the weeds but that's for an mm person to verify.
Comment 9 JerryD 2019-10-20 03:16:59 UTC
I am going to have to bump the bios back to get stable enough to build the kernel. I did so with 5.3 so will do so again with 5.4 latest rc. I am on business travel next week so will see what I can do before I leave.
Comment 10 Rafal Kupiec 2019-10-20 15:07:26 UTC
I had got same/similar issue. Fixed it by motherboard replacement.
Comment 11 JerryD 2019-10-20 16:10:07 UTC
(In reply to Rafal Kupiec from comment #10)
> I had got same/similar issue. Fixed it by motherboard replacement.

Your new motherboard is same one I am reporting here now. It is stable under bios ver F3. With newer bios it is stable under Windows regardless of bios version. Linux kernel is not stable under newer bios. It is a kernel bug. We are trying to help identify it via this bug report

(PS I am able to build kernel="/boot/vmlinuz-5.4.0-rc3". I Have to redo it because I muffed the .config.)
Comment 12 JerryD 2019-10-20 19:50:54 UTC
Soft lockup which repeats every few seconds using 5.4.0-rc3

https://drive.google.com/drive/folders/1AaucH1NYsmvYLJxQJ4xlIIODWC2GJ6MC?usp=sharing

This should be a folder with multiple pictures of console.
Comment 13 JerryD 2019-10-26 22:08:20 UTC
I can apply patches for testing. I am building kernel master at the moment, but would be happy to clone in anything to try or apply patches manually.
Comment 14 JerryD 2019-10-30 00:46:57 UTC
I can confirm problem remains with 5.4.0-rc4.
Comment 15 Borislav Petkov 2019-10-30 08:14:49 UTC
Unfortunately, the pics in comment #12 don't show the first oops as it has scrolled by. And the first oops is the important one. Can you catch full dmesg over serial console or netconsole on another box?
Comment 16 JerryD 2019-11-01 01:56:03 UTC
Can you plese advise how I could get a "continous" dmesg output that can be written to a file or something. Otherwise I will try to repeat the expermiment and catch it as it flys by.
Comment 17 Borislav Petkov 2019-11-01 08:38:46 UTC
The best thing to do is if you have a second machine and both machines have serial out and you connect them with a null modem cable:

https://en.wikipedia.org/wiki/Null_modem

This is explaining what you need to do on the target kernel:

https://www.kernel.org/doc/html/latest/admin-guide/serial-console.html

On the logging machine you can use minicom or so to log stuff. Here's another short text explaining what to do: https://elinux.org/Serial_console

If you can't do serial, you can try netconsole but that also needs another machine on the same network:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/netconsole.txt

Ask if something's unclear.

HTH.
Comment 18 JerryD 2019-11-03 21:54:50 UTC
(In reply to Borislav Petkov from comment #17)
> The best thing to do is if you have a second machine and both machines have
> serial out and you connect them with a null modem cable:
> 

OK, I have the serial ports and NULL modem. I have installed minicom on the logging machine and it is looking for a /dev/modem

  Cannot create lockfile for /dev/modem: No such file or directory

I think I have to create a dev entry, but I am not sure.

I am now running FC31 and have the kernel 5.4.0-rc5 running fine with older BIOS on the test machine.  As soon as I can get the logging machine working I will bump the BIOS and see what we get.
Comment 19 JerryD 2019-11-03 23:12:39 UTC
OK got pass the minicom issue.

$ ln -s /dev/ttyUSB0 /dev/modem
Comment 20 JerryD 2019-11-04 00:07:36 UTC
Here is is from serial console:

[   86.732505] kernel BUG at mm/mmap.c:680!                                       
[   86.732518] invalid opcode: 0000 [#1] SMP NOPTI                                
[   86.732520] CPU: 10 PID: 2466 Comm: gnome-terminal Not tainted 5.4.0-rc5 #1    
[   86.732521] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X5709
[   86.732525] RIP: 0010:__vma_adjust+0x7e8/0xaf0                                 
[   86.732526] Code: 28 fa ff ff 4d 85 f6 0f 84 74 01 00 00 4c 89 f6 4c 89 e7 e8 b
[   86.732527] RSP: 0018:ffffad8fc1347c80 EFLAGS: 00010206                        
[   86.732529] RAX: ffff936b783b3920 RBX: 0000000000000000 RCX: 00007fda11fda000  
[   86.732530] RDX: ffff936b7ca39c40 RSI: 00007fda11fbe000 RDI: ffff936b783b3958  
[   86.732531] RBP: ffff936b7ca391f8 R08: ffff936b7ca38b20 R09: 0000000000000058  
[   86.732531] R10: 0000000000000000 R11: ffff936b783b3958 R12: ffff936b783b3900  
[   86.732532] R13: 0000000000000000 R14: ffff936b5de97cd0 R15: ffff936b783b3920  
[   86.732533] FS:  0000000000000000(0000) GS:ffff936b8ee80000(0000) knlGS:0000000
[   86.732534] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                  
[   86.732535] CR2: 00007fda12001840 CR3: 00000003f82d0000 CR4: 0000000000340ee0  
[   86.732536] Call Trace:                                                        
[   86.732540]  __split_vma+0x17e/0x190                                           
[   86.732541]  __do_munmap+0x46f/0x4b0                                           
[   86.732543]  mmap_region+0x232/0x660                                           
[   86.732545]  do_mmap+0x391/0x530                                               
[   86.732547]  ? security_mmap_file+0x7a/0xe0                                    
[   86.732549]  vm_mmap_pgoff+0xd4/0x120                                          
[   86.732551]  ksys_mmap_pgoff+0x1b1/0x270                                       
[   86.732553]  do_syscall_64+0x5b/0x180                                          
[   86.732556]  entry_SYSCALL_64_after_hwframe+0x44/0xa9                          
[   86.732558] RIP: 0033:0x7fda12a93056                                           
[   86.732559] Code: 1f 44 00 00 f3 0f 1e fa 41 f7 c1 ff 0f 00 00 75 2b 55 48 89 0
[   86.732560] RSP: 002b:00007ffcbc189248 EFLAGS: 00000206 ORIG_RAX: 0000000000009
[   86.732561] RAX: ffffffffffffffda RBX: 0000000000000812 RCX: 00007fda12a93056  
[   86.732561] RDX: 0000000000000001 RSI: 0000000000018000 RDI: 00007fda11fbe000  
[   86.732562] RBP: 00007fda11fbe000 R08: 0000000000000003 R09: 0000000000040000  
[   86.732563] R10: 0000000000000812 R11: 0000000000000206 R12: 00007fda12a588b0  
[   86.732564] R13: 0000000000000001 R14: 00007ffcbc189698 R15: 0000000000000003  
[   86.732565] Modules linked in: uinput xt_CHECKSUM xt_MASQUERADE nf_nat_tftp nfp
[   86.732586]  acpi_cpufreq ip_tables hid_logitech_hidpp uas usb_storage hid_loge
[   86.732595] ---[ end trace 2f0a8b1548c04a0f ]---                               
[   86.732597] RIP: 0010:__vma_adjust+0x7e8/0xaf0                                 
[   86.732598] Code: 28 fa ff ff 4d 85 f6 0f 84 74 01 00 00 4c 89 f6 4c 89 e7 e8 b
[   86.732599] RSP: 0018:ffffad8fc1347c80 EFLAGS: 00010206                        
[   86.732599] RAX: ffff936b783b3920 RBX: 0000000000000000 RCX: 00007fda11fda000  
[   86.732600] RDX: ffff936b7ca39c
Comment 21 JerryD 2019-11-04 00:37:44 UTC
After a restart and playing with Firfox this one just died:

        Starting Hold until boot process finishes up...                   
[  OK  ] Started Network Manager Wait Online.                              
[  840.057082] BUG: kernel NULL pointer dereference, address: 0000000000000001
[  840.057091] #PF: supervisor read access in kernel mode                  
[  840.057092] #PF: error_code(0x0000) - not-present page                  
[  840.057093] PGD 0 P4D 0                                                 
[  840.057095] Oops: 0000 [#1] SMP NOPTI                                   
[  840.057097] CPU: 4 PID: 1804 Comm: gnome-shell Not tainted 5.4.0-rc5 #1 
[  840.057098] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X5709
[  840.057102] RIP: 0010:irq_work_run_list+0x2b/0x70                       
[  840.057104] Code: 55 41 54 55 53 9c 58 0f 1f 44 00 00 25 00 02 00 00 75 57
Comment 22 Borislav Petkov 2019-11-05 10:29:56 UTC
Can you upload the vmlinux from that 5.4.0-rc5 kernel with which you caught the oops along with full dmesg somewhere?

Thx.
Comment 23 JerryD 2019-11-06 02:48:52 UTC
(In reply to Borislav Petkov from comment #22)
> Can you upload the vmlinux from that 5.4.0-rc5 kernel with which you caught
> the oops along with full dmesg somewhere?
> 
> Thx.

See:

https://drive.google.com/drive/folders/1dK1CpT0s61Meg3mYImbMLgL00-Ss4Zib

Three files from today.

consoleatboot: is several reboots from cold power start captured on minicom. On these the locksups are occuring before the normal bootup finishes and at different places.

5.4-rc5-dmesg-F3bios.txt: is a complete output of dmesg after changing bios back to the F3 version with all looking good in case you need all info from the kernel.

vmlinuz-5.4.0-rc5: is the kernel executable being booted. Compiled using gcc 9.2.1 (Red Hat 9.2.1-1)

Hope this helps.  The kernel is failing on the AGESA 1.0.0.3ABBA bios but not on AGESA 1.0.0.3.

I see a new bios F10a as just arrived for the bios with AGESA 1.0.0.4B so I will be trying that soon and reporting back.  (I also see a kernel RC6 has arrived.)  I will try it a bit later.
Comment 24 JerryD 2019-11-07 01:52:57 UTC
(In reply to JerryD from comment #23)
> I see a new bios F10a as just arrived for the bios with AGESA 1.0.0.4B so I
> will be trying that soon and reporting back.  (I also see a kernel RC6 has
> arrived.)  I will try it a bit later.

No improvement with this latest BIOS.
Comment 25 Borislav Petkov 2019-11-10 09:41:07 UTC
(In reply to JerryD from comment #23)
> consoleatboot: is several reboots from cold power start captured on minicom.
> On these the locksups are occuring before the normal bootup finishes and at
> different places.

Ok, so far so good.

Please do exactly the same but try catching the output from the very beginning until it oopses. The *important* thing you need to do before booting the target machine is to do in minicom:

Ctrl-A L

so that it says

 +-----------------------------------------+
 |Capture to which file?                   |
 |> minicom.cap                            |
 +-----------------------------------------+

then you press enter and *then* you boot the target machine, wait until it oopses and then you send me that minicom.cap file.

Because as it is, consoleatboot is cut off to the terminal width and missing important information like the whole Code: line which has the instruction bytes around rIP.

Other than that, do not change anything on the setup so that I can use the vmlinuz you uploaded earlier.

Thx.
Comment 26 JerryD 2019-11-11 04:18:29 UTC
Created attachment 285851 [details]
minicom.cap as requested

Note: I am using USB serial ports so the capture begins right after line 10. Before that was a previous attempt that just locked in the middle of the capture. Some of these fails are pretty abrubt, others will stay alive for a while. The one captured here, at line 52 you will see a second RIP, That one just stopped mid way on dumping the code. After that I noticed the gnome desktop was still responding and thats when I attempted a shutdown. It did a few steps and hung on the stop job, so ignore that one.
Comment 27 JerryD 2019-11-11 04:22:15 UTC
(In reply to JerryD from comment #26)
> Created attachment 285851 [details]
> minicom.cap as requested
> 
> Note: I am using USB serial ports so the capture begins right after line 10.
> Before that was a previous attempt that just locked in the middle of the
> capture. Some of these fails are pretty abrubt, others will stay alive for a
> while. The one captured here, at line 52 you will see a second RIP, That one
> just stopped mid way on dumping the code. After that I noticed the gnome
> desktop was still responding and thats when I attempted a shutdown. It did a
> few steps and hung on the stop job, so ignore that one.

To clarify, logging starts after the USB interfaces are brought up, so I watch the VGA console before this and the boot is clean. The bug manifests after I start exercising the system such as using Firefox or attempting a multithreaded gcc build.
Comment 28 Borislav Petkov 2019-11-11 08:53:31 UTC
Ok, so this looks like it fails here:

[ 123.219007] Code: 08 e8 b5 48 ea ff eb be 0f 0b 90 0f 1f 44 00 00 41 57 41 56 41 55 4c 8d 6f 78 41 54 55 53 48 83 ec 08 4c 8b 67 78 48 89 3c 24 <49> 8b 0c 24 4d 39 e5 0f 84 7b 01 00 00 4d 8d 7c 24 f0 48 8d 59 f0
All code
========
   0:	08 e8                	or     %ch,%al
   2:	b5 48                	mov    $0x48,%ch
   4:	ea                   	(bad)  
   5:	ff                   	(bad)  
   6:	eb be                	jmp    0xffffffffffffffc6
   8:	0f 0b                	ud2    
   a:	90                   	nop
   b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  10:	41 57                	push   %r15
  12:	41 56                	push   %r14
  14:	41 55                	push   %r13
  16:	4c 8d 6f 78          	lea    0x78(%rdi),%r13
  1a:	41 54                	push   %r12
  1c:	55                   	push   %rbp
  1d:	53                   	push   %rbx
  1e:	48 83 ec 08          	sub    $0x8,%rsp
  22:	4c 8b 67 78          	mov    0x78(%rdi),%r12
  26:	48 89 3c 24          	mov    %rdi,(%rsp)
  2a:*	49 8b 0c 24          	mov    (%r12),%rcx		<-- trapping instruction
  2e:	4d 39 e5             	cmp    %r12,%r13
  31:	0f 84 7b 01 00 00    	je     0x1b2
  37:	4d 8d 7c 24 f0       	lea    -0x10(%r12),%r15
  3c:	48 8d 59 f0          	lea    -0x10(%rcx),%rbx


unlink_anon_vmas:
1:	call	__fentry__
	.section __mcount_loc, "a",@progbits
	.quad 1b
	.previous
	pushq	%r15	#
	pushq	%r14	#
	pushq	%r13	#
# mm/rmap.c:386: 	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
	leaq	120(%rdi), %r13	#, _92
# mm/rmap.c:378: {
	pushq	%r12	#
	pushq	%rbp	#
	pushq	%rbx	#
	subq	$8, %rsp	#,
# mm/rmap.c:386: 	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
	movq	120(%rdi), %r12	# vma_23(D)->anon_vma_chain.next, __mptr
# mm/rmap.c:378: {
	movq	%rdi, (%rsp)	# vma, %sfp
# mm/rmap.c:386: 	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
	cmpq	%r13, %r12	# _92, _6
	movq	(%r12), %rcx	# MEM[(struct anon_vma_chain *)__mptr_24 + -16B].same_vma.next, tmp153

which is this last instruction and which says that one of the next pointers on this &vma->anon_vma_chain list is NULL. Leading to the NULL ptr deref.

Now, this is a different splat than the one I've seen before and you don't seem to have ECC memory so have you checked whether your DIMMs are all ok?

IOW, can you try swapping them out one by one and each time booting the machine to see whether the error happens. And log it each time. Also, before you do, build your kernel with CONFIG_DEBUG_INFO=y because it is easier to search in vmlinux with it.

Thx.
Comment 29 Borislav Petkov 2019-11-11 09:53:24 UTC
(In reply to JerryD from comment #23)
> The kernel is failing on the AGESA 1.0.0.3ABBA bios but
> not on AGESA 1.0.0.3.

Are you sure the failure doesn't happen at all with the old BIOS?

If so, you might consider not upgrading it. Or you can try reporting the issue but good luck reporting BIOS bugs...
Comment 30 Rafal Kupiec 2019-11-11 16:20:29 UTC
(In reply to JerryD from comment #24)
> (In reply to JerryD from comment #23)
> > I see a new bios F10a as just arrived for the bios with AGESA 1.0.0.4B so I
> > will be trying that soon and reporting back.  (I also see a kernel RC6 has
> > arrived.)  I will try it a bit later.
> 
> No improvement with this latest BIOS.

Check F10c
Comment 31 JerryD 2019-11-13 03:11:45 UTC
(In reply to Rafal Kupiec from comment #30)
> (In reply to JerryD from comment #24)
> > (In reply to JerryD from comment #23)
> > > I see a new bios F10a as just arrived for the bios with AGESA 1.0.0.4B so
> I
> > > will be trying that soon and reporting back.  (I also see a kernel RC6
> has
> > > arrived.)  I will try it a bit later.
> > 
> > No improvement with this latest BIOS.
> 
> Check F10c

Just installed F10c BIOS.  The system will boot, but any operations that require paging memory get into trouble.  For example, when I ssh into the machine and ran 'git svn --rebase' (which fails to run) I get the following as displayed by dmesg.  Note, I am still shelled in and the machine is not completely locking up. The rest of dmesg before this is clean as a whistle.

[   50.843414] fbcon: Taking over console
[   50.851866] Console: switching to colour frame buffer device 320x90
[ 2158.328293] BUG: Bad page map in process git-svn  pte:ffffffff8427dd5d pmd:3f1d26067
[ 2158.328297] addr:000055ddee37e000 vm_flags:08100073 anon_vma:ffff9eee457fdaa8 mapping:0000000000000000 index:55ddee37e
[ 2158.328301] file:(null) fault:0x0 mmap:0x0 readpage:0x0
[ 2158.328304] CPU: 8 PID: 35359 Comm: git-svn Not tainted 5.4.0-rc5 #1
[ 2158.328305] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F10c 11/08/2019
[ 2158.328306] Call Trace:
[ 2158.328313]  dump_stack+0x5c/0x80
[ 2158.328317]  print_bad_pte.cold+0x6a/0xd2
[ 2158.328319]  vm_normal_page+0xbe/0xd0
[ 2158.328321]  unmap_page_range+0x50b/0xd10
[ 2158.328323]  unmap_vmas+0x7a/0xf0
[ 2158.328326]  exit_mmap+0xad/0x1a0
[ 2158.328330]  mmput+0x61/0x130
[ 2158.328333]  flush_old_exec+0x261/0x680
[ 2158.328335]  load_elf_binary+0x32e/0x1640
[ 2158.328337]  ? get_user_pages_remote+0x14f/0x240
[ 2158.328339]  ? _cond_resched+0x15/0x30
[ 2158.328342]  ? selinux_inode_permission+0x107/0x1c0
[ 2158.328345]  search_binary_handler+0x84/0x1b0
[ 2158.328347]  __do_execve_file.isra.0+0x4f0/0x850
[ 2158.328350]  __x64_sys_execve+0x35/0x40
[ 2158.328353]  do_syscall_64+0x5b/0x180
[ 2158.328355]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2158.328357] RIP: 0033:0x7efd6acb744b
[ 2158.328360] Code: Bad RIP value.
[ 2158.328361] RSP: 002b:00007ffcf1c16958 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
[ 2158.328363] RAX: ffffffffffffffda RBX: 000055ddede4c680 RCX: 00007efd6acb744b
[ 2158.328364] RDX: 000055ddeef031b0 RSI: 000055ddeee473b0 RDI: 00007ffcf1c16960
[ 2158.328365] RBP: 00007ffcf1c16a50 R08: 0000000000000fff R09: 000055ddede36a3a
[ 2158.328366] R10: 0000000000000008 R11: 0000000000000202 R12: 000055ddeee473b0
[ 2158.328366] R13: 000055ddeef031b0 R14: 0000000000000004 R15: 000055ddede36a25
[ 2158.328368] Disabling lock debugging due to kernel taint
[ 2158.328798] BUG: Bad rss-counter state mm:000000006486e0cf type:MM_ANONPAGES val:1

So far I have been able to boot into windows and it runs fine. However, today I noticed the windows boot got clobbered, either from installing Fedora 31 or the numerous crashes I have encountered while doing a lot of disk I/O.  So I will restore it and confirm Windows does indeed run fine on this F10c bios.
Comment 32 JerryD 2019-11-13 03:20:58 UTC
(In reply to Borislav Petkov from comment #29)
> (In reply to JerryD from comment #23)
> > The kernel is failing on the AGESA 1.0.0.3ABBA bios but
> > not on AGESA 1.0.0.3.
> 
> Are you sure the failure doesn't happen at all with the old BIOS?
> 
> If so, you might consider not upgrading it. Or you can try reporting the
> issue but good luck reporting BIOS bugs...

Yes, I can always just go back to the original BIOS however, it is obvious to me there is a bug lurking here.  I think the inconsistancy we see is because the bug  is related to memory management and paging which implies the IOMMU I plan to do some more experiments regarding this and will report back.
Comment 33 Borislav Petkov 2019-11-13 11:12:01 UTC
(In reply to JerryD from comment #32)
> Yes, I can always just go back to the original BIOS however, it is obvious
> to me there is a bug lurking here.

How is it obvious to you?

> I think the inconsistancy we see is
> because the bug  is related to memory management and paging which implies
> the IOMMU

So that "implication" certainly doesn't make any sense to me.

>  I plan to do some more experiments regarding this and will report
> back.

It looks to me more like a BIOS issue on that particular platform. Especially since you're the only one reporting such random corruptions. And the fact that the corruptions are random could mean a hw issue, like one of your DIMMs is faulty and the old BIOS is hiding it properly, or ...

IOW, I don't think we can yet clearly pinpoint what the issue is. At least I can't.
Comment 34 JerryD 2019-11-14 02:43:30 UTC
(In reply to Borislav Petkov from comment #33)
> (In reply to JerryD from comment #32)
> > Yes, I can always just go back to the original BIOS however, it is obvious
> > to me there is a bug lurking here.
> 
> How is it obvious to you?

The lockup and kernel oops indicate a bug.  It does not indicate that the bug is software or hardware, kernel or otherwise.

> 
> > I think the inconsistancy we see is
> > because the bug  is related to memory management and paging which implies
> > the IOMMU
> 
> So that "implication" certainly doesn't make any sense to me.

Agree, I was being too specific, it appears to be memory related.

> 
> >  I plan to do some more experiments regarding this and will report
> > back.

I finally got around to your suggested pull and swap memory modules. Pulled the second bank and rebooted and it locked as seen before. Pulled first bank and swapped back in the second bank....booted and I could build gcc without a hitch.

I have to confess I have never seen a bios "mask" a hardware problem before, but you said it and sure enough it seems the case. The only issue I have seen now is when I request a system retart from the gnome menu it appears to start to do so, but stops and does not actually restart. Power off works fine.

Thankyou for your helpful advice, I have learned a lot.  I will monitor this for a few days befor closing the bug.
Comment 35 JerryD 2019-11-14 03:12:13 UTC
Oops wrote too soon. After about 20 minutes it ceased function. I am back to the original bios and will query the bios provider.
Comment 36 Borislav Petkov 2019-11-14 11:56:10 UTC
But did you run with the original BIOS for a sufficiently long time to confirm that it really is ok and only updating to the newer BIOS is starting to cause those random corruptions?
Comment 37 JerryD 2019-11-16 02:51:06 UTC
(In reply to Borislav Petkov from comment #36)
> But did you run with the original BIOS for a sufficiently long time to
> confirm that it really is ok and only updating to the newer BIOS is starting
> to cause those random corruptions?

Yes I have used the original BIOS for several months before the new ones were released.

Also I have been been using Windows all day today on the most recent F10c bios without a glitch. Two different browsers, IDE, shell, Ubuntu inside Windows virtual machine and email.  I also have been running CPU-Z benchmarks and stress tests without an issue. Ran the stress test for about 15 minutes, while doing all the other things. (sidenote: Ryzen 3600 keeps up quite well with all these things at once).
Comment 38 Borislav Petkov 2019-11-16 10:56:12 UTC
Yah, OEMs test only with windoze so no wonder it works under windoze. It could be some unfortunate interaction between the kernel and the newer BIOSes.

What you could do is try older kernels and try to pinpoint, if possible, whether there's a kernel version which is ok. If you establish that, then bisecting should help us narrow it down...
Comment 39 JerryD 2019-11-16 16:22:37 UTC
(In reply to Borislav Petkov from comment #38)
> Yah, OEMs test only with windoze so no wonder it works under windoze. It
> could be some unfortunate interaction between the kernel and the newer
> BIOSes.
> 
> What you could do is try older kernels and try to pinpoint, if possible,
> whether there's a kernel version which is ok. If you establish that, then
> bisecting should help us narrow it down...

My exact same thoughts, to back up the kernel and see if I can find something.
Comment 40 JerryD 2019-11-17 16:00:01 UTC
For what its worth. I loaded the 5.0 kernel from Fedora koji. Running a -j10 build of gcc to exercise the system.

$ cat 5dump 
[   10.437509] virbr0: port 1(virbr0-nic) entered listening state
[   10.456870] virbr0: port 1(virbr0-nic) entered disabled state
[   12.508858] usb 1-3.2: reset high-speed USB device number 5 using xhci_hcd
[  625.463440] BUG: Bad page state in process make  pfn:27eddd
[  625.463443] page:ffffe09cc9fb7740 count:0 mapcount:0 mapping:0000000000000000 index:0x127
[  625.463445] flags: 0x17fffe00080004(uptodate|swapbacked)
[  625.463447] raw: 0017fffe00080004 ffffe09cca22b508 ffffb0e557c27d40 0000000000000000
[  625.463448] raw: 0000000000000127 0000000000000000 00000000ffffffff ffff97e70c41b000
[  625.463448] page dumped because: page still charged to cgroup
[  625.463448] page->mem_cgroup:ffff97e70c41b000
[  625.463449] Modules linked in: uinput devlink xt_CHECKSUM ipt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat snd_hda_codec_hdmi edac_mce_amd kvm_amd kvm uvcvideo videobuf2_vmalloc snd_hda_codec_generic irqbypass videobuf2_memops ledtrig_audio videobuf2_v4l2 videobuf2_common videodev crct10dif_pclmul snd_hda_intel crc32_pclmul snd_usb_audio pl2303 snd_hda_codec joydev media snd_hda_core snd_usbmidi_lib wmi_bmof ghash_clmulni_intel snd_rawmidi snd_hwdep snd_seq snd_seq_device bfq snd_pcm snd_timer sp5100_tco snd i2c_piix4 soundcore ccp pcc_cpufreq acpi_cpufreq nouveau
[  625.463472]  video mxm_wmi drm_kms_helper ttm drm crc32c_intel igb dca i2c_algo_bit wmi pinctrl_amd fuse
[  625.463477] CPU: 4 PID: 183178 Comm: make Not tainted 5.0.0-1.fc31.x86_64 #1
[  625.463478] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F10c 11/08/2019
[  625.463478] Call Trace:
[  625.463484]  dump_stack+0x5c/0x80
[  625.463486]  bad_page.cold+0x7f/0xb3
[  625.463487]  free_unref_page_list+0xf1/0x270
[  625.463489]  release_pages+0x1b7/0x480
[  625.463491]  tlb_flush_mmu_free+0x28/0x50
[  625.463493]  arch_tlb_finish_mmu+0x16/0xa0
[  625.463494]  tlb_finish_mmu+0x1f/0x30
[  625.463495]  exit_mmap+0xcd/0x1a0
[  625.463497]  mmput+0x61/0x130
[  625.463498]  do_exit+0x286/0xbd0
[  625.463499]  ? handle_mm_fault+0xdc/0x210
[  625.463501]  ? do_user_addr_fault+0x218/0x450
[  625.463502]  do_group_exit+0x3a/0xa0
[  625.463503]  __x64_sys_exit_group+0x14/0x20
[  625.463504]  do_syscall_64+0x5b/0x150
[  625.463506]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  625.463508] RIP: 0033:0x7f0899974416
[  625.463510] Code: Bad RIP value.
[  625.463511] RSP: 002b:00007ffc92578978 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  625.463512] RAX: ffffffffffffffda RBX: 00007f0899a6a470 RCX: 00007f0899974416
[  625.463512] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[  625.463513] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
[  625.463513] R10: 00007f089938c84e R11: 0000000000000246 R12: 00007f0899a6a470
[  625.463513] R13: 0000000000000002 R14: 00007f0899a6de68 R15: 0000000000000000
[  625.463515] Disabling lock debugging due to kernel taint
Comment 41 Tom 2019-11-21 16:41:22 UTC
For what it’s worth, I experienced issues strikingly similar to JerryD. 

I have a Ryzen 3600 on MSI x570 motherboard. 

Initially installed Ubuntu 18.10 in a dual boot setup with Kernel 4.18 which was stable but after upgrade of the bios to address AMD issues the system became unstable and would freeze as soon as I opened any application on the desktop that put a load on the system, no matter how slight. I abandoned it for a few months due to having the dual boot option and the bios upgrade solved an issue I was having in windows.

Recently I tried different distros and combinations with Kernel 4.19 and 5.3 and 5.4 RC 4 but I experienced the same issue. 
My mother board is MSI MPG X570 and they have a beta bios at the moment (7C37vA61) which I decided to try before rolling back to a pre issue version to test. 

I tried to install Manjaro with 4.19 and the same issues existed but I decided to give Ubuntu 19.10 with 5.3 a run and for the moment its very stable but I have yet to give it a full stress test. Compared to any other install, I was able to open and use Firefox and load a video which I never could since this issue started. 

Once I find time, I will give it a full run. 

There does not appear to be any release notes with the bios update to detail any changes apart from the following limited info.
  - Improved system boot up time.
  - Improved PCI-E device compatibility.

Is there anything I can run on my system that would help troubleshoot this issue for you?

Apologies for the limited information, I am not to the technical level of yourselves but hoped it may help to see someone else with the same hardware and similar issue.
Comment 42 Borislav Petkov 2019-11-21 17:23:47 UTC
(In reply to Tom from comment #41)
> Is there anything I can run on my system that would help troubleshoot this
> issue for you?

First of all, please open a separate bugzilla entry and add me to CC. If it turns out that your issue really is the same as Jerry's, we can merge them.

Then, boot the latest upstream kernel - 5.4-rc8 - and see if it freezes. If it does, try catching the freeze over serial, it is explained how in this bug, and upload that log to your fresh bugzilla entry.

Don't hesitate to ask if there are questions.

Thx.
Comment 43 Tom 2019-11-21 18:11:54 UTC
Ok, I will wait until I get the serial output done before I open a new bugzilla, this is my main machine so it will be the weekend before I could get to it, its needed for work purposes. 

Thank You
Comment 44 JerryD 2020-03-01 18:44:19 UTC
After taking a break from fighting this, I tried again with the latest vendor BIOS which had additional options in it's set up. I disabled c-states and set to typical idle currrent and the system is stable. I would consider this a BIOS bug and not a kernel bug, though it could have been a combination of these earlier.

Closing this bug report.

Note You need to log in before you can comment on or make changes to this bug.