Bug 212917

Summary: Unable to handle kernel NULL pointer dereference at virtual address
Product: Memory Management Reporter: rudi
Component: OtherAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: normal CC: geraldogabriel, rudi, stefan.bruens
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.12 Subsystem:
Regression: No Bisected commit-id:
Attachments: Simple patch to downclock affected Rock Pi N10 units

Description rudi 2021-05-01 11:12:36 UTC
On Radxa RockPi N10 using current u-boot 2021.04 and Kernel 5.12 (using mainline config, and dtbs) 

Recompiling the kernel with “make -j6 Image dtbs” failed with this kernel error.

[ 7422.215547] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 7422.216335] Mem abort info:
[ 7422.216583]   ESR = 0x86000006
[ 7422.216855]   EC = 0x21: IABT (current EL), IL = 32 bits
[ 7422.217322]   SET = 0, FnV = 0
[ 7422.217593]   EA = 0, S1PTW = 0
[ 7422.217870] user pgtable: 4k pages, 39-bit VAs, pgdp=000000000f3f3000
[ 7422.218438] [0000000000000000] pgd=0000000006bac003, p4d=0000000006bac003, pud=0000000006bac003, pmd=0000000000000000
[ 7422.219375] Internal error: Oops: 86000006 [#1] SMP
[ 7422.219808] Modules linked in: autofs4
[ 7422.220146] CPU: 4 PID: 9677 Comm: cc1 Not tainted 5.12.0 #1
[ 7422.220645] Hardware name: Radxa ROCK Pi N10 (DT)
[ 7422.221061] pstate: 20000085 (nzCv daIf -PAN -UAO -TCO BTYPE=--)
[ 7422.221590] pc : 0x0
[ 7422.221789] lr : __mod_lruvec_state+0x30/0x5c
[ 7422.222182] sp : ffffffc0165ab970
[ 7422.222475] x29: ffffffc0165ab970 x28: 0000000000001000 
[ 7422.222946] x27: ffffff800e561200 x26: ffffff80139638c0 
[ 7422.223416] x25: fffffffe00e20380 x24: 0000000000045a8e 
[ 7422.223886] x23: 0000000000000008 x22: 00000000ffffffff 
[ 7422.224358] x21: 0000000000000014 x20: ffffffffffffffff 
[ 7422.224828] x19: ffffff800c865c00 x18: 0000000000000000 
[ 7422.225298] x17: 0000000000000000 x16: 0000000000000000 
[ 7422.225768] x15: 0000000000000000 x14: 0000000000001000 
[ 7422.226238] x13: 0000000000000040 x12: 0000000000000000 
[ 7422.226708] x11: 000000003880d000 x10: ffffff8013ac6000 
[ 7422.227178] x9 : ffffffc0101a3678 x8 : 0000000000000300 
[ 7422.227648] x7 : 0000000000004000 x6 : 00000000000000d3 
[ 7422.228118] x5 : 000000003880c000 x4 : 0000000000000000 
[ 7422.228587] x3 : 0000000000002015 x2 : ffffffc0113543f0 
[ 7422.229057] x1 : 0000000000000024 x0 : ffffffc0e6463000 
[ 7422.229528] Call trace:
[ 7422.229748]  0x0
[ 7422.229915]  __mod_lruvec_page_state+0x5c/0x60
[ 7422.230312]  mod_lruvec_page_state+0x34/0x4c
[ 7422.230694]  clear_page_dirty_for_io+0x80/0xc4
[ 7422.231089]  mpage_submit_page+0x38/0x90
[ 7422.231438]  mpage_map_and_submit_buffers+0x21c/0x268
[ 7422.231884]  ext4_writepages+0x65c/0x908
[ 7422.232234]  do_writepages+0x40/0x80
[ 7422.232551]  __filemap_fdatawrite_range+0x64/0x94
[ 7422.232969]  filemap_flush+0x20/0x28
[ 7422.233287]  ext4_alloc_da_blocks+0x68/0x7c
[ 7422.233658]  ext4_release_file+0x30/0xd0
[ 7422.234006]  __fput+0xe8/0x1f8
[ 7422.234281]  ____fput+0x14/0x1c
[ 7422.234563]  task_work_run+0x88/0xac
[ 7422.234883]  do_notify_resume+0x214/0x2c4
[ 7422.235240]  work_pending+0xc/0x1c0
[ 7422.235557] Code: bad PC value
[ 7422.235831] ---[ end trace a189587e728b7e44 ]---
Comment 1 rudi 2021-05-01 11:14:06 UTC
The OS itself is a patched Ubuntu 20.04.2
Comment 2 Stefan Brüns 2021-05-26 16:14:08 UTC
I see the same, but also all other kinds of memory errors, like userspace segfaults, "stack smashing" faults, etc.

When I run with kernel command line parameter maxcpus=4, i.e. disabling the two A72 cores, the systems runs fine even under stress.

Offlining the two A72 cores via sysfs has the same effect, enabling the cores again makes the system unreliable. This is completely repeatable.
Comment 3 rudi 2021-06-27 15:47:08 UTC
I believe that this has been addressed in 5.13, I read the patch somewhere About adjusting 1 to 2 or 2 to 1… I just can’t find the reference at the moment. Closing the bug.
Comment 5 Geraldo Nascimento 2024-06-18 03:44:35 UTC
I also own a Rock Pi N10 and like Stefan Brüns I have been disabling the A72 cores since forever. Lately however I still have been having crashes under load and I must say this is much in test but I downclocked the SBC through a rk3399pro-rock-pi-n10.dts change.

-#include "rk3399-opp.dtsi"
+#include "rk3399-t-opp.dtsi"

This enforces a 1512 MHz clock on the big cores and a 1008 MHz clock on the little cores.

With this DTB patch I can run all six cores apparently smoothly, with a performance penalty but it seems it won't crash randomly anymore.

Still in test, happy to continue reporting.
Comment 6 Geraldo Nascimento 2024-06-18 04:01:26 UTC
Created attachment 306469 [details]
Simple patch to downclock affected Rock Pi N10 units
Comment 7 Geraldo Nascimento 2024-06-19 00:21:06 UTC
(In reply to rudi from comment #4)
> I believe this has contributed to the fix
> - https://lore.kernel.org/lkml/20210511211335.2935163-1-pgwipeout@gmail.com/
> Along with the DTS
> - https://lore.kernel.org/lkml/20210527105943.GA441@7698f5da3a10/
> - https://lore.kernel.org/lkml/20210527122911.GA1640@7698f5da3a10/

Rudi, by the way, I'm running Rock Pi N10 with an active cooling fan solution, I was aware that it would crash once it hit that 80ºC cpufreq threshold, and I run therefore with "performance" cpu mode and active cooling.

And I still see the crashes on Linux 6.10-rc4.

The only thing that helped was downclocking the CPU to RK3399-T frequencies, no more crashes so far.

By the way I'm on gentoo on this box, so it's easy to crash it with "emerge -av --emptytree @world" which will trigger a re-compilation of all packages. Even with distcc helping it used to crash. That was in the context of turning off the two BIG A72 cores or it would happen exactly as Stefan Brüns described, stack smashing detected, segfaults etc. Still it would crash with only the little cores and under high load, sometimes in a few hours, sometimes it took more than a day.

Now by downclocking the CPU the SBC seems to be running fine under high load with all six cores, BIG and little, online.