Bug 215389
Created attachment 300115 [details]
kernel .config (5.15.10, PowerMac G4 DP)
Probably hard to track. Any chance to bisect the issue ? Bisecting will take some time. I'll report back as soon as I have any findings. I was able to easily reproduce this on 5.15.13, however not on 5.16-rc8. But on 5.16-rc8 I got this the 3rd time I ran the glibc testsuite: [...] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/u4:7:32566] Modules linked in: auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc ghash_generic gf128mul gcm ccm algif_aead des_generic libdes ctr cbc ecb algif_skcipher aes_generic libaes cmac sha512_generic sha1_generic sha1_powerpc md5 md5_ppc md4 b43legacy mac80211 libarc4 snd_aoa_codec_tas snd_aoa_fabric_layout snd_aoa cfg80211 rfkill evdev mac_hid therm_windtunnel firewire_ohci firewire_core crc_itu_t sr_mod cdrom snd_aoa_i2sbus snd_aoa_soundbus snd_pcm snd_timer snd ohci_pci soundcore radeon ohci_hcd ehci_pci ehci_hcd hwmon i2c_algo_bit drm_ttm_helper ttm ssb drm_kms_helper pcmcia pcmcia_core usbcore 8250_pci syscopyarea sysfillrect sysimgblt usb_common 8250 8250_base serial_mctrl_gpio fb_sys_fops pkcs8_key_parser fuse drm drm_panel_orientation_quirks configfs CPU: 1 PID: 32566 Comm: kworker/u4:7 Not tainted 5.16.0-rc8-PowerMacG4 #1 Workqueue: zswap1 compact_page_work NIP: c0078730 LR: c0078724 CTR: 00000000 REGS: f698dd40 TRAP: 0900 Not tainted (5.16.0-rc8-PowerMacG4) MSR: 00009032 <EE,ME,IR,DR,RI> CR: 44008242 XER: 20000000 GPR00: c01856c8 f698de00 ca20b540 00000001 d4c73ffc 00000000 de0bd0bc aaaaaaaa GPR08: aaaaaaaa 00000000 ffffffff 00000004 84002242 00000000 c00553fc 00000001 GPR16: 00000002 d4c73fc0 c0980000 002ec02c 00000040 d4c7300c d4c7302e c19c4bc0 GPR24: c19c4bc0 c0185d74 ef0d0040 d4c73008 d4c74a4c 0000007f de0bd000 d4c74a54 NIP [c0078730] arch_write_lock+0x28/0x3c LR [c0078724] arch_write_lock+0x1c/0x3c Call Trace: [f698de00] [c0185d74] release_z3fold_page_locked+0x0/0x44 (unreliable) [f698de20] [c01856c8] do_compact_page+0x334/0x508 [f698de80] [c004f354] process_one_work+0x1d4/0x288 [f698dec0] [c004f814] worker_thread+0x1b8/0x260 [f698df00] [c0055514] kthread+0x118/0x11c [f698df30] [c0016268] ret_from_kernel_thread+0x5c/0x64 Instruction dump: 39610020 4bfa7668 9421ffe0 7c0802a6 90010024 93e1001c 7c7f1b78 7fe3fb78 4bffff0d 2c030000 41a20014 813f0000 <2c090000> 4182ffe8 4bfffff4 39610020 Kernel panic - not syncing: softlockup: hung tasks CPU: 1 PID: 32566 Comm: kworker/u4:7 Tainted: G L 5.16.0-rc8-PowerMacG4 #1 Workqueue: zswap1 compact_page_work Call Trace: [f698dbb0] [c03e7f04] dump_stack_lvl+0x60/0x80 (unreliable) [f698dbd0] [c0037734] panic+0x128/0x30c [f698dc30] [c00c6334] watchdog_nmi_enable+0x0/0x10 [f698dc70] [c0097fc8] __hrtimer_run_queues+0xf0/0x154 [f698dcb0] [c0098b7c] hrtimer_interrupt+0xf8/0x25c [f698dcf0] [c000d70c] timer_interrupt+0x20c/0x294 [f698dd30] [c0004a50] Decrementer_virt+0x100/0x104 --- interrupt: 900 at arch_write_lock+0x28/0x3c NIP: c0078730 LR: c0078724 CTR: 00000000 REGS: f698dd40 TRAP: 0900 Tainted: G L (5.16.0-rc8-PowerMacG4) MSR: 00009032 <EE,ME,IR,DR,RI> CR: 44008242 XER: 20000000 GPR00: c01856c8 f698de00 ca20b540 00000001 d4c73ffc 00000000 de0bd0bc aaaaaaaa GPR08: aaaaaaaa 00000000 ffffffff 00000004 84002242 00000000 c00553fc 00000001 GPR16: 00000002 d4c73fc0 c0980000 002ec02c 00000040 d4c7300c d4c7302e c19c4bc0 GPR24: c19c4bc0 c0185d74 ef0d0040 d4c73008 d4c74a4c 0000007f de0bd000 d4c74a54 NIP [c0078730] arch_write_lock+0x28/0x3c LR [c0078724] arch_write_lock+0x1c/0x3c --- interrupt: 900 [f698de00] [c0185d74] release_z3fold_page_locked+0x0/0x44 (unreliable) [f698de20] [c01856c8] do_compact_page+0x334/0x508 [f698de80] [c004f354] process_one_work+0x1d4/0x288 [f698dec0] [c004f814] worker_thread+0x1b8/0x260 [f698df00] [c0055514] kthread+0x118/0x11c [f698df30] [c0016268] ret_from_kernel_thread+0x5c/0x64 Rebooting in 40 seconds.. Which is interesting because on bug #213837 my not yet finished bisect is also giving hints z3fold may be the problem... I'll check out next whether the issue is reproduceable on 5.15.x when I use zbud or zmalloc for zswap instead of z3fold. Ok, with zswap lzo/zbud I also get this memory corruption on 5.15.13. So most probably it's not lzo/z3pool but something else. I'll start a bisect then... Created attachment 300318 [details] bisect.log Ok, finally got it. Interesting find: # git bisect bad db972a3787d12b1ce9ba7a31ec376d8a79e04c47 is the first bad commit commit db972a3787d12b1ce9ba7a31ec376d8a79e04c47 Author: Christophe Leroy <christophe.leroy@csgroup.eu> Date: Tue Dec 8 05:24:19 2020 +0000 powerpc/powermac: Fix low_sleep_handler with CONFIG_VMAP_STACK low_sleep_handler() can't restore the context from standard stack because the stack can hardly be accessed with MMU OFF. Store everything in a global storage area instead of storing a pointer to the stack in that global storage area. To avoid a complete churn of the function, still use r1 as the pointer to the storage area during restore. Fixes: cd08f109e262 ("powerpc/32s: Enable CONFIG_VMAP_STACK") Reported-by: Giuseppe Sacco <giuseppe@sguazz.it> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Giuseppe Sacco <giuseppe@sguazz.it> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/e3e0d8042a3ba75cb4a9546c19c408b5b5b28994.1607404931.git.christophe.leroy@csgroup.eu arch/powerpc/platforms/Kconfig.cputype | 2 +- arch/powerpc/platforms/powermac/sleep.S | 132 ++++++++++++++------------------ 2 files changed, 60 insertions(+), 74 deletions(-) Interesting ... Though confusing. Looking closer, in fact that might be a false positive. The huge difference with that bad commit is that: - Before the commit, the kernel is built _without_ CONFIG_VMAP_STACK - After the commit, the kernel is built _with_ CONFIG_VMAP_STACK Would you be able to perform following tests: - Disable VMAP_STACK and see if the problem still occurs. - Disable ADB_PMU and see it the problem still occurs. With the version which preceeds the bad commit, can you disable ADB_PMU and enable VMAP_STACK and see what happens ? Created attachment 300354 [details]
dmesg (5.10-rc2 with ADB_PMU disabled, PowerMac G4 DP)
Took a little time but I double checked the results (one time using distcc '-j8 -l2', one time native '-j3') to be sure:
ADB_PMU disabled, VMAP_STACK disabled ... "neverending build"
ADB_PMU enabled, VMAP_STACK disabled ... works ok
ADB_PMU disabled, VMAP_STACK enabled ... "neverending build"
ADB_PMU enabled, VMAP_STACK enabled ... memory corruption
Version used was git db972a3787d12b1ce9ba7a31ec376d8a79e04c47, which is the one before a last 'git bisect bad' ends the git bisect.
The "neverending builds" happen when I run this kernel with ADB_PMU disabled. The G4 runs for several hours building (?) without reaching the glibc test stage. With ADB_PMU enabled I get a pass or memory corruption much earlier.
Also without ADB_PMU I get a kernel panic when rebooting or shutting down the G4. Also the G4 does not reboot/poweroff in this case, I need to switch it off manually.
Thanks for the tests. I'm not surprised that the system doesn't poweroff or reboot without ADB_PMU because the PMU manages power. The "neverending build" is maybe because the PMU also manages RTC clock and without it you get inconsistent time ? Anyway, it looks like there is indeed something linked to VMAP_STACK. I'm wondering whether you could be running out of vmalloc space. I initially thought you were using KASAN, but it seems not according to your .config. Could you try reducing CONFIG_LOWMEM_SIZE to 0x28000000 for instance and see if the memory corruption still happens ? To do this you'll need CONFIG_ADVANCED_OPTIONS and CONFIG_LOWMEM_SIZE_BOOL. (In reply to Christophe Leroy from comment #10) > I'm wondering whether you could be running out of vmalloc space. I initially > thought you were using KASAN, but it seems not according to your .config. Correct, I was not using KASAN. I use it only for testing -rc kernels or when I am particularly wary. This memory corruption I noticed during regular usage. Seems running the kernel with slub_debug=FZP page_poison=1 is a good thing. ;) > Could you try reducing CONFIG_LOWMEM_SIZE to 0x28000000 for instance and see > if the memory corruption still happens ? Thanks, that did the trick! With CONFIG_LOWMEM_SIZE=0x28000000 the memory corruption is gone on VMAP_STACK enabled kernels. Tested it additionally on current 5.16.4 where this works too. Created attachment 300774 [details]
dmesg (5.18-rc3, PowerMac G4 DP)
Another try with running glibc-2.34 testsuite on kernel 5.18-rc3. Looks like it's still a problem.
[...]
pagealloc: memory corruption
fffdfff0: 00 00 00 00 ....
CPU: 0 PID: 21222 Comm: install Not tainted 5.18.0-rc3-PMacG4 #5
Call Trace:
[f8085a70] [c06e8820] dump_stack_lvl+0x80/0xc0 (unreliable)
[f8085a90] [c02c0b2c] __kernel_unpoison_pages+0x1c0/0x204
[f8085ae0] [c02a4cb0] get_page_from_freelist+0xcb4/0xeb0
[f8085ba0] [c02a5754] __alloc_pages+0x184/0x11b4
[f8085c70] [c0230d50] __filemap_get_folio+0x224/0x598
[f8085cf0] [c0240ebc] pagecache_get_page+0x20/0x88
[f8085d10] [c04e2600] prepare_pages+0xf8/0x358
[f8085d60] [c04e4e54] btrfs_buffered_write+0x334/0x850
[f8085e20] [c04ea598] btrfs_do_write_iter+0x3a8/0x768
[f8085e80] [c02ee25c] vfs_write+0x364/0x488
[f8085f00] [c02ee52c] ksys_write+0x78/0x128
[f8085f30] [c001e1a8] ret_from_syscall+0x0/0x2c
--- interrupt: c00 at 0x5c5d08
NIP: 005c5d08 LR: 005c5ce0 CTR: c0289c9c
REGS: f8085f40 TRAP: 0c00 Not tainted (5.18.0-rc3-PMacG4)
MSR: 0000f932 <EE,PR,FP,ME,IR,DR,RI> CR: 28022464 XER: 20000000
GPR00: 00000004 af820720 a7ced760 00000006 a77aa000 00020000 00000000 00000000
GPR08: 00120000 a77a9000 00000008 403d77ca 403d7497 0077fff4 00000000 00020000
GPR16: 00000000 af8208c8 00020000 00000000 00000000 af821d2b a77aa000 00020000
GPR24: 00000000 00000006 7ff00000 00000006 a77aa000 00020000 006c7ff4 00020000
NIP [005c5d08] 0x5c5d08
LR [005c5ce0] 0x5c5ce0
--- interrupt: c00
page:ef4c4ec4 refcount:1 mapcount:0 mapping:00000000 index:0x1 pfn:0x31069
flags: 0x80000000(zone=2)
raw: 80000000 00000100 00000122 00000000 00000001 00000000 ffffffff 00000001
raw: 00000000
page dumped because: pagealloc: corrupted page detail
Created attachment 300775 [details]
kernel .config (5.18-rc3, PowerMac G4 DP)
Do you mean it still happens with the default values, or it also happens with the reduced CONFIG_LOWMEM_SIZE ? It definitively still happens with the default values. Can test with the reduced CONFIG_LOWMEM_SIZE next week and report back. Created attachment 300929 [details] dmesg (5.18-rc6, CONFIG_LOWMEM_SIZE=0x28000000, PowerMac G4 DP) (In reply to Christophe Leroy from comment #14) > Do you mean it still happens with the default values, or it also happens > with the reduced CONFIG_LOWMEM_SIZE ? Turns out the memory corruption also happens with the reduced CONFIG_LOWMEM_SIZE=0x28000000. Tested again on v5.18-rc6, both with CONFIG_LOWMEM_SIZE=0x28000000 and without. Created attachment 300930 [details]
kernel .config (5.18-rc6, CONFIG_LOWMEM_SIZE=0x28000000, PowerMac G4 DP)
Ok, and another problem during building via distcc on the G4, still LOWMEM_SIZE=0x28000000 (kernel v5.17.6). [...] Oops: Kernel stack overflow, sig: 11 [#1] BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac Modules linked in: auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc ghash_generic gf128mul gcm ccm algif_aead des_generic libdes ctr cbc ecb algif_skcipher aes_generic libaes cmac sha512_generic sha1_generic sha1_powerpc md5 md5_ppc md4 hid_generic b43legacy usbhid mac80211 hid libarc4 cfg80211 snd_aoa_codec_tas rfkill snd_aoa_fabric_layout snd_aoa evdev mac_hid therm_windtunnel firewire_ohci firewire_core crc_itu_t sr_mod cdrom ohci_pci 8250_pci radeon snd_aoa_i2sbus ohci_hcd snd_aoa_soundbus ssb snd_pcm ehci_pci snd_timer pcmcia snd soundcore pcmcia_core hwmon 8250 ehci_hcd i2c_algo_bit 8250_base drm_ttm_helper serial_mctrl_gpio ttm drm_kms_helper usbcore syscopyarea sysfillrect sysimgblt usb_common fb_sys_fops pkcs8_key_parser fuse drm drm_panel_orientation_quirks configfs CPU: 0 PID: 24122 Comm: sh Not tainted 5.17.6-gentoo-PMacG4 #1 NIP: c0018614 LR: 00000000 CTR: c103cbe0 REGS: e7fe9f50 TRAP: 0000 Not tainted (5.17.6-gentoo-PMacG4) MSR: 00001030 <ME,IR,DR> CR: 00000001 XER: c000e234 GPR00: a78bfe90 80002288 00000000 d6a5e1a0 e991de60 0068c6c4 a7a3ff98 c1099000 GPR08: 00000000 e991dec0 d6a5e1a0 80002288 005900d0 0068fff4 00000000 00000007 GPR16: 00000029 00000007 00bc44b0 a7ddafe8 a78bfe90 fffff000 00000000 00000000 GPR24: 005900d0 0068c6c4 c0dcc7a0 c1402b48 caa899c0 c103cbe0 c4ce9400 c08fe234 NIP [c0018614] interrupt_return+0x17c/0x190 LR [00000000] 0x0 Call Trace: Instruction dump: 40860018 7ccff120 80c10028 80010010 80210014 4c000064 7ccff120 7d3043a6 392100c0 80c10028 80010010 80210014 <91210000> 7d3042a6 4c000064 7c000828 ---[ end trace 0000000000000000 ]--- @Christophe: Would it be helpful for these issues to try a KASAN build? Yes KASAN can bring some additional inputs. Maybe start with CONFIG_KFENCE, it is lighter than KASAN. For the above problem, maybe CONFIG_DEBUG_STACKOVERFLOW can help. DEBUG_STACKOVERFLOW and KFENCE have been enabled already in the builds I did here (see kernel attached kernel .config here). However if I enable (inline) KASAN the kernel won't boot at all. I get dropped out in OpenFirmware console with: [...] Finalizing device tree... using OF tree (promptr=ff847240) Invalid memory access at %SRR0: 40000000 %SRR1: 00000000 Increasing the stack size (CONFIG_THREAD_SHIFT) might avoid the stack overflows and allow you to debug the original issue in isolation. Created attachment 300977 [details]
dmesg (5.18-rc6, CONFIG_LOWMEM_SIZE=0x28000000, outline KASAN, PowerMac G4 DP)
I increased THREAD_SHIFT to 14 and used outline KASAN still with CONFIG_LOWMEM_SIZE=0x28000000. The memory corruption output looks slightly different (but not much):
[...]
pagealloc: memory corruption
f5fcfff0: 00 00 00 00 ....
CPU: 1 PID: 29742 Comm: ld.so.1 Not tainted 5.18.0-rc6-PMacG4 #7
Call Trace:
[eea3ba90] [c09890d4] dump_stack_lvl+0x80/0xc0 (unreliable)
[eea3bab0] [c03cce40] __kernel_unpoison_pages+0x208/0x250
[eea3bb00] [c03a2e48] post_alloc_hook+0x108/0x144
[eea3bb30] [c03a66e0] get_page_from_freelist+0x9d4/0x12dc
[eea3bc70] [c03a7ad0] __alloc_pages+0x23c/0x1570
[eea3bde0] [c0379c8c] handle_mm_fault+0x610/0x1240
[eea3bed0] [c002e2d4] ___do_page_fault+0x19c/0x850
[eea3bf10] [c002ebbc] do_page_fault+0x28/0x5c
[eea3bf30] [c000433c] DataAccess_virt+0x124/0x17c
--- interrupt: 300 at 0x6fe0338c
NIP: 6fe0338c LR: 6fe032c4 CTR: 6fe033e0
REGS: eea3bf40 TRAP: 0300 Not tainted (5.18.0-rc6-PMacG4)
MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 48002262 XER: 20000000
DAR: 046a5000 DSISR: 42000000
GPR00: 6ffbcb94 afc45940 a7c95560 046a4fe4 8a000000 000127e0 03e59a8b 00000003
GPR08: 046a5004 046a5000 04621cfc 6fe03170 6fe032c4 6ffece34 00000000 6ffef34d
GPR16: 02dea020 04416750 00000003 01f8cbec 02de9fa0 01f8c660 00000000 00000000
GPR24: afc45aa0 6ffef37c afc45a18 04678c7c 0007630c 04678c7c 6ff76ff4 045f5990
NIP [6fe0338c] 0x6fe0338c
LR [6fe032c4] 0x6fe032c4
--- interrupt: 300
page:e739d6ec refcount:1 mapcount:0 mapping:00000000 index:0x1 pfn:0x290a3
flags: 0x80000000(zone=2)
raw: 80000000 00000100 00000122 00000000 00000001 00000000 ffffffff 00000001
raw: 00000000
page dumped because: pagealloc: corrupted page details
[...]
With THREAD_SHIFT=14 the stack issue does not show up.
A kernel with inline KASAN and same setup otherwise won't boot showing me this at the OpenFirmware prompt:
[...]
Finalizing device tree... using OF tree (promptr=ff847240)
Invalid memory access at %SRR0: 40000000 %SRR1: 00000000
Created attachment 300978 [details]
kernel .config (5.18-rc6, CONFIG_LOWMEM_SIZE=0x28000000, outline KASAN, PowerMac G4 DP)
Seems like with Inline KASAN your kernel is far too big compared to what we support at the time being: c2468000 T __end_rodata c2800000 T __init_begin c2800000 T _sinittext c2801644 T prom_init The init text section is behind the 32Mbytes boundary, it means that prom_init and other functions are not called anymore directly but via a trampoline. c000000c <__start>: c000000c: 2c 05 00 00 cmpwi r5,0 c0000010: 41 82 00 1c beq c000002c <__start+0x20> c0000014: 42 9f 00 05 bcl 20,4*cr7+so,c0000018 <__start+0xc> c0000018: 7d 08 02 a6 mflr r8 c000001c: 3d 08 00 00 addis r8,r8,0 c0000020: 39 08 ff e8 addi r8,r8,-24 c0000024: 48 00 38 e5 bl c0003908 <setup_disp_bat+0x30> ... c0003908: 3d 80 c2 80 lis r12,-15744 c000390c: 39 8c 16 44 addi r12,r12,5700 c0003910: 7d 89 03 a6 mtctr r12 c0003914: 4e 80 04 20 bctr And it cannot work because at that time the kernel is not yet relocated to its final location. There was the same problem with PPC64 and it was fix by 24d33ac5b8ff ("powerpc/64s: Make prom_init require RELOCATABLE"). Don't know if a similar approach could work. The Kernel stack overflow looks odd. Value of R1 is wrong and LR is NULL. Don't know how we ended up here, but probably not by a real stack overflow. Note that THREAD_SHIFT is set to 14 when using KASAN: config THREAD_SHIFT int "Thread shift" if EXPERT range 13 15 default "15" if PPC_256K_PAGES default "14" if PPC64 default "14" if KASAN default "13" help Used to define the stack size. The default is almost always what you want. Only change this if you know what you are doing. I opened a new bug for the stack issue which contains a bit more data. Hopefully the output is of some help (see bug #216041). Created attachment 301302 [details] dmesg (5.19-rc4, PowerMac G4 DP) Re-tried on v5.19-rc4 (without fadditional patches) + KFENCE. My findings so far: 1. Memory corruption still persists. 2. Even without KASAN I need THREAD_SHIFT=14 or else I get the stack overflow from bug #216041. 3. Memory corruption also happens with CONFIG_LOWMEM_SIZE=0x28000000. 4. But the "neverending build" commit mentioned in comment #9 is gone (be it with default .config or CONFIG_LOWMEM_SIZE=0x28000000). [...] pagealloc: memory corruption fffdfff0: 00 00 00 00 .... CPU: 0 PID: 29136 Comm: localedef Not tainted 5.19.0-rc4-PMacG4 #3 Call Trace: [f39b3c20] [c05eb9c0] dump_stack_lvl+0x60/0x90 (unreliable) [f39b3c40] [c0232fb0] __kernel_unpoison_pages+0x1a8/0x1ec [f39b3c90] [c02170dc] get_page_from_freelist+0xc20/0xe70 [f39b3d50] [c0217bdc] __alloc_pages+0x18c/0xe80 [f39b3e10] [c01f46b4] wp_page_copy+0x214/0xa1c [f39b3e80] [c01fa0b8] handle_mm_fault+0x720/0xd64 [f39b3f00] [c00215dc] do_page_fault+0x1d4/0x830 [f39b3f30] [c000433c] DataAccess_virt+0x124/0x17c --- interrupt: 300 at 0x669410 NIP: 00669410 LR: 006693e4 CTR: 00000000 REGS: f39b3f40 TRAP: 0300 Not tainted (5.19.0-rc4-PMacG4) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 84002462 XER: 20000000 DAR: a7a3cce8 DSISR: 0a000000 GPR00: 0066961c afd34060 a7bd3000 01a069bc 01b76d60 00000009 a4e0c05a 0005ccd8 GPR08: 01b76140 a7a3cce8 a7a43e44 400a713a 44002862 0068fe34 01b8d730 00000001 GPR16: 00000000 01a069bc 01a069f8 01a06990 01b8d170 01a06894 0000000f 00000009 GPR24: 01b76d60 a4e0c05a 0000018d a7ad9f00 a79e0010 000041cb 00697cdc 01a069bc NIP [00669410] 0x669410 LR [006693e4] 0x6693e4 --- interrupt: 300 page:ef4bd80c refcount:1 mapcount:0 mapping:00000000 index:0x1 pfn:0x310ab flags: 0x80000000(zone=2) raw: 80000000 00000100 00000122 00000000 00000001 00000000 ffffffff 00000001 raw: 00000000 page dumped because: pagealloc: corrupted page details Created attachment 301303 [details]
kernel .config (5.19-rc4, PowerMac G4 DP)
It's a bit of a stab in the dark, but can you try turning preempt off? ie. CONFIG_PREEMPT_NONE=y (In reply to Michael Ellerman from comment #30) > It's a bit of a stab in the dark, but can you try turning preempt off? > > ie. CONFIG_PREEMPT_NONE=y Just tested that. Backtrace looks a little different but not much. [..] pagealloc: memory corruption fffdfff0: 00 00 00 00 .... CPU: 0 PID: 29086 Comm: localedef Not tainted 5.19.0-rc4-PMacG4 #2 Call Trace: [f397bc90] [c05eb280] dump_stack_lvl+0x60/0x90 (unreliable) [f397bcb0] [c0233128] __kernel_unpoison_pages+0x1a8/0x1ec [f397bd00] [c02172ec] get_page_from_freelist+0xc20/0xe70 [f397bdc0] [c0217de0] __alloc_pages+0x180/0xe98 [f397be80] [c01fa164] handle_mm_fault+0x450/0xd64 [f397bf00] [c00215d8] do_page_fault+0x1d0/0x82c [f397bf30] [c000433c] DataAccess_virt+0x124/0x17c --- interrupt: 300 at 0x83f1b8 NIP: 0083f1b8 LR: 0083e25c CTR: 00000000 REGS: f397bf40 TRAP: 0300 Not tainted (5.19.0-rc4-PMacG4) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 88224462 XER: 00000000 DAR: 01232b3c DSISR: 42000000 GPR00: 00840220 af9416c0 a7ca4000 01231b50 00000fe0 00000005 01232b38 00000000 GPR08: 00000ff1 01231b48 0000f4c9 008422b0 01067408 00a2fe34 00000070 01231b50 GPR16: 00000000 00000000 00000000 00000007 0000003f 009ba23c 01067010 009ba79c GPR24: 00000062 009bdac8 000000fe 009ba79c 00000fe0 009ba764 009b9ff4 00000ff0 NIP [0083f1b8] 0x83f1b8 LR [0083e25c] 0x83e25c --- interrupt: 300 page:ef4bd80c refcount:1 mapcount:0 mapping:00000000 index:0x1 pfn:0x310ab flags: 0x80000000(zone=2) raw: 80000000 00000100 00000122 00000000 00000001 00000000 ffffffff 00000001 raw: 00000000 page dumped because: pagealloc: corrupted page details Interesting thing is the memory corruption always seems to happen in the last stage of installing, after building is done at copying over the binaries from build directory to target directory: [...] if test -r /var/tmp/portage/sys-libs/glibc-2.34-r13/image//usr/include/gnu/stubs-32.h && cmp -s /var/tmp/portage/sys-libs/glibc-2.34-r13/work/build-ppc-powerpc-unknown-linux-gnu-nptl/stubs.h /var/tmp/portage/sys-libs/glibc-2.34-r13/image//usr/include/gnu/stubs-32.h; \ then echo 'stubs.h unchanged'; \ else /usr/lib/portage/python3.10/ebuild-helpers/xattr/install -c -m 644 /var/tmp/portage/sys-libs/glibc-2.34-r13/work/build-ppc-powerpc-unknown-linux-gnu-nptl/stubs.h /var/tmp/portage/sys-libs/glibc-2.34-r13/image//usr/include/gnu/stubs-32.h; fi rm -f /var/tmp/portage/sys-libs/glibc-2.34-r13/work/build-ppc-powerpc-unknown-linux-gnu-nptl/stubs.h make[1]: Leaving directory '/var/tmp/portage/sys-libs/glibc-2.34-r13/work/glibc-2.34' >>> Completed installing sys-libs/glibc-2.34-r13 into >>> /var/tmp/portage/sys-libs/glibc-2.34-r13/image * Final size of build directory: 635640 KiB (620.7 MiB) * Final size of installed tree: 109892 KiB (107.3 MiB) making executable: /usr/lib/libc.so compressme : 44.96% ( 3.80 KiB => 1.71 KiB, compressme.zst) [...] /var/tmp/portage/sys-libs/glibc-2.34-r13/image/usr/share/doc/glibc-2.34-r13/NEWS : 33.98% ( 315 KiB => 107 KiB, /var/tmp/portage/sys-libs/glibc-2.34-r13/image/usr/share/doc/glibc-2.34-r13/NEWS.zst) strip: powerpc-unknown-linux-gnu-strip --strip-unneeded -N __gentoo_check_ldflags__ -R .comment -R .GCC.command.line -R .note.gnu.gold-version /usr/lib/crt1.o /usr/lib/Mcrt1.o /usr/lib/gcrt1.o /usr/lib/Scrt1.o [...] /lib/ld.so.1 /usr/lib/audit/sotruss-lib.so /usr/bin/pldd installsources: rsyncing source files rsync: [sender] link_stat "/var/tmp/portage/sys-libs/glibc-2.34-r13/work/glibc-2.34/iconv/charmap-kw.gperf" failed: No such file or directory (2) rsync: [sender] link_stat "/var/tmp/portage/sys-libs/glibc-2.34-r13/work/glibc-2.34/locale/charmap-kw.gperf" failed: No such file or directory (2) rsync: [sender] link_stat "/var/tmp/portage/sys-libs/glibc-2.34-r13/work/glibc-2.34/locale/locfile-kw.gperf" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1326) [sender=3.2.4] >>> Installing (1 of 1) sys-libs/glibc-2.34-r13::gentoo * Defaulting /etc/host.conf:multi to on * Last-minute run tests with ./ld.so.1 in /lib ... [...] (In reply to Michael Ellerman from comment #30) > It's a bit of a stab in the dark, but can you try turning preempt off? > > ie. CONFIG_PREEMPT_NONE=y Looks like your intuition was not bad at all. ;) CONFIG_PREEMPT_NONE=y had no effect but when I disable SMP at all '# CONFIG_SMP is not set' I get no memory corruption and also no stack overflow issues. Also no special treatment with Advanced Options or setting THREAD_SHIFT manually was necessary. The G4 just does fine, albeit with 1 of it's 2 CPUs only with disabled SMP. For testing I did 6 of this glibc testsuite builds in a row without getting issues. With SMP enabled I get memory corruption or stack overflow at the 1st build allmost all of the time. Created attachment 301337 [details] dmesg (5.19-rc5, outline KASAN, PowerMac G4 DP) Re-tested on 5.19-rc5 + https://patchwork.ozlabs.org/project/linuxppc-dev/patch/2ee707512b8b212b079b877f4ceb525a1606a3fb.1656655567.git.christophe.leroy@csgroup.eu/ I can run the kernel with outline KASAN, default THREAD_SHIFT and without advanced options necessary. Also I don't get the stack issue (bug #216041) any longer. However as long as CONFIG_SMP=y (CONFIG_NR_CPUS=2) is set I still get the memory corruption: [...] pagealloc: memory corruption f5fcfff0: 00 00 00 00 .... CPU: 1 PID: 27635 Comm: estrip Not tainted 5.19.0-rc5-PMacG4+ #1 Call Trace: [f380b9b0] [c0829ebc] dump_stack_lvl+0x60/0x90 (unreliable) [f380b9d0] [c0307528] __kernel_unpoison_pages+0x1d8/0x220 [f380ba20] [c02dd3bc] post_alloc_hook+0x108/0x144 [f380ba50] [c02e0a70] get_page_from_freelist+0x9e0/0x1278 [f380bb90] [c02e1e04] __alloc_pages+0x250/0x1078 [f380bcf0] [c02af098] wp_page_copy+0x128/0xdb8 [f380bde0] [c02b6fdc] handle_mm_fault+0x954/0x1138 [f380bed0] [c0029938] ___do_page_fault+0x250/0x84c [f380bf10] [c002a168] do_page_fault+0x28/0x5c [f380bf30] [c000433c] DataAccess_virt+0x124/0x17c --- interrupt: 300 at 0x65b734 NIP: 0065b734 LR: 0065b708 CTR: 00354600 REGS: f380bf40 TRAP: 0300 Not tainted (5.19.0-rc5-PMacG4+) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 82222420 XER: 00000000 DAR: 026fcea0 DSISR: 0a000000 GPR00: 00000000 afbd5250 a7b0c560 026bb5f0 0269deac 026bb628 696e6f64 026fcea0 GPR08: 00000000 00000000 00000000 00354600 42222420 0071fff4 026af620 0072243c GPR16: 00723b50 007223a4 026b1770 026ec8a0 007222e4 0269de70 02700920 00000001 GPR24: 00721e9c 00721eb8 0072082c 00000000 afbd52ec 00000000 0072608c 00000000 NIP [0065b734] 0x65b734 LR [0065b708] 0x65b708 --- interrupt: 300 page:ef4bd6ec refcount:1 mapcount:0 mapping:00000000 index:0x1 pfn:0x310a3 flags: 0x80000000(zone=2) raw: 80000000 00000100 00000122 00000000 00000001 00000000 ffffffff 00000001 raw: 00000000 page dumped because: pagealloc: corrupted page details Created attachment 301639 [details] dmesg (6.0-rc2, outline KASAN, PowerMac G4 DP) Getting a more interesting backtrace with v6.0.0-rc2 + outline KASAN: [...] BUG: KASAN: slab-out-of-bounds in handle_mm_fault+0x27c/0x10f4 Read of size 4 at addr c32edd48 by task cc1plus/1230 CPU: 1 PID: 1230 Comm: cc1plus Tainted: G T 6.0.0-rc2-PMacG4 #5 Call Trace: [f4d2bd40] [c0864cc4] dump_stack_lvl+0x60/0xa4 (unreliable) [f4d2bd60] [c032b8d8] print_report+0x30c/0x688 [f4d2bdb0] [c032befc] kasan_report+0xe4/0x214 [f4d2be00] [c02ce4d8] handle_mm_fault+0x27c/0x10f4 [f4d2bed0] [c002cc98] ___do_page_fault+0x25c/0x8d0 [f4d2bf10] [c002d560] do_page_fault+0x28/0x6c [f4d2bf30] [c000433c] DataAccess_virt+0x124/0x17c --- interrupt: 300 at 0xfa9c0c0 NIP: 0fa9c0c0 LR: 1066b838 CTR: 0fa9bea4 REGS: f4d2bf40 TRAP: 0300 Tainted: G T (6.0.0-rc2-PMacG4) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 24022828 XER: 20000000 DAR: 9a352014 DSISR: 42000000 GPR00: 1066b828 af869c10 a7dd1ba0 9a352000 00000000 00000018 9a352018 00000000 GPR08: 11c30000 0fb89a88 099aec30 0fa9bea4 88022444 11c3d4e0 00000001 af869e78 GPR16: 9afeeed0 9afedd60 10cef83c 115fdfd0 af869e80 11603030 9afeeed0 00000002 GPR24: 9afeeed0 9b611f60 115fdfd0 a0c82c30 9afeeed0 00000005 0000006e 9a352000 NIP [0fa9c0c0] 0xfa9c0c0 LR [1066b838] 0x1066b838 --- interrupt: 300 Allocated by task 1: __kasan_slab_alloc+0xd0/0x134 kmem_cache_alloc+0x21c/0x66c __kernfs_new_node+0xe8/0x354 kernfs_new_node+0x84/0xfc __kernfs_create_file+0x50/0x204 sysfs_add_file_mode_ns+0xf4/0x1f0 internal_create_group+0x1f0/0x620 btrfs_init_sysfs+0x264/0x350 init_btrfs_fs+0x24/0x280 do_one_initcall+0xc0/0x34c kernel_init_freeable+0x2c0/0x400 kernel_init+0x28/0x178 ret_from_kernel_thread+0x5c/0x64 The buggy address belongs to the object at c32edd50 which belongs to the cache kernfs_node_cache of size 88 The buggy address is located 8 bytes to the left of 88-byte region [c32edd50, c32edda8) The buggy address belongs to the physical page: page:eee4a954 refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0x32ed flags: 0x200(slab|zone=0) raw: 00000200 00000100 00000122 c1852520 00000000 001e003c ffffffff 00000001 raw: 00000000 page dumped because: kasan: bad access detected Memory state around the buggy address: c32edc00: 00 00 fc fc fc fc fc fc 00 00 00 00 00 00 00 00 c32edc80: 00 00 00 fc fc fc fc fc fc 00 00 00 00 00 00 00 >c32edd00: 00 00 00 00 fc fc fc fc fc fc 00 00 00 00 00 00 ^ c32edd80: 00 00 00 00 00 fc fc fc fc fc fc 00 00 00 00 00 c32ede00: 00 00 00 00 00 00 fc fc fc fc fc fc 00 00 00 00 ================================================================== Disabling lock debugging due to kernel taint get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc [...] get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc get_swap_device: Bad swap file entry 64cccccc _swap_info_get: Bad swap file entry 64cccccc BUG: Bad page map in process cc1plus pte:cccccccc pmd:032ed000 addr:9a352000 vm_flags:00100073 anon_vma:c5933ee8 mapping:00000000 index:9a352 file:(null) fault:0x0 mmap:0x0 read_folio:0x0 CPU: 0 PID: 1230 Comm: cc1plus Tainted: G B W T 6.0.0-rc2-PMacG4 #5 Call Trace: [f4d2b9b0] [c0864cc4] dump_stack_lvl+0x60/0xa4 (unreliable) [f4d2b9d0] [c02c5bc4] print_bad_pte+0x2e8/0x364 [f4d2ba60] [c02c9c3c] unmap_page_range+0x964/0xb78 [f4d2bb20] [c02ca590] unmap_vmas+0x168/0x2d4 [f4d2bbd0] [c02d8af0] exit_mmap+0x11c/0x2dc [f4d2bca0] [c005e8f4] mmput+0xa0/0x254 [f4d2bcd0] [c006e1b4] do_exit+0x430/0xe08 [f4d2bd50] [c006ed88] do_group_exit+0x68/0x11c [f4d2bd80] [c0086818] get_signal+0xbfc/0xc50 [f4d2be30] [c000edf8] do_notify_resume+0xf0/0x540 [f4d2bf10] [c0019cfc] interrupt_exit_user_prepare_main+0x7c/0xd0 [f4d2bf30] [c00234ac] interrupt_return+0x14/0x190 --- interrupt: 300 at 0xfa9c0c0 NIP: 0fa9c0c0 LR: 1066b838 CTR: 0fa9bea4 REGS: f4d2bf40 TRAP: 0300 Tainted: G B W T (6.0.0-rc2-PMacG4) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 24022828 XER: 20000000 DAR: 9a352014 DSISR: 42000000 GPR00: 1066b828 af869c10 a7dd1ba0 9a352000 00000000 00000018 9a352018 00000000 GPR08: 11c30000 0fb89a88 099aec30 0fa9bea4 88022444 11c3d4e0 00000001 af869e78 GPR16: 9afeeed0 9afedd60 10cef83c 115fdfd0 af869e80 11603030 9afeeed0 00000002 GPR24: 9afeeed0 9b611f60 115fdfd0 a0c82c30 9afeeed0 00000005 0000006e 9a352000 NIP [0fa9c0c0] 0xfa9c0c0 LR [1066b838] 0x1066b838 --- interrupt: 300 _swap_info_get: Bad swap file entry 64cccccc BUG: Bad page map in process cc1plus pte:cccccccc pmd:032ed000 addr:9a353000 vm_flags:00100073 anon_vma:c5933ee8 mapping:00000000 index:9a353 file:(null) fault:0x0 mmap:0x0 read_folio:0x0 CPU: 0 PID: 1230 Comm: cc1plus Tainted: G B W T 6.0.0-rc2-PMacG4 #5 Call Trace: [f4d2b9b0] [c0864cc4] dump_stack_lvl+0x60/0xa4 (unreliable) [f4d2b9d0] [c02c5bc4] print_bad_pte+0x2e8/0x364 [f4d2ba60] [c02c9c3c] unmap_page_range+0x964/0xb78 [f4d2bb20] [c02ca590] unmap_vmas+0x168/0x2d4 [f4d2bbd0] [c02d8af0] exit_mmap+0x11c/0x2dc [f4d2bca0] [c005e8f4] mmput+0xa0/0x254 [f4d2bcd0] [c006e1b4] do_exit+0x430/0xe08 [f4d2bd50] [c006ed88] do_group_exit+0x68/0x11c [f4d2bd80] [c0086818] get_signal+0xbfc/0xc50 [f4d2be30] [c000edf8] do_notify_resume+0xf0/0x540 [f4d2bf10] [c0019cfc] interrupt_exit_user_prepare_main+0x7c/0xd0 [f4d2bf30] [c00234ac] interrupt_return+0x14/0x190 --- interrupt: 300 at 0xfa9c0c0 NIP: 0fa9c0c0 LR: 1066b838 CTR: 0fa9bea4 REGS: f4d2bf40 TRAP: 0300 Tainted: G B W T (6.0.0-rc2-PMacG4) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 24022828 XER: 20000000 DAR: 9a352014 DSISR: 42000000 GPR00: 1066b828 af869c10 a7dd1ba0 9a352000 00000000 00000018 9a352018 00000000 GPR08: 11c30000 0fb89a88 099aec30 0fa9bea4 88022444 11c3d4e0 00000001 af869e78 GPR16: 9afeeed0 9afedd60 10cef83c 115fdfd0 af869e80 11603030 9afeeed0 00000002 GPR24: 9afeeed0 9b611f60 115fdfd0 a0c82c30 9afeeed0 00000005 0000006e 9a352000 NIP [0fa9c0c0] 0xfa9c0c0 LR [1066b838] 0x1066b838 --- interrupt: 300 BUG: Bad page map in process cc1plus pte:00000001 pmd:032ed000 page:eedd8000 refcount:1 mapcount:-1 mapping:00000000 index:0x0 pfn:0x0 flags: 0x1000(reserved|zone=0) raw: 00001000 eedd8004 eedd8004 00000000 00000000 00000000 fffffffe 00000001 raw: 00000000 page dumped because: bad pte addr:9a354000 vm_flags:00100073 anon_vma:c5933ee8 mapping:00000000 index:9a354 file:(null) fault:0x0 mmap:0x0 read_folio:0x0 CPU: 0 PID: 1230 Comm: cc1plus Tainted: G B W T 6.0.0-rc2-PMacG4 #5 Call Trace: [f4d2b9b0] [c0864cc4] dump_stack_lvl+0x60/0xa4 (unreliable) [f4d2b9d0] [c02c5bc4] print_bad_pte+0x2e8/0x364 [f4d2ba60] [c02c9974] unmap_page_range+0x69c/0xb78 [f4d2bb20] [c02ca590] unmap_vmas+0x168/0x2d4 [f4d2bbd0] [c02d8af0] exit_mmap+0x11c/0x2dc [f4d2bca0] [c005e8f4] mmput+0xa0/0x254 [f4d2bcd0] [c006e1b4] do_exit+0x430/0xe08 [f4d2bd50] [c006ed88] do_group_exit+0x68/0x11c [f4d2bd80] [c0086818] get_signal+0xbfc/0xc50 [f4d2be30] [c000edf8] do_notify_resume+0xf0/0x540 [f4d2bf10] [c0019cfc] interrupt_exit_user_prepare_main+0x7c/0xd0 [f4d2bf30] [c00234ac] interrupt_return+0x14/0x190 --- interrupt: 300 at 0xfa9c0c0 NIP: 0fa9c0c0 LR: 1066b838 CTR: 0fa9bea4 REGS: f4d2bf40 TRAP: 0300 Tainted: G B W T (6.0.0-rc2-PMacG4) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 24022828 XER: 20000000 DAR: 9a352014 DSISR: 42000000 GPR00: 1066b828 af869c10 a7dd1ba0 9a352000 00000000 00000018 9a352018 00000000 GPR08: 11c30000 0fb89a88 099aec30 0fa9bea4 88022444 11c3d4e0 00000001 af869e78 GPR16: 9afeeed0 9afedd60 10cef83c 115fdfd0 af869e80 11603030 9afeeed0 00000002 GPR24: 9afeeed0 9b611f60 115fdfd0 a0c82c30 9afeeed0 00000005 0000006e 9a352000 NIP [0fa9c0c0] 0xfa9c0c0 LR [1066b838] 0x1066b838 --- interrupt: 300 _swap_info_get: Bad swap file entry 14c32d92 BUG: Bad page map in process cc1plus pte:c32d9228 pmd:032ed000 addr:9a356000 vm_flags:00100073 anon_vma:c5933ee8 mapping:00000000 index:9a356 file:(null) fault:0x0 mmap:0x0 read_folio:0x0 CPU: 0 PID: 1230 Comm: cc1plus Tainted: G B W T 6.0.0-rc2-PMacG4 #5 Call Trace: [f4d2b9b0] [c0864cc4] dump_stack_lvl+0x60/0xa4 (unreliable) [f4d2b9d0] [c02c5bc4] print_bad_pte+0x2e8/0x364 [f4d2ba60] [c02c9c3c] unmap_page_range+0x964/0xb78 [f4d2bb20] [c02ca590] unmap_vmas+0x168/0x2d4 [f4d2bbd0] [c02d8af0] exit_mmap+0x11c/0x2dc [f4d2bca0] [c005e8f4] mmput+0xa0/0x254 [f4d2bcd0] [c006e1b4] do_exit+0x430/0xe08 [f4d2bd50] [c006ed88] do_group_exit+0x68/0x11c [f4d2bd80] [c0086818] get_signal+0xbfc/0xc50 [f4d2be30] [c000edf8] do_notify_resume+0xf0/0x540 [f4d2bf10] [c0019cfc] interrupt_exit_user_prepare_main+0x7c/0xd0 [f4d2bf30] [c00234ac] interrupt_return+0x14/0x190 --- interrupt: 300 at 0xfa9c0c0 NIP: 0fa9c0c0 LR: 1066b838 CTR: 0fa9bea4 REGS: f4d2bf40 TRAP: 0300 Tainted: G B W T (6.0.0-rc2-PMacG4) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 24022828 XER: 20000000 DAR: 9a352014 DSISR: 42000000 GPR00: 1066b828 af869c10 a7dd1ba0 9a352000 00000000 00000018 9a352018 00000000 GPR08: 11c30000 0fb89a88 099aec30 0fa9bea4 88022444 11c3d4e0 00000001 af869e78 GPR16: 9afeeed0 9afedd60 10cef83c 115fdfd0 af869e80 11603030 9afeeed0 00000002 GPR24: 9afeeed0 9b611f60 115fdfd0 a0c82c30 9afeeed0 00000005 0000006e 9a352000 NIP [0fa9c0c0] 0xfa9c0c0 LR [1066b838] 0x1066b838 --- interrupt: 300 _swap_info_get: Bad swap file entry 60c0e0fa BUG: Bad page map in process cc1plus pte:c0e0fac0 pmd:032ed000 addr:9a357000 vm_flags:00100073 anon_vma:c5933ee8 mapping:00000000 index:9a357 file:(null) fault:0x0 mmap:0x0 read_folio:0x0 CPU: 0 PID: 1230 Comm: cc1plus Tainted: G B W T 6.0.0-rc2-PMacG4 #5 Call Trace: [f4d2b9b0] [c0864cc4] dump_stack_lvl+0x60/0xa4 (unreliable) [f4d2b9d0] [c02c5bc4] print_bad_pte+0x2e8/0x364 [f4d2ba60] [c02c9c3c] unmap_page_range+0x964/0xb78 [f4d2bb20] [c02ca590] unmap_vmas+0x168/0x2d4 [f4d2bbd0] [c02d8af0] exit_mmap+0x11c/0x2dc [f4d2bca0] [c005e8f4] mmput+0xa0/0x254 [f4d2bcd0] [c006e1b4] do_exit+0x430/0xe08 [f4d2bd50] [c006ed88] do_group_exit+0x68/0x11c [f4d2bd80] [c0086818] get_signal+0xbfc/0xc50 [f4d2be30] [c000edf8] do_notify_resume+0xf0/0x540 [f4d2bf10] [c0019cfc] interrupt_exit_user_prepare_main+0x7c/0xd0 [f4d2bf30] [c00234ac] interrupt_return+0x14/0x190 --- interrupt: 300 at 0xfa9c0c0 NIP: 0fa9c0c0 LR: 1066b838 CTR: 0fa9bea4 REGS: f4d2bf40 TRAP: 0300 Tainted: G B W T (6.0.0-rc2-PMacG4) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 24022828 XER: 20000000 DAR: 9a352014 DSISR: 42000000 GPR00: 1066b828 af869c10 a7dd1ba0 9a352000 00000000 00000018 9a352018 00000000 GPR08: 11c30000 0fb89a88 099aec30 0fa9bea4 88022444 11c3d4e0 00000001 af869e78 GPR16: 9afeeed0 9afedd60 10cef83c 115fdfd0 af869e80 11603030 9afeeed0 00000002 GPR24: 9afeeed0 9b611f60 115fdfd0 a0c82c30 9afeeed0 00000005 0000006e 9a352000 NIP [0fa9c0c0] 0xfa9c0c0 LR [1066b838] 0x1066b838 --- interrupt: 300 _swap_info_get: Bad swap file entry 50c32ed0 BUG: Bad page map in process cc1plus pte:c32ed0a0 pmd:032ed000 addr:9a358000 vm_flags:00100073 anon_vma:c5933ee8 mapping:00000000 index:9a358 file:(null) fault:0x0 mmap:0x0 read_folio:0x0 CPU: 0 PID: 1230 Comm: cc1plus Tainted: G B W T 6.0.0-rc2-PMacG4 #5 Call Trace: [f4d2b9b0] [c0864cc4] dump_stack_lvl+0x60/0xa4 (unreliable) [f4d2b9d0] [c02c5bc4] print_bad_pte+0x2e8/0x364 [f4d2ba60] [c02c9c3c] unmap_page_range+0x964/0xb78 [f4d2bb20] [c02ca590] unmap_vmas+0x168/0x2d4 [f4d2bbd0] [c02d8af0] exit_mmap+0x11c/0x2dc [f4d2bca0] [c005e8f4] mmput+0xa0/0x254 [f4d2bcd0] [c006e1b4] do_exit+0x430/0xe08 [f4d2bd50] [c006ed88] do_group_exit+0x68/0x11c [f4d2bd80] [c0086818] get_signal+0xbfc/0xc50 [f4d2be30] [c000edf8] do_notify_resume+0xf0/0x540 [f4d2bf10] [c0019cfc] interrupt_exit_user_prepare_main+0x7c/0xd0 [f4d2bf30] [c00234ac] interrupt_return+0x14/0x190 --- interrupt: 300 at 0xfa9c0c0 NIP: 0fa9c0c0 LR: 1066b838 CTR: 0fa9bea4 REGS: f4d2bf40 TRAP: 0300 Tainted: G B W T (6.0.0-rc2-PMacG4) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 24022828 XER: 20000000 DAR: 9a352014 DSISR: 42000000 GPR00: 1066b828 af869c10 a7dd1ba0 9a352000 00000000 00000018 9a352018 00000000 GPR08: 11c30000 0fb89a88 099aec30 0fa9bea4 88022444 11c3d4e0 00000001 af869e78 GPR16: 9afeeed0 9afedd60 10cef83c 115fdfd0 af869e80 11603030 9afeeed0 00000002 GPR24: 9afeeed0 9b611f60 115fdfd0 a0c82c30 9afeeed0 00000005 0000006e 9a352000 NIP [0fa9c0c0] 0xfa9c0c0 LR [1066b838] 0x1066b838 --- interrupt: 300 _swap_info_get: Bad swap file entry 4c2a45bd BUG: Bad page map in process cc1plus pte:2a45bd98 pmd:032ed000 addr:9a35c000 vm_flags:00100073 anon_vma:c5933ee8 mapping:00000000 index:9a35c file:(null) fault:0x0 mmap:0x0 read_folio:0x0 CPU: 0 PID: 1230 Comm: cc1plus Tainted: G B W T 6.0.0-rc2-PMacG4 #5 Call Trace: [f4d2b9b0] [c0864cc4] dump_stack_lvl+0x60/0xa4 (unreliable) [f4d2b9d0] [c02c5bc4] print_bad_pte+0x2e8/0x364 [f4d2ba60] [c02c9c3c] unmap_page_range+0x964/0xb78 [f4d2bb20] [c02ca590] unmap_vmas+0x168/0x2d4 [f4d2bbd0] [c02d8af0] exit_mmap+0x11c/0x2dc [f4d2bca0] [c005e8f4] mmput+0xa0/0x254 [f4d2bcd0] [c006e1b4] do_exit+0x430/0xe08 [f4d2bd50] [c006ed88] do_group_exit+0x68/0x11c [f4d2bd80] [c0086818] get_signal+0xbfc/0xc50 [f4d2be30] [c000edf8] do_notify_resume+0xf0/0x540 [f4d2bf10] [c0019cfc] interrupt_exit_user_prepare_main+0x7c/0xd0 [f4d2bf30] [c00234ac] interrupt_return+0x14/0x190 --- interrupt: 300 at 0xfa9c0c0 NIP: 0fa9c0c0 LR: 1066b838 CTR: 0fa9bea4 REGS: f4d2bf40 TRAP: 0300 Tainted: G B W T (6.0.0-rc2-PMacG4) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 24022828 XER: 20000000 DAR: 9a352014 DSISR: 42000000 GPR00: 1066b828 af869c10 a7dd1ba0 9a352000 00000000 00000018 9a352018 00000000 GPR08: 11c30000 0fb89a88 099aec30 0fa9bea4 88022444 11c3d4e0 00000001 af869e78 GPR16: 9afeeed0 9afedd60 10cef83c 115fdfd0 af869e80 11603030 9afeeed0 00000002 GPR24: 9afeeed0 9b611f60 115fdfd0 a0c82c30 9afeeed0 00000005 0000006e 9a352000 NIP [0fa9c0c0] 0xfa9c0c0 LR [1066b838] 0x1066b838 --- interrupt: 300 BUG: Unable to handle kernel data access on read at 0x09fcbaf8 Faulting instruction address: 0xc02c99a8 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac Modules linked in: auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc hid_generic usbhid hid b43legacy mac80211 snd_aoa_codec_tas libarc4 snd_aoa_fabric_layout snd_aoa cfg80211 rfkill evdev mac_hid firewire_ohci therm_windtunnel firewire_core sr_mod cdrom crc_itu_t snd_aoa_i2sbus snd_aoa_soundbus snd_pcm snd_timer snd 8250_pci soundcore ssb pcmcia pcmcia_core 8250 8250_base serial_mctrl_gpio ohci_pci radeon hwmon ohci_hcd ehci_pci i2c_algo_bit drm_ttm_helper ttm drm_display_helper ehci_hcd drm_kms_helper syscopyarea sysfillrect usbcore sysimgblt fb_sys_fops usb_common fuse drm drm_panel_orientation_quirks configfs CPU: 0 PID: 1230 Comm: cc1plus Tainted: G B W T 6.0.0-rc2-PMacG4 #5 NIP: c02c99a8 LR: c02c99a8 CTR: 00000000 REGS: f4d2b9a0 TRAP: 0300 Tainted: G B W T (6.0.0-rc2-PMacG4) MSR: 00009032 <EE,ME,IR,DR,RI> CR: 24d88838 XER: 20000000 DAR: 09fcbaf8 DSISR: 40000000 GPR00: 00000000 f4d2ba60 c3ff0020 00000000 00000000 00000000 00000000 00000000 GPR08: 00000000 00000000 00000000 00000000 00000000 11c3d4e0 f4d2bad0 00000000 GPR16: c7d11008 fe9a5752 c0de15e0 f4d2bb50 f4d2bab0 c3626ac8 00000000 c16ed525 GPR24: 9a400000 fffffffd 00000000 f4d2bc10 09fcbaf4 09fcbaf4 9a35f000 c32edd78 NIP [c02c99a8] unmap_page_range+0x6d0/0xb78 LR [c02c99a8] unmap_page_range+0x6d0/0xb78 Call Trace: [f4d2ba60] [c02c99a8] unmap_page_range+0x6d0/0xb78 (unreliable) [f4d2bb20] [c02ca590] unmap_vmas+0x168/0x2d4 [f4d2bbd0] [c02d8af0] exit_mmap+0x11c/0x2dc [f4d2bca0] [c005e8f4] mmput+0xa0/0x254 [f4d2bcd0] [c006e1b4] do_exit+0x430/0xe08 [f4d2bd50] [c006ed88] do_group_exit+0x68/0x11c [f4d2bd80] [c0086818] get_signal+0xbfc/0xc50 [f4d2be30] [c000edf8] do_notify_resume+0xf0/0x540 [f4d2bf10] [c0019cfc] interrupt_exit_user_prepare_main+0x7c/0xd0 [f4d2bf30] [c00234ac] interrupt_return+0x14/0x190 --- interrupt: 300 at 0xfa9c0c0 NIP: 0fa9c0c0 LR: 1066b838 CTR: 0fa9bea4 REGS: f4d2bf40 TRAP: 0300 Tainted: G B W T (6.0.0-rc2-PMacG4) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 24022828 XER: 20000000 DAR: 9a352014 DSISR: 42000000 GPR00: 1066b828 af869c10 a7dd1ba0 9a352000 00000000 00000018 9a352018 00000000 GPR08: 11c30000 0fb89a88 099aec30 0fa9bea4 88022444 11c3d4e0 00000001 af869e78 GPR16: 9afeeed0 9afedd60 10cef83c 115fdfd0 af869e80 11603030 9afeeed0 00000002 GPR24: 9afeeed0 9b611f60 115fdfd0 a0c82c30 9afeeed0 00000005 0000006e 9a352000 NIP [0fa9c0c0] 0xfa9c0c0 LR [1066b838] 0x1066b838 --- interrupt: 300 Instruction dump: 7ecfb378 82410014 82c10018 4bffff04 3d40c170 578901be 83aa5580 1d290024 7fbd4a14 387d0004 7fbceb78 48063245 <813d0004> 712a0001 40820304 7f83e378 ---[ end trace 0000000000000000 ]--- Fixing recursive fault but reboot is needed! I deleted about 120.000 lines of "get_swap_device: Bad swap file entry 64cccccc" in the kernel dmesg to make it more compact. swap partition is 8192 MiB large at /dev/sdb6. Created attachment 301640 [details]
kernel .config (6.0-rc2, PowerMac G4 DP)
Would be nice to give it a new try with KCSAN enabled. To get KCSAN on powerpc/32, apply following series: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=354731 Created attachment 304308 [details]
dmesg (6.3.3, KCSAN, PowerMac G4 DP)
Thanks for taking another look into this Christophe!
Applied the patches on top of 6.3.3 and these are my findings so far:
1. KCSAN works fine on my G4 and passes self tests.
2. It does not generate any additional output when I hit the "pagealloc: memory corruption".
3. When setting CONFIG_KCSAN_WEAK_MEMORY=y my G4 won't finish booting. Early boot works, the screen shows some dmesg but booting gets stuck there never reaching console. I also don't get any netconsole output with CONFIG_KCSAN_WEAK_MEMORY=y.
4. As soon as I set CONFIG_KCSAN_EARLY_ENABLE=y dmesg shows plenty of data races!
netconsole output and kernel .config attached.
To provoke the memory corruption 'stress' is a good tool. stress -m2 --vm-bytes 915M provokes the corruption easily and --vm-bytes 915M is small enough to not provoke the OOM killer on my G4 DP with its' 2 CPUs and 2 GiB RAM.
Created attachment 304309 [details]
kernel .config (6.3.3, PowerMac G4 DP)
No change with 6.4-rc4, only additional data "page_type: 0xffffffff()" is shown: [...] pagealloc: memory corruption 06fe3258: 00 00 00 00 .... CPU: 1 PID: 397 Comm: stress Tainted: G W T 6.4.0-rc3-PMacG4-dirty #1 Hardware name: PowerMac3,6 7455 0x80010303 PowerMac Call Trace: [f2a35c70] [c0eea17c] dump_stack_lvl+0x60/0xa4 (unreliable) [f2a35c90] [c0eea1d8] dump_stack+0x18/0x30 [f2a35ca0] [c0360f90] __kernel_unpoison_pages+0x234/0x288 [f2a35ce0] [c033fdf4] get_page_from_freelist+0xd90/0x10d8 [f2a35d90] [c0340978] __alloc_pages+0x138/0xdd8 [f2a35e40] [c0315b80] handle_mm_fault+0xab8/0x15e0 [f2a35ed0] [c003a3d4] ___do_page_fault+0x320/0x8c4 [f2a35f10] [c003abe0] do_page_fault+0x28/0x80 [f2a35f30] [c000433c] DataAccess_virt+0x124/0x17c --- interrupt: 300 at 0xaf30d8 NIP: 00af30d8 LR: 00af30b4 CTR: 00000000 REGS: f2a35f40 TRAP: 0300 Tainted: G W T (6.4.0-rc3-PMacG4-dirty) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 20882464 XER: 00000000 DAR: 8f7a3010 DSISR: 42000000 GPR00: 00af30b4 af9c9cb0 a7cd2740 6e97d010 39300000 20224462 00000000 00a10264 GPR08: 20e27000 20e26000 00000000 4062ceda 20882462 00b0fff4 00000000 00000000 GPR16: 00000000 00000002 00000000 0000005a 40802462 80002462 40002462 00b100a4 GPR24: ffffffff ffffffff 39300000 00000000 00000000 6e97d010 00b17d64 00001000 NIP [00af30d8] 0xaf30d8 LR [00af30b4] 0xaf30b4 --- interrupt: 300 page:e314e657 refcount:1 mapcount:0 mapping:00000000 index:0x1 pfn:0x31065 flags: 0x80000000(zone=2) page_type: 0xffffffff() raw: 80000000 00000100 00000122 00000000 00000001 00000000 ffffffff 00000001 raw: 00000000 page dumped because: pagealloc: corrupted page details Created attachment 305297 [details]
dmesg (5.5-rc5, PowerMac G4 DP)
Re-visiting this bug as it's reproducible on v6.6-rc7.
This time I tried the other way round. CONFIG_VMAP_STACK was added for ppc with commit cd08f109e26231b279bcc0388428afcac6408ec6 (at about kernel v5.5-rc5 time). So I did a git checkout cd08f109e26231b279bcc0388428afcac6408ec6 and started from there with a further reduced kernel .config.
I added two additional patches to get the G4 to boot with VMAP_STACK enabled: 4119622 "powerpc/32s: Fix kasan_early_hash_table() for CONFIG_VMAP_STACK" and 232ca1e "powerpc/32s: Fix DSI and ISI exceptions for CONFIG_VMAP_STACK".
Then I burdened the memory subsystem with "stress -c 2 --vm 2 --vm-bytes 896M" as before and hit the issue in less than 20 sec. Not hitting the issue means my G4 runs "stress -c 2 --vm 2 --vm-bytes 896M" for about half an hour without side effects.
So it looks like the issue was here from the start when CONFIG_VMAP_STACK was added for ppc. (see dmesg)
I don't hit the issue when:
1. nr_cpus=1 is set + VMAP_STACK enabled
2. VMAP_STACK disabled
Setting LOWMEM_SIZE to 0x28000000 does not seem to have an effect on it.
This bug really plays hard to get... T'll do further KCSAN checks in recent kernels and open separate issues if KCSAN digs up something useful.
Created attachment 305298 [details]
attachment-30247-0.html
I'm out of office until 06 Nov.
Created attachment 305299 [details]
kernel .config (5.5-rc5, PowerMac G4 DP)
|
Created attachment 300113 [details] dmesg (5.15.10, PowerMac G4 DP) Happens at running the glibc-2.33 testsuite on my G4 DP. [...] [ 5503.973022] pagealloc: memory corruption [ 5503.973226] fffdfff0: 00 00 00 00 .... [ 5503.973469] CPU: 0 PID: 15826 Comm: ld.so.1 Tainted: G W 5.15.10-gentoo-PowerMacG4 #3 [ 5503.973791] Call Trace: [ 5503.973849] [f61edc20] [c03e8644] dump_stack_lvl+0x60/0x80 (unreliable) [ 5503.974096] [f61edc40] [c016ece8] __kernel_unpoison_pages+0x13c/0x174 [ 5503.974320] [f61edc90] [c015aa64] post_alloc_hook+0x60/0xb4 [ 5503.974511] [f61edcb0] [c015aadc] prep_new_page+0x24/0x5c [ 5503.974687] [f61edcd0] [c015be14] get_page_from_freelist+0x26c/0x548 [ 5503.974898] [f61edd50] [c015c5d8] __alloc_pages+0xc8/0x7a4 [ 5503.975080] [f61eddf0] [c0146470] alloc_zeroed_user_highpage_movable.constprop.0+0x18/0x48 [ 5503.975358] [f61ede10] [c01467a8] wp_page_copy+0x58/0x4a4 [ 5503.975534] [f61ede80] [c0149df4] handle_mm_fault+0x72c/0x864 [ 5503.975725] [f61edf00] [c001a9dc] do_page_fault+0x578/0x6c8 [ 5503.975919] [f61edf30] [c000424c] DataAccess_virt+0xd4/0xe4 [ 5503.976102] --- interrupt: 300 at 0x6ffc5eb0 [ 5503.976228] NIP: 6ffc5eb0 LR: 6ffc5e84 CTR: c0335cb0 [ 5503.976383] REGS: f61edf40 TRAP: 0300 Tainted: G W (5.15.10-gentoo-PowerMacG4) [ 5503.976684] MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 840022c8 XER: 20000000 [ 5503.976929] DAR: a78032e4 DSISR: 0a000000 GPR00: 6ffc60bc af9a9650 a7a15550 0064c9ac 00896b60 00000009 bcecbe5c 001282d4 GPR08: 00899280 a78032e4 a7809068 f61edf30 240022c2 6ffece34 008a1a90 00000001 GPR16: 00000000 0064c9ac 0064c9e8 0064c980 008a1830 0064b8f4 0000000f 00000009 GPR24: 00896b60 bcecbe5c 000002c6 a7828774 a76db010 000083a7 6fff4cdc 0064c9ac [ 5504.008476] NIP [6ffc5eb0] 0x6ffc5eb0 [ 5504.018630] LR [6ffc5e84] 0x6ffc5e84 [ 5504.028738] --- interrupt: 300 [ 5504.038956] page:ef4c8e34 refcount:1 mapcount:0 mapping:00000000 index:0x1 pfn:0x31065 [ 5504.049340] flags: 0x80000000(zone=2) [ 5504.059763] raw: 80000000 00000100 00000122 00000000 00000001 00000000 ffffffff 00000001 [ 5504.070297] raw: 00000000 [ 5504.080511] page dumped because: pagealloc: corrupted page details The machine stays usable afterwards. Happened also a 2nd time after a reboot, again at building glibc-2.33 and running testsuite: [...] [ 2946.948834] pagealloc: memory corruption [ 2946.949078] fffcfff0: 00 00 00 00 .... [ 2946.949419] CPU: 1 PID: 31318 Comm: ld.so.1 Tainted: G W 5.15.10-gentoo-PowerMacG4 #3 [ 2946.949753] Call Trace: [ 2946.949814] [f5c21b00] [c03e8644] dump_stack_lvl+0x60/0x80 (unreliable) [ 2946.950054] [f5c21b20] [c016ece8] __kernel_unpoison_pages+0x13c/0x174 [ 2946.950281] [f5c21b70] [c015aa64] post_alloc_hook+0x60/0xb4 [ 2946.950476] [f5c21b90] [c015aadc] prep_new_page+0x24/0x5c [ 2946.950651] [f5c21bb0] [c015be14] get_page_from_freelist+0x26c/0x548 [ 2946.950865] [f5c21c30] [c015c5d8] __alloc_pages+0xc8/0x7a4 [ 2946.951053] [f5c21cd0] [c011f6d4] pagecache_get_page+0x184/0x1fc [ 2946.951259] [f5c21d30] [c029fd34] prepare_pages+0x80/0x14c [ 2946.951442] [f5c21d80] [c02a28dc] btrfs_buffered_write+0x2b8/0x54c [ 2946.951653] [f5c21e20] [c02a4700] btrfs_file_write_iter+0x340/0x368 [ 2946.951876] [f5c21e70] [c01892fc] vfs_write+0x18c/0x1dc [ 2946.952057] [f5c21ef0] [c0189484] ksys_write+0x74/0xb8 [ 2946.952231] [f5c21f30] [c0015098] ret_from_syscall+0x0/0x28 [ 2946.952420] --- interrupt: c00 at 0x6fecc128 [ 2946.952547] NIP: 6fecc128 LR: 6fecc100 CTR: 00000001 [ 2946.952704] REGS: f5c21f40 TRAP: 0c00 Tainted: G W (5.15.10-gentoo-PowerMacG4) [ 2946.953008] MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 24022448 XER: 00000000 [ 2946.953267] GPR00: 00000004 afad5d90 a7b83550 00000009 afad5e9c 00002000 00000000 6fecbfe8 GPR08: 0000d032 402c551a 402c5409 f5c21f30 84022448 6ffeee28 007889b0 afad8070 GPR16: afad7fa0 afad8008 00000000 00000000 00008000 00000008 00976000 001c5bcc GPR24: 00000000 afad5e9c 00002000 00000009 afad7e9c 00000000 6ffbaff4 afad5e9c [ 2946.975430] NIP [6fecc128] 0x6fecc128 [ 2946.985730] LR [6fecc100] 0x6fecc100 [ 2946.995992] --- interrupt: c00 [ 2947.006198] page:ef4c8e34 refcount:1 mapcount:0 mapping:00000000 index:0x1 pfn:0x31065 [ 2947.016579] flags: 0x80000000(zone=2) [ 2947.026946] raw: 80000000 00000100 00000122 00000000 00000001 00000000 ffffffff 00000001 [ 2947.037712] raw: 00000000 [ 2947.048178] page dumped because: pagealloc: corrupted page details