Created attachment 301065 [details] dmesg (5.18.0, PowerMac G4 DP), case 1 The attached v5.18 kernel .config triggers this stack overflow pretty easy on my G4 DP. It does not show up on every boot but very often. The overflow usually happens after the radeon drm gets loaded. Don't know whether it's the same issue as bug #207129 but here the KASAN + debug output looks more useful. [...] [drm] radeon: irq initialized. [drm] Loading R300 Microcode Loading firmware: radeon/R300_cp.bin do_IRQ: stack overflow: 1984 CPU: 0 PID: 126 Comm: systemd-udevd Not tainted 5.18.0-gentoo-PMacG4 #1 Call Trace: Oops: Kernel stack overflow, sig: 11 [#1] BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac Modules linked in: sr_mod cdrom radeon(+) ohci_pci(+) hwmon i2c_algo_bit drm_ttm_helper ttm drm_dp_helper snd_aoa_i2sbus snd_aoa_soundbus snd_pcm ehci_pci snd_timer ohci_hcd snd ssb ehci_hcd 8250_pci soundcore drm_kms_helper pcmcia 8250 pcmcia_core syscopyarea usbcore sysfillrect 8250_base sysimgblt serial_mctrl_gpio fb_sys_fops usb_common pkcs8_key_parser fuse drm drm_panel_orientation_quirks configfs CPU: 0 PID: 126 Comm: systemd-udevd Not tainted 5.18.0-gentoo-PMacG4 #1 NIP: c02e5558 LR: c07eb3bc CTR: c07f46a8 REGS: e7fe9f50 TRAP: 0000 Not tainted (5.18.0-gentoo-PMacG4) MSR: 00001032 <ME,IR,DR,RI> CR: 44a14824 XER: 20000000 GPR00: c07eb3bc eaa1c000 c26baea0 eaa1c0a0 00000008 00000000 c07eb3bc eaa1c010 GPR08: eaa1c0a8 04f3f3f3 f1f1f1f1 c07f4c84 44a14824 0080f7e4 00000005 00000010 GPR16: 00000025 eaa1c154 eaa1c158 c0dbad64 00000020 fd543810 eaa1c0a0 eaa1c29e GPR24: c0dbad44 c0db8740 05ffffff fd543802 eaa1c150 c0c9a3c0 eaa1c0a0 c0c9a3c0 NIP [c02e5558] kasan_check_range+0xc/0x2b4 LR [c07eb3bc] format_decode+0x80/0x604 Call Trace: [eaa1c000] [c07eb3bc] format_decode+0x80/0x604 (unreliable) [eaa1c070] [c07f4dac] vsnprintf+0x128/0x938 [eaa1c110] [c07f5788] sprintf+0xa0/0xc0 [eaa1c180] [c0154c1c] __sprint_symbol.constprop.0+0x170/0x198 [eaa1c230] [c07ee71c] symbol_string+0xf8/0x260 [eaa1c430] [c07f46d0] pointer+0x15c/0x710 [eaa1c4b0] [c07f4fbc] vsnprintf+0x338/0x938 [eaa1c550] [c00e8fa0] vprintk_store+0x2a8/0x678 [eaa1c690] [c00e94e4] vprintk_emit+0x174/0x378 [eaa1c6d0] [c00ea008] _printk+0x9c/0xc0 [eaa1c750] [c000ca94] show_stack+0x21c/0x260 [eaa1c7a0] [c07d0bd4] dump_stack_lvl+0x60/0x90 [eaa1c7c0] [c0009234] __do_IRQ+0x170/0x174 [eaa1c800] [c0009258] do_IRQ+0x20/0x34 [eaa1c820] [c00045b4] HardwareInterrupt_virt+0x108/0x10c --- interrupt: 500 at finish_task_switch.isra.0+0x130/0x3a8 NIP: c00a3c9c LR: c00a3c88 CTR: c036560c REGS: eaa1c830 TRAP: 0500 Not tainted (5.18.0-gentoo-PMacG4) MSR: 0220b032 <VEC,EE,FP,ME,IR,DR,RI> CR: 22882848 XER: 20000000 GPR00: c00a3c88 eaa1c8e0 c26baea0 e6dcf2a0 c0c59b28 ea18df50 c002a268 c2fd1470 GPR08: 00000003 0200b032 00000000 00000001 22004868 0080f7e4 92e60efc c88833b4 GPR16: 00000000 c88833c0 00000003 25c84000 00000006 e6dcf82c d82c68a0 00000000 GPR24: 25c84000 00000000 c2fd0bc0 c0c59b2c c114b2a0 e6dcf2a0 00000000 00000000 NIP [c00a3c9c] finish_task_switch.isra.0+0x130/0x3a8 LR [c00a3c88] finish_task_switch.isra.0+0x11c/0x3a8 --- interrupt: 500 [eaa1c920] [c0c59b2c] __schedule+0x3f0/0x9dc [eaa1c9b0] [c0c5a18c] schedule+0x74/0x13c [eaa1c9d0] [c0c5a2e4] io_schedule+0x54/0x8c [eaa1c9f0] [c0c5af0c] bit_wait_io+0x18/0x94 [eaa1ca10] [c0c5a8a0] __wait_on_bit+0x100/0x28c [eaa1ca60] [c0c5aaf4] out_of_line_wait_on_bit+0xc8/0xf0 [eaa1cae0] [c0494d38] ext4_read_bh+0x184/0x1a8 [eaa1cb10] [c0428bd4] __read_extent_tree_block+0x1a4/0x2d0 [eaa1cb50] [c042ade4] ext4_find_extent+0x270/0x5a4 [eaa1cbb0] [c04302b8] ext4_ext_map_blocks+0x11c/0x1d8c [eaa1cdd0] [c0452f14] ext4_map_blocks+0x3f0/0x950 [eaa1ce90] [c0454890] ext4_getblk+0x2e8/0x3cc [eaa1cf30] [c0454988] ext4_bread+0x14/0x110 [eaa1cf50] [c047d538] __ext4_read_dirblock+0x4c/0x52c [eaa1cf90] [c047da8c] dx_probe+0x74/0x95c [eaa1cff0] [c0480ed0] __ext4_find_entry+0x6c0/0x9b4 [eaa1d120] [c04812d4] ext4_lookup+0x110/0x3c0 [eaa1d1d0] [c0326524] __lookup_slow+0x10c/0x25c [eaa1d240] [c032d1f8] walk_component+0x220/0x30c [eaa1d2d0] [c032d6a0] link_path_walk.part.0.constprop.0+0x3bc/0x608 usb usb6: New USB device found, idVendor=1d6b, idProduct=0001, bcdDevice= 5.18 [eaa1d380] [c032e384] path_openat+0x1b4/0x1648 usb usb6: New USB device strings: Mfr=3, Product=2, SerialNumber=1 [eaa1d490] [c03310e8] do_file_open_root+0x168/0x288 [eaa1d5c0] [c03095f8] file_open_root+0x150/0x234 usb usb6: Product: OHCI PCI host controller usb usb6: Manufacturer: Linux 5.18.0-gentoo-PMacG4 ohci_hcd [eaa1d640] [c037e688] kernel_read_file_from_path_initns+0x140/0x204 [eaa1d6d0] [c08f7614] _request_firmware+0x970/0xae8 usb usb6: SerialNumber: 0001:10:1b.1 [eaa1d7c0] [c08f77d8] request_firmware+0x4c/0x78 [eaa1d7e0] [befd63e0] r100_cp_init+0x590/0x608 [radeon] [eaa1d810] [befe0310] r300_startup.constprop.0+0x3b0/0x458 [radeon] [eaa1d840] [befe0854] r300_init+0x19c/0x368 [radeon] [eaa1d860] [bef79e3c] radeon_device_init+0x930/0x10ac [radeon] [eaa1d8c0] [bef7c0b0] radeon_driver_load_kms+0xf4/0x2b8 [radeon] [eaa1d900] [bec82498] drm_dev_register+0x15c/0x3ac [drm] [eaa1d940] [bef77620] radeon_pci_probe+0x13c/0x178 [radeon] [eaa1d970] [c0829080] pci_device_probe+0x100/0x238 [eaa1d9a0] [c08da164] really_probe.part.0+0x108/0x428 [eaa1d9d0] [c08da578] __driver_probe_device+0xf4/0x1ac [eaa1d9f0] [c08da69c] driver_probe_device+0x6c/0x150 [eaa1da20] [c08db100] __driver_attach+0xec/0x200 [eaa1da50] [c08d63e8] bus_for_each_dev+0xf8/0x164 [eaa1dac0] [c08d8dec] bus_add_driver+0x274/0x31c [eaa1db00] [c08dbee4] driver_register+0x114/0x258 [eaa1db30] [c0007b84] do_one_initcall+0xb0/0x318 [eaa1dc00] [c014eff4] do_init_module+0xfc/0x3e0 [eaa1dc30] [c0152aec] load_module+0x36d8/0x3d1c [eaa1de60] [c01534bc] sys_finit_module+0x114/0x170 [eaa1df40] [c001f1a8] ret_from_syscall+0x0/0x2c --- interrupt: c00 at 0xa7e17acc NIP: a7e17acc LR: 003d8538 CTR: a7d9dac0 REGS: eaa1df50 TRAP: 0c00 Not tainted (5.18.0-gentoo-PMacG4) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 24222448 XER: 00000000 GPR00: 00000161 afd91980 a7f5a560 00000018 003e41a8 00000000 00000018 00000000 GPR08: 00000000 00000008 00000000 a7e8a44c a7d209ec 0080f7e4 00000000 00000000 GPR16: 009cea00 00000000 0aba9500 009cea00 00000000 afd91b3c 00968570 00000000 GPR24: 009cea00 00000007 009cf6c0 003e41a8 00020000 00000000 00404cb8 009cea00 NIP [a7e17acc] 0xa7e17acc LR [003d8538] 0x3d8538 --- interrupt: c00 Instruction dump: 80010024 8361000c 83c10018 83810010 7c0803a6 83a10014 83e1001c 38210020 4e800020 2c040000 41820260 7d032214 <9421fff0> 7c034040 41810154 3d40b000 ---[ end trace 0000000000000000 ]---
Created attachment 301066 [details] dmesg (5.18.0, PowerMac G4 DP), case 2
Created attachment 301067 [details] kernel .config (5.18.0, PowerMac G4 DP)
I can't see any issue, other than your CONFIG_THREAD_SHIFT is set to 13. It should be 14 by default, see https://elixir.bootlin.com/linux/v5.18/source/arch/powerpc/Kconfig#L769 Is there any reason why you set it to 13 ?
Setting it higher is probably a good idea, but there really isn't a safe limit with KASAN, at least if KASAN_STACK is active, running with KASAN always has a risk of running into stack overflow issues. One thing that sticks out is that there is an interrupt on the same stack as the task, in [eaa1c800] [c0009258] do_IRQ+0x20/0x34 [eaa1c820] [c00045b4] HardwareInterrupt_virt+0x108/0x10c [eaa1c920] [c0c59b2c] __schedule+0x3f0/0x9dc [eaa1c9b0] [c0c5a18c] schedule+0x74/0x13c It looks like on ppc32, as of 547db12fd8a0 ("powerpc/32: Use vmapped stacks for interrupts"), you have either VMAP_STACK (to detect stack overflows) or IRQ stacks (to make them less likely). I think you really want both instead, and allocate the IRQ stacks from vmalloc space as well. The ext4 read path is a bit wasteful with KASAN enabled, using 1776 bytes from ext4_lookup to ext4_read_bh, but not excessively so.
There is an interrupt, that needs too looked at a bit deeper: [eaa1c7a0] [c07d0bd4] dump_stack_lvl+0x60/0x90 [eaa1c7c0] [c0009234] __do_IRQ+0x170/0x174 [eaa1c800] [c0009258] do_IRQ+0x20/0x34 [eaa1c820] [c00045b4] HardwareInterrupt_virt+0x108/0x10c The interesting part is __do_IRQ() : void __do_IRQ(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); void *cursp, *irqsp, *sirqsp; /* Switch to the irq stack to handle this */ cursp = (void *)(current_stack_pointer & ~(THREAD_SIZE - 1)); irqsp = hardirq_ctx[raw_smp_processor_id()]; sirqsp = softirq_ctx[raw_smp_processor_id()]; check_stack_overflow(); /* Already there ? */ if (unlikely(cursp == irqsp || cursp == sirqsp)) { __do_irq(regs); set_irq_regs(old_regs); return; } /* Switch stack and call */ call_do_irq(regs, irqsp); set_irq_regs(old_regs); } The dump_stack() we see in the call trace is from check_stack_overflow(), following the message "do_IRQ: stack overflow: 1984", because the stack dropped below 0xeaa1c800 check_stack_overflow() function emits a warning and a stack dump when CONFIG_DEBUG_STACKOVERFLOW is selected and only 2kbytes remain available on the stack. But here we get an Oops when the stack reaches 0xeaa1c000. Seems like the 2kbytes limit it not enough to properly perform the stack dump. Commit 547db12fd8a0 ("powerpc/32: Use vmapped stacks for interrupts") doesn't remove IRQ stacks. It change the IRQ stacks allocation from kmalloc to vmalloc. Here we are stillon the original stack. The switch to the IRQ stack is performed by call_do_irq().
(In reply to Christophe Leroy from comment #3) > I can't see any issue, other than your CONFIG_THREAD_SHIFT is set to 13. > > It should be 14 by default, see > https://elixir.bootlin.com/linux/v5.18/source/arch/powerpc/Kconfig#L769 > > Is there any reason why you set it to 13 ? I was not aware setting it to a custom value. I thought 13 is the default on ppc32 which gets overriden to 14 if I select KASAN? But I'll make sure I'll double check this on future builds. Only advanced option I did set is CONFIG_LOWMEM_SIZE=0x28000000 (see bug #215389).
Created attachment 301129 [details] kernel .config (5.19-rc1, Outline KASAN + patches, PowerMac G4 DP) Tried to reinvestigate this issue with a KASAN build of v5.19-rc1 but it seems it's not quite there. I applied the 2 patches "powerpc-kasan-Force-thread-size-increase-with-KASAN" and "v2-powerpc-irq-Increase-stack_overflow-detection-limit-when-KASAN-is-enabled" on top of v5.19-rc1 but I get a non-booting kernel. The kernel boots first but gets stuck on a white screen reading "done found display: /pci@f0000000/ATY,AlteracParent@10/ATY,Alterac_B@1, opening..." Kernel with same config but with KFENCE instead of KASAN boots fine (see bug #216095).
Reinvestigate this issue with a KASAN build of v6.0.0-rc2 and it's looking good so far! No stack overflow at boot, did about 10 reboots. Outline KASAN also seems to work fine. I'll keep an eye on this and close here if I don't see it the next few kernel releases.
The two patches mentioned in comment #7 were merged as: 3e8635fb2e07 ("powerpc/kasan: Force thread size increase with KASAN") https://git.kernel.org/torvalds/c/3e8635fb2e072672cbc650989ffedf8300ad67fb 41f20d6db2b6 ("powerpc/irq: Increase stack_overflow detection limit when KASAN is enabled") https://git.kernel.org/torvalds/c/41f20d6db2b64677225bb0b97df956241c353ef8 The boot failure with v5.19-rc1 might have been some other issue? I'll close this for now, please reopen if you see this again.
Created attachment 304123 [details] attachment-25616-0.html I'm away from the office until April 24th