Bug 216041 - Stack overflow at boot (do_IRQ: stack overflow: 1984) on a PowerMac G4 DP, KASAN debug build
Summary: Stack overflow at boot (do_IRQ: stack overflow: 1984) on a PowerMac G4 DP, KA...
Status: CLOSED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: PPC-32 (show other bugs)
Hardware: PPC-32 Linux
: P1 normal
Assignee: platform_ppc-32
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-05-28 11:43 UTC by Erhard F.
Modified: 2023-04-12 00:47 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.18.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (5.18.0, PowerMac G4 DP), case 1 (36.59 KB, text/plain)
2022-05-28 11:43 UTC, Erhard F.
Details
dmesg (5.18.0, PowerMac G4 DP), case 2 (36.05 KB, text/plain)
2022-05-28 11:43 UTC, Erhard F.
Details
kernel .config (5.18.0, PowerMac G4 DP) (108.85 KB, text/plain)
2022-05-28 11:44 UTC, Erhard F.
Details
kernel .config (5.19-rc1, Outline KASAN + patches, PowerMac G4 DP) (109.38 KB, text/plain)
2022-06-08 22:54 UTC, Erhard F.
Details
attachment-25616-0.html (168 bytes, text/html)
2023-04-12 00:47 UTC, Christophe Leroy
Details

Description Erhard F. 2022-05-28 11:43:04 UTC
Created attachment 301065 [details]
dmesg (5.18.0, PowerMac G4 DP), case 1

The attached v5.18 kernel .config triggers this stack overflow pretty easy on my G4 DP. It does not show up on every boot but very often. The overflow usually happens after the radeon drm gets loaded.

Don't know whether it's the same issue as bug #207129 but here the KASAN + debug output looks more useful.

[...]
[drm] radeon: irq initialized.
[drm] Loading R300 Microcode
Loading firmware: radeon/R300_cp.bin
do_IRQ: stack overflow: 1984
CPU: 0 PID: 126 Comm: systemd-udevd Not tainted 5.18.0-gentoo-PMacG4 #1
Call Trace:
Oops: Kernel stack overflow, sig: 11 [#1]
BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac
Modules linked in: sr_mod cdrom radeon(+) ohci_pci(+) hwmon i2c_algo_bit drm_ttm_helper ttm drm_dp_helper snd_aoa_i2sbus snd_aoa_soundbus snd_pcm ehci_pci snd_timer ohci_hcd snd ssb ehci_hcd 8250_pci soundcore drm_kms_helper pcmcia 8250 pcmcia_core syscopyarea usbcore sysfillrect 8250_base sysimgblt serial_mctrl_gpio fb_sys_fops usb_common pkcs8_key_parser fuse drm drm_panel_orientation_quirks configfs
CPU: 0 PID: 126 Comm: systemd-udevd Not tainted 5.18.0-gentoo-PMacG4 #1
NIP:  c02e5558 LR: c07eb3bc CTR: c07f46a8
REGS: e7fe9f50 TRAP: 0000   Not tainted  (5.18.0-gentoo-PMacG4)
MSR:  00001032 <ME,IR,DR,RI>  CR: 44a14824  XER: 20000000

GPR00: c07eb3bc eaa1c000 c26baea0 eaa1c0a0 00000008 00000000 c07eb3bc eaa1c010 
GPR08: eaa1c0a8 04f3f3f3 f1f1f1f1 c07f4c84 44a14824 0080f7e4 00000005 00000010 
GPR16: 00000025 eaa1c154 eaa1c158 c0dbad64 00000020 fd543810 eaa1c0a0 eaa1c29e 
GPR24: c0dbad44 c0db8740 05ffffff fd543802 eaa1c150 c0c9a3c0 eaa1c0a0 c0c9a3c0 
NIP [c02e5558] kasan_check_range+0xc/0x2b4
LR [c07eb3bc] format_decode+0x80/0x604
Call Trace:
[eaa1c000] [c07eb3bc] format_decode+0x80/0x604 (unreliable)
[eaa1c070] [c07f4dac] vsnprintf+0x128/0x938
[eaa1c110] [c07f5788] sprintf+0xa0/0xc0
[eaa1c180] [c0154c1c] __sprint_symbol.constprop.0+0x170/0x198
[eaa1c230] [c07ee71c] symbol_string+0xf8/0x260
[eaa1c430] [c07f46d0] pointer+0x15c/0x710
[eaa1c4b0] [c07f4fbc] vsnprintf+0x338/0x938
[eaa1c550] [c00e8fa0] vprintk_store+0x2a8/0x678
[eaa1c690] [c00e94e4] vprintk_emit+0x174/0x378
[eaa1c6d0] [c00ea008] _printk+0x9c/0xc0
[eaa1c750] [c000ca94] show_stack+0x21c/0x260
[eaa1c7a0] [c07d0bd4] dump_stack_lvl+0x60/0x90
[eaa1c7c0] [c0009234] __do_IRQ+0x170/0x174
[eaa1c800] [c0009258] do_IRQ+0x20/0x34
[eaa1c820] [c00045b4] HardwareInterrupt_virt+0x108/0x10c
--- interrupt: 500 at finish_task_switch.isra.0+0x130/0x3a8
NIP:  c00a3c9c LR: c00a3c88 CTR: c036560c
REGS: eaa1c830 TRAP: 0500   Not tainted  (5.18.0-gentoo-PMacG4)
MSR:  0220b032 <VEC,EE,FP,ME,IR,DR,RI>  CR: 22882848  XER: 20000000

GPR00: c00a3c88 eaa1c8e0 c26baea0 e6dcf2a0 c0c59b28 ea18df50 c002a268 c2fd1470 
GPR08: 00000003 0200b032 00000000 00000001 22004868 0080f7e4 92e60efc c88833b4 
GPR16: 00000000 c88833c0 00000003 25c84000 00000006 e6dcf82c d82c68a0 00000000 
GPR24: 25c84000 00000000 c2fd0bc0 c0c59b2c c114b2a0 e6dcf2a0 00000000 00000000 
NIP [c00a3c9c] finish_task_switch.isra.0+0x130/0x3a8
LR [c00a3c88] finish_task_switch.isra.0+0x11c/0x3a8
--- interrupt: 500
[eaa1c920] [c0c59b2c] __schedule+0x3f0/0x9dc
[eaa1c9b0] [c0c5a18c] schedule+0x74/0x13c
[eaa1c9d0] [c0c5a2e4] io_schedule+0x54/0x8c
[eaa1c9f0] [c0c5af0c] bit_wait_io+0x18/0x94
[eaa1ca10] [c0c5a8a0] __wait_on_bit+0x100/0x28c
[eaa1ca60] [c0c5aaf4] out_of_line_wait_on_bit+0xc8/0xf0
[eaa1cae0] [c0494d38] ext4_read_bh+0x184/0x1a8
[eaa1cb10] [c0428bd4] __read_extent_tree_block+0x1a4/0x2d0
[eaa1cb50] [c042ade4] ext4_find_extent+0x270/0x5a4
[eaa1cbb0] [c04302b8] ext4_ext_map_blocks+0x11c/0x1d8c
[eaa1cdd0] [c0452f14] ext4_map_blocks+0x3f0/0x950
[eaa1ce90] [c0454890] ext4_getblk+0x2e8/0x3cc
[eaa1cf30] [c0454988] ext4_bread+0x14/0x110
[eaa1cf50] [c047d538] __ext4_read_dirblock+0x4c/0x52c
[eaa1cf90] [c047da8c] dx_probe+0x74/0x95c
[eaa1cff0] [c0480ed0] __ext4_find_entry+0x6c0/0x9b4
[eaa1d120] [c04812d4] ext4_lookup+0x110/0x3c0
[eaa1d1d0] [c0326524] __lookup_slow+0x10c/0x25c
[eaa1d240] [c032d1f8] walk_component+0x220/0x30c
[eaa1d2d0] [c032d6a0] link_path_walk.part.0.constprop.0+0x3bc/0x608
usb usb6: New USB device found, idVendor=1d6b, idProduct=0001, bcdDevice= 5.18

[eaa1d380] [c032e384] path_openat+0x1b4/0x1648
usb usb6: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[eaa1d490] [c03310e8] do_file_open_root+0x168/0x288
[eaa1d5c0] [c03095f8] file_open_root+0x150/0x234
usb usb6: Product: OHCI PCI host controller

usb usb6: Manufacturer: Linux 5.18.0-gentoo-PMacG4 ohci_hcd
[eaa1d640] [c037e688] kernel_read_file_from_path_initns+0x140/0x204
[eaa1d6d0] [c08f7614] _request_firmware+0x970/0xae8
usb usb6: SerialNumber: 0001:10:1b.1

[eaa1d7c0] [c08f77d8] request_firmware+0x4c/0x78
[eaa1d7e0] [befd63e0] r100_cp_init+0x590/0x608 [radeon]
[eaa1d810] [befe0310] r300_startup.constprop.0+0x3b0/0x458 [radeon]
[eaa1d840] [befe0854] r300_init+0x19c/0x368 [radeon]
[eaa1d860] [bef79e3c] radeon_device_init+0x930/0x10ac [radeon]
[eaa1d8c0] [bef7c0b0] radeon_driver_load_kms+0xf4/0x2b8 [radeon]
[eaa1d900] [bec82498] drm_dev_register+0x15c/0x3ac [drm]
[eaa1d940] [bef77620] radeon_pci_probe+0x13c/0x178 [radeon]
[eaa1d970] [c0829080] pci_device_probe+0x100/0x238
[eaa1d9a0] [c08da164] really_probe.part.0+0x108/0x428
[eaa1d9d0] [c08da578] __driver_probe_device+0xf4/0x1ac
[eaa1d9f0] [c08da69c] driver_probe_device+0x6c/0x150
[eaa1da20] [c08db100] __driver_attach+0xec/0x200
[eaa1da50] [c08d63e8] bus_for_each_dev+0xf8/0x164
[eaa1dac0] [c08d8dec] bus_add_driver+0x274/0x31c
[eaa1db00] [c08dbee4] driver_register+0x114/0x258
[eaa1db30] [c0007b84] do_one_initcall+0xb0/0x318
[eaa1dc00] [c014eff4] do_init_module+0xfc/0x3e0
[eaa1dc30] [c0152aec] load_module+0x36d8/0x3d1c
[eaa1de60] [c01534bc] sys_finit_module+0x114/0x170
[eaa1df40] [c001f1a8] ret_from_syscall+0x0/0x2c
--- interrupt: c00 at 0xa7e17acc
NIP:  a7e17acc LR: 003d8538 CTR: a7d9dac0
REGS: eaa1df50 TRAP: 0c00   Not tainted  (5.18.0-gentoo-PMacG4)
MSR:  0000d032 <EE,PR,ME,IR,DR,RI>  CR: 24222448  XER: 00000000

GPR00: 00000161 afd91980 a7f5a560 00000018 003e41a8 00000000 00000018 00000000 
GPR08: 00000000 00000008 00000000 a7e8a44c a7d209ec 0080f7e4 00000000 00000000 
GPR16: 009cea00 00000000 0aba9500 009cea00 00000000 afd91b3c 00968570 00000000 
GPR24: 009cea00 00000007 009cf6c0 003e41a8 00020000 00000000 00404cb8 009cea00 
NIP [a7e17acc] 0xa7e17acc
LR [003d8538] 0x3d8538
--- interrupt: c00
Instruction dump:
80010024 8361000c 83c10018 83810010 7c0803a6 83a10014 83e1001c 38210020 
4e800020 2c040000 41820260 7d032214 <9421fff0> 7c034040 41810154 3d40b000 
---[ end trace 0000000000000000 ]---
Comment 1 Erhard F. 2022-05-28 11:43:31 UTC
Created attachment 301066 [details]
dmesg (5.18.0, PowerMac G4 DP), case 2
Comment 2 Erhard F. 2022-05-28 11:44:54 UTC
Created attachment 301067 [details]
kernel .config (5.18.0, PowerMac G4 DP)
Comment 3 Christophe Leroy 2022-05-28 17:59:59 UTC
I can't see any issue, other than your CONFIG_THREAD_SHIFT is set to 13.

It should be 14 by default, see https://elixir.bootlin.com/linux/v5.18/source/arch/powerpc/Kconfig#L769

Is there any reason why you set it to 13 ?
Comment 4 Arnd Bergmann 2022-05-28 18:50:12 UTC
Setting it higher is probably a good idea, but there really isn't a safe limit with KASAN, at least if KASAN_STACK is active, running with KASAN always has a risk of running into stack overflow issues.

One thing that sticks out is that there is an interrupt on the same stack as the task, in 

[eaa1c800] [c0009258] do_IRQ+0x20/0x34
[eaa1c820] [c00045b4] HardwareInterrupt_virt+0x108/0x10c
[eaa1c920] [c0c59b2c] __schedule+0x3f0/0x9dc
[eaa1c9b0] [c0c5a18c] schedule+0x74/0x13c


It looks like on ppc32, as of 547db12fd8a0 ("powerpc/32: Use vmapped stacks for interrupts"), you have either VMAP_STACK (to detect stack overflows) or IRQ stacks (to make them less likely). I think you really want both instead, and allocate the  IRQ stacks from vmalloc space as well.

The ext4 read path is a bit wasteful with KASAN enabled, using 1776 bytes from ext4_lookup to ext4_read_bh, but not excessively so.
Comment 5 Christophe Leroy 2022-05-29 08:06:31 UTC
There is an interrupt, that needs too looked at a bit deeper:

[eaa1c7a0] [c07d0bd4] dump_stack_lvl+0x60/0x90
[eaa1c7c0] [c0009234] __do_IRQ+0x170/0x174
[eaa1c800] [c0009258] do_IRQ+0x20/0x34
[eaa1c820] [c00045b4] HardwareInterrupt_virt+0x108/0x10c

The interesting part is __do_IRQ() :

void __do_IRQ(struct pt_regs *regs)
{
	struct pt_regs *old_regs = set_irq_regs(regs);
	void *cursp, *irqsp, *sirqsp;

	/* Switch to the irq stack to handle this */
	cursp = (void *)(current_stack_pointer & ~(THREAD_SIZE - 1));
	irqsp = hardirq_ctx[raw_smp_processor_id()];
	sirqsp = softirq_ctx[raw_smp_processor_id()];

	check_stack_overflow();

	/* Already there ? */
	if (unlikely(cursp == irqsp || cursp == sirqsp)) {
		__do_irq(regs);
		set_irq_regs(old_regs);
		return;
	}
	/* Switch stack and call */
	call_do_irq(regs, irqsp);

	set_irq_regs(old_regs);
}

The dump_stack() we see in the call trace is from check_stack_overflow(), following the message "do_IRQ: stack overflow: 1984", because the stack dropped below 0xeaa1c800

check_stack_overflow() function emits a warning and a stack dump when CONFIG_DEBUG_STACKOVERFLOW is selected and only 2kbytes remain available on the stack.

But here we get an Oops when the stack reaches 0xeaa1c000. Seems like the 2kbytes limit it not enough to properly perform the stack dump.

Commit 547db12fd8a0 ("powerpc/32: Use vmapped stacks for interrupts") doesn't remove IRQ stacks. It change the IRQ stacks allocation from kmalloc to vmalloc.

Here we are stillon the original stack. The switch to the IRQ stack is performed by call_do_irq().
Comment 6 Erhard F. 2022-05-29 13:08:03 UTC
(In reply to Christophe Leroy from comment #3)
> I can't see any issue, other than your CONFIG_THREAD_SHIFT is set to 13.
> 
> It should be 14 by default, see
> https://elixir.bootlin.com/linux/v5.18/source/arch/powerpc/Kconfig#L769
> 
> Is there any reason why you set it to 13 ?
I was not aware setting it to a custom value. I thought 13 is the default on ppc32 which gets overriden to 14 if I select KASAN?

But I'll make sure I'll double check this on future builds. Only advanced option I did set is CONFIG_LOWMEM_SIZE=0x28000000 (see bug #215389).
Comment 7 Erhard F. 2022-06-08 22:54:59 UTC
Created attachment 301129 [details]
kernel .config (5.19-rc1, Outline KASAN + patches, PowerMac G4 DP)

Tried to reinvestigate this issue with a KASAN build of v5.19-rc1 but it seems it's not quite there.

I applied the 2 patches "powerpc-kasan-Force-thread-size-increase-with-KASAN" and "v2-powerpc-irq-Increase-stack_overflow-detection-limit-when-KASAN-is-enabled" on top of v5.19-rc1 but I get a non-booting kernel. The kernel boots first but gets stuck on a white screen reading

"done
found display: /pci@f0000000/ATY,AlteracParent@10/ATY,Alterac_B@1, opening..."

Kernel with same config but with KFENCE instead of KASAN boots fine (see bug #216095).
Comment 8 Erhard F. 2022-08-23 21:31:54 UTC
Reinvestigate this issue with a KASAN build of v6.0.0-rc2 and it's looking good so far! No stack overflow at boot, did about 10 reboots. Outline KASAN also seems to work fine.

I'll keep an eye on this and close here if I don't see it the next few kernel releases.
Comment 9 Michael Ellerman 2023-04-12 00:47:30 UTC
The two patches mentioned in comment #7 were merged as:

3e8635fb2e07 ("powerpc/kasan: Force thread size increase with KASAN")

https://git.kernel.org/torvalds/c/3e8635fb2e072672cbc650989ffedf8300ad67fb

41f20d6db2b6 ("powerpc/irq: Increase stack_overflow detection limit when KASAN is enabled")

https://git.kernel.org/torvalds/c/41f20d6db2b64677225bb0b97df956241c353ef8


The boot failure with v5.19-rc1 might have been some other issue?

I'll close this for now, please reopen if you see this again.
Comment 10 Christophe Leroy 2023-04-12 00:47:50 UTC
Created attachment 304123 [details]
attachment-25616-0.html

I'm away from the office until April 24th

Note You need to log in before you can comment on or make changes to this bug.