Bug 213837

Summary: [bisected] "Kernel panic - not syncing: corrupted stack end detected inside scheduler" at building via distcc on a G5
Product: Platform Specific/Hardware Reporter: Erhard F. (erhard_f)
Component: PPC-64Assignee: platform_ppc-64
Status: NEEDINFO ---    
Severity: normal CC: davem, michael, pablo, platform_ppc-64
Priority: P1    
Hardware: PPC-64   
OS: Linux   
Kernel Version: 5.13.4 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg (5.13.4, PowerMac G5 11,2)
kernel .config (5.13.4, PowerMac G5 11,2)
dmesg (5.14-rc6, PowerMac G5 11,2)
kernel .config (5.14-rc6, PowerMac G5 11,2)
kernel .config (5.14.1, PowerMac G5 11,2)
dmesg (5.15-rc2 + patch, PowerMac G5 11,2)
System.map (5.15-rc2 + patch, PowerMac G5 11,2)
kernel .config (5.15-rc2 + CONFIG_THREAD_SHIFT=15, PowerMac G5 11,2)
System.map (5.15-rc2 + patch + CONFIG_THREAD_SHIFT=15, PowerMac G5 11,2)
dmesg (5.15-rc2 + patch, PowerMac G5 11,2) #1
dmesg (5.16-rc2 + patch, PowerMac G5 11,2)
bisect.log
kernel .config (5.17-rc4, PowerMac G5 11,2)
dmesg (5.17-rc4 + patch, PowerMac G5 11,2)
System.map (5.17-rc7 + patch, PowerMac G5 11,2)
kernel .config (5.17-rc7, PowerMac G5 11,2)
dmesg (5.17-rc7 + patch, PowerMac G5 11,2)

Description Erhard F. 2021-07-23 20:00:41 UTC
Created attachment 298017 [details]
dmesg (5.13.4, PowerMac G5 11,2)

Happens when building larger projects on my G5 via distcc. Time to failure is about 3-10 minutes. Kernel 5.10.x does not show this problem. Probably connected to bug #213079.

[..]
Call Trace:
Kernel panic - not syncing: corrupted stack end detected inside scheduler
CPU: 1 PID: 11467 Comm: powerpc64-unkno Tainted: G        W         5.13.4-PowerMacG5+ #2
[c00000003e79ea80] [c000000000541c90] .dump_stack+0xe0/0x13c (unreliable)
[c00000003e79eb20] [c00000000006813c] .panic+0x168/0x430
[c00000003e79ebd0] [c00000000080a4b0] .__schedule+0x80/0x840
[c00000003e79ecb0] [c00000000080adbc] .preempt_schedule_common+0x28/0x48
[c00000003e79ed30] [c00000000080ae0c] .__cond_resched+0x30/0x4c
[c00000003e79edb0] [c0000000001c6d80] .mempool_alloc+0x38/0x198
[c00000003e79ee90] [c00000000049a444] .bio_alloc_bioset+0x94/0x174
[c00000003e79ef40] [c00000000049a544] .bio_clone_fast+0x20/0x7c
[c00000003e79efd0] [c00000000049a60c] .bio_split+0x6c/0xc4
[c00000003e79f060] [c0000000004a7018] .__blk_queue_split+0x120/0x474
[c00000003e79f160] [c0000000004adc30] .blk_mq_submit_bio+0x88/0x524
[c00000003e79f250] [c0000000004a0e30] .submit_bio_noacct+0xc4/0x26c
[c00000003e79f340] [c000000000355bec] .ext4_io_submit+0x5c/0x70
[c00000003e79f3c0] [c000000000355f08] .ext4_bio_write_page+0x2f4/0x480
[c00000003e79f480] [c000000000334b84] .mpage_submit_page+0x70/0xa0
[c00000003e79f500] [c00000000033b09c] .ext4_writepages+0xcc4/0xe5c
[c00000003e79f7b0] [c0000000001cf214] .do_writepages+0x54/0xa0
[c00000003e79f830] [c0000000001c3ab8] .__filemap_fdatawrite_range+0xc0/0xfc
[c00000003e79f930] [c000000000337f34] .ext4_alloc_da_blocks+0xf4/0x100
[c00000003e79f9b0] [c000000000328594] .ext4_release_file+0x24/0xd8
[c00000003e79fa40] [c00000000026ea5c] .__fput+0x12c/0x270
[c00000003e79fae0] [c00000000008eb40] .task_work_run+0xa0/0xc0
[c00000003e79fb70] [c00000000006e284] .do_exit+0x55c/0xa6c
[c00000003e79fc60] [c00000000006e824] .do_group_exit+0x50/0xb0
[c00000003e79fcf0] [c00000000006e898] .__wake_up_parent+0x0/0x34
[c00000003e79fd60] [c000000000021540] .system_call_exception+0x1b4/0x1ec
[c00000003e79fe10] [c00000000000b9c4] system_call_common+0xe4/0x214
--- interrupt: c00 at 0x3fffadc46aa8
NIP:  00003fffadc46aa8 LR: 00003fffadba6d04 CTR: 0000000000000000
REGS: c00000003e79fe80 TRAP: 0c00   Tainted: G        W          (5.13.4-PowerMacG5+)
MSR:  900000000200f032 <SF,HV,VEC,EE,PR,FP,ME,IR,DR,RI>  CR: 22000482  XER: 00000000
IRQMASK: 0 
GPR00: 00000000000000ea 00003ffff3f1ae50 00003fffadd65300 0000000000000000 
GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
GPR12: 0000000000000000 00003fffadeddc30 000000014af075a0 000000012379b0f0 
GPR16: 000000012947ec38 00003fffd7d95cd8 000000012947eb28 000000000000002f 
GPR20: 0000000000000000 00003fffadd5fff8 0000000000000001 00003fffadd5ea58 
GPR24: 0000000000000000 0000000000000000 0000000000000003 0000000000000001 
GPR28: 0000000000000000 00003fffaded6c50 fffffffffffff000 0000000000000000 
NIP [00003fffadc46aa8] 0x3fffadc46aa8
LR [00003fffadba6d04] 0x3fffadba6d04
--- interrupt: c00
Rebooting in 40 seconds..
Comment 1 Erhard F. 2021-07-23 20:01:15 UTC
Created attachment 298019 [details]
kernel .config (5.13.4, PowerMac G5 11,2)
Comment 2 Erhard F. 2021-08-20 16:34:22 UTC
Created attachment 298393 [details]
dmesg (5.14-rc6, PowerMac G5 11,2)

Happens also on 5.14-rc6:

[...]
Kernel panic - not syncing: corrupted stack end detected inside scheduler
CPU: 1 PID: 32354 Comm: powerpc64-unkno Tainted: G        W         5.14.0-rc6-PowerMacG5+ #3
Call Trace:
[c000000062feb4c0] [c00000000054de04] .dump_stack_lvl+0x98/0xe0 (unreliable)
[c000000062feb550] [c000000000068f4c] .panic+0x160/0x40c
[c000000062feb600] [c000000000818d3c] .__schedule+0x7c/0x840
[c000000062feb6d0] [c00000000081964c] .preempt_schedule_common+0x28/0x48
[c000000062feb750] [c00000000081969c] .__cond_resched+0x30/0x4c
[c000000062feb7d0] [c0000000004ee8f8] .copy_page_to_iter+0xbc/0x32c
[c000000062feb8a0] [c0000000001c9c20] .filemap_read+0x574/0x618
[c000000062feba60] [c000000000333c18] .ext4_file_read_iter+0xb8/0x11c
[c000000062febb00] [c000000000275488] .new_sync_read+0x94/0xe0
[c000000062febc00] [c000000000276c2c] .vfs_read+0x128/0x12c
[c000000062febca0] [c000000000276fc4] .ksys_read+0x78/0xc4
[c000000062febd60] [c000000000022808] .system_call_exception+0x1a4/0x1dc
[c000000062febe10] [c00000000000b4cc] system_call_common+0xec/0x250
--- interrupt: c00 at 0x3fff999e0cd0
NIP:  00003fff999e0cd0 LR: 00000001039d3660 CTR: 0000000000000000
REGS: c000000062febe80 TRAP: 0c00   Tainted: G        W          (5.14.0-rc6-PowerMacG5+)
MSR:  900000000200f032 <SF,HV,VEC,EE,PR,FP,ME,IR,DR,RI>  CR: 24000442  XER: 00000000
IRQMASK: 0 
GPR00: 0000000000000003 00003ffff4e23690 00003fff99a0df00 0000000000000004 
GPR04: 00003fff99699010 00000000000583ca 00003fff999c1320 0000000000000000 
GPR08: 00003fff999c12e0 0000000000000000 0000000000000000 0000000000000000 
GPR12: 0000000000000000 00003fff99a93c20 000000011fbcf950 000000017149a1d0 
GPR16: 00000001039dec38 00003ffff4e23b78 00000001039deb28 00003ffff4e239c8 
GPR20: 00003ffff4e23d80 ffffffffffffffff 000000011fbce540 0000000000000000 
GPR24: 000000011fbcf930 000000011fbcfa90 0000000000000005 00003ffff4e238e0 
GPR28: 0000000103a268e8 0000000000000004 00003fff99699010 00000000000583ca 
NIP [00003fff999e0cd0] 0x3fff999e0cd0
LR [00000001039d3660] 0x1039d3660
--- interrupt: c00
Rebooting in 40 seconds..
Comment 3 Erhard F. 2021-08-20 16:35:11 UTC
Created attachment 298395 [details]
kernel .config (5.14-rc6, PowerMac G5 11,2)
Comment 4 Erhard F. 2021-09-05 14:11:17 UTC
Checked out whether this has really something to do with bug #213079 or not by copying this root partition to a regular HDD and use that one instead. As the issue still happens it seems these are two seperate bugs.

[...]
Kernel panic - not syncing: corrupted stack end detected inside scheduler
CPU: 1 PID: 1509 Comm: powerpc64-unkno Tainted: G        W         5.14.1-PowerMacG5+ #2
Call Trace:
[c0000000386434c0] [c00000000054cd64] .dump_stack_lvl+0x98/0xe0 (unreliable)
[c000000038643550] [c000000000068ab8] .panic+0x160/0x40c
[c000000038643600] [c00000000081202c] .__schedule+0x7c/0x840
[c0000000386436d0] [c00000000081293c] .preempt_schedule_common+0x28/0x48
[c000000038643750] [c00000000081298c] .__cond_resched+0x30/0x4c
[c0000000386437d0] [c0000000004edf18] .copy_page_to_iter+0xbc/0x32c
[c0000000386438a0] [c0000000001c99d8] .filemap_read+0x574/0x618
[c000000038643a60] [c00000000033182c] .ext4_file_read_iter+0xb8/0x11c
[c000000038643b00] [c000000000272f1c] .new_sync_read+0x94/0xe0
[c000000038643c00] [c0000000002746c0] .vfs_read+0x128/0x12c
[c000000038643ca0] [c000000000274a58] .ksys_read+0x78/0xc4
[c000000038643d60] [c000000000022808] .system_call_exception+0x1a4/0x1dc
[c000000038643e10] [c00000000000b4cc] system_call_common+0xec/0x250
--- interrupt: c00 at 0x3fffbc477cd0
NIP:  00003fffbc477cd0 LR: 000000011c413660 CTR: 0000000000000000
REGS: c000000038643e80 TRAP: 0c00   Tainted: G        W          (5.14.1-PowerMacG5+)
MSR:  900000000200f032 <SF,HV,VEC,EE,PR,FP,ME,IR,DR,RI>  CR: 24000422  XER: 00000000
IRQMASK: 0 
GPR00: 0000000000000003 00003fffd3c43d70 00003fffbc4a4f00 0000000000000004 
GPR04: 00003fffbbfac010 00000000001e7697 00003fffbc458320 0000000000000000 
GPR08: 00003fffbc4582e0 0000000000000000 0000000000000000 0000000000000000 
GPR12: 0000000000000000 00003fffbc54ec20 00000001470b79c0 0000000157c21760 
GPR16: 000000011c41ec38 00003fffd3c44258 000000011c41eb28 00003fffd3c440a8 
GPR20: 00003fffd3c44460 ffffffffffffffff 00000001470b6dd0 0000000000000000 
GPR24: 00000001470b77f0 00000001470b7d30 0000000000000005 00003fffd3c43fc0 
GPR28: 000000011c4668e8 0000000000000004 00003fffbbfac010 00000000001e7697 
NIP [00003fffbc477cd0] 0x3fffbc477cd0
LR [000000011c413660] 0x11c413660
--- interrupt: c00
Rebooting in 40 seconds..
Comment 5 Erhard F. 2021-09-05 14:15:28 UTC
Created attachment 298671 [details]
kernel .config (5.14.1, PowerMac G5 11,2)
Comment 6 mpe 2021-09-08 12:54:59 UTC
bugzilla-daemon@bugzilla.kernel.org writes:
> https://bugzilla.kernel.org/show_bug.cgi?id=213837
>
> Erhard F. (erhard_f@mailbox.org) changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>            See Also|https://bugzilla.kernel.org |
>                    |/show_bug.cgi?id=213079     |
>
> --- Comment #4 from Erhard F. (erhard_f@mailbox.org) ---
> Checked out whether this has really something to do with bug #213079 or not
> by
> copying this root partition to a regular HDD and use that one instead. As the
> issue still happens it seems these are two seperate bugs.
>
> [...]
> Kernel panic - not syncing: corrupted stack end detected inside scheduler

Can you try this patch, it might help us work out what is corrupting the
stack.

cheers

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c4462c454ab9..07bfa25c1b48 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5490,8 +5490,14 @@ static noinline void __schedule_bug(struct task_struct *prev)
 static inline void schedule_debug(struct task_struct *prev, bool preempt)
 {
 #ifdef CONFIG_SCHED_STACK_END_CHECK
-	if (task_stack_end_corrupted(prev))
+	if (task_stack_end_corrupted(prev)) {
+		char *start = (char *)end_of_stack(prev);
+		pr_err("stack corrupted? stack end = 0x%px\n", end_of_stack(prev));
+		print_hex_dump(KERN_ERR, "stack: ", DUMP_PREFIX_ADDRESS, 16, 4,
+			       start - SZ_1K, THREAD_SIZE + SZ_1K, true);
+
 		panic("corrupted stack end detected inside scheduler\n");
+	}
 
 	if (task_scs_end_corrupted(prev))
 		panic("corrupted shadow stack detected inside scheduler\n");
Comment 7 Erhard F. 2021-09-22 15:33:30 UTC
Created attachment 298919 [details]
dmesg (5.15-rc2 + patch, PowerMac G5 11,2)

(In reply to mpe from comment #6)
> Can you try this patch, it might help us work out what is corrupting the
> stack.
With your patch applied to recent v5.15-rc2 the output looks like this:

[...]
stack corrupted? stack end = 0xc000000029fdc000
stack: c000000029fdbc00: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
stack: c000000029fdbc10: 00000ddc 7c000010 cccccccc cccccccc  ....|...........
stack: c000000029fdbc20: 29fc4e41 673d4bb3 5a5a5a5a 5a5a5a5a  ).NAg=K.ZZZZZZZZ
stack: c000000029fdbc30: cccccccc cccccccc 00000ddc 8e000010  ................
stack: c000000029fdbc40: cccccccc cccccccc 41fc4e41 673d41a3  ........A.NAg=A.
stack: c000000029fdbc50: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
stack: c000000029fdbc60: 00000ddc 8e00000c cccccccc cccccccc  ................
stack: c000000029fdbc70: 79fc4e41 673d4dab 5a5a5a5a 5a5a5a5a  y.NAg=M.ZZZZZZZZ
stack: c000000029fdbc80: cccccccc cccccccc 00000ddc 90000008  ................
stack: c000000029fdbc90: cccccccc cccccccc 91fc4e41 673d4573  ..........NAg=Es
stack: c000000029fdbca0: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
stack: c000000029fdbcb0: 00000dd7 ac000016 cccccccc cccccccc  ................
stack: c000000029fdbcc0: c9fc4e41 673d4203 5a5a5a5a 5a5a5a5a  ..NAg=B.ZZZZZZZZ
stack: c000000029fdbcd0: cccccccc cccccccc 00000ddc 6c000004  ............l...
stack: c000000029fdbce0: cccccccc cccccccc e1fc4e41 673d474b  ..........NAg=GK
stack: c000000029fdbcf0: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
stack: c000000029fdbd00: 00000ddc 88000000 cccccccc cccccccc  ................
stack: c000000029fdbd10: 19fd4e41 673d4143 5a5a5a5a 5a5a5a5a  ..NAg=ACZZZZZZZZ
[...]
stack: c000000029fdffd0: 00000000 00000000 00000000 00000000  ................
stack: c000000029fdffe0: 00000000 00000000 00000000 00000000  ................
stack: c000000029fdfff0: 00000000 00000000 00000000 00000000  ................
Kernel panic - not syncing: corrupted stack end detected inside scheduler
CPU: 0 PID: 686 Comm: kworker/u4:0 Tainted: G        W         5.15.0-rc2-PowerMacG5+ #2
Workqueue: writeback .wb_workfn (flush-254:1)
Call Trace:
[c000000029fdf400] [c0000000005532c8] .dump_stack_lvl+0x98/0xe0 (unreliable)
[c000000029fdf490] [c000000000069534] .panic+0x14c/0x3e8
[c000000029fdf540] [c00000000081d598] .__schedule+0xc0/0x874
[c000000029fdf610] [c00000000081de98] .preempt_schedule_common+0x28/0x48
[c000000029fdf690] [c00000000081dee4] .__cond_resched+0x2c/0x50
[c000000029fdf700] [c0000000002b31b8] .writeback_sb_inodes+0x328/0x4c8
[c000000029fdf880] [c0000000002b33e8] .__writeback_inodes_wb+0x90/0xcc
[c000000029fdf930] [c0000000002b3650] .wb_writeback+0x22c/0x3c8
[c000000029fdfa50] [c0000000002b45a8] .wb_workfn+0x380/0x460
[c000000029fdfbb0] [c00000000008b300] .process_one_work+0x31c/0x4ec
[c000000029fdfca0] [c00000000008b950] .worker_thread+0x1d4/0x290
[c000000029fdfd60] [c000000000093b0c] .kthread+0x124/0x12c
[c000000029fdfe10] [c00000000000bce0] .ret_from_kernel_thread+0x58/0x60
Rebooting in 40 seconds..

Can't make much sense out of it but hopefully you can. ;) For the full trace please have a look at the attached kernel dmesg (via netconsole).
Comment 8 mpe 2021-09-23 14:05:19 UTC
bugzilla-daemon@bugzilla.kernel.org writes:
> https://bugzilla.kernel.org/show_bug.cgi?id=213837
>
> --- Comment #7 from Erhard F. (erhard_f@mailbox.org) ---
> Created attachment 298919 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=298919&action=edit
> dmesg (5.15-rc2 + patch, PowerMac G5 11,2)
>
> (In reply to mpe from comment #6)
>> Can you try this patch, it might help us work out what is corrupting the
>> stack.
> With your patch applied to recent v5.15-rc2 the output looks like this:
>
> [...]
> stack corrupted? stack end = 0xc000000029fdc000
> stack: c000000029fdbc00: 5a5a5a5a 5a5a5a5a cccccccc cccccccc 
> ZZZZZZZZ........
...

> Can't make much sense out of it but hopefully you can. ;)

Thanks. Obvious isn't it? ;)

  stack corrupted? stack end = 0xc000000029fdc000
  stack: c000000029fdbc00: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbc10: 00000ddc 7c000010 cccccccc cccccccc  ....|...........
  stack: c000000029fdbc20: 29fc4e41 673d4bb3 5a5a5a5a 5a5a5a5a  ).NAg=K.ZZZZZZZZ
  stack: c000000029fdbc30: cccccccc cccccccc 00000ddc 8e000010  ................
  stack: c000000029fdbc40: cccccccc cccccccc 41fc4e41 673d41a3  ........A.NAg=A.
  stack: c000000029fdbc50: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbc60: 00000ddc 8e00000c cccccccc cccccccc  ................
  stack: c000000029fdbc70: 79fc4e41 673d4dab 5a5a5a5a 5a5a5a5a  y.NAg=M.ZZZZZZZZ
  stack: c000000029fdbc80: cccccccc cccccccc 00000ddc 90000008  ................
  stack: c000000029fdbc90: cccccccc cccccccc 91fc4e41 673d4573  ..........NAg=Es
  stack: c000000029fdbca0: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbcb0: 00000dd7 ac000016 cccccccc cccccccc  ................
  stack: c000000029fdbcc0: c9fc4e41 673d4203 5a5a5a5a 5a5a5a5a  ..NAg=B.ZZZZZZZZ
  stack: c000000029fdbcd0: cccccccc cccccccc 00000ddc 6c000004  ............l...
  stack: c000000029fdbce0: cccccccc cccccccc e1fc4e41 673d474b  ..........NAg=GK
  stack: c000000029fdbcf0: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbd00: 00000ddc 88000000 cccccccc cccccccc  ................
  stack: c000000029fdbd10: 19fd4e41 673d4143 5a5a5a5a 5a5a5a5a  ..NAg=ACZZZZZZZZ
  stack: c000000029fdbd20: cccccccc cccccccc 00000ddb 6c00000e  ............l...
  stack: c000000029fdbd30: cccccccc cccccccc 31fd4e41 673d4f43  ........1.NAg=OC
  stack: c000000029fdbd40: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbd50: 00000ddc 8e000008 cccccccc cccccccc  ................
  stack: c000000029fdbd60: 69fd4e41 673d407b 5a5a5a5a 5a5a5a5a  i.NAg=@{ZZZZZZZZ
  stack: c000000029fdbd70: cccccccc cccccccc 00000ddc 92000008  ................
  stack: c000000029fdbd80: cccccccc cccccccc 81fd4e41 673d4633  ..........NAg=F3
  stack: c000000029fdbd90: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbda0: 00000ddb 42000018 cccccccc cccccccc  ....B...........
  stack: c000000029fdbdb0: b9fd4e41 673d42fb 5a5a5a5a 5a5a5a5a  ..NAg=B.ZZZZZZZZ
  stack: c000000029fdbdc0: cccccccc cccccccc 00000ddc 7e000018  ............~...
  stack: c000000029fdbdd0: cccccccc cccccccc d1fd4e41 673d4a1b  ..........NAg=J.
  stack: c000000029fdbde0: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbdf0: 00000ddc 8e000004 cccccccc cccccccc  ................
  stack: c000000029fdbe00: 09fe4e41 673d4ee3 5a5a5a5a 5a5a5a5a  ..NAg=N.ZZZZZZZZ
  stack: c000000029fdbe10: cccccccc cccccccc 00000dd9 7200001c  ............r...
  stack: c000000029fdbe20: cccccccc cccccccc 21fe4e41 673d4fa3  ........!.NAg=O.

That's slab data.

It's not clear what the actual data is, but because you booted with
slub_debug=FZP we can see the red zones and poison.

The cccccccc is SLUB_RED_ACTIVE, and 5a5a5a5a is POISON_INUSE (see poison.h)


  stack: c000000029fdbe30: c0000000 29fdbeb0 cccccccc cccccccc  ....)...........

But then here we have an obvious pointer (big endian FTW).

And it points nearby, just slightly higher in memory, so that looks
suspiciously like a stack back chain pointer. There's more similar
values if you look further.

But we shouldn't be seeing the stack yet, it's meant to start (end) at
c000000029fdc000 ...

  stack: c000000029fdbe40: 00000ddc 94000000 cccccccc cccccccc  ................
  stack: c000000029fdbe50: 59fe4e41 673d4933 5a5a5a5a 5a5a5a5a  Y.NAg=I3ZZZZZZZZ
  stack: c000000029fdbe60: cccccccc cccccccc 00000dd9 60000024  ............`..$
  stack: c000000029fdbe70: cccccccc cccccccc 71fe4e41 673d416b  ........q.NAg=Ak
  stack: c000000029fdbe80: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbe90: 00000ddc 6000000c cccccccc cccccccc  ....`...........
  stack: c000000029fdbea0: c0000000 29fdbf20 00000000 00000002  ....).. ........
  stack: c000000029fdbeb0: c0000000 29fdbf30 00000ddc 7e00001c ....)..0....~...     <---
  stack: c000000029fdbec0: c0000000 29fdbf40 c1fe4e41 673d4723  ....)..@..NAg=G#
  stack: c000000029fdbed0: 5a5a5a5a 5a5a5a5a cccccccc cccccccc  ZZZZZZZZ........
  stack: c000000029fdbee0: c0000000 29fdbf60 cccccccc cccccccc  ....)..`........
  stack: c000000029fdbef0: c0000000 29fdbf70 5a5a5a5a 5a5a5a5a  ....)..pZZZZZZZZ
  stack: c000000029fdbf00: cccccccc cccccccc 00000ddc 60000010  ............`...
  stack: c000000029fdbf10: c0000000 29fdbf90 00000000 00000002  ....)...........
  stack: c000000029fdbf20: c0000000 29fdbf01 001d3029 96167689  ....).....0)..v.
  stack: c000000029fdbf30: c0000000 29fdbfc0 c0000004 7f6f1800 ....)........o..     <---
  stack: c000000029fdbf40: c0000000 29fdbfc0 5a5a5a5a 5a5a5a5a  ....)...ZZZZZZZZ
  stack: c000000029fdbf50: c0000000 000ea33c 00000000 00000000  .......<........
  stack: c000000029fdbf60: c0000000 29fdbfe0 c0000000 05cdb700  ....)...........
  stack: c000000029fdbf70: c0000000 29fdbff0 cccccccc cccccccc  ....)...........
  stack: c000000029fdbf80: c0000000 000ea33c 00000000 00328780  .......<.....2..
  stack: c000000029fdbf90: c0000000 29fdc010 001d3029 96167689  ....).....0)..v.
  stack: c000000029fdbfa0: c0000000 29fdc020 00000000 000008e4  ....).. ........
  stack: c000000029fdbfb0: 00000000 00000201 001d3029 96167689  ..........0)..v.
  stack: c000000029fdbfc0: c0000000 29fdc040 cccccccc cccccccc ....)..@........     <---
  stack: c000000029fdbfd0: c0000000 000c2344 001d3029 96167689  ......#D..0)..v.
  stack: c000000029fdbfe0: c0000000 29fdc001 001d3029 96167689  ....).....0)..v.
  stack: c000000029fdbff0: c0000000 29fdc080 00000088 554c539a  ....).......ULS.

... which is here:

  stack: c000000029fdc000: c0000000 000c1d9c 001d3029 96167689  ..........0)..v.
  stack: c000000029fdc010: c0000000 29fdc0d0 c0000004 7f6f1700  ....)........o..
  stack: c000000029fdc020: c0000000 29fdc0a0 c0000000 05cdb580  ....)...........
  stack: c000000029fdc030: c0000000 29fdc0b0 c0000004 7f6f1700  ....)........o..
  stack: c000000029fdc040: c0000000 29fdc0c0 00000000 00000001  ....)...........


So it looks like you have actually overran your stack, rather than
something else clobbering your stack.

Can you attach your System.map for that exact kernel? We might be able
to work out what functions we were in when we overran.

You could also try changing CONFIG_THREAD_SHIFT to 15, that might keep
the system running a bit longer and give us some other clues.

cheers
Comment 9 Erhard F. 2021-09-23 16:29:32 UTC
Created attachment 298933 [details]
System.map (5.15-rc2 + patch, PowerMac G5 11,2)

(In reply to mpe from comment #8)
> So it looks like you have actually overran your stack, rather than
> something else clobbering your stack.
> 
> Can you attach your System.map for that exact kernel? We might be able
> to work out what functions we were in when we overran.
> 
> You could also try changing CONFIG_THREAD_SHIFT to 15, that might keep
> the system running a bit longer and give us some other clues.
> 
> cheers
Hm, interesting...

What I do to trigger this bug is building llvm-12 on the G5 via distcc (on the other side is a 16-core Opteron) and MAKEOPTS="-j10 -l3". As the G5 got 16 GiB RAM building runs in a zstd-compressed ext2 filesystem (/sbin/zram-init -d1 -s2 -azstd -text2 -orelatime -m1777 -Lvar_tmp_dir 49152 /var/tmp). Most of the time the bug is triggered very shortly after the actual building starts via meson. At this time the build directory /var/tmp/portage occupies about 800 MiB.

Also sometimes I don't get a proper stack trace via netconsole but this:
BUG: unable to handle kernel data access on write at 0xc000000037c82040
BUG: unable to handle kernel data access on write at 0xc000000037c80000

Please find the relevant System.map attached. I'll do another kernel build with CONFIG_THREAD_SHIFT=15 and see if anything changes.

Thanks for investigating this!
Comment 10 Erhard F. 2021-09-24 22:16:42 UTC
Created attachment 298959 [details]
kernel .config (5.15-rc2 + CONFIG_THREAD_SHIFT=15, PowerMac G5 11,2)

(In reply to mpe from comment #8)
> You could also try changing CONFIG_THREAD_SHIFT to 15, that might keep
> the system running a bit longer and give us some other clues.
The stack seems just large enough with CONFIG_THREAD_SHIFT=15 to not run into this bug. I let the G5 build stuff via distcc in zram disk for a day without an issue. With  CONFIG_THREAD_SHIFT=14 I hit the bug within minutes.

Just for completeness I'll upload the System.map and kernel .config with CONFIG_THREAD_SHIFT=15.
Comment 11 Erhard F. 2021-09-24 22:17:49 UTC
Created attachment 298961 [details]
System.map (5.15-rc2 + patch + CONFIG_THREAD_SHIFT=15, PowerMac G5 11,2)
Comment 12 Erhard F. 2021-09-24 22:25:08 UTC
Created attachment 298963 [details]
dmesg (5.15-rc2 + patch, PowerMac G5 11,2) #1

Last stack trace with CONFIG_THREAD_SHIFT=14 however did reveal a bit more data:

[...]
stack: c0000000023effb0: 00000000 28022284 00000000 00000000  ....(.".........
stack: c0000000023effc0: 00000000 00000c00 00003fff 94c95000  ..........?...P.
stack: c0000000023effd0: 00000000 42000000 ffffffff ffffffea  ....B...........
stack: c0000000023effe0: 00000000 00000000 00000000 00000000  ................
stack: c0000000023efff0: 00000000 00000000 00000000 00000000  ................
Kernel panic - not syncing: corrupted stack end detected inside scheduler
CPU: 1 PID: 2652 Comm: cc1plus Tainted: G        W         5.15.0-rc2-PowerMacG5+ #4
Call Trace:
[c0000000023ef7f0] [c0000000005532d8] .dump_stack_lvl+0x98/0xe0 (unreliable)
[c0000000023ef880] [c000000000069538] .panic+0x14c/0x3e8
[c0000000023ef930] [c00000000081d5a0] .__schedule+0xc0/0x874
[c0000000023efa00] [c00000000081dea0] .preempt_schedule_common+0x28/0x48
[c0000000023efa80] [c00000000081deec] .__cond_resched+0x2c/0x50
[c0000000023efaf0] [c00000000029579c] .dput+0x40/0x218
[c0000000023efba0] [c000000000285204] .path_put+0x1c/0x34
[c0000000023efc20] [c00000000027eab8] .do_readlinkat+0xdc/0x124
[c0000000023efcf0] [c00000000027f310] .__se_sys_readlink+0x20/0x30
[c0000000023efd60] [c000000000022850] .system_call_exception+0x1ac/0x1e4
[c0000000023efe10] [c00000000000b4cc] system_call_common+0xec/0x250
--- interrupt: c00 at 0x3fff95335500
NIP:  00003fff95335500 LR: 00003fff95273f6c CTR: 0000000000000000
REGS: c0000000023efe80 TRAP: 0c00   Tainted: G        W          (5.15.0-rc2-PowerMacG5+)
MSR:  900000000200f032 <SF,HV,VEC,EE,PR,FP,ME,IR,DR,RI>  CR: 28022284  XER: 00000000
IRQMASK: 0 
GPR00: 0000000000000055 00003fffd5fb3600 00003fff95424300 00003fffd5fb4040 
GPR04: 00003fffd5fb3b20 00000000000003ff 0000000062697473 000000002bcf8581 
GPR08: ffffffffd4307a80 0000000000000000 0000000000000000 0000000000000000 
GPR12: 0000000000000000 00003fff95946430 00000000000003ff 00003fffd5fb4030 
GPR16: 00003fffd5fb3b20 0000000000000000 00003fffd5fb3700 0000000000000000 
GPR20: 000000000000002f 00003fffd5fb3b10 00003fff9593f7b8 00003fffd5fb4080 
GPR24: 0000000000000004 00003fffd5fb3710 00003fffd5fb3b20 00003fffd5fb4040 
GPR28: 00003fffd5fb4538 00003fffd5fb4084 00003fffd5fb4040 0000000000000000 
NIP [00003fff95335500] 0x3fff95335500
LR [00003fff95273f6c] 0x3fff95273f6c
--- interrupt: c00
Rebooting in 40 seconds..

And another one:
[...]
stack: c00000002cd77fb0: 00000000 28042822 00000000 00000000  ....(.("........
stack: c00000002cd77fc0: 00000000 00000c00 00003fff 9b2dd000  ..........?..-..
stack: c00000002cd77fd0: 00000000 42000000 00000000 00000000  ....B...........
stack: c00000002cd77fe0: 00000000 00000000 00000000 00000000  ................
stack: c00000002cd77ff0: 00000000 00000000 00000000 00000000  ................
Kernel panic - not syncing: corrupted stack end detected inside scheduler
CPU: 1 PID: 2713 Comm: cc1plus Tainted: G        W         5.15.0-rc2-PowerMacG5+ #4
Call Trace:
[c00000002cd76dd0] [c0000000005532d8] .dump_stack_lvl+0x98/0xe0 (unreliable)
[c00000002cd76e60] [c000000000069538] .panic+0x14c/0x3e8
[c00000002cd76f10] [c00000000081d5a0] .__schedule+0xc0/0x874
[c00000002cd76fe0] [c00000000081dea0] .preempt_schedule_common+0x28/0x48
[c00000002cd77060] [c00000000081deec] .__cond_resched+0x2c/0x50
[c00000002cd770d0] [c000000000327848] .__ext4_handle_dirty_metadata+0x24/0x214
[c00000002cd771a0] [c000000000352a6c] .ext4_mb_mark_diskspace_used+0x3e0/0x41c
[c00000002cd77290] [c000000000355ae4] .ext4_mb_new_blocks+0x580/0xe10
[c00000002cd773b0] [c00000000033b2f0] .ext4_ind_map_blocks+0x63c/0xb28
[c00000002cd775a0] [c000000000342bf4] .ext4_map_blocks+0x37c/0x588
[c00000002cd77680] [c000000000342e64] ._ext4_get_block+0x64/0xec
[c00000002cd77730] [c0000000002c36a4] .__block_write_begin_int+0x188/0x4a4
[c00000002cd77850] [c0000000003480f8] .ext4_write_begin+0x2a8/0x3d0
[c00000002cd77970] [c0000000001c9170] .generic_perform_write+0xb8/0x1f4
[c00000002cd77a60] [c000000000333d68] .ext4_buffered_write_iter+0xb8/0x154
[c00000002cd77b00] [c000000000277a14] .new_sync_write+0x94/0xe8
[c00000002cd77c00] [c000000000278d6c] .vfs_write+0x13c/0x140
[c00000002cd77ca0] [c000000000278eb4] .ksys_write+0x78/0xc4
[c00000002cd77d60] [c000000000022850] .system_call_exception+0x1ac/0x1e4
[c00000002cd77e10] [c00000000000b4cc] system_call_common+0xec/0x250
--- interrupt: c00 at 0x3fff9b804b00
NIP:  00003fff9b804b00 LR: 00003fff9b780d04 CTR: 0000000000000000
REGS: c00000002cd77e80 TRAP: 0c00   Tainted: G        W          (5.15.0-rc2-PowerMacG5+)
MSR:  900000000200d032 <SF,HV,VEC,EE,PR,ME,IR,DR,RI>  CR: 28042822  XER: 00000000
IRQMASK: 0 
GPR00: 0000000000000004 00003fffe21d3000 00003fff9b8f6300 0000000000000001 
GPR04: 0000000026979780 0000000000001000 0000000000000001 0000000000000036 
GPR08: 000000002697a780 0000000000000000 0000000000000000 0000000000000000 
GPR12: 0000000000000000 00003fff9be17430 00000000100f1858 0000000010063a50 
GPR16: 00000000100639f8 0000000010063968 000000012e83eb28 0000000000000000 
GPR20: ffffffffffffffff 0000000011a5b890 0000000000000001 00000000114e5160 
GPR24: 00000000114e5170 0000000000000000 0000000026979780 0000000000001000 
GPR28: 0000000000001000 00003fff9b8f0920 0000000026979780 0000000000001000 
NIP [00003fff9b804b00] 0x3fff9b804b00
LR [00003fff9b780d04] 0x3fff9b780d04
--- interrupt: c00
Rebooting in 40 seconds..
Comment 13 Erhard F. 2021-11-28 14:20:25 UTC
Created attachment 299755 [details]
dmesg (5.16-rc2 + patch, PowerMac G5 11,2)

Still happens with with 5.16-rc2, but getting a slightly different error message this time. Also this crash happened earlier, not at building distcc but at unpacking the to be built tar.gz archive with tar + pigz:

[...]
stack: c000000005b0e600: 00000000 00000003 c0000000 00105b2c  ..............[,
stack: c000000005b0e610: c0000000 05b0e6a0 0031faa1 bd74990f  .........1...t..
stack: c000000005b0e620: c0000000 00104f50 00000000 00000006  ......OP........
stack: c000000005b0e630: c0000000 05b0e6a0 0031faa1 bd74990f  .........1...t..
kernel tried to execute exec-protected page (c000000005b0bbe0) - exploit attemp? (uid: 0)
stack: c000000005b0e640: c0000000 00000001 c0000000 022f2a08  ............./*.

The last 2 lines were not in the netconsole.log but only were to be seen on the screen of the frozen G5 so I added them manually.
Comment 14 Erhard F. 2022-01-22 00:07:33 UTC
Created attachment 300297 [details]
bisect.log

Finally did a bisect which revealed the following commit:

 # git bisect good
c2c11289021dfacec1658b2019faab10e12f383a is the first bad commit
commit c2c11289021dfacec1658b2019faab10e12f383a
Merge: 63bef48fd6c9 ef516e8625dd
Author: David S. Miller <davem@davemloft.net>
Date:   Tue Apr 7 18:08:06 2020 -0700

    Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
    
    Pablo Neira Ayuso says:
    
    ====================
    Netfilter fixes for net
    
    The following patchset contains Netfilter fixes for net, they are:
    
    1) Fix spurious overlap condition in the rbtree tree, from Stefano Brivio.
    
    2) Fix possible uninitialized pointer dereference in nft_lookup.
    
    3) IDLETIMER v1 target matches the Android layout, from
       Maciej Zenczykowski.
    
    4) Dangling pointer in nf_tables_set_alloc_name, from Eric Dumazet.
    
    5) Fix RCU warning splat in ipset find_set_type(), from Amol Grover.
    
    6) Report EOPNOTSUPP on unsupported set flags and object types in sets.
    
    7) Add NFT_SET_CONCAT flag to provide consistent error reporting
       when users defines set with ranges in concatenations in old kernels.
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>

 include/net/netfilter/nf_tables.h           |  2 +-
 include/uapi/linux/netfilter/nf_tables.h    |  2 ++
 include/uapi/linux/netfilter/xt_IDLETIMER.h |  1 +
 net/netfilter/ipset/ip_set_core.c           |  3 ++-
 net/netfilter/nf_tables_api.c               |  7 ++++---
 net/netfilter/nft_lookup.c                  | 12 +++++++-----
 net/netfilter/nft_set_bitmap.c              |  1 -
 net/netfilter/nft_set_rbtree.c              | 23 +++++++++++------------
 net/netfilter/xt_IDLETIMER.c                |  3 +++
 9 files changed, 31 insertions(+), 23 deletions(-)
Comment 15 Erhard F. 2022-01-22 00:14:18 UTC
This may look a bit odd at first to cause memory corruption while building stuff, but as I do the builds via distcc on another host (sources are fetched via nfs from this host too) it seems possible.

Problem is the 'bad' commit is a merge and reverting it on v5.16.2 for a test via git revert -m1 c2c11289021dfacec1658b2019faab10e12f383a  gets me some merge conflicts which I don't know to resolve properly..
Comment 16 Erhard F. 2022-02-19 23:29:22 UTC
Created attachment 300486 [details]
kernel .config (5.17-rc4, PowerMac G5 11,2)
Comment 17 Erhard F. 2022-02-19 23:31:40 UTC
Created attachment 300487 [details]
dmesg (5.17-rc4 + patch, PowerMac G5 11,2)

Still an issue on 5.17-rc4.

Stacktrace looks a bit more interesting this time.
Comment 18 Erhard F. 2022-03-13 11:32:33 UTC
Created attachment 300561 [details]
System.map (5.17-rc7 + patch, PowerMac G5 11,2)
Comment 19 Erhard F. 2022-03-13 11:33:17 UTC
Created attachment 300562 [details]
kernel .config (5.17-rc7, PowerMac G5 11,2)
Comment 20 Erhard F. 2022-03-13 11:34:15 UTC
Created attachment 300563 [details]
dmesg (5.17-rc7 + patch, PowerMac G5 11,2)
Comment 21 Erhard F. 2022-07-07 10:27:19 UTC
(Luckily) I am no longer able to reproduce this. Re-tested on 5.19-rc5.

I'll keep an eye on it and will close here if it stays like that for the next few stable kernels.