Running Talos Linux with PREEMPT_RT 6.12.13 kernel. The system crashes once or twice per day with this issue (rt.c:1035). The system is running more or less idle when the crash happens. [19450.947613] ------------[ cut here ]------------ [19450.947614] kernel BUG at kernel/sched/rt.c:1035! [19450.947619] Oops: invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI [19450.947622] CPU: 4 UID: 0 PID: 0 Comm: swapper/4 Not tainted 6.12.13-talos #1 [19450.947625] Hardware name: HPE ProLiant DL110 Gen11/ProLiant DL110 Gen11, BIOS 2.44 01/17/2025 [19450.947626] RIP: 0010:dequeue_rt_stack+0x359/0x3b0 [19450.947631] Code: d4 08 00 00 8b 83 ec fe ff ff 39 f0 0f 8d d7 fe ff ff 0f 0b e9 28 fd ff ff 0f 0b e9 71 fd ff ff e8 bc 3a 08 00 e9 94 fe ff ff <0f> 0b 48 89 de 48 c7 c7 a0 a6 4b 96 e8 06 02 96 00 e9 4e fe ff ff [19450.947632] RSP: 0018:ffffbcf180450e28 EFLAGS: 00010046 [19450.947634] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000001 [19450.947636] RDX: 0000000000000004 RSI: 0000000000000009 RDI: ffff9f2973e23180 [19450.947636] RBP: ffff9f67be231900 R08: 0000000000000000 R09: 0000000000000000 [19450.947637] R10: 0000000000000000 R11: ffffbcf180450ff8 R12: ffff9f2973e23180 [19450.947638] R13: 0000000000000009 R14: 0000000000031900 R15: 0000000000000009 [19450.947639] FS: 0000000000000000(0000) GS:ffff9f67be200000(0000) knlGS:0000000000000000 [19450.947640] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [19450.947642] CR2: 00007f2bc3d49680 CR3: 000000013083e002 CR4: 0000000000f72ef0 [19450.947643] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [19450.947644] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [19450.947644] PKRU: 55555554 [19450.947645] Call Trace: [19450.947646] <IRQ> [19450.947648] ? __die_body.cold+0x19/0x26 [19450.947652] ? die+0x2e/0x50 [19450.947655] ? do_trap+0xca/0x110 [19450.947657] ? do_error_trap+0x6a/0x90 [19450.947659] ? dequeue_rt_stack+0x359/0x3b0 [19450.947660] ? exc_invalid_op+0x50/0x70 [19450.947662] ? dequeue_rt_stack+0x359/0x3b0 [19450.947663] ? asm_exc_invalid_op+0x1a/0x20 [19450.947668] ? dequeue_rt_stack+0x359/0x3b0 [19450.947670] enqueue_task_rt+0xae/0x4b0 [19450.947671] enqueue_task+0x32/0x130 [19450.947674] ttwu_do_activate.isra.0+0x6e/0x1e0 [19450.947677] try_to_wake_up+0x1f9/0x630 [19450.947679] ? __hrtimer_run_queues+0x158/0x2d0 [19450.947682] __handle_irq_event_percpu+0x83/0x1a0 [19450.947685] handle_irq_event+0x46/0x90 [19450.947688] handle_edge_irq+0x9b/0x2a0 [19450.947690] __common_interrupt+0x4c/0xd0 [19450.947692] common_interrupt+0x8f/0xc0 [19450.947694] </IRQ> [19450.947695] <TASK> [19450.947695] asm_common_interrupt+0x26/0x40 [19450.947698] RIP: 0010:cpuidle_enter_state+0xd3/0x6b0 [19450.947700] Code: 00 00 e8 a0 21 99 fe e8 6b ed ff ff 49 89 c6 0f 1f 44 00 00 31 ff e8 4c 0b 98 fe 45 84 ff 0f 85 13 02 00 00 fb 0f 1f 44 00 00 <45> 85 ed 0f 88 e3 01 00 00 4d 63 e5 49 83 fc 0a 0f 83 7e 04 00 00 [19450.947701] RSP: 0018:ffffbcf18015fe70 EFLAGS: 00000246 [19450.947702] RAX: ffff9f67be200000 RBX: ffff9f296eae7400 RCX: 0000000000000000 [19450.947703] RDX: 000011b0c6d299b6 RSI: 0000000062762762 RDI: 0000000000000000 [19450.947704] RBP: ffffffff96785d40 R08: 0000000000000008 R09: 00000000001b5030 [19450.947704] R10: 00000000000007d0 R11: 0000000000000002 R12: 0000000000000001 [19450.947705] R13: 0000000000000001 R14: 000011b0c6d299b6 R15: 0000000000000000 [19450.947708] cpuidle_enter+0x2d/0x40 [19450.947710] do_idle+0x1c5/0x220 [19450.947712] cpu_startup_entry+0x29/0x30 [19450.947714] start_secondary+0x106/0x120 [19450.947717] common_startup_64+0x13e/0x148 [19450.947720] </TASK> [19450.947720] Modules linked in: iavf libeth vrf vfio_pci vfio_pci_core vfio_iommu_type1 vfio mlx5_ib mlx5_core ice igb mlxfw nvme i2c_algo_bit libie hpilo [19450.947729] ---[ end trace 0000000000000000 ]--- [19451.274535] RIP: 0010:dequeue_rt_stack+0x359/0x3b0 [19451.274544] Code: d4 08 00 00 8b 83 ec fe ff ff 39 f0 0f 8d d7 fe ff ff 0f 0b e9 28 fd ff ff 0f 0b e9 71 fd ff ff e8 bc 3a 08 00 e9 94 fe ff ff <0f> 0b 48 89 de 48 c7 c7 a0 a6 4b 96 e8 06 02 96 00 e9 4e fe ff ff [19451.274546] RSP: 0018:ffffbcf180450e28 EFLAGS: 00010046 [19451.274549] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000001 [19451.274550] RDX: 0000000000000004 RSI: 0000000000000009 RDI: ffff9f2973e23180 [19451.274551] RBP: ffff9f67be231900 R08: 0000000000000000 R09: 0000000000000000 [19451.274552] R10: 0000000000000000 R11: ffffbcf180450ff8 R12: ffff9f2973e23180 [19451.274553] R13: 0000000000000009 R14: 0000000000031900 R15: 0000000000000009 [19451.274554] FS: 0000000000000000(0000) GS:ffff9f67be200000(0000) knlGS:0000000000000000 [19451.274556] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [19451.274557] CR2: 00007f2bc3d49680 CR3: 000000013083e002 CR4: 0000000000f72ef0 [19451.274558] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [19451.274559] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [19451.274560] PKRU: 55555554 [19451.274562] Kernel panic - not syncing: Fatal exception in interrupt [19452.324942] Shutting down cpus with NMI [19452.324959] Kernel Offset: 0x12600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [0m[39m[49m Got a Failure (00000010) on a CHIF transaction
Note: can be related to https://bugzilla.kernel.org/show_bug.cgi?id=219920, which is another crash on the same (or very similar) system.
After building a kernel with CONFIG_DEBUG_ATOMIC_SLEEP=y and updating to 6.12.18, I see always a "psi: inconsistent task state" message right before the crash. [27811.789630] psi: inconsistent task state! task=16461:operator[27811.789630] psi: inconsistent task state! task=16461:operator cpu=6 psi_flags=10 clear=14 set=0 [27811.808097] ------------[ cut here ]------------ [27811.808099] kernel BUG at kernel/sched/rt.c:1035! [27811.808106] Oops: invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI [27811.808109] CPU: 6 UID: 0 PID: 159 Comm: rcuog/16 Not tainted 6.12.18-mxie #1 [27811.808111] Hardware name: HPE ProLiant DL110 Gen11/ProLiant DL110 Gen11, BIOS 2.16 03/01/2024 [27811.808113] RIP: 0010:dequeue_rt_stack+0x359/0x3b0