Bug 219919 - Kernel panic in RT 6.12.13
Summary: Kernel panic in RT 6.12.13
Status: NEW
Alias: None
Product: Process Management
Classification: Unclassified
Component: Scheduler (show other bugs)
Hardware: Intel Linux
: P3 normal
Assignee: Ingo Molnar
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-03-24 11:09 UTC by Ismo Puustinen
Modified: 2025-04-22 07:51 UTC (History)
0 users

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Ismo Puustinen 2025-03-24 11:09:42 UTC
Running Talos Linux with PREEMPT_RT 6.12.13 kernel. The system crashes once or twice per day with this issue (rt.c:1035). The system is running more or less idle when the crash happens.

[19450.947613] ------------[ cut here ]------------
[19450.947614] kernel BUG at kernel/sched/rt.c:1035!
[19450.947619] Oops: invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI
[19450.947622] CPU: 4 UID: 0 PID: 0 Comm: swapper/4 Not tainted 6.12.13-talos #1
[19450.947625] Hardware name: HPE ProLiant DL110 Gen11/ProLiant DL110 Gen11, BIOS 2.44 01/17/2025
[19450.947626] RIP: 0010:dequeue_rt_stack+0x359/0x3b0
[19450.947631] Code: d4 08 00 00 8b 83 ec fe ff ff 39 f0 0f 8d d7 fe ff ff 0f 0b e9 28 fd ff ff 0f 0b e9 71 fd ff ff e8 bc 3a 08 00 e9 94 fe ff ff <0f> 0b 48 89 de 48 c7 c7 a0 a6 4b 96 e8 06 02 96 00 e9 4e fe ff ff
[19450.947632] RSP: 0018:ffffbcf180450e28 EFLAGS: 00010046
[19450.947634] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000001
[19450.947636] RDX: 0000000000000004 RSI: 0000000000000009 RDI: ffff9f2973e23180
[19450.947636] RBP: ffff9f67be231900 R08: 0000000000000000 R09: 0000000000000000
[19450.947637] R10: 0000000000000000 R11: ffffbcf180450ff8 R12: ffff9f2973e23180
[19450.947638] R13: 0000000000000009 R14: 0000000000031900 R15: 0000000000000009
[19450.947639] FS:  0000000000000000(0000) GS:ffff9f67be200000(0000) knlGS:0000000000000000
[19450.947640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19450.947642] CR2: 00007f2bc3d49680 CR3: 000000013083e002 CR4: 0000000000f72ef0
[19450.947643] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[19450.947644] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[19450.947644] PKRU: 55555554
[19450.947645] Call Trace:
[19450.947646]  <IRQ>
[19450.947648]  ? __die_body.cold+0x19/0x26
[19450.947652]  ? die+0x2e/0x50
[19450.947655]  ? do_trap+0xca/0x110
[19450.947657]  ? do_error_trap+0x6a/0x90
[19450.947659]  ? dequeue_rt_stack+0x359/0x3b0
[19450.947660]  ? exc_invalid_op+0x50/0x70
[19450.947662]  ? dequeue_rt_stack+0x359/0x3b0
[19450.947663]  ? asm_exc_invalid_op+0x1a/0x20
[19450.947668]  ? dequeue_rt_stack+0x359/0x3b0
[19450.947670]  enqueue_task_rt+0xae/0x4b0
[19450.947671]  enqueue_task+0x32/0x130
[19450.947674]  ttwu_do_activate.isra.0+0x6e/0x1e0
[19450.947677]  try_to_wake_up+0x1f9/0x630
[19450.947679]  ? __hrtimer_run_queues+0x158/0x2d0
[19450.947682]  __handle_irq_event_percpu+0x83/0x1a0
[19450.947685]  handle_irq_event+0x46/0x90
[19450.947688]  handle_edge_irq+0x9b/0x2a0
[19450.947690]  __common_interrupt+0x4c/0xd0
[19450.947692]  common_interrupt+0x8f/0xc0
[19450.947694]  </IRQ>
[19450.947695]  <TASK>
[19450.947695]  asm_common_interrupt+0x26/0x40
[19450.947698] RIP: 0010:cpuidle_enter_state+0xd3/0x6b0
[19450.947700] Code: 00 00 e8 a0 21 99 fe e8 6b ed ff ff 49 89 c6 0f 1f 44 00 00 31 ff e8 4c 0b 98 fe 45 84 ff 0f 85 13 02 00 00 fb 0f 1f 44 00 00 <45> 85 ed 0f 88 e3 01 00 00 4d 63 e5 49 83 fc 0a 0f 83 7e 04 00 00
[19450.947701] RSP: 0018:ffffbcf18015fe70 EFLAGS: 00000246
[19450.947702] RAX: ffff9f67be200000 RBX: ffff9f296eae7400 RCX: 0000000000000000
[19450.947703] RDX: 000011b0c6d299b6 RSI: 0000000062762762 RDI: 0000000000000000
[19450.947704] RBP: ffffffff96785d40 R08: 0000000000000008 R09: 00000000001b5030
[19450.947704] R10: 00000000000007d0 R11: 0000000000000002 R12: 0000000000000001
[19450.947705] R13: 0000000000000001 R14: 000011b0c6d299b6 R15: 0000000000000000
[19450.947708]  cpuidle_enter+0x2d/0x40
[19450.947710]  do_idle+0x1c5/0x220
[19450.947712]  cpu_startup_entry+0x29/0x30
[19450.947714]  start_secondary+0x106/0x120
[19450.947717]  common_startup_64+0x13e/0x148
[19450.947720]  </TASK>
[19450.947720] Modules linked in: iavf libeth vrf vfio_pci vfio_pci_core vfio_iommu_type1 vfio mlx5_ib mlx5_core ice igb mlxfw nvme i2c_algo_bit libie hpilo
[19450.947729] ---[ end trace 0000000000000000 ]---
[19451.274535] RIP: 0010:dequeue_rt_stack+0x359/0x3b0
[19451.274544] Code: d4 08 00 00 8b 83 ec fe ff ff 39 f0 0f 8d d7 fe ff ff 0f 0b e9 28 fd ff ff 0f 0b e9 71 fd ff ff e8 bc 3a 08 00 e9 94 fe ff ff <0f> 0b 48 89 de 48 c7 c7 a0 a6 4b 96 e8 06 02 96 00 e9 4e fe ff ff
[19451.274546] RSP: 0018:ffffbcf180450e28 EFLAGS: 00010046
[19451.274549] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000001
[19451.274550] RDX: 0000000000000004 RSI: 0000000000000009 RDI: ffff9f2973e23180
[19451.274551] RBP: ffff9f67be231900 R08: 0000000000000000 R09: 0000000000000000
[19451.274552] R10: 0000000000000000 R11: ffffbcf180450ff8 R12: ffff9f2973e23180
[19451.274553] R13: 0000000000000009 R14: 0000000000031900 R15: 0000000000000009
[19451.274554] FS:  0000000000000000(0000) GS:ffff9f67be200000(0000) knlGS:0000000000000000
[19451.274556] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19451.274557] CR2: 00007f2bc3d49680 CR3: 000000013083e002 CR4: 0000000000f72ef0
[19451.274558] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[19451.274559] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[19451.274560] PKRU: 55555554
[19451.274562] Kernel panic - not syncing: Fatal exception in interrupt
[19452.324942] Shutting down cpus with NMI
[19452.324959] Kernel Offset: 0x12600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 [0m[39m[49m Got a Failure (00000010) on a CHIF transaction
Comment 1 Ismo Puustinen 2025-03-24 12:22:22 UTC
Note: can be related to https://bugzilla.kernel.org/show_bug.cgi?id=219920, which is another crash on the same (or very similar) system.
Comment 2 Ismo Puustinen 2025-04-22 07:51:32 UTC
After building a kernel with CONFIG_DEBUG_ATOMIC_SLEEP=y and updating to 6.12.18, I see always a "psi: inconsistent task state" message right before the crash.


[27811.789630] psi: inconsistent task state! task=16461:operator[27811.789630] psi: inconsistent task state! task=16461:operator cpu=6 psi_flags=10 clear=14 set=0
[27811.808097] ------------[ cut here ]------------
[27811.808099] kernel BUG at kernel/sched/rt.c:1035!
[27811.808106] Oops: invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI
[27811.808109] CPU: 6 UID: 0 PID: 159 Comm: rcuog/16 Not tainted 6.12.18-mxie #1
[27811.808111] Hardware name: HPE ProLiant DL110 Gen11/ProLiant DL110 Gen11, BIOS 2.16 03/01/2024
[27811.808113] RIP: 0010:dequeue_rt_stack+0x359/0x3b0

Note You need to log in before you can comment on or make changes to this bug.