My first kernel report... Platform: Embedded device with Intel/Altera SOCFPGA (Cyclone V, Arm dual core), MCV Board, with included MMC. Software: Buildroot custom build with Linux up to version 6.6, got this problem with 6.5, 6.1, 5.1 The system works reliable since quite some time (~1-2 years, I don't mean continuously :) ). In the last days I was observing repeatable hangs on reboot, ONLY on firmware upgrade. I believe to have seen this earlier, but didn't pay much attention. Because the system is powered over POE, I use a "safe" partitioning: p1: uboot p2: exfat containing one folder ("system") containing kernel, squashfs root, fpga config p3: exfat user partition Boot handling: -boot with initramfs/init script -mount squashfs ro from p2 -prepare an overlay fs over entire root!, persistent on p3/overlay -switch_root using prepared overlay Relatively simple firmware update using swupdate: -uboot img update (always, although not really necessary; might contain specific soc pin muxer settings). -prepare a new "update" folder (on p2) with all components -rename "system" to "old" and "update" to "system" -reboot (swupdate) Sometimes it just hangs on reboot code! Logged in over serial, I saw "mmc0: card 0001 removed" as last message Waiting some time, I see "task swupdate-progre:181 blocked for more than 245 seconds." It looks like the MMC is shut down while still having pending writes!? I added "dump_stack" call to "__mmc_stop_host" (mmc core.c) and "mmc_remove_card" (bus.c) It took some retries (repeated swu update over web interface), hier are the traces: [ 295.529922] CPU: 1 PID: 181 Comm: swupdate-progre Tainted: G O 6.5.13 #4 [ 295.538045] Hardware name: Altera SOCFPGA [ 295.542069] unwind_backtrace from show_stack+0x18/0x1c [ 295.547332] show_stack from dump_stack_lvl+0x40/0x4c [ 295.552411] dump_stack_lvl from __mmc_stop_host+0x14/0x48 [ 295.557919] __mmc_stop_host from mmc_host_classdev_shutdown+0x14/0x1c [ 295.564465] mmc_host_classdev_shutdown from device_shutdown+0x14c/0x230 [ 295.571191] device_shutdown from kernel_restart+0x40/0x98 [ 295.576703] kernel_restart from __do_sys_reboot+0x10c/0x200 [ 295.582378] __do_sys_reboot from ret_fast_syscall+0x0/0x54 [ 295.587961] Exception stack(0xf0985fa8 to 0xf0985ff0) [ 295.593019] 5fa0: 00013131 00000004 fee1dead 28121969 01234567 00000010 [ 295.601193] 5fc0: 00013131 00000004 00000000 00000058 00000000 00013124 00000004 00000064 [ 295.609363] 5fe0: b6ecbe3c bec921bc 0001147c b6ecbe5c [ 295.768587] mmc0: card 0001 removed [ 295.772110] CPU: 1 PID: 44 Comm: kworker/1:3 Tainted: G O 6.5.13 #4 [ 295.779769] Hardware name: Altera SOCFPGA [ 295.783783] Workqueue: events_freezable mmc_rescan [ 295.788599] unwind_backtrace from show_stack+0x18/0x1c [ 295.793851] show_stack from dump_stack_lvl+0x40/0x4c [ 295.798930] dump_stack_lvl from mmc_remove_card+0x80/0xb0 [ 295.804436] mmc_remove_card from mmc_detect+0x44/0x74 [ 295.809591] mmc_detect from mmc_rescan+0xa8/0x350 [ 295.814396] mmc_rescan from process_one_work+0x224/0x4e8 [ 295.819821] process_one_work from worker_thread+0x58/0x554 [ 295.825420] worker_thread from kthread+0xe8/0x104 [ 295.830243] kthread from ret_from_fork+0x14/0x28 [ 295.834964] Exception stack(0xf11a5fb0 to 0xf11a5ff8) [ 295.840018] 5fa0: 00000000 00000000 00000000 00000000 [ 295.848191] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 295.856362] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000 I am not a Kernel expert, hope someone can help with the right hints. Thanks, Gabriel
Pulling the plug... everything is updated. Waiting... I get further traces (not from me): [ 612.280732] INFO: task swupdate-progre:181 blocked for more than 245 seconds. [ 612.287905] Tainted: G O 6.5.13 #4 [ 612.293262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 612.301125] task:swupdate-progre state:D stack:0 pid:181 ppid:175 flags:0x00000004 [ 612.309498] __schedule from schedule+0x58/0xa8 [ 612.314108] schedule from schedule_timeout+0xe4/0x160 [ 612.319279] schedule_timeout from __wait_for_common+0x90/0x1b0 [ 612.325249] __wait_for_common from __flush_work+0x194/0x324 [ 612.330982] __flush_work from __cancel_work_timer+0x11c/0x1fc [ 612.336851] __cancel_work_timer from mmc_host_classdev_shutdown+0x14/0x1c [ 612.343774] mmc_host_classdev_shutdown from device_shutdown+0x14c/0x230 [ 612.350510] device_shutdown from kernel_restart+0x40/0x98 [ 612.356042] kernel_restart from __do_sys_reboot+0x10c/0x200 [ 612.361742] __do_sys_reboot from ret_fast_syscall+0x0/0x54 [ 612.367334] Exception stack(0xf0985fa8 to 0xf0985ff0) [ 612.372414] 5fa0: 00013131 00000004 fee1dead 28121969 01234567 00000010 [ 612.380594] 5fc0: 00013131 00000004 00000000 00000058 00000000 00013124 00000004 00000064 [ 612.388787] 5fe0: b6ecbe3c bec921bc 0001147c b6ecbe5c Tried to login on console!: [ 1103.890654] INFO: task login:205 blocked for more than 491 seconds. [ 1103.896930] Tainted: G O 6.5.13 #4 [ 1103.902285] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1103.910131] task:login state:D stack:0 pid:205 ppid:1 flags:0x00000004 [ 1103.918519] __schedule from schedule+0x58/0xa8 [ 1103.923098] schedule from io_schedule+0x3c/0x4c [ 1103.927770] io_schedule from bit_wait_io+0x14/0x5c [ 1103.932683] bit_wait_io from __wait_on_bit+0x7c/0x13c [ 1103.937893] __wait_on_bit from out_of_line_wait_on_bit+0x78/0x9c [ 1103.944021] out_of_line_wait_on_bit from __ext4_find_entry+0x2d4/0x61c [ 1103.950699] __ext4_find_entry from ext4_lookup+0xa8/0x24c [ 1103.956217] ext4_lookup from __lookup_slow+0x8c/0x134 [ 1103.961428] __lookup_slow from lookup_one_unlocked+0xa4/0xb8 [ 1103.967209] lookup_one_unlocked from ovl_lookup_single+0x54/0x334 [ 1103.973443] ovl_lookup_single from ovl_lookup_layer+0x144/0x188 [ 1103.979487] ovl_lookup_layer from ovl_lookup+0x114/0xd18 [ 1103.984907] ovl_lookup from __lookup_slow+0x8c/0x134 [ 1103.990006] __lookup_slow from walk_component+0xf4/0x164 [ 1103.995429] walk_component from path_lookupat+0x7c/0x1a4 [ 1104.000865] path_lookupat from filename_lookup+0xbc/0x154 [ 1104.006367] filename_lookup from user_path_at_empty+0x4c/0x64 [ 1104.012233] user_path_at_empty from do_faccessat+0x120/0x2b4 [ 1104.018029] do_faccessat from ret_fast_syscall+0x0/0x54 [ 1104.023360] Exception stack(0xf097dfa8 to 0xf097dff0) [ 1104.028442] dfa0: b6faab8c 000cf1a8 000b9d1e 00000000 00000000 00000001 [ 1104.036619] dfc0: b6faab8c 000cf1a8 00000000 00000021 00000000 00000003 000ce6f0 000b9cbc [ 1104.044816] dfe0: 000ce3e8 be96ac5c 0001d060 b6f27b30 [ 1104.049897] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings
Sorry, I made a mistake: "exfat" should read "ext4" everywhere (p2, p3)...
I think isolated the problem... nothing to do with the write/update. mmc_rescan (polling) during shutdown hangs: Alive fails (timeout?), triggers a card remove, and mmc_remove_card hangs in device_del() call. __mmc_stop_host (reboot context) also hangs in cancel_delayed_work_sync() call The "rescan_disable" flag in mmc/core.c usually helps, but it's too late. I could easily force the bug (mmc/core.c) by delaying the rescan_disable flag. -at the end of mmc_rescan, requeue for faster polling: out: if (host->caps & MMC_CAP_NEEDS_POLL) mmc_schedule_delayed_work(&host->detect, HZ / 10); //gc: poll often -in __mmc_stop_host(), add a delay to allow mm_rescan to enter void __mmc_stop_host(struct mmc_host *host) { pr_info("__mmc_stop_host\n"); msleep(100); //gc: force bug! .... Then just reboot from console.