Bug 14430
Summary: | sync() hangs in bdi_sched_wait | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Petr Vandrovec (petr) |
Component: | Block Layer | Assignee: | Jens Axboe (axboe) |
Status: | CLOSED UNREPRODUCIBLE | ||
Severity: | normal | CC: | rjw, shiwenlu518 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.32-rc5 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 14230 | ||
Attachments: | kernel log |
Petr, was IO still going on at this point or was the disk idle? I ask because we have seen a bug here where the fs sync ends up basically doing sync writes to every dirty inode, instead of just waiting on the previously submitted async write. And that tends to go reaaaally slow. So it could be that very bug, hence my question. If the disk is idle (and network, if this is an nfs mount it's waiting on), then it's likely another bug. I'll code up a debug patch that we can merge to help fix these issues, since it's hard to collect data afterwards. As far as I can tell disk was idle, and NFS did not complain about any retries either - I believe that 'ps axf | grep D' reported only syslinux, no-one else, operations like 'df' did work (so NFS & friends seemed to be OK), only running 'sync' from command line got hung. Box was in this state for over 15 minutes until I decided to alt-sysrq-b box... I'll try reproducing it, but I have no idea how successful I'll be - I've did same build 3 times since hang, and they all succeeded without any hangs. I've kept running 'while true; do sync; done' for a day, while doing regular work. Once sync took more than 120 seconds (which I find quite surprising given that sync was supposed to run back-to-back), but sync completed successfully before I've noticed. So for now it is not reproducible. [235441.210125] INFO: task sync:13249 blocked for more than 120 seconds. [235441.210276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [235441.210425] sync D ffff88012cdd5e40 0 13249 5520 0x00000000 [235441.210609] BUG: using smp_processor_id() in preemptible [00000000] code: khungtaskd/400 [235441.210761] caller is show_stack_log_lvl+0x29/0x150 [235441.210866] Pid: 400, comm: khungtaskd Not tainted 2.6.32-rc5-64-00021-g758646f #31 [235441.211008] Call Trace: [235441.211084] [<ffffffff8125ac39>] debug_smp_processor_id+0xd9/0xe0 [235441.211208] [<ffffffff8100e9f9>] show_stack_log_lvl+0x29/0x150 [235441.211328] [<ffffffff8100fdc7>] show_stack+0x17/0x20 [235441.211438] [<ffffffff8104f9c5>] sched_show_task+0x95/0x100 [235441.211556] [<ffffffff810c9cc9>] watchdog+0x269/0x2c0 [235441.211666] [<ffffffff810c9ad3>] ? watchdog+0x73/0x2c0 [235441.211776] [<ffffffff810c9a60>] ? watchdog+0x0/0x2c0 [235441.211886] [<ffffffff810836ce>] kthread+0x8e/0xa0 [235441.211992] [<ffffffff8100cf5a>] child_rip+0xa/0x20 [235441.212096] [<ffffffff810512ea>] ? finish_task_switch+0x7a/0xe0 [235441.212216] [<ffffffff81451916>] ? _spin_unlock_irq+0x36/0x60 [235441.212335] [<ffffffff8100c8c0>] ? restore_args+0x0/0x30 [235441.212450] [<ffffffff81083640>] ? kthread+0x0/0xa0 [235441.212554] [<ffffffff8100cf50>] ? child_rip+0x0/0x20 [235441.212662] ffff88020133ddd8 0000000000000046 0000000000000000 ffffffff810969f9 [235441.212890] ffff88020133dfd8 ffff88020133c000 ffff88020133dfd8 ffff88020133c000 [235441.213117] 000000000000f468 0000000000000002 ffff88006a36c720 ffff88011ba88000 [235441.213346] Call Trace: [235441.213418] [<ffffffff810969f9>] ? trace_hardirqs_off_caller+0x29/0xc0 [235441.213548] [<ffffffff81455109>] ? sub_preempt_count+0xa9/0xf0 [235441.213699] [<ffffffff81170969>] bdi_sched_wait+0x9/0x10 [235441.213703] [<ffffffff8144f19a>] __wait_on_bit+0x5a/0x90 [235441.213706] [<ffffffff81170960>] ? bdi_sched_wait+0x0/0x10 [235441.213710] [<ffffffff81170960>] ? bdi_sched_wait+0x0/0x10 [235441.213713] [<ffffffff8144f244>] out_of_line_wait_on_bit+0x74/0x90 [235441.213717] [<ffffffff81083b80>] ? wake_bit_function+0x0/0x40 [235441.213721] [<ffffffff8117092c>] ? bdi_queue_work+0x9c/0xd0 [235441.213724] [<ffffffff811709ea>] sync_inodes_sb+0x7a/0x170 [235441.213728] [<ffffffff81175e82>] __sync_filesystem+0x22/0x60 [235441.213732] [<ffffffff81175f8a>] sync_filesystems+0xca/0x130 [235441.213735] [<ffffffff8117604c>] sys_sync+0x1c/0x40 [235441.213740] [<ffffffff8100be6b>] system_call_fastpath+0x16/0x1b [235441.213744] 2 locks held by sync/13249: [235441.213746] #0: (mutex){+.+...}, at: [<ffffffff81175ede>] sync_filesystems+0x1e/0x130 [235441.213752] #1: (&type->s_umount_key#30){++++..}, at: [<ffffffff81175f68>] sync_filesystems+0xa8/0x130 OK,that is helpfulo (in that something weird reproduces). I'll do the debug patch as soon as I can. I did not see it for some time, so I guess it went away. |
Created attachment 23449 [details] kernel log Hello, when I upgraded from 2.6.31 to 2.6.32-rc5, syslinux once hung on me in sync(). When I tried to reproduce it, I was not able to yet. INFO: task syslinux:4439 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syslinux D ffff8800033d5e40 0 4439 2283 0x00020000 BUG: using smp_processor_id() in preemptible [00000000] code: khungtaskd/400 caller is show_stack_log_lvl+0x29/0x150 Pid: 400, comm: khungtaskd Not tainted 2.6.32-rc5-64-00021-g758646f #31 Call Trace: [<ffffffff8125ac39>] debug_smp_processor_id+0xd9/0xe0 [<ffffffff8100e9f9>] show_stack_log_lvl+0x29/0x150 [<ffffffff8100fdc7>] show_stack+0x17/0x20 [<ffffffff8104f9c5>] sched_show_task+0x95/0x100 [<ffffffff810c9cc9>] watchdog+0x269/0x2c0 [<ffffffff810c9ad3>] ? watchdog+0x73/0x2c0 [<ffffffff810c9a60>] ? watchdog+0x0/0x2c0 [<ffffffff810836ce>] kthread+0x8e/0xa0 [<ffffffff8100cf5a>] child_rip+0xa/0x20 [<ffffffff810512ea>] ? finish_task_switch+0x7a/0xe0 [<ffffffff81451916>] ? _spin_unlock_irq+0x36/0x60 [<ffffffff8100c8c0>] ? restore_args+0x0/0x30 [<ffffffff81083640>] ? kthread+0x0/0xa0 [<ffffffff8100cf50>] ? child_rip+0x0/0x20 ffff8800d7b2fdd8 0000000000000046 0000000000000000 ffffffff810969f9 ffff8800d7b2ffd8 ffff8800d7b2e000 ffff8800d7b2ffd8 ffff8800d7b2e000 000000000000f468 0000000000000000 ffff880116f62390 ffffffff816770a0 Call Trace: [<ffffffff810969f9>] ? trace_hardirqs_off_caller+0x29/0xc0 [<ffffffff81455109>] ? sub_preempt_count+0xa9/0xf0 [<ffffffff81170969>] bdi_sched_wait+0x9/0x10 [<ffffffff8144f19a>] __wait_on_bit+0x5a/0x90 [<ffffffff81170960>] ? bdi_sched_wait+0x0/0x10 [<ffffffff81170960>] ? bdi_sched_wait+0x0/0x10 [<ffffffff8144f244>] out_of_line_wait_on_bit+0x74/0x90 [<ffffffff81083b80>] ? wake_bit_function+0x0/0x40 [<ffffffff8117094e>] ? bdi_queue_work+0xbe/0xd0 [<ffffffff811709ea>] sync_inodes_sb+0x7a/0x170 [<ffffffff81175e82>] __sync_filesystem+0x22/0x60 [<ffffffff81175f8a>] sync_filesystems+0xca/0x130 [<ffffffff8117604c>] sys_sync+0x1c/0x40 [<ffffffff8103e0e8>] cstar_dispatch+0x7/0x2c 2 locks held by syslinux/4439: #0: (mutex){+.+...}, at: [<ffffffff81175ede>] sync_filesystems+0x1e/0x130 #1: (&type->s_umount_key#30){++++..}, at: [<ffffffff81175f68>] sync_filesystems+0xa8/0x130 petr-dev3:/tmp$ cat /proc/mounts | cut -d' ' -f3 | sort | uniq -c 1 binfmt_misc 1 cifs 1 debugfs 1 devpts 7 ext3 1 fusectl 51 nfs 1 nfsd 1 proc 1 rootfs 1 rpc_pipefs 1 sysfs 3 tmpfs