When using nfs and writing some files nfsd locks up in D state and I get the attached backtraces in dmesg. Once nfsd killed the request that triggered the lockup, the problem will not happen again until the system has been rebooted. (Tested over a period of 20 days) I'm not yet sure if it happens on the first write request or only on some special ones, but I don't really want to reboot my file server and let it hang that often. However, if you can tell me what to look for or what debugging features I should use, I'll be happy to try them. I also didn't yet test if this can be reproduced when writing files directly (without nfs), but I think it would lock up during boot in that case since I'm running a mysqld slave on the system. This started happening around linux 3.2 or 3.3. I've attached the whole dmesg output (including bootup, in case this is related to the hardware). Arch Linux x86_64 Linux 3.4 (config [1]) nfs-utils 1.2.5 [1]: https://projects.archlinux.org/svntogit/packages.git/plain/trunk/config.x86_64?h=packages/linux&id=f8c396e6ca1e29c3da83431f64074f8c127e1af3
Created attachment 73412 [details] dmesg with backtrace
If you can reproduce this, can you make the deadlock happen, then run 'echo w >/proc/sysrq-trigger', and attach here output from dmesg? Thanks!
Created attachment 73459 [details] output of echo w > /proc/sysrq-trigger This is with linux 3.0.32 instead of 3.4 because 3.4 had lots of bugs that caused me to switch all systems away from it. (These might get their own reports some day unless someone else takes care of them first) If you think it really makes a difference I can switch back to 3.4 for testing.
Hmm, thanks for the data. So jbd2 thread (which is blocking the rest) is waiting for the following nfsd thread to finish a transaction: [ 128.487611] nfsd D 0000000000000000 0 750 2 0x00000000 [ 128.487613] ffff880076833050 0000000000000046 ffff880076832f50 ffffffff00000000 [ 128.487615] ffff8800790ea040 ffff880076833fd8 ffff880076833fd8 ffff880076833fd8 [ 128.487618] ffffffff8189b020 ffff8800790ea040 0000000179880938 ffff880079880848 [ 128.487620] Call Trace: [ 128.487624] [<ffffffff81204569>] ? queue_unplugged+0x59/0x110 [ 128.487627] [<ffffffff81187fa0>] ? __wait_on_buffer+0x30/0x30 [ 128.487629] [<ffffffff81421d8f>] schedule+0x3f/0x60 [ 128.487631] [<ffffffff81421e3f>] io_schedule+0x8f/0xd0 [ 128.487633] [<ffffffff81187fae>] sleep_on_buffer+0xe/0x20 [ 128.487634] [<ffffffff814225a0>] __wait_on_bit+0x60/0x90 [ 128.487637] [<ffffffff81187fa0>] ? __wait_on_buffer+0x30/0x30 [ 128.487639] [<ffffffff8142264c>] out_of_line_wait_on_bit+0x7c/0x90 [ 128.487641] [<ffffffff810819f0>] ? autoremove_wake_function+0x40/0x40 [ 128.487643] [<ffffffff81187f9e>] __wait_on_buffer+0x2e/0x30 [ 128.487648] [<ffffffffa01744d3>] ext4_mb_init_cache+0x203/0x9c0 [ext4] [ 128.487651] [<ffffffff81106e40>] ? __lru_cache_add+0x90/0xb0 [ 128.487656] [<ffffffffa017665e>] ext4_mb_init_group+0x10e/0x210 [ext4] [ 128.487660] [<ffffffffa0176876>] ext4_mb_good_group+0x116/0x130 [ext4] [ 128.487665] [<ffffffffa0178a3b>] ext4_mb_regular_allocator+0x19b/0x420 [ext4] [ 128.487669] [<ffffffffa017620d>] ? ext4_mb_normalize_request+0x20d/0x500 [ext4] [ 128.487674] [<ffffffffa017a40e>] ext4_mb_new_blocks+0x42e/0x5d0 [ext4] [ 128.487678] [<ffffffffa0148558>] ext4_alloc_branch+0x528/0x670 [ext4] [ 128.487681] [<ffffffff81118129>] ? zone_statistics+0x99/0xc0 [ 128.487686] [<ffffffffa014ba78>] ext4_ind_map_blocks+0x328/0x7c0 [ext4] [ 128.487690] [<ffffffffa014bfea>] ext4_map_blocks+0xda/0x1f0 [ext4] [ 128.487694] [<ffffffffa014c1a6>] _ext4_get_block+0xa6/0x160 [ext4] [ 128.487698] [<ffffffffa014c2c6>] ext4_get_block+0x16/0x20 [ext4] [ 128.487701] [<ffffffff8118a427>] __block_write_begin+0x1c7/0x590 [ 128.487705] [<ffffffffa014c2b0>] ? noalloc_get_block_write+0x30/0x30 [ext4] [ 128.487709] [<ffffffffa014d9c2>] ext4_write_begin+0x162/0x390 [ext4] [ 128.487711] [<ffffffff810fa739>] generic_file_buffered_write+0x109/0x260 [ 128.487714] [<ffffffff810fb8d5>] __generic_file_aio_write+0x245/0x460 [ 128.487716] [<ffffffff8142350d>] ? __mutex_lock_slowpath+0x25d/0x350 [ 128.487718] [<ffffffff810fbb5e>] generic_file_aio_write+0x6e/0xe0 [ 128.487722] [<ffffffffa0143dff>] ext4_file_write+0xaf/0x260 [ext4] [ 128.487724] [<ffffffff811745e9>] ? iget_locked+0x89/0x180 [ 128.487727] [<ffffffffa06293b0>] ? _fh_update.isra.8.part.9+0x60/0x60 [nfsd] [ 128.487729] [<ffffffff81174902>] ? iput+0x42/0x1c0 [ 128.487733] [<ffffffffa0143d50>] ? ext4_llseek+0x110/0x110 [ext4] [ 128.487735] [<ffffffff8115a9d2>] do_sync_readv_writev+0xd2/0x110 [ 128.487738] [<ffffffffa06293b0>] ? _fh_update.isra.8.part.9+0x60/0x60 [nfsd] [ 128.487741] [<ffffffff811449b9>] ? __kmalloc+0x39/0x1a0 [ 128.487743] [<ffffffff811e33ec>] ? security_file_permission+0x2c/0xb0 [ 128.487745] [<ffffffff8115a0f1>] ? rw_verify_area+0x61/0xf0 [ 128.487747] [<ffffffff8115aca4>] do_readv_writev+0xd4/0x1e0 [ 128.487751] [<ffffffffa0143a5f>] ? ext4_file_open+0x6f/0x1e0 [ext4] [ 128.487753] [<ffffffff8115ade5>] vfs_writev+0x35/0x60 [ 128.487757] [<ffffffffa062a83b>] nfsd_vfs_write.isra.9+0xeb/0x3e0 [nfsd] [ 128.487759] [<ffffffff811588df>] ? dentry_open+0x4f/0x90 [ 128.487762] [<ffffffffa062b690>] ? nfsd_open+0xa0/0x1a0 [nfsd] [ 128.487766] [<ffffffffa062ce08>] nfsd_write+0xf8/0x110 [nfsd] [ 128.487770] [<ffffffffa063476b>] nfsd3_proc_write+0xbb/0x150 [nfsd] [ 128.487773] [<ffffffffa0626a4e>] nfsd_dispatch+0xfe/0x240 [nfsd] [ 128.487777] [<ffffffffa054597b>] svc_process+0x4bb/0x840 [sunrpc] [ 128.487779] [<ffffffff8107143b>] ? recalc_sigpending+0x1b/0x50 [ 128.487782] [<ffffffffa06260c2>] nfsd+0xc2/0x160 [nfsd] [ 128.487785] [<ffffffffa0626000>] ? 0xffffffffa0625fff [ 128.487786] [<ffffffff8108102c>] kthread+0x8c/0xa0 [ 128.487788] [<ffffffff814266a4>] kernel_thread_helper+0x4/0x10 [ 128.487791] [<ffffffff81080fa0>] ? kthread_worker_fn+0x180/0x180 [ 128.487792] [<ffffffff814266a0>] ? gs_change+0x13/0x13 That thread waits for block bitmap to be read from disk. The question is why would reading take so long (or never actually finish). For start can you maybe switch IO scheduler to deadline. You can do that by: echo "deadline" >/sys/block/<dev>/queue/scheduler for all disks, reproduce the problem and post here output from echo w > /proc/sysrq-trigger again? Thanks.
I'm already using deadline, should I try cfq/noop or is that not worth trying?
OK, so try with NOOP to make things as simple as possible. Thanks!
Created attachment 73465 [details] output of echo w > /proc/sysrq-trigger with noop
Hum, it looks exactly the same like the previous attempt. I'm starting to suspect some problem with plugging or bio list handling. Since I can see you can trigger the lockup pretty quickly, can you gather blktrace data for me? The filesystem is on some device mapper device so you'll need to run something like: blktrace -d /dev/dm-2 -d /dev/sd? ... where you include all scsi disks compounding the dm device. And then just tar.gz the created files after the lockup happens and attach them here. Hopefully we'll be able to see where the read got stuck. Thanks!
Created attachment 73471 [details] blktrace files I'm running 2 raid 1 arrays with 2 and 3 disks, both in a single LVM volume group which contains one big LV (dm-2). I've started blktrace before the lockup happened and killed it a few seconds after with ^C. I can also let it run until nfsd gives up, if you want that.
I have to say I'm puzzled. Everything looks normal, the IO is submitted and completed, no sign of any hang. Just the last request - read - has been only submitted to disk and not yet completed but that can be just a coincidence. To check what has been read, can you attach here output of dumpe2fs /dev/dm-2? When the lockup happens, those nfsd threads are sitting there doing nothing for a long time, am I right?
Created attachment 73480 [details] output of dumpe2fs /dev/dm-2 (In reply to comment #10) > I have to say I'm puzzled. Everything looks normal, the IO is submitted and > completed, no sign of any hang. Just the last request - read - has been only > submitted to disk and not yet completed but that can be just a coincidence. > To > check what has been read, can you attach here output of dumpe2fs /dev/dm-2? I've run it without the hanging nfsd, is that correct? > When the lockup happens, those nfsd threads are sitting there doing nothing > for > a long time, am I right? Yeah. All nfs activity is frozen, but local system activity on the same file system works just fine.
Oh, that explains it. I wanted blktrace around the moment when nfsd hangs. So just start blktrace and let it run until nfsd hangs... Thanks.
(In reply to comment #12) > Oh, that explains it. I wanted blktrace around the moment when nfsd hangs. So > just start blktrace and let it run until nfsd hangs... Thanks. I meant, I ran dumpe2fs without nfsd hanging. As for blktrace, I ran that before nfsd hung and waited a few seconds after it hung before I killed it.
OK, something strange is going on. The block trace ends like: 253,2 0 3716 19.264276270 751 Q R 485490688 + 8 [nfsd] 9,0 0 3699 19.264279134 751 A R 527540608 + 8 <- (253,2) 485490688 9,0 0 3700 19.264280391 751 Q R 527540608 + 8 [nfsd] 8,32 0 14793 19.264283254 751 A R 527540672 + 8 <- (8,33) 527540608 8,32 0 14794 19.264284162 751 Q R 527540672 + 8 [nfsd] 8,32 0 14795 19.264286537 751 G R 527540672 + 8 [nfsd] 8,32 0 14796 19.264288073 751 P N [nfsd] 8,32 0 14797 19.264290727 751 I R 527540672 + 8 [nfsd] 8,32 0 14798 19.264292264 751 U N [nfsd] 1 8,32 0 14799 19.264293940 751 D R 527540672 + 8 [nfsd] We see request to read 8 sectors starting at 485490688 has been issued to /dev/dm-2. I've checked that against dumpe2fs output and indeed that is a block bitmap for group 1852. That request has gone through md-0 to sdc where it has been dispatched to the disk driver for processing. All is well but for some reason the read never completed which is why nfsd hangs. So this doesn't look like a problem in the filesystem or block layer but rather somewhere in the driver or hardware. Can you check what happens if you just run dd if=/dev/dm-2 of=/dev/null bs=512 count=8 skip=485490688 Does the command finish? And what if you run it after nfsd has hung? If it does finish in both cases, can you gather the blktrace of hanging once more like before so that I can check whether we hang waiting for the same block the second time as well? Thanks!
Created attachment 73488 [details] 2nd blktrace (In reply to comment #14) > dd if=/dev/dm-2 of=/dev/null bs=512 count=8 skip=485490688 Finishes before and after the hang. I've attached another blktrace, but I let it run for like 20 seconds after nfsd hung up this time instead of only a few seconds.
Grumble, it's the same, just for a different allocation group: 253,2 0 6514 25.375884394 744 Q R 707526656 + 8 [nfsd] 9,0 0 5451 25.375887257 744 A R 749576576 + 8 <- (253,2) 707526 656 9,0 0 5452 25.375888584 744 Q R 749576576 + 8 [nfsd] 8,16 0 17311 25.375891308 744 A R 749576639 + 8 <- (8,17) 7495765 76 8,16 0 17312 25.375892077 744 Q R 749576639 + 8 [nfsd] 8,16 0 17313 25.375894381 744 G R 749576639 + 8 [nfsd] 8,16 0 17314 25.375895918 744 P N [nfsd] 8,16 0 17315 25.375898572 744 I R 749576639 + 8 [nfsd] 8,16 0 17316 25.375900038 744 U N [nfsd] 1 8,16 0 17317 25.375901715 744 D R 749576639 + 8 [nfsd] Let's try the following: for (( i = 0; i < 25293; i++ )); do dd if=/dev/dm-2 of=/dev/null bs=4k count=1 skip=$(($i*32768)) done Again run it before and after the hang. If it finishes even after the hang, check whether nfsd is still hung.
It finishes in both cases. I've run it directly after nfs hung and when the loop completed nfs was running again. Normally it takes ~2minutes until nfs unfreezes, but in this case it was faster.
OK, thanks for testing. So you say the problem resolves itself even when you don't run dd, just it takes longer? I'm more and more inclined to blame some hardware strangeness... Also originally you said the problem started happening around 3.3 or 3.4 but now you are testing with 3.0. So do you have any kernel where you don't see the problem?
(In reply to comment #18) > OK, thanks for testing. So you say the problem resolves itself even when you > don't run dd, just it takes longer? I'm more and more inclined to blame some > hardware strangeness... Also originally you said the problem started > happening > around 3.3 or 3.4 but now you are testing with 3.0. So do you have any kernel > where you don't see the problem? So far I can *not* reproduce the problem with 3.0.17 (transfered 8gb of data). However, with 3.0.32 I can reproduce every time by copying a 20mb+ file. If you want, I can test a few kernels in between .17 and .32, but that's going to take some time. FYI: I do notice short hickups on the client, but those are most likely just because the disks aren't fast enough to handle the write requests. I'm running with Gbit ethernet and reading is obviously faster than writing and writing is slower than the link limit, resulting in the network buffers filling up. I don't consider this a bug.
Ah, that's interesting to know! It would be great if you could narrow down in which -stable update the problem got introduced (that should take like 4 tests) and then I can have a look at patches in that release and see whether something that catches my eye.
Finally got around to testing. Looks like it break from 3.0.22 -> .23. I've used the Arch Linux packages for testing and noticed that there have been a few config changes [1] at that time. Could those be related? I've already tried booting .23 with transparent_hugepage=never, but that didn't help. [1]: https://projects.archlinux.org/svntogit/packages.git/log/trunk?h=packages/linux-lts
Hum, I'm not sure how to actually find out what has changed in that package. The reference you gave shows just packaging changes. Looking at the changes between 3.0.22 and 3.0.23 there are commits: genirq: Unmask oneshot irqs when thread was not woken genirq: Handle pending irqs in irq_startup() which could be related (and actually one of them really caused problems). Those problems should supposedly be fixed in 3.0.24 but maybe they are not. So can you test 3.0.23 with those two patches reverted? And can you test whether you see hang with 3.0.24 as well? Thanks!
Alright, after bisecting and trying lots of kernels, I believe it's not a code change causing this, but rather a change [1] in Arch's config file. 3.0.22 with the ext3 module works fine, but if I apply that patch to use ext4 for ext3 file systems, it hangs. 3.0.23 with the ext3 module works too. Since I don't know if it ever worked with that option enabled, I don't think bisecting will help. Any ideas? [1]: https://projects.archlinux.org/svntogit/packages.git/commit/trunk?h=packages/linux-lts&id=7c53d3c600983af5a021315b1ef27ed6d4807170
Thanks a lot for trying out the kernels. I don't see how using ext4 instead of ext3 can make the difference we observed - i.e. a submitted read request is not completed. I guess I'll try to use ext4 to mount ext3 filesystem and beat it up a bit to see whether I'll reproduce the problem locally for further investigation.
I've recently switched my file server to xfs and can no longer reproduce this. Since I seem to have been the only one with this issue, I'm closing this bug.