Bug 43292 - jdb2 lockup with ext3 and nfs
Summary: jdb2 lockup with ext3 and nfs
Status: RESOLVED UNREPRODUCIBLE
Alias: None
Product: File System
Classification: Unclassified
Component: ext3 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: fs_ext3@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-24 15:54 UTC by Florian Pritz
Modified: 2012-11-02 15:46 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.4
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg with backtrace (76.87 KB, text/plain)
2012-05-26 20:31 UTC, Florian Pritz
Details
output of echo w > /proc/sysrq-trigger (66.02 KB, text/plain)
2012-05-29 21:44 UTC, Florian Pritz
Details
output of echo w > /proc/sysrq-trigger with noop (67.39 KB, text/plain)
2012-05-30 09:47 UTC, Florian Pritz
Details
blktrace files (226.01 KB, application/x-gzip)
2012-05-30 21:20 UTC, Florian Pritz
Details
output of dumpe2fs /dev/dm-2 (1.84 KB, text/plain)
2012-05-31 21:40 UTC, Florian Pritz
Details
2nd blktrace (471.51 KB, application/x-gzip)
2012-06-01 15:40 UTC, Florian Pritz
Details

Description Florian Pritz 2012-05-24 15:54:29 UTC
When using nfs and writing some files nfsd locks up in D state and I get the attached backtraces in dmesg. Once nfsd killed the request that triggered the lockup, the problem will not happen again until the system has been rebooted. (Tested over a period of 20 days)

I'm not yet sure if it happens on the first write request or only on some special ones, but I don't really want to reboot my file server and let it hang that often. However, if you can tell me what to look for or what debugging features I should use, I'll be happy to try them.

I also didn't yet test if this can be reproduced when writing files directly (without nfs), but I think it would lock up during boot in that case since I'm running a mysqld slave on the system.

This started happening around linux 3.2 or 3.3.

I've attached the whole dmesg output (including bootup, in case this is related to the hardware).

Arch Linux x86_64
Linux 3.4 (config [1])
nfs-utils 1.2.5


[1]: https://projects.archlinux.org/svntogit/packages.git/plain/trunk/config.x86_64?h=packages/linux&id=f8c396e6ca1e29c3da83431f64074f8c127e1af3
Comment 1 Florian Pritz 2012-05-26 20:31:25 UTC
Created attachment 73412 [details]
dmesg with backtrace
Comment 2 Jan Kara 2012-05-29 14:04:10 UTC
If you can reproduce this, can you make the deadlock happen, then run 'echo w >/proc/sysrq-trigger', and attach here output from dmesg? Thanks!
Comment 3 Florian Pritz 2012-05-29 21:44:13 UTC
Created attachment 73459 [details]
output of echo w > /proc/sysrq-trigger

This is with linux 3.0.32 instead of 3.4 because 3.4 had lots of bugs that caused me to switch all systems away from it. (These might get their own reports some day unless someone else takes care of them first)

If you think it really makes a difference I can switch back to 3.4 for testing.
Comment 4 Jan Kara 2012-05-30 09:18:30 UTC
Hmm, thanks for the data. So jbd2 thread (which is blocking the rest) is waiting for the following nfsd thread to finish a transaction:
[  128.487611] nfsd            D 0000000000000000     0   750      2 0x00000000
[  128.487613]  ffff880076833050 0000000000000046 ffff880076832f50 ffffffff00000000
[  128.487615]  ffff8800790ea040 ffff880076833fd8 ffff880076833fd8 ffff880076833fd8
[  128.487618]  ffffffff8189b020 ffff8800790ea040 0000000179880938 ffff880079880848
[  128.487620] Call Trace:
[  128.487624]  [<ffffffff81204569>] ? queue_unplugged+0x59/0x110
[  128.487627]  [<ffffffff81187fa0>] ? __wait_on_buffer+0x30/0x30
[  128.487629]  [<ffffffff81421d8f>] schedule+0x3f/0x60
[  128.487631]  [<ffffffff81421e3f>] io_schedule+0x8f/0xd0
[  128.487633]  [<ffffffff81187fae>] sleep_on_buffer+0xe/0x20
[  128.487634]  [<ffffffff814225a0>] __wait_on_bit+0x60/0x90
[  128.487637]  [<ffffffff81187fa0>] ? __wait_on_buffer+0x30/0x30
[  128.487639]  [<ffffffff8142264c>] out_of_line_wait_on_bit+0x7c/0x90
[  128.487641]  [<ffffffff810819f0>] ? autoremove_wake_function+0x40/0x40
[  128.487643]  [<ffffffff81187f9e>] __wait_on_buffer+0x2e/0x30
[  128.487648]  [<ffffffffa01744d3>] ext4_mb_init_cache+0x203/0x9c0 [ext4]
[  128.487651]  [<ffffffff81106e40>] ? __lru_cache_add+0x90/0xb0
[  128.487656]  [<ffffffffa017665e>] ext4_mb_init_group+0x10e/0x210 [ext4]
[  128.487660]  [<ffffffffa0176876>] ext4_mb_good_group+0x116/0x130 [ext4]
[  128.487665]  [<ffffffffa0178a3b>] ext4_mb_regular_allocator+0x19b/0x420 [ext4]
[  128.487669]  [<ffffffffa017620d>] ? ext4_mb_normalize_request+0x20d/0x500 [ext4]
[  128.487674]  [<ffffffffa017a40e>] ext4_mb_new_blocks+0x42e/0x5d0 [ext4]
[  128.487678]  [<ffffffffa0148558>] ext4_alloc_branch+0x528/0x670 [ext4]
[  128.487681]  [<ffffffff81118129>] ? zone_statistics+0x99/0xc0
[  128.487686]  [<ffffffffa014ba78>] ext4_ind_map_blocks+0x328/0x7c0 [ext4]
[  128.487690]  [<ffffffffa014bfea>] ext4_map_blocks+0xda/0x1f0 [ext4]
[  128.487694]  [<ffffffffa014c1a6>] _ext4_get_block+0xa6/0x160 [ext4]
[  128.487698]  [<ffffffffa014c2c6>] ext4_get_block+0x16/0x20 [ext4]
[  128.487701]  [<ffffffff8118a427>] __block_write_begin+0x1c7/0x590
[  128.487705]  [<ffffffffa014c2b0>] ? noalloc_get_block_write+0x30/0x30 [ext4]
[  128.487709]  [<ffffffffa014d9c2>] ext4_write_begin+0x162/0x390 [ext4]
[  128.487711]  [<ffffffff810fa739>] generic_file_buffered_write+0x109/0x260
[  128.487714]  [<ffffffff810fb8d5>] __generic_file_aio_write+0x245/0x460
[  128.487716]  [<ffffffff8142350d>] ? __mutex_lock_slowpath+0x25d/0x350
[  128.487718]  [<ffffffff810fbb5e>] generic_file_aio_write+0x6e/0xe0
[  128.487722]  [<ffffffffa0143dff>] ext4_file_write+0xaf/0x260 [ext4]
[  128.487724]  [<ffffffff811745e9>] ? iget_locked+0x89/0x180
[  128.487727]  [<ffffffffa06293b0>] ? _fh_update.isra.8.part.9+0x60/0x60 [nfsd]
[  128.487729]  [<ffffffff81174902>] ? iput+0x42/0x1c0
[  128.487733]  [<ffffffffa0143d50>] ? ext4_llseek+0x110/0x110 [ext4]
[  128.487735]  [<ffffffff8115a9d2>] do_sync_readv_writev+0xd2/0x110
[  128.487738]  [<ffffffffa06293b0>] ? _fh_update.isra.8.part.9+0x60/0x60 [nfsd]
[  128.487741]  [<ffffffff811449b9>] ? __kmalloc+0x39/0x1a0
[  128.487743]  [<ffffffff811e33ec>] ? security_file_permission+0x2c/0xb0
[  128.487745]  [<ffffffff8115a0f1>] ? rw_verify_area+0x61/0xf0
[  128.487747]  [<ffffffff8115aca4>] do_readv_writev+0xd4/0x1e0
[  128.487751]  [<ffffffffa0143a5f>] ? ext4_file_open+0x6f/0x1e0 [ext4]
[  128.487753]  [<ffffffff8115ade5>] vfs_writev+0x35/0x60
[  128.487757]  [<ffffffffa062a83b>] nfsd_vfs_write.isra.9+0xeb/0x3e0 [nfsd]
[  128.487759]  [<ffffffff811588df>] ? dentry_open+0x4f/0x90
[  128.487762]  [<ffffffffa062b690>] ? nfsd_open+0xa0/0x1a0 [nfsd]
[  128.487766]  [<ffffffffa062ce08>] nfsd_write+0xf8/0x110 [nfsd]
[  128.487770]  [<ffffffffa063476b>] nfsd3_proc_write+0xbb/0x150 [nfsd]
[  128.487773]  [<ffffffffa0626a4e>] nfsd_dispatch+0xfe/0x240 [nfsd]
[  128.487777]  [<ffffffffa054597b>] svc_process+0x4bb/0x840 [sunrpc]
[  128.487779]  [<ffffffff8107143b>] ? recalc_sigpending+0x1b/0x50
[  128.487782]  [<ffffffffa06260c2>] nfsd+0xc2/0x160 [nfsd]
[  128.487785]  [<ffffffffa0626000>] ? 0xffffffffa0625fff
[  128.487786]  [<ffffffff8108102c>] kthread+0x8c/0xa0
[  128.487788]  [<ffffffff814266a4>] kernel_thread_helper+0x4/0x10
[  128.487791]  [<ffffffff81080fa0>] ? kthread_worker_fn+0x180/0x180
[  128.487792]  [<ffffffff814266a0>] ? gs_change+0x13/0x13

That thread waits for block bitmap to be read from disk. The question is why would reading take so long (or never actually finish).

For start can you maybe switch IO scheduler to deadline. You can do that by:
echo "deadline" >/sys/block/<dev>/queue/scheduler
for all disks, reproduce the problem and post here output from echo w > /proc/sysrq-trigger again? Thanks.
Comment 5 Florian Pritz 2012-05-30 09:24:44 UTC
I'm already using deadline, should I try cfq/noop or is that not worth trying?
Comment 6 Jan Kara 2012-05-30 09:41:29 UTC
OK, so try with NOOP to make things as simple as possible. Thanks!
Comment 7 Florian Pritz 2012-05-30 09:47:27 UTC
Created attachment 73465 [details]
output of echo w > /proc/sysrq-trigger with noop
Comment 8 Jan Kara 2012-05-30 13:33:29 UTC
Hum, it looks exactly the same like the previous attempt. I'm starting to suspect some problem with plugging or bio list handling. Since I can see you can trigger the lockup pretty quickly, can you gather blktrace data for me? The filesystem is on some device mapper device so you'll need to run something like:
blktrace -d /dev/dm-2 -d /dev/sd? ...
where you include all scsi disks compounding the dm device. And then just tar.gz the created files after the lockup happens and attach them here. Hopefully we'll be able to see where the read got stuck. Thanks!
Comment 9 Florian Pritz 2012-05-30 21:20:32 UTC
Created attachment 73471 [details]
blktrace files

I'm running 2 raid 1 arrays with 2 and 3 disks, both in a single LVM volume group which contains one big LV (dm-2). 

I've started blktrace before the lockup happened and killed it a few seconds after with ^C. I can also let it run until nfsd gives up, if you want that.
Comment 10 Jan Kara 2012-05-31 20:16:27 UTC
I have to say I'm puzzled. Everything looks normal, the IO is submitted and completed, no sign of any hang. Just the last request - read - has been only submitted to disk and not yet completed but that can be just a coincidence. To check what has been read, can you attach here output of dumpe2fs /dev/dm-2?

When the lockup happens, those nfsd threads are sitting there doing nothing for a long time, am I right?
Comment 11 Florian Pritz 2012-05-31 21:40:38 UTC
Created attachment 73480 [details]
output of dumpe2fs /dev/dm-2

(In reply to comment #10)
> I have to say I'm puzzled. Everything looks normal, the IO is submitted and
> completed, no sign of any hang. Just the last request - read - has been only
> submitted to disk and not yet completed but that can be just a coincidence.
> To
> check what has been read, can you attach here output of dumpe2fs /dev/dm-2?

I've run it without the hanging nfsd, is that correct?

> When the lockup happens, those nfsd threads are sitting there doing nothing
> for
> a long time, am I right?

Yeah. All nfs activity is frozen, but local system activity on the same file system works just fine.
Comment 12 Jan Kara 2012-05-31 22:44:19 UTC
Oh, that explains it. I wanted blktrace around the moment when nfsd hangs. So just start blktrace and let it run until nfsd hangs... Thanks.
Comment 13 Florian Pritz 2012-05-31 22:49:02 UTC
(In reply to comment #12)
> Oh, that explains it. I wanted blktrace around the moment when nfsd hangs. So
> just start blktrace and let it run until nfsd hangs... Thanks.

I meant, I ran dumpe2fs without nfsd hanging. As for blktrace, I ran that before nfsd hung and waited a few seconds after it hung before I killed it.
Comment 14 Jan Kara 2012-06-01 15:24:37 UTC
OK, something strange is going on. The block trace ends like:
253,2    0     3716    19.264276270   751  Q   R 485490688 + 8 [nfsd]
  9,0    0     3699    19.264279134   751  A   R 527540608 + 8 <- (253,2) 485490688
  9,0    0     3700    19.264280391   751  Q   R 527540608 + 8 [nfsd]
  8,32   0    14793    19.264283254   751  A   R 527540672 + 8 <- (8,33) 527540608
  8,32   0    14794    19.264284162   751  Q   R 527540672 + 8 [nfsd]
  8,32   0    14795    19.264286537   751  G   R 527540672 + 8 [nfsd]
  8,32   0    14796    19.264288073   751  P   N [nfsd]
  8,32   0    14797    19.264290727   751  I   R 527540672 + 8 [nfsd]
  8,32   0    14798    19.264292264   751  U   N [nfsd] 1
  8,32   0    14799    19.264293940   751  D   R 527540672 + 8 [nfsd]

We see request to read 8 sectors starting at 485490688 has been issued to /dev/dm-2. I've checked that against dumpe2fs output and indeed that is a block bitmap for group 1852. That request has gone through md-0 to sdc where it has been dispatched to the disk driver for processing. All is well but for some reason the read never completed which is why nfsd hangs.

So this doesn't look like a problem in the filesystem or block layer but rather somewhere in the driver or hardware. Can you check what happens if you just run

dd if=/dev/dm-2 of=/dev/null bs=512 count=8 skip=485490688

Does the command finish? And what if you run it after nfsd has hung? If it does finish in both cases, can you gather the blktrace of hanging once more like before so that I can check whether we hang waiting for the same block the second time as well? Thanks!
Comment 15 Florian Pritz 2012-06-01 15:40:50 UTC
Created attachment 73488 [details]
2nd blktrace

(In reply to comment #14)
> dd if=/dev/dm-2 of=/dev/null bs=512 count=8 skip=485490688

Finishes before and after the hang. 

I've attached another blktrace, but I let it run for like 20 seconds after nfsd hung up this time instead of only a few seconds.
Comment 16 Jan Kara 2012-06-01 16:26:59 UTC
Grumble, it's the same, just for a different allocation group:
253,2    0     6514    25.375884394   744  Q   R 707526656 + 8 [nfsd]
  9,0    0     5451    25.375887257   744  A   R 749576576 + 8 <- (253,2) 707526
656
  9,0    0     5452    25.375888584   744  Q   R 749576576 + 8 [nfsd]
  8,16   0    17311    25.375891308   744  A   R 749576639 + 8 <- (8,17) 7495765
76
  8,16   0    17312    25.375892077   744  Q   R 749576639 + 8 [nfsd]
  8,16   0    17313    25.375894381   744  G   R 749576639 + 8 [nfsd]
  8,16   0    17314    25.375895918   744  P   N [nfsd]
  8,16   0    17315    25.375898572   744  I   R 749576639 + 8 [nfsd]
  8,16   0    17316    25.375900038   744  U   N [nfsd] 1
  8,16   0    17317    25.375901715   744  D   R 749576639 + 8 [nfsd]

Let's try the following:
for (( i = 0; i < 25293; i++ )); do
  dd if=/dev/dm-2 of=/dev/null bs=4k count=1 skip=$(($i*32768))
done

Again run it before and after the hang. If it finishes even after the hang, check whether nfsd is still hung.
Comment 17 Florian Pritz 2012-06-03 20:16:10 UTC
It finishes in both cases.

I've run it directly after nfs hung and when the loop completed nfs was running again. Normally it takes ~2minutes until nfs unfreezes, but in this case it was faster.
Comment 18 Jan Kara 2012-06-04 10:19:33 UTC
OK, thanks for testing. So you say the problem resolves itself even when you don't run dd, just it takes longer? I'm more and more inclined to blame some hardware strangeness... Also originally you said the problem started happening around 3.3 or 3.4 but now you are testing with 3.0. So do you have any kernel where you don't see the problem?
Comment 19 Florian Pritz 2012-06-06 21:56:31 UTC
(In reply to comment #18)
> OK, thanks for testing. So you say the problem resolves itself even when you
> don't run dd, just it takes longer? I'm more and more inclined to blame some
> hardware strangeness... Also originally you said the problem started
> happening
> around 3.3 or 3.4 but now you are testing with 3.0. So do you have any kernel
> where you don't see the problem?

So far I can *not* reproduce the problem with 3.0.17 (transfered 8gb of data). However, with 3.0.32 I can reproduce every time by copying a 20mb+ file.

If you want, I can test a few kernels in between .17 and .32, but that's going to take some time.

FYI: I do notice short hickups on the client, but those are most likely just because the disks aren't fast enough to handle the write requests. I'm running with Gbit ethernet and reading is obviously faster than writing and writing is slower than the link limit, resulting in the network buffers filling up. I don't consider this a bug.
Comment 20 Jan Kara 2012-06-07 09:29:00 UTC
Ah, that's interesting to know! It would be great if you could narrow down in which -stable update the problem got introduced (that should take like 4 tests) and then I can have a look at patches in that release and see whether something that catches my eye.
Comment 21 Florian Pritz 2012-06-22 20:45:13 UTC
Finally got around to testing. Looks like it break from 3.0.22 -> .23. 

I've used the Arch Linux packages for testing and noticed that there have been a few config changes [1] at that time. Could those be related? I've already tried booting .23 with transparent_hugepage=never, but that didn't help.

[1]: https://projects.archlinux.org/svntogit/packages.git/log/trunk?h=packages/linux-lts
Comment 22 Jan Kara 2012-06-27 13:49:47 UTC
Hum, I'm not sure how to actually find out what has changed in that package. The reference you gave shows just packaging changes. Looking at the changes between 3.0.22 and 3.0.23 there are commits:
genirq: Unmask oneshot irqs when thread was not woken
genirq: Handle pending irqs in irq_startup()
which could be related (and actually one of them really caused problems). Those problems should supposedly be fixed in 3.0.24 but maybe they are not.

So can you test 3.0.23 with those two patches reverted? And can you test whether you see hang with 3.0.24 as well? Thanks!
Comment 23 Florian Pritz 2012-07-08 14:44:49 UTC
Alright, after bisecting and trying lots of kernels, I believe it's not a code change causing this, but rather a change [1] in Arch's config file.

3.0.22 with the ext3 module works fine, but if I apply that patch to use ext4 for ext3 file systems, it hangs. 
3.0.23 with the ext3 module works too.

Since I don't know if it ever worked with that option enabled, I don't think bisecting will help. Any ideas?

[1]: https://projects.archlinux.org/svntogit/packages.git/commit/trunk?h=packages/linux-lts&id=7c53d3c600983af5a021315b1ef27ed6d4807170
Comment 24 Jan Kara 2012-07-10 13:12:45 UTC
Thanks a lot for trying out the kernels. I don't see how using ext4 instead of ext3 can make the difference we observed - i.e. a submitted read request is not completed. I guess I'll try to use ext4 to mount ext3 filesystem and beat it up a bit to see whether I'll reproduce the problem locally for further investigation.
Comment 25 Florian Pritz 2012-11-02 15:46:21 UTC
I've recently switched my file server to xfs and can no longer reproduce this. Since I seem to have been the only one with this issue, I'm closing this bug.

Note You need to log in before you can comment on or make changes to this bug.