Bug 200981

Summary: hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy
Product: File System Reporter: Alexander Y. Fomichev (git.user)
Component: XFSAssignee: FileSystem/XFS Default Virtual Assignee (filesystem_xfs)
Status: NEW ---    
Severity: normal CC: david
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.18.5 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
config
dmesg showing all locks held in the system
missing sb_end_intwrite: proof of concept

Description Alexander Y. Fomichev 2018-08-30 14:32:35 UTC
Created attachment 278203 [details]
dmesg

kernel: vanilla 4.18.5
gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

More or less reproducible for me using next sequence:

- on host:
  create LV of appropriate size (20g in my case)
  mkfs.xfs -m reflink=1 /dev/data/LV
  mount /dev/data/LV /mnt/
  run kvm VM with qcow2 image (/mnt/disk)    

- inside vm:
  sysbench --test=fileio --file-total-size=9G prepare

- on host:
  cp --reflink=always disk disk.b

- inside vm: 
  sysbench --test=fileio --file-total-size=9G --file-test-mode=seqwr --max-time=6000 --max-requests=0 --threads=16 run

Some time after i/o on /dev/data/LV fall to zero and fs become completely unavailable and then I see a bunch of records:

[ 2580.058205] INFO: task worker:6343 blocked for more than 120 seconds.
[ 2580.064719]       Not tainted 4.18.5 #1
[ 2580.068614] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2580.076496] worker          D    0  6343      1 0x00000000
[ 2580.082034] Call Trace:
[ 2580.084532]  ? __schedule+0x386/0xc50
[ 2580.088248]  ? xlog_grant_head_wait+0xa3/0x3a0
[ 2580.092741]  schedule+0x2f/0x90
[ 2580.095932]  xlog_grant_head_wait+0x53/0x3a0
[ 2580.100256]  xlog_grant_head_check+0xb3/0x160
[ 2580.104662]  xfs_log_reserve+0x108/0x3f0
[ 2580.108682]  xfs_trans_reserve+0x1b4/0x2b0
[ 2580.112948]  xfs_trans_alloc+0xbe/0x220
[ 2580.116952]  xfs_vn_update_time+0xcb/0x2b0
[ 2580.121220]  ? current_time+0x4d/0x90
[ 2580.125047]  file_update_time+0xe0/0x120
[ 2580.129139]  xfs_file_aio_write_checks+0x14f/0x2d0
[ 2580.134099]  xfs_file_dio_aio_write+0xcc/0x420
[ 2580.138715]  xfs_file_write_iter+0x7b/0xa0
[ 2580.142978]  do_iter_readv_writev+0x139/0x190
[ 2580.147502]  do_iter_write+0x7f/0x1c0
[ 2580.151329]  vfs_writev+0x98/0x110
[ 2580.154907]  ? lock_acquire+0x8e/0x230
[ 2580.158823]  ? __fget+0x5/0x200
[ 2580.162131]  ? do_pwritev+0x9c/0xe0
[ 2580.165782]  ? __fget_light+0x51/0x60
[ 2580.169614]  do_pwritev+0x9c/0xe0
[ 2580.173095]  do_syscall_64+0x5a/0x190
[ 2580.176922]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2580.182138] RIP: 0033:0x7fe1937b784a
[ 2580.185836] Code: Bad RIP value.
[ 2580.189239] RSP: 002b:00007fe05e1f5850 EFLAGS: 00000246 ORIG_RAX: 0000000000000128
[ 2580.197058] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 00007fe1937b784a
[ 2580.204361] RDX: 000000000000001e RSI: 0000564bf126f200 RDI: 0000000000000014
[ 2580.211660] RBP: 0000564bf126f200 R08: 0000000000000000 R09: 0000000000000000
[ 2580.218968] R10: 00000000dccf0000 R11: 0000000000000246 R12: 000000000000001e
[ 2580.226265] R13: 00000000dccf0000 R14: 0000564bf13312a0 R15: 00007fe05e9f67a0

Full dmesg attached
Comment 1 Alexander Y. Fomichev 2018-08-30 14:34:16 UTC
Created attachment 278205 [details]
config
Comment 2 Dave Chinner 2018-08-31 01:26:46 UTC
On Thu, Aug 30, 2018 at 02:32:35PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=200981
> 
> kernel: vanilla 4.18.5
> gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
> Copyright (C) 2017 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> 
> More or less reproducible for me using next sequence:
> 
> - on host:
>   create LV of appropriate size (20g in my case)
>   mkfs.xfs -m reflink=1 /dev/data/LV
>   mount /dev/data/LV /mnt/
>   run kvm VM with qcow2 image (/mnt/disk)    
> 
> - inside vm:
>   sysbench --test=fileio --file-total-size=9G prepare
> 
> - on host:
>   cp --reflink=always disk disk.b
> 
> - inside vm: 
>   sysbench --test=fileio --file-total-size=9G --file-test-mode=seqwr
> --max-time=6000 --max-requests=0 --threads=16 run
> 
> Some time after i/o on /dev/data/LV fall to zero and fs become completely
> unavailable and then I see a bunch of records:

The first error is this:

[ 2212.046108] ================================================
[ 2212.051809] WARNING: lock held when returning to user space!
[ 2212.057511] 4.18.5 #1 Not tainted
[ 2212.060864] ------------------------------------------------
[ 2212.066564] worker/6123 is leaving the kernel with locks still held!
[ 2212.072961] 1 lock held by worker/6123:
[ 2212.076835]  #0: 000000009eab4f1b (sb_internal#2){.+.+}, at: xfs_trans_alloc+0x17c/0x220

Which happens 5 minutes before the hung processes start being
reported. Looks like something has gone wrong and an error path has
leaked a transaction.

Can you see if commit dcbd44f79986 ("xfs: fix transaction leak on
remote attr set/remove failure") addresses the problem you are
seeing?

-Dave.
Comment 3 Alexander Y. Fomichev 2018-08-31 07:39:52 UTC
> Can you see if commit dcbd44f79986 ("xfs: fix transaction leak on
remote attr set/remove failure") addresses the problem you are
seeing?

I've just tried but see no difference.
Warning still here and vm become blocked some time after.
( It worth noting I've CONFIG_LOCKDEP turned on )
Comment 4 Dave Chinner 2018-09-01 05:01:17 UTC
Ok, so the obvious fix doesn't help.

Any idea why lockdep is not giving you the full stack trace of where the lock that was leaked to userspace was first accessed? Without that, we have no idea which of the 70-odd calls to xfs_trans_alloc() in the XFS code the lockdep warning refers to. At minimum, we need to know what syscall is triggering the warning so we know what code to look at.

Also, can you reproduce this on older kernels? If not, could you bisect it?

-Dave.
Comment 5 Alexander Y. Fomichev 2018-09-06 05:36:37 UTC
Created attachment 278345 [details]
dmesg showing all locks held in the system
Comment 6 Alexander Y. Fomichev 2018-09-06 05:37:51 UTC
Created attachment 278347 [details]
missing sb_end_intwrite: proof of concept
Comment 7 Alexander Y. Fomichev 2018-09-06 05:39:18 UTC
Ok, it looks like a warning is a result of missing sb_end_intwrite in
xfs_reflink_allocate_cow, namely in a sequence xfs_trans_alloc -> goto retry -> goto convert. With proof-of-concept patch (attached) warning is gone but deadlock is not. Though now I have a list of held locks (dmesg attached).
Did you mean taintad kernel  will not show locks held?
Comment 8 Dave Chinner 2018-09-06 14:25:59 UTC
(In reply to Alexander Y.  Fomichev from comment #7)
> Ok, it looks like a warning is a result of missing y in
> xfs_reflink_allocate_cow, namely in a sequence xfs_trans_alloc -> goto retry
> -> goto convert.

Yup, I see the issue. Nice work finding this!

> With proof-of-concept patch (attached) warning is gone but
> deadlock is not.

The code is not missing a sb_end_intwrite() call - it's missing a xfs_trans_cancel() call. That's why it still deadlocks even though you shut up the warning. I'll have a closer look in the morning to see if there's any other paths that can leak the transaction and write a patch to fix it.

Thanks again for tracking it down, Alexander :)

Cheers,

Dave.
Comment 9 Alexander Y. Fomichev 2018-09-06 16:52:24 UTC
I confirm. With xfs_trans_cancel() lock is gone and works like a charm. Thanks.