Created attachment 277895 [details] Hang of kernel 4.14.62 The attached file shows backtraces that occur in quick succession and ultimately lead to a complete server hang. The backtraces seem related to inode reclamation. For example the first backtrace: Aug 16 02:33:30 hpmicroserver kernel: INFO: task khugepaged:28 blocked for more than 120 seconds. Aug 16 02:33:30 hpmicroserver kernel: Not tainted 4.14.62-1-lts #1 Aug 16 02:33:30 hpmicroserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 16 02:33:30 hpmicroserver kernel: khugepaged D 0 28 2 0x80000000 Aug 16 02:33:30 hpmicroserver kernel: Call Trace: Aug 16 02:33:30 hpmicroserver kernel: ? __schedule+0x284/0x860 Aug 16 02:33:30 hpmicroserver kernel: schedule+0x28/0x80 Aug 16 02:33:30 hpmicroserver kernel: schedule_timeout+0x292/0x370 Aug 16 02:33:30 hpmicroserver kernel: ? check_preempt_curr+0x62/0x90 Aug 16 02:33:30 hpmicroserver kernel: wait_for_completion+0xaf/0x140 Aug 16 02:33:30 hpmicroserver kernel: ? wake_up_q+0x70/0x70 Aug 16 02:33:30 hpmicroserver kernel: flush_work+0x116/0x1d0 Aug 16 02:33:30 hpmicroserver kernel: ? worker_detach_from_pool+0xa0/0xa0 Aug 16 02:33:30 hpmicroserver kernel: xlog_cil_force_lsn+0x78/0x210 [xfs] Aug 16 02:33:30 hpmicroserver kernel: ? enqueue_task_fair+0x5a/0x500 Aug 16 02:33:30 hpmicroserver kernel: ? native_sched_clock+0x37/0x90 Aug 16 02:33:30 hpmicroserver kernel: ? __switch_to_asm+0x40/0x70 Aug 16 02:33:30 hpmicroserver kernel: _xfs_log_force_lsn+0x71/0x340 [xfs] Aug 16 02:33:30 hpmicroserver kernel: ? try_to_wake_up+0x54/0x4b0 Aug 16 02:33:30 hpmicroserver kernel: ? update_group_capacity+0x27/0x1e0 Aug 16 02:33:30 hpmicroserver kernel: ? xfs_reclaim_inode+0xe3/0x340 [xfs] Aug 16 02:33:30 hpmicroserver kernel: __xfs_iunpin_wait+0xa7/0x160 [xfs] Aug 16 02:33:30 hpmicroserver kernel: ? bit_waitqueue+0x30/0x30 Aug 16 02:33:30 hpmicroserver kernel: xfs_reclaim_inode+0xe3/0x340 [xfs] Aug 16 02:33:30 hpmicroserver kernel: xfs_reclaim_inodes_ag+0x1b1/0x300 [xfs] Aug 16 02:33:30 hpmicroserver kernel: xfs_reclaim_inodes_nr+0x31/0x40 [xfs] Aug 16 02:33:30 hpmicroserver kernel: super_cache_scan+0x152/0x1a0 Aug 16 02:33:30 hpmicroserver kernel: shrink_slab.part.45+0x1e8/0x3c0 Aug 16 02:33:30 hpmicroserver kernel: shrink_node+0x123/0x310 Aug 16 02:33:30 hpmicroserver kernel: do_try_to_free_pages+0xc3/0x330 Aug 16 02:33:30 hpmicroserver kernel: try_to_free_pages+0xf4/0x1b0 Aug 16 02:33:30 hpmicroserver kernel: __alloc_pages_slowpath+0x3e4/0xd80 Aug 16 02:33:30 hpmicroserver kernel: ? __switch_to+0x170/0x4b0 Aug 16 02:33:30 hpmicroserver kernel: ? __switch_to_asm+0x34/0x70 Aug 16 02:33:30 hpmicroserver kernel: ? __switch_to_asm+0x34/0x70 Aug 16 02:33:30 hpmicroserver kernel: ? __switch_to_asm+0x40/0x70 Aug 16 02:33:30 hpmicroserver kernel: __alloc_pages_nodemask+0x226/0x240 Aug 16 02:33:30 hpmicroserver kernel: khugepaged_alloc_page+0x17/0x50 Aug 16 02:33:30 hpmicroserver kernel: khugepaged+0xbcb/0x2120 Aug 16 02:33:30 hpmicroserver kernel: ? wait_woken+0x80/0x80 Aug 16 02:33:30 hpmicroserver kernel: ? collapse_shmem+0xb90/0xb90 Aug 16 02:33:30 hpmicroserver kernel: kthread+0x119/0x130 Aug 16 02:33:30 hpmicroserver kernel: ? __kthread_parkme+0xa0/0xa0 Aug 16 02:33:30 hpmicroserver kernel: ret_from_fork+0x22/0x40 The problem has occurred twice so far. Once in kernel 4.14.52 and once in 4.14.62. I am using the Arch Linux LTS kernel so it is more or less the vanilla kernel. The server has two XFS filesystems. One is a 30TB XFS/LUKS/RAID setup, the other one is a plain 4TB XFS volume.
Created attachment 277897 [details] Hang of kernel 4.14.52
Memory reclaim is stuck waiting for log forces to complete. The log force is stuck in both cases waiting for iclog IO completion here: Aug 04 04:33:45 hpmicroserver kernel: INFO: task kworker/1:2:30142 blocked for more than 120 seconds. Aug 04 04:33:45 hpmicroserver kernel: Not tainted 4.14.52-1-lts #1 Aug 04 04:33:45 hpmicroserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 04 04:33:45 hpmicroserver kernel: kworker/1:2 D 0 30142 2 0x80000000 Aug 04 04:33:45 hpmicroserver kernel: Workqueue: xfs-cil/dm-4 xlog_cil_push_work [xfs] Aug 04 04:33:45 hpmicroserver kernel: Call Trace: Aug 04 04:33:45 hpmicroserver kernel: ? __schedule+0x284/0x870 Aug 04 04:33:45 hpmicroserver kernel: ? wake_up_q+0x70/0x70 Aug 04 04:33:45 hpmicroserver kernel: schedule+0x28/0x80 Aug 04 04:33:45 hpmicroserver kernel: xlog_state_get_iclog_space+0x105/0x2c0 [xfs] Aug 04 04:33:45 hpmicroserver kernel: ? wake_up_q+0x70/0x70 Aug 04 04:33:45 hpmicroserver kernel: xlog_write+0x163/0x6e0 [xfs] Aug 04 04:33:45 hpmicroserver kernel: xlog_cil_push+0x21d/0x3c0 [xfs] Aug 04 04:33:45 hpmicroserver kernel: process_one_work+0x1e9/0x3b0 Aug 04 04:33:45 hpmicroserver kernel: worker_thread+0x21f/0x3f0 Aug 04 04:33:45 hpmicroserver kernel: ? pwq_unbound_release_workfn+0xc0/0xc0 Aug 04 04:33:45 hpmicroserver kernel: kthread+0x119/0x130 Aug 04 04:33:45 hpmicroserver kernel: ? __kthread_parkme+0xa0/0xa0 Aug 04 04:33:45 hpmicroserver kernel: ret_from_fork+0x22/0x40 This indicates the log force is in progress, and it is waiting for log buffer space to become available to continue writing the in-memory journal to disk. This means that all log buffers are currently under IO, and the journal is waiting for IO completion to occur before it can issue more IO to the journal to make further progress. This typically occurs when there is an IO problem in the layers below XFS. Without seeing the full logs, or knowing any other details about your system, hardware or storage I can't say any more than this. -Dave.
Hello Dave Thanks for the reply. I will upload a complete journald leading to one of the crashes. It seems the problem now occurs on a daily basis. The trigger is the deletion of old backups. I have a directory structure of this kind: Backups Machine1 Day1 Day2 ... Machine2 Day1 Day2 ... The files in the daily directories are hardlinked, so a minimum amount of space is wasted. However subdirectories still consume inodes, so lots of them are in use. [root@hpmicroserver ~]# df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/halvar 390701504 12805954 377895550 4% /mnt/halvar /dev/mapper/faxe 2929890048 24276605 2905613443 1% /mnt/faxe The only unusual thing I had to configure is "libata.force=noncq" since the SATA controller had otherwise problems when talking to the SSD. Any help (maybe a workaround) is appreciated.
Created attachment 277905 [details] journald output leading to the 4.14.62 hang The output covers everything from booting the server to the XFS backtraces. The delete_backups.sh script is called 3 minutes before the server is stuck. There was no other activity on the server.
Ok, so the hung task warnings should up 3m30s after the delete script starts, then there's a second, smaller set almost exactly 120s after the first which is a repeat of some of the warnings from the first set that had not resolved themselves. The thing I note is that the log push that is "hung" waiting for journal buffer space is reported in the first set of warnings but not the second set, andthe second set only contains 2 tasks, not the 7 that are in the first set. Further, I note that kcryptd (i.e. dm-crypt) is one of the tasks that is hung, so there's an encrypted filesystem configured - is it the XFS filesystem files are being deleted from? Finally, after the second set of warnings, there are no more warnings, so whatever is occurred is temporary and the filesystem is not actually hung. i.e. there's no direct evidence in that trace that there was a complete system hang. However, there is evidence of a potential problem if your XFS filesystem is hosted on dm-crypt volumes. i.e. this: Aug 16 02:33:30 hpmicroserver kernel: Workqueue: kcryptd kcryptd_crypt [dm_crypt] Aug 16 02:33:30 hpmicroserver kernel: Call Trace: Aug 16 02:33:30 hpmicroserver kernel: ? __schedule+0x284/0x860 Aug 16 02:33:30 hpmicroserver kernel: schedule+0x28/0x80 Aug 16 02:33:30 hpmicroserver kernel: schedule_timeout+0x292/0x370 Aug 16 02:33:30 hpmicroserver kernel: ? check_preempt_curr+0x62/0x90 Aug 16 02:33:30 hpmicroserver kernel: wait_for_completion+0xaf/0x140 Aug 16 02:33:30 hpmicroserver kernel: ? wake_up_q+0x70/0x70 Aug 16 02:33:30 hpmicroserver kernel: flush_work+0x116/0x1d0 Aug 16 02:33:30 hpmicroserver kernel: ? worker_detach_from_pool+0xa0/0xa0 Aug 16 02:33:30 hpmicroserver kernel: xlog_cil_force_lsn+0x78/0x210 [xfs] Aug 16 02:33:30 hpmicroserver kernel: _xfs_log_force_lsn+0x71/0x340 [xfs] Aug 16 02:33:30 hpmicroserver kernel: ? xfs_reclaim_inode+0xe3/0x340 [xfs] Aug 16 02:33:30 hpmicroserver kernel: __xfs_iunpin_wait+0xa7/0x160 [xfs] Aug 16 02:33:30 hpmicroserver kernel: ? bit_waitqueue+0x30/0x30 Aug 16 02:33:30 hpmicroserver kernel: xfs_reclaim_inode+0xe3/0x340 [xfs] Aug 16 02:33:30 hpmicroserver kernel: xfs_reclaim_inodes_ag+0x1b1/0x300 [xfs] Aug 16 02:33:30 hpmicroserver kernel: xfs_reclaim_inodes_nr+0x31/0x40 [xfs] Aug 16 02:33:30 hpmicroserver kernel: super_cache_scan+0x152/0x1a0 Aug 16 02:33:30 hpmicroserver kernel: shrink_slab.part.45+0x1e8/0x3c0 Aug 16 02:33:30 hpmicroserver kernel: shrink_node+0x123/0x310 Aug 16 02:33:30 hpmicroserver kernel: do_try_to_free_pages+0xc3/0x330 Aug 16 02:33:30 hpmicroserver kernel: try_to_free_pages+0xf4/0x1b0 Aug 16 02:33:30 hpmicroserver kernel: __alloc_pages_slowpath+0x3e4/0xd80 Aug 16 02:33:30 hpmicroserver kernel: __alloc_pages_nodemask+0x226/0x240 Aug 16 02:33:30 hpmicroserver kernel: new_slab+0x2f3/0x620 Aug 16 02:33:30 hpmicroserver kernel: ___slab_alloc+0x322/0x4a0 Aug 16 02:33:30 hpmicroserver kernel: ? __alloc_pages_slowpath+0xd4d/0xd80 Aug 16 02:33:30 hpmicroserver kernel: ? init_crypt+0x7f/0xd0 [xts] Aug 16 02:33:30 hpmicroserver kernel: __slab_alloc+0x1c/0x30 Aug 16 02:33:30 hpmicroserver kernel: __kmalloc+0x18e/0x1f0 Aug 16 02:33:30 hpmicroserver kernel: init_crypt+0x7f/0xd0 [xts] Aug 16 02:33:30 hpmicroserver kernel: encrypt+0x15/0x20 [xts] Aug 16 02:33:30 hpmicroserver kernel: crypt_convert+0x954/0xec0 [dm_crypt] Aug 16 02:33:30 hpmicroserver kernel: ? bio_alloc_bioset+0x132/0x1e0 Aug 16 02:33:30 hpmicroserver kernel: kcryptd_crypt+0x2b8/0x370 [dm_crypt] Aug 16 02:33:30 hpmicroserver kernel: process_one_work+0x1e9/0x3b0 Aug 16 02:33:30 hpmicroserver kernel: worker_thread+0x2b/0x3f0 Aug 16 02:33:30 hpmicroserver kernel: ? pwq_unbound_release_workfn+0xc0/0xc0 Aug 16 02:33:30 hpmicroserver kernel: kthread+0x119/0x130 Aug 16 02:33:30 hpmicroserver kernel: ? __kthread_parkme+0xa0/0xa0 Au This appears to be a potential deadlock via incorrect memory allocation contexts in dm-crypt. i.e. the crypto code it uses is doing GFP_KERNEL allocations while setting up the encryption context which allows it to get stuck in a filesystem that can't make progress until the encryption completes. . i.e. the dm-crypt/crypto allocation context should probably be GFP_NOIO to prevent memory reclaim recursion into contexts that might be already be dependent on dm-crypt making progress (i.e. filesystems).... This isn't really looking like an XFS issue at this point.... -Dave.
Hello Dave You are right! Both XFS filesystems are on top of dm-crypt: * 30TB: XFS/dm-crypt/dm-raid * 4TB: XFS/dm-crypt So everytime a hang occurred, dm-crypt was definitely involved. Do you know any dm-crypt developers that could be CCed so that the investigation can continue? Or should I change Product/Component of the bug? Regards, Peter.
I'll poke some dm people here @ red hat. Thanks, -Eric
(In reply to Dave Chinner from comment #5) > Finally, after the second set of warnings, there are no more warnings, so > whatever is occurred is temporary and the filesystem is not actually hung. > i.e. there's no direct evidence in that trace that there was a complete > system hang. However, there is evidence of a potential problem if your XFS > filesystem is hosted on dm-crypt volumes. > > i.e. this: > > Aug 16 02:33:30 hpmicroserver kernel: Workqueue: kcryptd kcryptd_crypt > [dm_crypt] > Aug 16 02:33:30 hpmicroserver kernel: Call Trace: ... > Aug 16 02:33:30 hpmicroserver kernel: ? init_crypt+0x7f/0xd0 [xts] > Aug 16 02:33:30 hpmicroserver kernel: __slab_alloc+0x1c/0x30 > Aug 16 02:33:30 hpmicroserver kernel: __kmalloc+0x18e/0x1f0 > Aug 16 02:33:30 hpmicroserver kernel: init_crypt+0x7f/0xd0 [xts] > Aug 16 02:33:30 hpmicroserver kernel: encrypt+0x15/0x20 [xts] > Aug 16 02:33:30 hpmicroserver kernel: crypt_convert+0x954/0xec0 [dm_crypt] > Aug 16 02:33:30 hpmicroserver kernel: ? bio_alloc_bioset+0x132/0x1e0 > Aug 16 02:33:30 hpmicroserver kernel: kcryptd_crypt+0x2b8/0x370 [dm_crypt] > Aug 16 02:33:30 hpmicroserver kernel: process_one_work+0x1e9/0x3b0 > Aug 16 02:33:30 hpmicroserver kernel: worker_thread+0x2b/0x3f0 > Aug 16 02:33:30 hpmicroserver kernel: ? pwq_unbound_release_workfn+0xc0/0xc0 > Aug 16 02:33:30 hpmicroserver kernel: kthread+0x119/0x130 > Aug 16 02:33:30 hpmicroserver kernel: ? __kthread_parkme+0xa0/0xa0 > Au > > This appears to be a potential deadlock via incorrect memory allocation > contexts in dm-crypt. i.e. the crypto code it uses is doing GFP_KERNEL > allocations while setting up the encryption context which allows it to get > stuck in a filesystem that can't make progress until the encryption > completes. . i.e. the dm-crypt/crypto allocation context should probably be > GFP_NOIO to prevent memory reclaim recursion into contexts that might be > already be dependent on dm-crypt making progress (i.e. filesystems).... So problematic call chain is: crypt_convert -> encrypt -> init_crypt -> __kmalloc crypto:xts.c:encrypt is: static int encrypt(struct skcipher_request *req) { return do_encrypt(req, init_crypt(req, encrypt_done)); } There are no gfp flags passed in. SO yes, to be able to work for all callers init_crypt() should be changed from GFP_KERNEL to GFO_NOIO. init_crypt() does the allocation with: gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ? GFP_KERNEL : GFP_ATOMIC; So another option would be for DM-crypt to _not_ set CRYPTO_TFM_REQ_MAY_SLEEP in struct skcipher_request *req's base.flags (I wonder if it defaults to setting it?) The crypto code gets opaque quite quickly.. not yet sure where the relevant dm-crypt code is that would be able to ensure CRYPTO_TFM_REQ_MAY_SLEEP is _not_ set in skcipher_request req->base.flags In any case, it really does seem to make sense to change xts.c:init_crypt() to use GDP_NOIO instead of GFP_KERNEL.
(In reply to Mike Snitzer from comment #8) > init_crypt() does the allocation with: > gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ? GFP_KERNEL : > GFP_ATOMIC; > > So another option would be for DM-crypt to _not_ set > CRYPTO_TFM_REQ_MAY_SLEEP in struct skcipher_request *req's base.flags (I > wonder if it defaults to setting it?) > > The crypto code gets opaque quite quickly.. not yet sure where the relevant > dm-crypt code is that would be able to ensure CRYPTO_TFM_REQ_MAY_SLEEP is > _not_ set in skcipher_request req->base.flags drivers/md/dm-crypt.c:crypt_alloc_req_skcipher() is calling skcipher_request_set_callback() with CRYPTO_TFM_REQ_MAY_SLEEP set. But there are quite a few other modes that also set CRYPTO_TFM_REQ_MAY_SLEEP. And most crypto code uses GFP_KERNEL if that is set. In any case, it would seem commit f1c131b45410a ("crypto: xts - Convert to skcipher") is what ushered in the use of GFP_KERNEL underneath dm-crypt's IO path. > In any case, it really does seem to make sense to change xts.c:init_crypt() > to use GDP_NOIO instead of GFP_KERNEL. I still think switching to GFP_NOIO here needs further thought and review by the crypto/XTS developers.
Mike, perhaps another flag like CRYPTO_TFM_REQ_BLKIO could be added to tell the crypto subsystem that it's operating under GFP_NOIO allocation constraints? I mean, it obviously can sleep in the kcryptd context, but being able to sleep doesn't imply anything about the operation's allowable memory allocation context. I guess that Herbert Xu is the person to talk to here - ah, you've already cc'd him. :) -Dave.
Created attachment 278027 [details] a patch for this bug A patch that adds memalloc_noio_save/memalloc_noio_restore around the encryption code.
(In reply to Mikulas Patocka from comment #11) > Created attachment 278027 [details] > a patch for this bug > > A patch that adds memalloc_noio_save/memalloc_noio_restore around the > encryption code. Very good idea Mikulas. I had forgotten about memalloc_noio_save/memalloc_noio_restore Certainly ideal to avoid having to so deeply train the crypto code about block's forward-progress requirements.
(In reply to Mike Snitzer from comment #12) > (In reply to Mikulas Patocka from comment #11) > > Created attachment 278027 [details] > > a patch for this bug > > > > A patch that adds memalloc_noio_save/memalloc_noio_restore around the > > encryption code. > > Very good idea Mikulas. I had forgotten about > memalloc_noio_save/memalloc_noio_restore > > Certainly ideal to avoid having to so deeply train the crypto code about > block's forward-progress requirements. But, as we discussed, this only works if the cipher code doesn't offload encryption to another thread. So needs more thought and coordination with Herbert, etc.
This issue should be fixed with the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.19&id=432061b3da64e488be3403124a72a9250bbe96d4 I'll be sending it to Linus for inclusion in either 4.19-rc3 or 4.19-rc4 Please feel free to test it.
@Mike Snitzer Could you please also provide a patch that can be applied against 4.18? I'd prefer not to run my server on a 4.19 prerelease version. This part of the patch currently fails in 4.18: --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -332,7 +332,7 @@ static int crypt_iv_essiv_init(struct crypt_config *cc) int err; desc->tfm = essiv->hash_tfm; - desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP; + desc->flags = 0; err = crypto_shash_digest(desc, cc->key, cc->key_size, essiv->salt); shash_desc_zero(desc); Can it be simply skipped (the assignment to desc->flags in function crypt_iv_essiv_init() does not exist in 4.18)? Regards, Peter.
(In reply to Peter Klotz from comment #15) > @Mike Snitzer > > Could you please also provide a patch that can be applied against 4.18? > I'd prefer not to run my server on a 4.19 prerelease version. > > This part of the patch currently fails in 4.18: > > --- a/drivers/md/dm-crypt.c > +++ b/drivers/md/dm-crypt.c > @@ -332,7 +332,7 @@ static int crypt_iv_essiv_init(struct crypt_config *cc) > int err; > > desc->tfm = essiv->hash_tfm; > - desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP; > + desc->flags = 0; > > err = crypto_shash_digest(desc, cc->key, cc->key_size, essiv->salt); > shash_desc_zero(desc); > > Can it be simply skipped (the assignment to desc->flags in function > crypt_iv_essiv_init() does not exist in 4.18)? > > Regards, Peter. You just need to remove all assignments that set CRYPTO_TFM_REQ_MAY_SLEEP