Bug 199589 - Deadlock during memory reclaim path involving sysfs and MD-Raid layers
Summary: Deadlock during memory reclaim path involving sysfs and MD-Raid layers
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: SysFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Greg Kroah-Hartman
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-02 08:32 UTC by Bruno Faccini
Modified: 2018-05-04 06:55 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.x, 4.x
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Patch (2.60 KB, text/plain)
2018-05-02 08:32 UTC, Bruno Faccini
Details

Description Bruno Faccini 2018-05-02 08:32:03 UTC
Created attachment 275721 [details]
Patch

Hello,
We have recently triggered a dead-lock situation with the Kernel versions used in RHEL 7.x distros, when using Lustre over MD-Raid devices.
The whole story can be found at https://jira.hpdd.intel.com/browse/LU-10709.
As per my live and forced crash-dumps analysis, the scenario of this dead-lock can be described as following.

A user-land thread wants to access some tunable in sysfs, and doing so, with sysfs_mutex locked, it triggers memory allocation when allocating a new inode via alloc_inode().
Since  the inode allocation is done under GFP_KERNEL, the registered memory shrinkers will be allowed to run and thus to start new filesystem operations, or eventually block when doing so because a concurrent thread is already doing it and thus owns some protection lock.
But anyway, whoever starts some filesystem operation will finally end up with MD-Raid layer and its associated device-specific service threads being involved which can then block, due to sysfs_mutex already being locked, if an automatic recovery or a manual check process has already been started for the concerned MD device, and there is the need to report in-progress/completion status thru sysfs_notify().
Hence the dead-lock and no further operation to be started anymore to the concerned device.

I have been able to get rid of this problem by adding the following patch to the 3.10.x Kernel series shipped with RHEL 7.x distros :
===================================================================================================
bfaccini-mac02:crash-master bfaccini$ cat ~/Documents/JIRAs/LU-10709/sysfs_alloc_inode_GFP_NOFS.patch
As part of LU-10709 problem/deadlock analysis, it has been
found that user-land processes intensivelly using sysfs
can cause a dead-lock if doing so memory reclaim is being
triggered and as part of it FS-specific shrinkers are run
and directly/indirectly involving layers (like MD/Raid) 
also relying on sysfs.
To fix this, sysfs inode allocation must no longer use
the generic/GFP_KERNEL way but to be done as GFP_NOFS
to prevent any FS operations to interfer during possible
reclaim.

Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>

--- orig/fs/inode.c
2017-09-09 07:06:42.000000000 +0000
+++ bfi/fs/inode.c
2018-03-14 09:24:48.533380200 +0000
@@ -73,7 +73,7 @@ struct inodes_stat_t inodes_stat;
 static DEFINE_PER_CPU(unsigned int, nr_inodes);
 static DEFINE_PER_CPU(unsigned int, nr_unused);
 
-static struct kmem_cache *inode_cachep __read_mostly;
+struct kmem_cache *inode_cachep __read_mostly;
 
 static int get_nr_inodes(void)
 {
--- orig/fs/sysfs/sysfs.h
2017-09-09 07:06:42.000000000 +0000
+++ bfi/fs/sysfs/sysfs.h
2018-03-14 09:24:48.534380233 +0000
@@ -211,6 +211,8 @@ static inline void __sysfs_put(struct sy
  */
 struct inode *sysfs_get_inode(struct super_block *sb, struct sysfs_dirent *sd);
 void sysfs_evict_inode(struct inode *inode);
+extern struct kmem_cache *inode_cachep;
+struct inode *sysfs_alloc_inode(struct super_block *sb);
 int sysfs_sd_setattr(struct sysfs_dirent *sd, struct iattr *iattr);
 int sysfs_permission(struct inode *inode, int mask);
 int sysfs_setattr(struct dentry *dentry, struct iattr *iattr);
--- orig/fs/sysfs/mount.c
2017-09-09 07:06:42.000000000 +0000
+++ bfi/fs/sysfs/mount.c
2018-03-14 09:24:48.534380233 +0000
@@ -31,6 +31,7 @@ static const struct super_operations sys
 
.statfs = simple_statfs,
 
.drop_inode = generic_delete_inode,
 
.evict_inode = sysfs_evict_inode,
+
.alloc_inode = sysfs_alloc_inode,
 };
 
 struct sysfs_dirent sysfs_root = {
--- orig/fs/sysfs/inode.c
2017-09-09 07:06:42.000000000 +0000
+++ bfi/fs/sysfs/inode.c
2018-03-14 09:24:48.534380233 +0000
@@ -314,6 +314,17 @@ void sysfs_evict_inode(struct inode *ino
 
sysfs_put(sd);
 }
 
+/*
+ * As a new inode allocation occurs with sysfs_mutex held and memory reclaim
+ * can be triggered doing so, this needs to happen with FS operations disabled
+ * to avoid any deadlock between shrinkers and FS/device layers doing
+ * extensive use of sysfs (like MD/Raid) as part of their operations.
+ */
+struct inode *sysfs_alloc_inode(struct super_block *sb)
+{
+
return kmem_cache_alloc(inode_cachep, GFP_NOFS);
+}
+
 int sysfs_hash_and_remove(struct sysfs_dirent *dir_sd, const void *ns, const char *name)
 {
 
struct sysfs_addrm_cxt acxt;
===================================================================================================
which forces new sysfs inode allocation to be done under GFP_NOFS instead of only with GFP_KERNEL previously.

After browsing recent 3.x/4.x Kernels source code, I believe problem is still there but now in kernfs instead of sysfs, as the latter uses the former's methods internally but where the same potential dead-lock seems to exist around kernfs_mutex .

Thanks in advance for any answer/help this, Best regards,
Bruno (bruno.faccini@intel.com).
Comment 1 Greg Kroah-Hartman 2018-05-02 11:31:49 UTC
On Wed, May 02, 2018 at 08:32:03AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> We have recently triggered a dead-lock situation with the Kernel versions
> used
> in RHEL 7.x distros, when using Lustre over MD-Raid devices.

Please contact Red Hat for RHEL specific questions.  You are paying for
the support, why not use it? :)
Comment 2 Bruno Faccini 2018-05-04 06:55:51 UTC
> Please contact Red Hat for RHEL specific questions.  You are paying for
> the support, why not use it? :)
We do, but the intent of my bug report hereby is to let you know about what we have experienced, when problem still seems potential in the current stable/-RC 3.x/4.x branches.

Note You need to log in before you can comment on or make changes to this bug.