Bug 107301

Summary: system hang during ext4 xattr operation
Product: File System Reporter: Mehdi Abaakouk (sileht)
Component: ext4Assignee: fs_ext4 (fs_ext4)
Status: NEEDINFO ---    
Severity: high CC: adilger.kernelbugzilla, agruen, erwan, jack, laurent, pmhahn, sage, szg00000, tytso
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.2.3 3.19 3.16 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg received via netconsole before the system hang
attachment-4900-0.html
Remove ext4 mbcache
xattr -l on random ceph files
add "no_mbcache
Multiple NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [smbd:8822]

Description Mehdi Abaakouk 2015-11-05 13:54:24 UTC
Created attachment 192191 [details]
dmesg received via netconsole before the system hang

Hi,

We are running a ceph cluster on ext4 filesystem, we recently got a hardware failure, the ceph recovery process provokes a huge amount of data write on all our ext4 filesystems (~40 disks).

Now, we are experienced random nodes hang, we catch some partial backtrace (that can be found on the ceph bug tracker). And recently we got the full dmesg log via netconsole (attached to this BZ).

When the freeze occurs, it seems ceph processes lockup all the CPUs, each CPUs backtrace is related to a xattr operation. 

bug report on ceph side: http://tracker.ceph.com/issues/13662

We have some nodes on debian and some other on ubuntu, we tried kernels 3.16, 3.19, 4.2.3. The issue occurs with all of them. 

Cheers,
Comment 1 Jan Kara 2015-11-05 17:02:57 UTC
Hum, from the stack traces in the log it seems mb_cache_entry_alloc() is racing with other operations on the LRU and restarting all the time. Another possibility is that there are lots of entries in the LRU and it takes a long time to scan (if I remember right ceph is a heavy user of xattrs).

The first problem would be easily fixed by adding cond_resched() at appropriate place, the second problem would require more intrusive changes in how LRU is handled.

Can you reproduce the issue?
Comment 2 Mehdi Abaakouk 2015-11-05 17:41:54 UTC
Created attachment 192211 [details]
attachment-4900-0.html

Yes ceph heavy uses xattr

I don't have a 'step by step to reproduce' list. But I have the issue ~4-5 times per day on a ceph cluster since the first incident.

On November 5, 2015 6:02:57 PM GMT+01:00, bugzilla-daemon@bugzilla.kernel.org wrote:
>https://bugzilla.kernel.org/show_bug.cgi?id=107301
>
>Jan Kara <jack@suse.cz> changed:
>
>           What    |Removed                     |Added
>----------------------------------------------------------------------------
>             Status|NEW                         |NEEDINFO
>
>--- Comment #1 from Jan Kara <jack@suse.cz> ---
>Hum, from the stack traces in the log it seems mb_cache_entry_alloc()
>is racing
>with other operations on the LRU and restarting all the time. Another
>possibility is that there are lots of entries in the LRU and it takes a
>long
>time to scan (if I remember right ceph is a heavy user of xattrs).
>
>The first problem would be easily fixed by adding cond_resched() at
>appropriate
>place, the second problem would require more intrusive changes in how
>LRU is
>handled.
>
>Can you reproduce the issue?
>
>-- 
>You are receiving this mail because:
>You reported the bug.
Comment 3 Theodore Tso 2015-11-06 02:43:17 UTC
If Ceph is using xattrs that are all different for each inode, the mbcache is not really useful at all.   So I wonder if we should have a mount option which disables the mbcache.   This will prove the scalability of the xattr functions, and should be a pretty big performance win for Ceph.
Comment 5 Laurent GUERBY 2015-11-07 15:24:34 UTC
My reading of the mbcache code is that

cache->c_max_entries = bucket_count << 4

is limiting to on average 16 entries per bucket. For ext4 bucket_bits is 10 so 1024 buckets.

Once the cache is full which is very likely situation given ceph use of xattr with differing values all ceph threads will compete for the global mbcache locks on *each* xattr operation.

Am I correct here?

ceph is designed to have one process managing one non shared device the mbcache design is very likely limiting scalability here by having global locks. Plus probably has a locking bug.
Comment 6 Laurent GUERBY 2015-11-07 17:58:55 UTC
Created attachment 192351 [details]
Remove ext4 mbcache

Would this (untested/mechanical) patch be ok to test mbcache removal?
Comment 7 Jan Kara 2015-11-09 08:50:58 UTC
Yeah, that would test what the performance would look like without mbcache. BTW: Among xattrs ceph is using how many of them are the same? Mbcache is a win for the common case where xattrs are mostly used for ACLs or SE Linux labels and thus the reuse is big (mbcache is essentially a deduplication layer for xattrs). 

And yes, it is in a need of some updates to meet current scalability demands (I don't think what you hit is a bug as such, rather an inefficiency that becomes lethal at your scale) - other users than ceph occasionally report issues as well. Probably we should track things per-fs, not globally, hook each per-fs mbcache into the shrinker framework and don't introduce artificial upper bounds on the number of entries and instead let natural memory pressure deal with it.

For now I'm not convinced adding a mount option to disable mbcache is the right way to go. Rather we should make it lightweight enough that it doesn't add too much overhead for the cases where it is not needed. With the mount option there is always the trouble that someone has to know it to turn it on/off and sometimes there even isn't a good choice as you can have heavy xattr reuse for some inodes and also quite a few unique xattrs for other inodes...
Comment 8 Laurent GUERBY 2015-11-09 09:24:04 UTC
Created attachment 192491 [details]
xattr -l on random ceph files

I ran xattr on various ceph files on our cluster and there are small ones that look mostly constant (or small value set), but the big ones "user.ceph._" are clearly all differents (beginning common but there's probably hash or something inside near the end).
Comment 9 Laurent GUERBY 2015-11-09 10:11:07 UTC
Note: ceph files are all big in our use case since we use it for rbd, about 4 Mbyte each, so two or three additional xattr blocks per file won't change disk usage in volume for us. For people using ceph as object store and with their own use of xattr the situation could be different.
Comment 10 Sage Weil 2015-11-09 14:31:37 UTC
The most common Ceph xattr is ~270 bytes and always has a unique value.  There are also a couple that are a few bytes each (and always the same), but i would expect the little ones get inlined in the inode?

Anyway, it sounds like disabling mbcache would be a win!
Comment 11 Jan Kara 2015-11-09 15:22:23 UTC
Actually if most of inodes have xattrs like this, then the biggest win would probably be to create filesystem with 512-byte inodes where the ~270-byte xattr will fit into the inode as well. Then you save on extra block lookup etc.

WRT disabling mbcache, it seems disabling it would really help your usecase. However I'm not a big fan of mount options for disabling/enabling various functionality in the filesystem as it blows up a test matrix and tends to be a usability issue as well (people don't know the option or don't know when to enable / disable it). So I'd prefer if we just fixed mbcache...
Comment 12 Andreas Dilger 2015-11-09 18:13:16 UTC
Created attachment 192551 [details]
add "no_mbcache

We have a patch for Lustre ldiskfs (ext4 modified with Lustre patches) to disable mbcache, since it has similar performance impact for Lustre servers, and provides no value because the xattrs Lustre stores on each file are unique and cannot be shared.  

Canonical patch location:
http://git.hpdd.intel.com/fs/lustre-release.git/blob_plain/HEAD:/ldiskfs/kernel_patches/patches/rhel7/ext4-disable-mb-cache.patch

While the referenced patch is for RHEL 7, it is small enough to port easily to the upstream kernel.

This patch adds a "no_mbcache" mount option, which Lustre automatically adds to the filesystem options when the servers are mounted.  There was a patch to improve mbcache performance in ext4 by making the cache per-sb, but that doesn't improve the contention within a filesystem, and doesn't avoid the fact that mbcache provides absolutely no value for Lustre (or Ceph, it seems).  Doing no work at all is better than doing it somewhat more efficiently.

The only way I could see this working automatically is if mbcache disabled itself after having a low/zero cache hit ratio within a certain number of inserts, and if not finding any shared xattr blocks when reading from disk.  In the meantime, I think having a mount option is a viable alternative for those people who know better than the auto-detection heuristics.
Comment 13 Laurent GUERBY 2015-11-10 11:16:37 UTC
We restarted all of our cluster machines with ubuntu 3.19 + my ext4 mbcache removal patch: so far no data loss, no new freeze and no visible change in performance (production too noisy to detect small changes).
Comment 14 Theodore Tso 2015-11-11 02:37:59 UTC
In response to Jan's concerns in comment #11, I wonder if we can use some hueristics to decide when to use mbcache and when to skip it.   So if there are ways that we can identify certain xattr's by name or type as being almost always unique, then if there are any of those xattrs in the external xattr block, there's no point using the mbcache.   If we had such a function, we could also use it to bias putting the unique xattrs in the extended inode space, which might increase the chance that the external xattr block is more likely to be shareable.

The final thing I'll note is that because of the mbcache locking, it can turn into a real scalability bottlencheck for ext4, and it's not clear this is a soluble problem.   So if you are running on a high-CPU count machine, and you are using xattrs, and (for example) the SELinux xattr is so piggy that it won't fit in your 256-byte inodes, things can get pretty painful.
Comment 15 Jan Kara 2015-11-11 10:28:17 UTC
So I guess there are two separate issues:
1) mbcache scales poorly - that is worth addressing regardless of whether ceph / lustre really need it or not since as you mention there are cases where mbcache helps and scalability is an issue.
2) some usecases do not benefit from mbache much (or at all) and so we could possibly have a heuristic to disable mbcache altogether.

My current feeling is that if mbcache is implemented properly, then the overhead of it is a hash-table insertion / deletion when creating / removing xattr and that should be pretty cheap compared to all the other work we do to create external xattr (although cache line contention could be an issue here to some degree).
Comment 16 Andreas Dilger 2015-11-11 11:37:33 UTC
I think that regardless of how efficient mbcache can be made, it is not as efficient (both CPU and RAM) as not using it at all in cases where it is not providing any benefit.  I'm not against improving the efficiency, but I think it still makes sense to have an option to disable it completely. Since mbcache is already a per-sb cache, having a per-sb mount option makes sense and doesn't interfere with other improvements to mbcache.
Comment 17 Jan Kara 2015-11-11 14:53:10 UTC
Ultimately it is Ted's call but your argument is like (randomly taken out of top of my head): "Proper locking to protect from hole punching costs us some performance and our workload doesn't use hole punching so let's create mount option to disable the locking". Sure it can be done and it will benefit your workload but how much you gain? And how many users need this? This has to be weighted against the cost of the new mount option in terms of testing and usability (and code complexity but that is fairly small in this case so I'm not that concerned).
Comment 18 Laurent GUERBY 2015-11-18 20:48:49 UTC
Should I formally submit the mbcache removal patch?
Comment 19 Andreas Dilger 2015-11-19 00:49:16 UTC
Jan, your example is a red herring, because the removal of mbcache at most affects the performance and space efficiency, not correctness. 

I agree that making mbcache more efficient is a good goal, but I don't think that ot should block the landing of this patch. Rather, it makes sense to land the patch and we can still fix mbcache performance if anyone actually wants to use it. It will also serve as the basis for any heuristic to turn mbcache on and off dynamically instead of (or in addition to) the mount option.

Laurent, I'd be greatful if you pushed the patch upstream for the latest kernel as I'm traveling this week.
Comment 20 Theodore Tso 2015-11-19 15:11:22 UTC
Apologies for not responding right away.  I'm currently on vacation.
So this also means that if someone sends me a patch right away, I'm
not likely to have time to look at it for a week or two.

As far as trying to make mbcache more scalable, this would be great,
but I suspect it's not going to be very easy because it requires a
global data structure, which is fundamentally hard to scale.  I can
imagine some schemes that sacrifice some of mbcache's ability to spot
duplicate xattr sets --- for example, you could just use a trylock,
and if there is lock contention just give up on detecting duplicate
xattrs.

Or maybe we could use some hueristics based on the xattr type/key to
decide whether using mbcache is likely to be successful.  So if we
know that Posix ACL's are very likely to be shared, and ceph xattrs
are very unlikely to be shared, this could be used to decide whether
or not to use mbcache automatically, without requiring a mount option.

Even better would be one where for unknonw xattr type/key
combinations, we use some learning algorithm which determines after
using mbcache for N instances of that xattr, if the percentage of
cache hits is too low, we stop using mbcache for that type/key
combination.
Comment 21 Andreas Gruenbacher 2015-12-10 02:51:52 UTC
As far as ceph is concerned, are there reasons for not using 512-byte inodes? In-inode xattrs are generally much faster.
Comment 22 Theodore Tso 2015-12-10 16:51:43 UTC
Using 512 byte inodes for Ceph is a really good idea.

Before folks go there, can people give Jan Kara's mbcache rewrite a try?

http://thread.gmane.org/gmane.comp.file-systems.ext4/51094

I'd really appreciate comments and any performance numbers you could give me on your workloads.

Thanks!!
Comment 23 Erwan Velu 2016-03-31 13:53:44 UTC
I'm making a short review on that issue that Ceph user could trigger.

Regarding the current state, it seems that Jan got his work to be present in the 4.6.x series.

Laurent, Medhi, could you make a try with Jan's patch to confirm that its ok for you too ?

Testing the 4.6-rc1 would be ideal as we could tell Ceph's users then that 4.6 is fixing the issue you had.

Thanks,
Comment 24 Philipp Matthias Hahn 2016-04-08 10:09:28 UTC
Created attachment 212131 [details]
Multiple NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [smbd:8822]

We have a similar situation with Samba4, which also heavily uses xattr to store the NTACLs.

From OCR (may contain errors):

[281537.032878] Call Trace:
[281537.032882]  [<ffffffffa0117c7c>] ? mb_cache_entry_get+0x1ac/0x1f0 [mbcache]
[281537.032883]  [<ffffffff81208f1f>] ? __find_get_block+0xef/0x120
[281537.032889]  [<ffffffffa033de70>] ? ext4_xattr_block_set+0x80/0xa50 [ext4]
[281537.032890]  [<ffffffff81208f1f>] ? __find_get_block+0xef/0x120
[281537.032896]  [<ffffffffa033d1da>] ? ext4_xattr_set_entrg+0x2a/0x350 [ext4]
[281537.032901]  [<ffffffffa033f3e6>] ? ext4_xattr_set_handle+0x376/0x4d0 [ext4]
[281537.032907]  [<ffffffffa033f614>] ? ext4_xattr_set+0xd4/0x130 [ext4]
[281537.032909]  [<ffffffff811f917e>] ? generic_setxattr+0x6e/0xa0
[281537.032910]  [<ffffffff811f9d91>] ? __ufs_setxattr_noperm+0x71/0x1d0
[281537.032912]  [<ffffffff811f9fb4>] ? ufs_setxattr+0xc4/0xd0
[281537.032914]  [<ffffffff811fa0f4>] ? setxattr+0x134/0x1f0
[281537.032916]  [<ffffffff811e1301>] ? filenane_lookup+0x31/0xd0
[281537.032917]  [<ffffffff811e50bc>] ? user_path_at_emptg+0x60/0xc0
[281537.032918]  [<ffffffff811d7473>] ? __sb_start_urite+0x53/0x100
[281537.032919]  [<ffffffff811fa240>] ? path_setxattr+0x90/0xc0
[281537.032921]  [<ffffffff811fa314>] ? SgS_setxattr+0x14/0x20
[281537.032922]  [<ffffffff8159de72>] ? system_call_fast_compare_end+0x0/0x6b
[281537.032933] Code: f0 0f c1 07 89 02 c1 ea 10 66 39 02 75 01 03 0f b7 f2 b8 00 80 00 00 0f b7 0f 41 89 08 41 31 d0 41 81 e0 f
e ff 00 00 74 10 f3 90 <83> e8 01 75 e7 0f 1f 80 00 00 00 00 eb d9 0f b7 f1 e8 08 75 ff
Comment 25 Jan Kara 2016-04-08 20:46:12 UTC
Phillip, I see you are using some variant of 4.1 kernel. As comment 23 mentions, fixes for the problem have landed in 4.6-rc1. So is it possible for you to test that kernel?
Comment 26 Philipp Matthias Hahn 2016-04-12 12:52:24 UTC
(In reply to Jan Kara from comment #25)
> Phillip, I see you are using some variant of 4.1 kernel. As comment 23
> mentions, fixes for the problem have landed in 4.6-rc1. So is it possible
> for you to test that kernel?

As that problem happend in production, I'm unwilling to install an -rc kernel there.
Currently I don't have the time to try to reproduce it in my development environment, but that may change if that lockup happens again.
Can you be more specific on which change in 4.6-rc1 you're referencing?
Comment 27 Jan Kara 2016-04-13 09:07:35 UTC
(In reply to Philipp Matthias Hahn from comment #26)
> Can you be more specific on which change in 4.6-rc1 you're referencing?

Well, it is a series of changes, commits:
f9a61eb4e2471c56a63cd804c7474128138c38ac
82939d7999dfc1f1998c4b1c12e2f19edbdff272
be0726d33cb8f411945884664924bed3cb8c70ee
ecd1e64412d5242b8afdef58a714bab3c5464f79
c2f3140fe2eceb3a6c1615b2648b9471544881c6
f0c8b46238db9d51ef9ea0858259958d0c601cec
7a2508e1b657cfc7e1371550f88c7a7bc4288f32
2335d05f3a83f5290ec28c1ed30c1c742a37edc9
dc8d5e565f00c9442fa1cbf9acc115475628527c
3fd164629d25b04f291a79a013dcc7ce1a301269
6048c64b26097a0ffbd966866b599f990e674e9b