Created attachment 192191 [details]
dmesg received via netconsole before the system hang
We are running a ceph cluster on ext4 filesystem, we recently got a hardware failure, the ceph recovery process provokes a huge amount of data write on all our ext4 filesystems (~40 disks).
Now, we are experienced random nodes hang, we catch some partial backtrace (that can be found on the ceph bug tracker). And recently we got the full dmesg log via netconsole (attached to this BZ).
When the freeze occurs, it seems ceph processes lockup all the CPUs, each CPUs backtrace is related to a xattr operation.
bug report on ceph side: http://tracker.ceph.com/issues/13662
We have some nodes on debian and some other on ubuntu, we tried kernels 3.16, 3.19, 4.2.3. The issue occurs with all of them.
Hum, from the stack traces in the log it seems mb_cache_entry_alloc() is racing with other operations on the LRU and restarting all the time. Another possibility is that there are lots of entries in the LRU and it takes a long time to scan (if I remember right ceph is a heavy user of xattrs).
The first problem would be easily fixed by adding cond_resched() at appropriate place, the second problem would require more intrusive changes in how LRU is handled.
Can you reproduce the issue?
Created attachment 192211 [details]
Yes ceph heavy uses xattr
I don't have a 'step by step to reproduce' list. But I have the issue ~4-5 times per day on a ceph cluster since the first incident.
On November 5, 2015 6:02:57 PM GMT+01:00, firstname.lastname@example.org wrote:
>Jan Kara <email@example.com> changed:
> What |Removed |Added
> Status|NEW |NEEDINFO
>--- Comment #1 from Jan Kara <firstname.lastname@example.org> ---
>Hum, from the stack traces in the log it seems mb_cache_entry_alloc()
>with other operations on the LRU and restarting all the time. Another
>possibility is that there are lots of entries in the LRU and it takes a
>time to scan (if I remember right ceph is a heavy user of xattrs).
>The first problem would be easily fixed by adding cond_resched() at
>place, the second problem would require more intrusive changes in how
>Can you reproduce the issue?
>You are receiving this mail because:
>You reported the bug.
If Ceph is using xattrs that are all different for each inode, the mbcache is not really useful at all. So I wonder if we should have a mount option which disables the mbcache. This will prove the scalability of the xattr functions, and should be a pretty big performance win for Ceph.
Don't know if this is still around:
My reading of the mbcache code is that
cache->c_max_entries = bucket_count << 4
is limiting to on average 16 entries per bucket. For ext4 bucket_bits is 10 so 1024 buckets.
Once the cache is full which is very likely situation given ceph use of xattr with differing values all ceph threads will compete for the global mbcache locks on *each* xattr operation.
Am I correct here?
ceph is designed to have one process managing one non shared device the mbcache design is very likely limiting scalability here by having global locks. Plus probably has a locking bug.
Created attachment 192351 [details]
Remove ext4 mbcache
Would this (untested/mechanical) patch be ok to test mbcache removal?
Yeah, that would test what the performance would look like without mbcache. BTW: Among xattrs ceph is using how many of them are the same? Mbcache is a win for the common case where xattrs are mostly used for ACLs or SE Linux labels and thus the reuse is big (mbcache is essentially a deduplication layer for xattrs).
And yes, it is in a need of some updates to meet current scalability demands (I don't think what you hit is a bug as such, rather an inefficiency that becomes lethal at your scale) - other users than ceph occasionally report issues as well. Probably we should track things per-fs, not globally, hook each per-fs mbcache into the shrinker framework and don't introduce artificial upper bounds on the number of entries and instead let natural memory pressure deal with it.
For now I'm not convinced adding a mount option to disable mbcache is the right way to go. Rather we should make it lightweight enough that it doesn't add too much overhead for the cases where it is not needed. With the mount option there is always the trouble that someone has to know it to turn it on/off and sometimes there even isn't a good choice as you can have heavy xattr reuse for some inodes and also quite a few unique xattrs for other inodes...
Created attachment 192491 [details]
xattr -l on random ceph files
I ran xattr on various ceph files on our cluster and there are small ones that look mostly constant (or small value set), but the big ones "user.ceph._" are clearly all differents (beginning common but there's probably hash or something inside near the end).
Note: ceph files are all big in our use case since we use it for rbd, about 4 Mbyte each, so two or three additional xattr blocks per file won't change disk usage in volume for us. For people using ceph as object store and with their own use of xattr the situation could be different.
The most common Ceph xattr is ~270 bytes and always has a unique value. There are also a couple that are a few bytes each (and always the same), but i would expect the little ones get inlined in the inode?
Anyway, it sounds like disabling mbcache would be a win!
Actually if most of inodes have xattrs like this, then the biggest win would probably be to create filesystem with 512-byte inodes where the ~270-byte xattr will fit into the inode as well. Then you save on extra block lookup etc.
WRT disabling mbcache, it seems disabling it would really help your usecase. However I'm not a big fan of mount options for disabling/enabling various functionality in the filesystem as it blows up a test matrix and tends to be a usability issue as well (people don't know the option or don't know when to enable / disable it). So I'd prefer if we just fixed mbcache...
Created attachment 192551 [details]
We have a patch for Lustre ldiskfs (ext4 modified with Lustre patches) to disable mbcache, since it has similar performance impact for Lustre servers, and provides no value because the xattrs Lustre stores on each file are unique and cannot be shared.
Canonical patch location:
While the referenced patch is for RHEL 7, it is small enough to port easily to the upstream kernel.
This patch adds a "no_mbcache" mount option, which Lustre automatically adds to the filesystem options when the servers are mounted. There was a patch to improve mbcache performance in ext4 by making the cache per-sb, but that doesn't improve the contention within a filesystem, and doesn't avoid the fact that mbcache provides absolutely no value for Lustre (or Ceph, it seems). Doing no work at all is better than doing it somewhat more efficiently.
The only way I could see this working automatically is if mbcache disabled itself after having a low/zero cache hit ratio within a certain number of inserts, and if not finding any shared xattr blocks when reading from disk. In the meantime, I think having a mount option is a viable alternative for those people who know better than the auto-detection heuristics.
We restarted all of our cluster machines with ubuntu 3.19 + my ext4 mbcache removal patch: so far no data loss, no new freeze and no visible change in performance (production too noisy to detect small changes).
In response to Jan's concerns in comment #11, I wonder if we can use some hueristics to decide when to use mbcache and when to skip it. So if there are ways that we can identify certain xattr's by name or type as being almost always unique, then if there are any of those xattrs in the external xattr block, there's no point using the mbcache. If we had such a function, we could also use it to bias putting the unique xattrs in the extended inode space, which might increase the chance that the external xattr block is more likely to be shareable.
The final thing I'll note is that because of the mbcache locking, it can turn into a real scalability bottlencheck for ext4, and it's not clear this is a soluble problem. So if you are running on a high-CPU count machine, and you are using xattrs, and (for example) the SELinux xattr is so piggy that it won't fit in your 256-byte inodes, things can get pretty painful.
So I guess there are two separate issues:
1) mbcache scales poorly - that is worth addressing regardless of whether ceph / lustre really need it or not since as you mention there are cases where mbcache helps and scalability is an issue.
2) some usecases do not benefit from mbache much (or at all) and so we could possibly have a heuristic to disable mbcache altogether.
My current feeling is that if mbcache is implemented properly, then the overhead of it is a hash-table insertion / deletion when creating / removing xattr and that should be pretty cheap compared to all the other work we do to create external xattr (although cache line contention could be an issue here to some degree).
I think that regardless of how efficient mbcache can be made, it is not as efficient (both CPU and RAM) as not using it at all in cases where it is not providing any benefit. I'm not against improving the efficiency, but I think it still makes sense to have an option to disable it completely. Since mbcache is already a per-sb cache, having a per-sb mount option makes sense and doesn't interfere with other improvements to mbcache.
Ultimately it is Ted's call but your argument is like (randomly taken out of top of my head): "Proper locking to protect from hole punching costs us some performance and our workload doesn't use hole punching so let's create mount option to disable the locking". Sure it can be done and it will benefit your workload but how much you gain? And how many users need this? This has to be weighted against the cost of the new mount option in terms of testing and usability (and code complexity but that is fairly small in this case so I'm not that concerned).
Should I formally submit the mbcache removal patch?
Jan, your example is a red herring, because the removal of mbcache at most affects the performance and space efficiency, not correctness.
I agree that making mbcache more efficient is a good goal, but I don't think that ot should block the landing of this patch. Rather, it makes sense to land the patch and we can still fix mbcache performance if anyone actually wants to use it. It will also serve as the basis for any heuristic to turn mbcache on and off dynamically instead of (or in addition to) the mount option.
Laurent, I'd be greatful if you pushed the patch upstream for the latest kernel as I'm traveling this week.
Apologies for not responding right away. I'm currently on vacation.
So this also means that if someone sends me a patch right away, I'm
not likely to have time to look at it for a week or two.
As far as trying to make mbcache more scalable, this would be great,
but I suspect it's not going to be very easy because it requires a
global data structure, which is fundamentally hard to scale. I can
imagine some schemes that sacrifice some of mbcache's ability to spot
duplicate xattr sets --- for example, you could just use a trylock,
and if there is lock contention just give up on detecting duplicate
Or maybe we could use some hueristics based on the xattr type/key to
decide whether using mbcache is likely to be successful. So if we
know that Posix ACL's are very likely to be shared, and ceph xattrs
are very unlikely to be shared, this could be used to decide whether
or not to use mbcache automatically, without requiring a mount option.
Even better would be one where for unknonw xattr type/key
combinations, we use some learning algorithm which determines after
using mbcache for N instances of that xattr, if the percentage of
cache hits is too low, we stop using mbcache for that type/key
As far as ceph is concerned, are there reasons for not using 512-byte inodes? In-inode xattrs are generally much faster.
Using 512 byte inodes for Ceph is a really good idea.
Before folks go there, can people give Jan Kara's mbcache rewrite a try?
I'd really appreciate comments and any performance numbers you could give me on your workloads.
I'm making a short review on that issue that Ceph user could trigger.
Regarding the current state, it seems that Jan got his work to be present in the 4.6.x series.
Laurent, Medhi, could you make a try with Jan's patch to confirm that its ok for you too ?
Testing the 4.6-rc1 would be ideal as we could tell Ceph's users then that 4.6 is fixing the issue you had.
Created attachment 212131 [details]
Multiple NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [smbd:8822]
We have a similar situation with Samba4, which also heavily uses xattr to store the NTACLs.
From OCR (may contain errors):
[281537.032878] Call Trace:
[281537.032882] [<ffffffffa0117c7c>] ? mb_cache_entry_get+0x1ac/0x1f0 [mbcache]
[281537.032883] [<ffffffff81208f1f>] ? __find_get_block+0xef/0x120
[281537.032889] [<ffffffffa033de70>] ? ext4_xattr_block_set+0x80/0xa50 [ext4]
[281537.032890] [<ffffffff81208f1f>] ? __find_get_block+0xef/0x120
[281537.032896] [<ffffffffa033d1da>] ? ext4_xattr_set_entrg+0x2a/0x350 [ext4]
[281537.032901] [<ffffffffa033f3e6>] ? ext4_xattr_set_handle+0x376/0x4d0 [ext4]
[281537.032907] [<ffffffffa033f614>] ? ext4_xattr_set+0xd4/0x130 [ext4]
[281537.032909] [<ffffffff811f917e>] ? generic_setxattr+0x6e/0xa0
[281537.032910] [<ffffffff811f9d91>] ? __ufs_setxattr_noperm+0x71/0x1d0
[281537.032912] [<ffffffff811f9fb4>] ? ufs_setxattr+0xc4/0xd0
[281537.032914] [<ffffffff811fa0f4>] ? setxattr+0x134/0x1f0
[281537.032916] [<ffffffff811e1301>] ? filenane_lookup+0x31/0xd0
[281537.032917] [<ffffffff811e50bc>] ? user_path_at_emptg+0x60/0xc0
[281537.032918] [<ffffffff811d7473>] ? __sb_start_urite+0x53/0x100
[281537.032919] [<ffffffff811fa240>] ? path_setxattr+0x90/0xc0
[281537.032921] [<ffffffff811fa314>] ? SgS_setxattr+0x14/0x20
[281537.032922] [<ffffffff8159de72>] ? system_call_fast_compare_end+0x0/0x6b
[281537.032933] Code: f0 0f c1 07 89 02 c1 ea 10 66 39 02 75 01 03 0f b7 f2 b8 00 80 00 00 0f b7 0f 41 89 08 41 31 d0 41 81 e0 f
e ff 00 00 74 10 f3 90 <83> e8 01 75 e7 0f 1f 80 00 00 00 00 eb d9 0f b7 f1 e8 08 75 ff
Phillip, I see you are using some variant of 4.1 kernel. As comment 23 mentions, fixes for the problem have landed in 4.6-rc1. So is it possible for you to test that kernel?
(In reply to Jan Kara from comment #25)
> Phillip, I see you are using some variant of 4.1 kernel. As comment 23
> mentions, fixes for the problem have landed in 4.6-rc1. So is it possible
> for you to test that kernel?
As that problem happend in production, I'm unwilling to install an -rc kernel there.
Currently I don't have the time to try to reproduce it in my development environment, but that may change if that lockup happens again.
Can you be more specific on which change in 4.6-rc1 you're referencing?
(In reply to Philipp Matthias Hahn from comment #26)
> Can you be more specific on which change in 4.6-rc1 you're referencing?
Well, it is a series of changes, commits: