The commit 0f12156dff2862ac54235fc72703f18770769042 ("memcg: enable accounting for file lock caches") adds memcg accounting for file lock related caches, however it turned out to cause some performance regression [1] -- synthetic benchmark will-it-scale drops ~34% in performance. Therefore this accounting was reverted [1]. At the same time unaccounted file locks may be abused to evade memcg limits [2]. I'm filing this bug to resolve both the performance issue and missing accounting (or at least to collect and track available info). I ran the attached test script and I got the following data. kernel cgroup metric std rel improvement [%] k1 0 3.05892e+07 7812.89 -7.22589 k1 1 2.50059e+07 9194.39 -24.15951 k2 0 3.29717e+07 21999.1 0.00000 (baseline) k2 1 32943691 13125.1 -0.08495 The metric is taken directly from the will-it-scale lock1 report (higher is better). k2 = 5.18.0-2.g3352b92-default (openSUSE TW stable kernel) k1 = 5.18.0-202205261821.g76c743f-default (ditto with revert of 3754707bcc3) cgroup = 0 means the benchmark runs in root memcg cgroup = 1 means the benchmark runs in memcg of depth 1 (child of root) My system had 48 CPUs, the benchmark ran in 24 parallel instances (that's less parallelism than the kernel test robot's report [1], therefore my measured relative drop might be less pronounced). [1] https://lore.kernel.org/lkml/20210907150757.GE17617@xsang-OptiPlex-9020/ [2] https://github.com/kata-containers/kata-containers/issues/3373
Created attachment 301062 [details] flamegraph of baseline measurement, 5.18.0-2.g3352b92-default, cgroup=0
Created attachment 301063 [details] flamegraph of accounted measurement, 5.18.0-202205261821.g76c743f-default, cgroup=0 Some quick insights: - locks_alloc_lock is not inlined in proband (that's weird with just SLAB_ACCOUNT flag, it seems a different compiler version sneaked into my rebuild) - get_obj_cgroup_from_current -- adds ~2% of samples, some more time is unattributd to new functions - no obvious contention on global locks
Created attachment 301064 [details] the wrapper of will-it-scale test referret to in comment 0
(In reply to Michal Koutný from comment #2) > - locks_alloc_lock is not inlined in proband (that's weird with just > SLAB_ACCOUNT flag, it seems a different compiler version sneaked into my > rebuild) Fixed baseline with same compiler: k3 0 32307750 8484.23 0.00000 Disabling kmem accounting on kernel cmdline eliminates the regression as expected k1 0 nokmem 3.25532e+07 15937.9 0.75972 One more observation, the regression reported in comment 0 with cgroup=0 is caused by systemd transiently enabling the memory controller (hence unsealing memcg_kmem_enabled_key), in theory nokmem-like result should be achievable with patched kernel too. Curated perf-report entries with the patched kernel and 1-level memcg (overall locks_alloc_lock goes up +12%, cf -24% of regression): children self 4.97% 4.66% [kernel.vmlinux] [k] mod_objcg_state 3.58% 1.78% [kernel.vmlinux] [k] get_obj_cgroup_from_current 1.82% 1.54% [kernel.vmlinux] [k] obj_cgroup_charge 1.67% 1.52% [kernel.vmlinux] [k] refill_obj_stock
Created attachment 301095 [details] memcontrol: Cache objcg in task_struct Quick idea of caching the objcg pointer in task_struct in order to save its repeated evaluation. > k3 0 32307750 8484.23 0.00000 // baseline > k5 0 3.08505e+07 10538 -4.5105 // > acct+cache > k5 1 2.51642e+07 12502.2 -22.111 > k4 0 30653597 22081 -5.1200 // acct > k4 1 2.49577e+07 37243 -22.75000 The improvement is visible (above noise) but nothing spectacular. The crucial question (wrt this particular file lock cache), how much this regression manifests with real workloads (contemplation suggests it's amortized by many other operations).