Bug 211329 - blkg_alloc memory leak on ppc64le
Summary: blkg_alloc memory leak on ppc64le
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Block Layer (show other bugs)
Hardware: PPC-64 Linux
: P1 normal
Assignee: FileSystem/XFS Default Virtual Assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-24 23:02 UTC by Cameron Berkenpas
Modified: 2021-01-31 06:44 UTC (History)
2 users (show)

See Also:
Kernel Version: =>5.10.10
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Cameron Berkenpas 2021-01-24 23:02:25 UTC
Not necessarily ppc64le specific, but so far I'm only seeing this on my Talos II system with XFS and none of my amd64 systems. 5.10.10 is the only 5.10.x series kernel tested so far. Issue may have started earlier.

These are the only leaks so far with the box up for several hours. Not sure how serious this is.

Specs:
2x 18 Core POWER9
512 GB memory

The XFS filesystems on this system:
df -h / /home
Filesystem      Size  Used Avail Use% Mounted on
/dev/md1        1.8T   59G  1.8T   4% /
/dev/bcache0     14T  634G   14T   5% /home

cat /sys/kernel/debug/kmemleak
unreferenced object 0xc0000011c63af400 (size 512):
  comm "worker", pid 7351, jiffies 4295245272 (age 21394.586s)
  hex dump (first 32 bytes):
    c0 e0 58 3c 00 00 00 c0 08 f4 3a c6 11 00 00 c0  ..X<......:.....
    08 f4 3a c6 11 00 00 c0 18 2a 3a c6 11 00 00 c0  ..:......*:.....
  backtrace:
    [<000000005f1fe84c>] blkg_alloc+0x58/0x260
    [<00000000bb469d61>] blkg_create+0x3b0/0x570
    [<000000007d35bf0d>] bio_associate_blkg_from_css+0x318/0x480
    [<00000000a4cfa6ed>] bio_associate_blkg+0x44/0xb0
    [<0000000014c40666>] cached_dev_submit_bio+0x140/0x1090
    [<000000001e375f40>] submit_bio_noacct+0x12c/0x5e0
    [<000000005d621ecf>] submit_bio+0x5c/0x270
    [<000000000d4d6bf5>] iomap_readahead+0xdc/0x230
    [<00000000b0093137>] xfs_vm_readahead+0x28/0x40
    [<00000000c7837a39>] read_pages+0xcc/0x370
    [<00000000ace2d2cc>] page_cache_ra_unbounded+0x1a4/0x280
    [<00000000ea5f8116>] generic_file_buffered_read+0x4cc/0xbd0
    [<00000000ce5a2b3b>] xfs_file_buffered_aio_read+0x70/0x130
    [<0000000019bddea7>] xfs_file_read_iter+0xa0/0x150
    [<0000000082a5c085>] new_sync_read+0x14c/0x1d0
    [<00000000abee86d0>] vfs_read+0x1a0/0x210
unreferenced object 0xc00000001772a840 (size 64):
  comm "worker", pid 7351, jiffies 4295245272 (age 21394.586s)
  hex dump (first 32 bytes):
    dc 5f 00 00 00 00 00 00 50 97 80 00 00 00 00 c0  ._......P.......
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000006e666d7e>] percpu_ref_init+0x7c/0x150
    [<00000000ac923962>] blkg_alloc+0x84/0x260
    [<00000000bb469d61>] blkg_create+0x3b0/0x570
    [<000000007d35bf0d>] bio_associate_blkg_from_css+0x318/0x480
    [<00000000a4cfa6ed>] bio_associate_blkg+0x44/0xb0
    [<0000000014c40666>] cached_dev_submit_bio+0x140/0x1090
    [<000000001e375f40>] submit_bio_noacct+0x12c/0x5e0
    [<000000005d621ecf>] submit_bio+0x5c/0x270
    [<000000000d4d6bf5>] iomap_readahead+0xdc/0x230
    [<00000000b0093137>] xfs_vm_readahead+0x28/0x40
    [<00000000c7837a39>] read_pages+0xcc/0x370
    [<00000000ace2d2cc>] page_cache_ra_unbounded+0x1a4/0x280
    [<00000000ea5f8116>] generic_file_buffered_read+0x4cc/0xbd0
    [<00000000ce5a2b3b>] xfs_file_buffered_aio_read+0x70/0x130
    [<0000000019bddea7>] xfs_file_read_iter+0xa0/0x150
    [<0000000082a5c085>] new_sync_read+0x14c/0x1d0
Comment 1 Cameron Berkenpas 2021-01-24 23:16:35 UTC
This also has not happened (yet) on my 2nd ppc64le box. Exact same kernel configuration.

Specs:
Raptor CS Blackbird motherbord
1x 8 core POWER9
64 GB of memory

df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb3       470G  149G  321G  32% /
Comment 2 Eric Sandeen 2021-01-25 00:20:21 UTC
At first glance, the allocation in question (blkg_alloc) is in the block/cgroup code, not XFS.

What leads you to believe that this is unique to XFS?
Comment 3 Cameron Berkenpas 2021-01-25 00:38:57 UTC
My poor interpretation of the stacktraces apparently. My bad!

I stopped and started one of the LXC containers and another 2 leaks were detected. Which product/component would it fall under virtualization?
Comment 4 Eric Sandeen 2021-01-25 01:56:06 UTC
Depends where the leaks are, are they all from blkg_alloc? It'd be block layer, I'm not sure which (if any) component is appropriate, perhaps IO/storage.

-Eric
Comment 5 Cameron Berkenpas 2021-01-25 02:15:54 UTC
Yes, they're all blkg_alloc.
Comment 6 Cameron Berkenpas 2021-01-30 16:32:43 UTC
The issue appears to be transient. Leaks are detected, and after a clear and a a re-scan, leaks are no longer present. I'm thinking these are false positives.
Comment 7 kch 2021-01-31 06:44:35 UTC
Is also seems to me that you are using bacache. Did you try to bisect the problem?

Or did you try to reproduce it on the membacked null_blk with xfs to make sure it is not the device before coming to the conclusion that xfs is the problem?

Note You need to log in before you can comment on or make changes to this bug.