Bug 218039

Summary: Memleaks in offset_ctx->xa (shmem)
Product: Memory Management Reporter: vladbu
Component: OtherAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: normal CC: bagasdotme, bcodding, cel, chuck.lever, regressions
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:

Description vladbu 2023-10-23 07:06:34 UTC
We have been getting memleaks in offset_ctx->xa in our networking tests:

unreferenced object 0xffff8881004cd080 (size 576):
  comm "systemd", pid 1, jiffies 4294893373 (age 1992.864s)
  hex dump (first 32 bytes):
    00 00 06 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    38 5c 7c 02 81 88 ff ff 98 d0 4c 00 81 88 ff ff  8\|.......L.....
  backtrace:
    [<000000000f554608>] xas_alloc+0x306/0x430
    [<0000000075537d52>] xas_create+0x4b4/0xc80
    [<00000000a927aab2>] xas_store+0x73/0x1680
    [<0000000020a61203>] __xa_alloc+0x1d8/0x2d0
    [<00000000ae300af2>] __xa_alloc_cyclic+0xf1/0x310
    [<000000001032332c>] simple_offset_add+0xd8/0x170
    [<0000000073229fad>] shmem_mknod+0xbf/0x180
    [<00000000242520ce>] vfs_mknod+0x3b0/0x5c0
    [<000000001ef218dd>] unix_bind+0x2c2/0xdb0
    [<0000000009b9a8dd>] __sys_bind+0x127/0x1e0
    [<000000003c949fbb>] __x64_sys_bind+0x6e/0xb0
    [<00000000b8a767c7>] do_syscall_64+0x3d/0x90
    [<000000006132ae0d>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

Memleak trace points to some syscall performed by systemd and none of our tests do anything more advanced with it than 'systemctl restart ovs-vswitchd'. Basically it is a setup with Fedora and an upstream kernel that executes bunch of network offload tests with Open vSwitch, iproute2 tc, Linux bridge, etc.

It looks like those may be caused by recent commit 6faddda69f62 ("libfs: Add directory operations for stable offsets") but we don't have a proper reproduction, just sometimes arbitrary getting the memleak complains during/after the regression run.
Comment 1 Bagas Sanjaya 2023-10-24 07:48:04 UTC
(In reply to vladbu from comment #0)
> We have been getting memleaks in offset_ctx->xa in our networking tests:
> 
> unreferenced object 0xffff8881004cd080 (size 576):
>   comm "systemd", pid 1, jiffies 4294893373 (age 1992.864s)
>   hex dump (first 32 bytes):
>     00 00 06 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     38 5c 7c 02 81 88 ff ff 98 d0 4c 00 81 88 ff ff  8\|.......L.....
>   backtrace:
>     [<000000000f554608>] xas_alloc+0x306/0x430
>     [<0000000075537d52>] xas_create+0x4b4/0xc80
>     [<00000000a927aab2>] xas_store+0x73/0x1680
>     [<0000000020a61203>] __xa_alloc+0x1d8/0x2d0
>     [<00000000ae300af2>] __xa_alloc_cyclic+0xf1/0x310
>     [<000000001032332c>] simple_offset_add+0xd8/0x170
>     [<0000000073229fad>] shmem_mknod+0xbf/0x180
>     [<00000000242520ce>] vfs_mknod+0x3b0/0x5c0
>     [<000000001ef218dd>] unix_bind+0x2c2/0xdb0
>     [<0000000009b9a8dd>] __sys_bind+0x127/0x1e0
>     [<000000003c949fbb>] __x64_sys_bind+0x6e/0xb0
>     [<00000000b8a767c7>] do_syscall_64+0x3d/0x90
>     [<000000006132ae0d>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
> Memleak trace points to some syscall performed by systemd and none of our
> tests do anything more advanced with it than 'systemctl restart
> ovs-vswitchd'. Basically it is a setup with Fedora and an upstream kernel
> that executes bunch of network offload tests with Open vSwitch, iproute2 tc,
> Linux bridge, etc.
> 
> It looks like those may be caused by recent commit 6faddda69f62 ("libfs: Add
> directory operations for stable offsets") but we don't have a proper
> reproduction, just sometimes arbitrary getting the memleak complains
> during/after the regression run.

Does reverting 6faddda69f62 fix this issue?
Comment 2 vladbu 2023-10-26 10:51:59 UTC
(In reply to Bagas Sanjaya from comment #1)
> (In reply to vladbu from comment #0)
> > We have been getting memleaks in offset_ctx->xa in our networking tests:
> > 
> > unreferenced object 0xffff8881004cd080 (size 576):
> >   comm "systemd", pid 1, jiffies 4294893373 (age 1992.864s)
> >   hex dump (first 32 bytes):
> >     00 00 06 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> >     38 5c 7c 02 81 88 ff ff 98 d0 4c 00 81 88 ff ff  8\|.......L.....
> >   backtrace:
> >     [<000000000f554608>] xas_alloc+0x306/0x430
> >     [<0000000075537d52>] xas_create+0x4b4/0xc80
> >     [<00000000a927aab2>] xas_store+0x73/0x1680
> >     [<0000000020a61203>] __xa_alloc+0x1d8/0x2d0
> >     [<00000000ae300af2>] __xa_alloc_cyclic+0xf1/0x310
> >     [<000000001032332c>] simple_offset_add+0xd8/0x170
> >     [<0000000073229fad>] shmem_mknod+0xbf/0x180
> >     [<00000000242520ce>] vfs_mknod+0x3b0/0x5c0
> >     [<000000001ef218dd>] unix_bind+0x2c2/0xdb0
> >     [<0000000009b9a8dd>] __sys_bind+0x127/0x1e0
> >     [<000000003c949fbb>] __x64_sys_bind+0x6e/0xb0
> >     [<00000000b8a767c7>] do_syscall_64+0x3d/0x90
> >     [<000000006132ae0d>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > 
> > Memleak trace points to some syscall performed by systemd and none of our
> > tests do anything more advanced with it than 'systemctl restart
> > ovs-vswitchd'. Basically it is a setup with Fedora and an upstream kernel
> > that executes bunch of network offload tests with Open vSwitch, iproute2
> tc,
> > Linux bridge, etc.
> > 
> > It looks like those may be caused by recent commit 6faddda69f62 ("libfs:
> Add
> > directory operations for stable offsets") but we don't have a proper
> > reproduction, just sometimes arbitrary getting the memleak complains
> > during/after the regression run.
> 
> Does reverting 6faddda69f62 fix this issue?

Unless I'm missing something neither simple_offset_add() function nor offset_ctx->xa xarray existed before 6faddda69f62, so reverting it would obviously fix the issue.
Comment 3 Chuck Lever 2023-10-31 20:47:16 UTC
Reverting 6faddda69f62 has unwanted side-effects, not the least of which is re-introducing a bug in the way tmpfs handles directory offsets.
Comment 4 Chuck Lever 2023-10-31 20:57:23 UTC
Here is a copy of ovs-vswitchd.service from github:

> [Unit]
> Description=Open vSwitch Forwarding Unit
> After=ovsdb-server.service network-pre.target systemd-udev-settle.service
> Before=network.target network.service
> Requires=ovsdb-server.service
> ReloadPropagatedFrom=ovsdb-server.service
> AssertPathIsReadWrite=/run/openvswitch/db.sock
> PartOf=openvswitch.service
> 
> [Service]
> Type=forking
> PIDFile=/run/openvswitch/ovs-vswitchd.pid
> Restart=on-failure
> Environment=XDG_RUNTIME_DIR=/run/openvswitch
> EnvironmentFile=/etc/openvswitch/default.conf
> EnvironmentFile=-/etc/sysconfig/openvswitch
> EnvironmentFile=-/run/openvswitch.useropts
> LimitSTACK=2M
> @begin_dpdk@
> ExecStartPre=-/bin/sh -c '/usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages'
> ExecStartPre=-/usr/bin/chmod 0775 /dev/hugepages
> @end_dpdk@
> ExecStart=/usr/share/openvswitch/scripts/ovs-ctl \
>           --no-ovsdb-server --no-monitor --system-id=random \
>           ${OVS_USER_OPT} \
>           start $OPTIONS
> ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop
> ExecReload=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server \
>           --no-monitor --system-id=random \
>           ${OVS_USER_OPT} \
>           restart $OPTIONS
> TimeoutSec=300

Vlad, have you tried running any of these ovs-ctl commands individually? My working theory is that one of these is doing something unexpected with a tmpfs file which exposes a missing kfree() in the shmemfs code that 6faddda69f62 converted to use simple directory offsets.
Comment 5 Chuck Lever 2023-11-01 16:17:04 UTC
Ilya Maximets has broken down ovs-ctl even further:

https://mail.openvswitch.org/pipermail/ovs-discuss/2023-November/052783.html

One or more of these individual steps could be the culprit.
Comment 6 vladbu 2023-11-01 18:39:06 UTC
(In reply to Chuck Lever from comment #4)
> Here is a copy of ovs-vswitchd.service from github:
> 
> > [Unit]
> > Description=Open vSwitch Forwarding Unit
> > After=ovsdb-server.service network-pre.target systemd-udev-settle.service
> > Before=network.target network.service
> > Requires=ovsdb-server.service
> > ReloadPropagatedFrom=ovsdb-server.service
> > AssertPathIsReadWrite=/run/openvswitch/db.sock
> > PartOf=openvswitch.service
> > 
> > [Service]
> > Type=forking
> > PIDFile=/run/openvswitch/ovs-vswitchd.pid
> > Restart=on-failure
> > Environment=XDG_RUNTIME_DIR=/run/openvswitch
> > EnvironmentFile=/etc/openvswitch/default.conf
> > EnvironmentFile=-/etc/sysconfig/openvswitch
> > EnvironmentFile=-/run/openvswitch.useropts
> > LimitSTACK=2M
> > @begin_dpdk@
> > ExecStartPre=-/bin/sh -c '/usr/bin/chown :$${OVS_USER_ID##*:}
> /dev/hugepages'
> > ExecStartPre=-/usr/bin/chmod 0775 /dev/hugepages
> > @end_dpdk@
> > ExecStart=/usr/share/openvswitch/scripts/ovs-ctl \
> >           --no-ovsdb-server --no-monitor --system-id=random \
> >           ${OVS_USER_OPT} \
> >           start $OPTIONS
> > ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop
> > ExecReload=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server \
> >           --no-monitor --system-id=random \
> >           ${OVS_USER_OPT} \
> >           restart $OPTIONS
> > TimeoutSec=300
> 
> Vlad, have you tried running any of these ovs-ctl commands individually? My
> working theory is that one of these is doing something unexpected with a
> tmpfs file which exposes a missing kfree() in the shmemfs code that
> 6faddda69f62 converted to use simple directory offsets.

Just starting/restarting/reloading/stopping ovs-vswitchd service doesn't reproduce the issue for me.
Comment 7 Chuck Lever 2023-11-03 14:54:03 UTC
At this point I cannot proceed without a reproducer. Vlad, please continue trying to identify a trigger for this issue. I expect as v6.6 is adopted by leading-edge distros, someone else might encounter this issue and be able to provide a more specific trigger.
Comment 8 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-11-22 09:02:42 UTC
(In reply to Chuck Lever from comment #7)
> At this point I cannot proceed without a reproducer. Vlad, please continue
> trying to identify a trigger for this issue. 

Vlad, any progress?

FWIW, I noticed that Chuck wrote a change that contains a fixes tag pointing to the culprit of this regression ( https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=master&id=796432efab1e372d404e7a71cc6891a53f105051 ). I wonder if that in some way might be relevant for this problem.
Comment 9 vladbu 2023-11-22 09:16:46 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #8)
> (In reply to Chuck Lever from comment #7)
> > At this point I cannot proceed without a reproducer. Vlad, please continue
> > trying to identify a trigger for this issue. 
> 
> Vlad, any progress?

No. As Chuck suggested in his previous comment: 
I expect as v6.6 is adopted by leading-edge distros, someone else might encounter this issue and be able to provide a more specific trigger.
Comment 10 Chuck Lever 2023-11-22 14:34:47 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #8)
> FWIW, I noticed that Chuck wrote a change that contains a fixes tag pointing
> to the culprit of this regression (
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> ?h=master&id=796432efab1e372d404e7a71cc6891a53f105051 ). I wonder if that in
> some way might be relevant for this problem.

That's possible, but not probable. Worth a try, though.