Bug 217366

Summary: large pause when opening file descriptor which is power of 2
Product: Linux Reporter: Patrick Hemmer (phemmer+kernel)
Component: KernelAssignee: Virtual assignee for kernel bugs (linux-kernel)
Status: NEW ---    
Severity: normal Flags: mricon: bugbot+
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: 6.2.8 Subsystem: FILESYSTEMS (VFS and infrastructure)
Regression: No Bisected commit-id:
Attachments: issue demonstration code

Description Patrick Hemmer 2023-04-24 23:16:37 UTC
Created attachment 304186 [details]
issue demonstration code

When running a threaded program, and opening a file descriptor that is a power of 2 (starting at 64), the call takes a very long time to complete. Normally such a call takes less than 2us. However with this issue, I've seen the call take up to around 50ms. Additionally this only happens the first time, and not subsequent times that file descriptor is used. I'm guessing there might be some expansion of some internal data structures going on. But I cannot see why this process would take so long.

Attached is a simple program that can reproduce the issue. The following is example output:

...
FD=62 duration=2695
FD=63 duration=2726
FD=64 duration=12565293
FD=64 duration=3837 (second attempt)
FD=65 duration=1523
FD=66 duration=1533
...
FD=126 duration=1463
FD=127 duration=1402
FD=128 duration=24755063
FD=128 duration=3186 (second attempt)
FD=129 duration=1433
FD=130 duration=1413
...
FD=254 duration=1363
FD=255 duration=1493
FD=256 duration=7602777
FD=256 duration=1573 (second attempt)
FD=257 duration=1392
FD=258 duration=1363
...

I've reproduced the issue on systems running multiple different 4.19 kernels, and on a system running 6.2.8
Comment 1 Bugspray Bot 2023-04-26 19:19:40 UTC
Matthew Wilcox <willy@infradead.org> writes:

On Wed, Apr 26, 2023 at 05:58:06PM +0000, Kernel.org Bugbot wrote:
> When running a threaded program, and opening a file descriptor that
> is a power of 2 (starting at 64), the call takes a very long time to
> complete. Normally such a call takes less than 2us. However with this
> issue, I've seen the call take up to around 50ms. Additionally this only
> happens the first time, and not subsequent times that file descriptor is
> used. I'm guessing there might be some expansion of some internal data
> structures going on. But I cannot see why this process would take so long.

Because we allocate a new block of memory and then memcpy() the old
block of memory into it.  This isn't surprising behaviour to me.
I don't think there's much we can do to change it (Allocating a
segmented array of file descriptors has previously been vetoed by
people who have programs with a million file descriptors).  Is it
causing you problems?

(via https://msgid.link/ZEl34WthS8UNJnNd@casper.infradead.org)
Comment 2 Bugspray Bot 2023-04-26 19:52:22 UTC
Al Viro <viro@zeniv.linux.org.uk> replies to comment #1:

On Wed, Apr 26, 2023 at 08:13:37PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 26, 2023 at 05:58:06PM +0000, Kernel.org Bugbot wrote:
> > When running a threaded program, and opening a file descriptor that
> > is a power of 2 (starting at 64), the call takes a very long time to
> > complete. Normally such a call takes less than 2us. However with this
> > issue, I've seen the call take up to around 50ms. Additionally this only
> > happens the first time, and not subsequent times that file descriptor is
> > used. I'm guessing there might be some expansion of some internal data
> > structures going on. But I cannot see why this process would take so long.
> 
> Because we allocate a new block of memory and then memcpy() the old
> block of memory into it.  This isn't surprising behaviour to me.
> I don't think there's much we can do to change it (Allocating a
> segmented array of file descriptors has previously been vetoed by
> people who have programs with a million file descriptors).  Is it
> causing you problems?

FWIW, I suspect that this is not so much allocation + memcpy.
        /* make sure all fd_install() have seen resize_in_progress
	 * or have finished their rcu_read_lock_sched() section.
	 */
	if (atomic_read(&files->count) > 1)
		synchronize_rcu();

in expand_fdtable() is a likelier source of delays.

(via https://msgid.link/20230426194628.GU3390869@ZenIV)
Comment 3 Bugspray Bot 2023-04-26 19:57:52 UTC
Matthew Wilcox <willy@infradead.org> replies to comment #2:

On Wed, Apr 26, 2023 at 08:46:28PM +0100, Al Viro wrote:
> On Wed, Apr 26, 2023 at 08:13:37PM +0100, Matthew Wilcox wrote:
> > On Wed, Apr 26, 2023 at 05:58:06PM +0000, Kernel.org Bugbot wrote:
> > > When running a threaded program, and opening a file descriptor that
> > > is a power of 2 (starting at 64), the call takes a very long time to
> > > complete. Normally such a call takes less than 2us. However with this
> > > issue, I've seen the call take up to around 50ms. Additionally this only
> > > happens the first time, and not subsequent times that file descriptor is
> > > used. I'm guessing there might be some expansion of some internal data
> > > structures going on. But I cannot see why this process would take so
> long.
> > 
> > Because we allocate a new block of memory and then memcpy() the old
> > block of memory into it.  This isn't surprising behaviour to me.
> > I don't think there's much we can do to change it (Allocating a
> > segmented array of file descriptors has previously been vetoed by
> > people who have programs with a million file descriptors).  Is it
> > causing you problems?
> 
> FWIW, I suspect that this is not so much allocation + memcpy.
>         /* make sure all fd_install() have seen resize_in_progress
>        * or have finished their rcu_read_lock_sched() section.
>        */
>       if (atomic_read(&files->count) > 1)
>               synchronize_rcu();
> 
> in expand_fdtable() is a likelier source of delays.

Perhaps?  The delay seemed to be roughly doubling with the test program,
so I assumed it was primarily the memcpy() cost for the reporter's
system:

FD=64 duration=12565293
FD=128 duration=24755063
FD=256 duration=7602777

... although now I've pasted it, I see my brain skipped one digit, so
256 was faster than 64, not about twice as slow as 128.

(via https://msgid.link/ZEmB1bOJFz72K0ho@casper.infradead.org)
Comment 4 Bugspray Bot 2023-04-26 20:03:22 UTC
Al Viro <viro@zeniv.linux.org.uk> replies to comment #2:

On Wed, Apr 26, 2023 at 08:46:28PM +0100, Al Viro wrote:
> On Wed, Apr 26, 2023 at 08:13:37PM +0100, Matthew Wilcox wrote:
> > On Wed, Apr 26, 2023 at 05:58:06PM +0000, Kernel.org Bugbot wrote:
> > > When running a threaded program, and opening a file descriptor that
> > > is a power of 2 (starting at 64), the call takes a very long time to
> > > complete. Normally such a call takes less than 2us. However with this
> > > issue, I've seen the call take up to around 50ms. Additionally this only
> > > happens the first time, and not subsequent times that file descriptor is
> > > used. I'm guessing there might be some expansion of some internal data
> > > structures going on. But I cannot see why this process would take so
> long.
> > 
> > Because we allocate a new block of memory and then memcpy() the old
> > block of memory into it.  This isn't surprising behaviour to me.
> > I don't think there's much we can do to change it (Allocating a
> > segmented array of file descriptors has previously been vetoed by
> > people who have programs with a million file descriptors).  Is it
> > causing you problems?
> 
> FWIW, I suspect that this is not so much allocation + memcpy.
>         /* make sure all fd_install() have seen resize_in_progress
>        * or have finished their rcu_read_lock_sched() section.
>        */
>       if (atomic_read(&files->count) > 1)
>               synchronize_rcu();
> 
> in expand_fdtable() is a likelier source of delays.

	A bit more background: we want to avoid grabbing ->file_lock in
fd_install() if at all possible.  After all, we have already claimed
the slot (back when we'd allocated the descriptor) and nobody else is
allowed to shove a file reference there.

	Which is fine, except for the fact that expansion of descriptor
table needs to allocate a new array, copy the old one into it and replace
the old one with it.

	Lockless fd_install() might overlap with the "copy" step in
the above and end up getting lost.  So in fd_install() we check if
there's a resize in progress before deciding to go for the lockless path.
Which means that on the resize side we need to mark the descriptor table
as getting resized, then wait long enough for all threads already in
fd_install() to get through.

(via https://msgid.link/20230426195821.GV3390869@ZenIV)
Comment 5 Bugspray Bot 2023-04-26 20:36:06 UTC
Al Viro <viro@zeniv.linux.org.uk> replies to comment #3:

On Wed, Apr 26, 2023 at 08:56:05PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 26, 2023 at 08:46:28PM +0100, Al Viro wrote:
> > On Wed, Apr 26, 2023 at 08:13:37PM +0100, Matthew Wilcox wrote:
> > > On Wed, Apr 26, 2023 at 05:58:06PM +0000, Kernel.org Bugbot wrote:
> > > > When running a threaded program, and opening a file descriptor that
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	This.  The cost of allocation and copy doesn't depend upon the number
of threads; synchronize_rcu() call, OTOH, is conditional upon that being more
than 1.

> > > > is a power of 2 (starting at 64), the call takes a very long time to
> > > > complete. Normally such a call takes less than 2us. However with this
> > > > issue, I've seen the call take up to around 50ms. Additionally this
> only
> > > > happens the first time, and not subsequent times that file descriptor
> is
> > > > used. I'm guessing there might be some expansion of some internal data
> > > > structures going on. But I cannot see why this process would take so
> long.
> > > 
> > > Because we allocate a new block of memory and then memcpy() the old
> > > block of memory into it.  This isn't surprising behaviour to me.
> > > I don't think there's much we can do to change it (Allocating a
> > > segmented array of file descriptors has previously been vetoed by
> > > people who have programs with a million file descriptors).  Is it
> > > causing you problems?
> > 
> > FWIW, I suspect that this is not so much allocation + memcpy.
> >         /* make sure all fd_install() have seen resize_in_progress
> >      * or have finished their rcu_read_lock_sched() section.
> >      */
> >     if (atomic_read(&files->count) > 1)
> >             synchronize_rcu();
> > 
> > in expand_fdtable() is a likelier source of delays.
> 
> Perhaps?  The delay seemed to be roughly doubling with the test program,
> so I assumed it was primarily the memcpy() cost for the reporter's
> system:
> 
> FD=64 duration=12565293
> FD=128 duration=24755063
> FD=256 duration=7602777
> 
> ... although now I've pasted it, I see my brain skipped one digit, so
> 256 was faster than 64, not about twice as slow as 128.

(via https://msgid.link/20230426203305.GW3390869@ZenIV)
Comment 6 Patrick Hemmer 2023-04-26 23:40:50 UTC
(In reply to Bugbot from comment #1)
> Matthew Wilcox <willy@infradead.org> writes:
> 
> On Wed, Apr 26, 2023 at 05:58:06PM +0000, Kernel.org Bugbot wrote:
> > When running a threaded program, and opening a file descriptor that
> > is a power of 2 (starting at 64), the call takes a very long time to
> > complete. Normally such a call takes less than 2us. However with this
> > issue, I've seen the call take up to around 50ms. Additionally this only
> > happens the first time, and not subsequent times that file descriptor is
> > used. I'm guessing there might be some expansion of some internal data
> > structures going on. But I cannot see why this process would take so long.
> 
> Because we allocate a new block of memory and then memcpy() the old
> block of memory into it.  This isn't surprising behaviour to me.
> I don't think there's much we can do to change it (Allocating a
> segmented array of file descriptors has previously been vetoed by
> people who have programs with a million file descriptors).  Is it
> causing you problems?

Yes. I'm using using sockets for IPC. Specifically haproxy with its SPOE protocol. Low latency is important. Normally a call (including optional connect if a new connection is needed) will easily complete in under 100us. So I want to set a timeout of 1ms to avoid blocking traffic. However because this issue effectively randomly pops up, that 1ms timeout is too low, and the issue can actually impact multiple in-flight requests because haproxy tries to share that one IPC connection for them all. But if I raise the timeout (and I'd have to raise it to something like 100ms, as I've seen delays up to 47ms in just light testing), then I run the risk of significantly impacting traffic if there is a legitimate slowdown. While a low timeout and the occasional failure is probably the better of the two options, I'd prefer not to fail at all.
Comment 7 Bugspray Bot 2023-04-27 09:33:09 UTC
Christian Brauner <brauner@kernel.org> replies to comment #6:

On Wed, Apr 26, 2023 at 11:42:01PM +0000, Kernel.org Bugbot wrote:
> phemmer+kernel writes via Kernel.org Bugzilla:
> 
> (In reply to Bugbot from comment #1)
> > Matthew Wilcox <willy@infradead.org> writes:
> > 
> > On Wed, Apr 26, 2023 at 05:58:06PM +0000, Kernel.org Bugbot wrote:
> > > When running a threaded program, and opening a file descriptor that
> > > is a power of 2 (starting at 64), the call takes a very long time to
> > > complete. Normally such a call takes less than 2us. However with this
> > > issue, I've seen the call take up to around 50ms. Additionally this only
> > > happens the first time, and not subsequent times that file descriptor is
> > > used. I'm guessing there might be some expansion of some internal data
> > > structures going on. But I cannot see why this process would take so
> long.
> > 
> > Because we allocate a new block of memory and then memcpy() the old
> > block of memory into it.  This isn't surprising behaviour to me.
> > I don't think there's much we can do to change it (Allocating a
> > segmented array of file descriptors has previously been vetoed by
> > people who have programs with a million file descriptors).  Is it
> > causing you problems?
> 
> Yes. I'm using using sockets for IPC. Specifically haproxy with its SPOE
> protocol. Low latency is important. Normally a call (including optional
> connect if a new connection is needed) will easily complete in under 100us.
> So I want to set a timeout of 1ms to avoid blocking traffic. However because
> this issue effectively randomly pops up, that 1ms timeout is too low, and the
> issue can actually impact multiple in-flight requests because haproxy tries
> to share that one IPC connection for them all. But if I raise the timeout
> (and I'd have to raise it to something like 100ms, as I've seen delays up to
> 47ms in just light testing), then I run the risk of significantly impacting
> traffic if there is a legitimate slowdown. While a low timeout and the
> occasional failure is probably the better of the two options, I'd prefer not
> to fail at all.

I wonder if you could use io_uring for this. The problem sounds a lot
like it could be solved by using the fixed file descriptor feature. The
async and linking operation nature of it might be rather valuable for
this as well...

(via https://msgid.link/20230427-arbeit-davor-6b1b48bc555a@brauner)
Comment 8 Bugspray Bot 2023-04-27 18:23:33 UTC
Matthew Wilcox <willy@infradead.org> replies to comment #6:

On Wed, Apr 26, 2023 at 11:42:01PM +0000, Kernel.org Bugbot wrote:
> Yes. I'm using using sockets for IPC. Specifically haproxy with its
> SPOE protocol. Low latency is important. Normally a call (including
> optional connect if a new connection is needed) will easily complete
> in under 100us. So I want to set a timeout of 1ms to avoid blocking
> traffic. However because this issue effectively randomly pops up,
> that 1ms timeout is too low, and the issue can actually impact
> multiple in-flight requests because haproxy tries to share that one
> IPC connection for them all. But if I raise the timeout (and I'd have
> to raise it to something like 100ms, as I've seen delays up to 47ms in
> just light testing), then I run the risk of significantly impacting
> traffic if there is a legitimate slowdown. While a low timeout and
> the occasional failure is probably the better of the two options,
> I'd prefer not to fail at all.

A quick workaround for this might be to use dup2() to open a newfd
that is larger than you think your process will ever use.  ulimit -n
is 1024 (on my system), so choosing 1023 might be a good idea.
It'll waste a little memory, but ensures the fd array will never need to
expand.

(via https://msgid.link/ZEq8iSl985aqEy4%2B@casper.infradead.org)
Comment 9 Patrick Hemmer 2023-04-28 00:33:53 UTC
(In reply to Bugbot from comment #8)
> Matthew Wilcox <willy@infradead.org> replies to comment #6:
> 
> On Wed, Apr 26, 2023 at 11:42:01PM +0000, Kernel.org Bugbot wrote:
> > Yes. I'm using using sockets for IPC. Specifically haproxy with its
> > SPOE protocol. Low latency is important. Normally a call (including
> > optional connect if a new connection is needed) will easily complete
> > in under 100us. So I want to set a timeout of 1ms to avoid blocking
> > traffic. However because this issue effectively randomly pops up,
> > that 1ms timeout is too low, and the issue can actually impact
> > multiple in-flight requests because haproxy tries to share that one
> > IPC connection for them all. But if I raise the timeout (and I'd have
> > to raise it to something like 100ms, as I've seen delays up to 47ms in
> > just light testing), then I run the risk of significantly impacting
> > traffic if there is a legitimate slowdown. While a low timeout and
> > the occasional failure is probably the better of the two options,
> > I'd prefer not to fail at all.
> 
> A quick workaround for this might be to use dup2() to open a newfd
> that is larger than you think your process will ever use.  ulimit -n
> is 1024 (on my system), so choosing 1023 might be a good idea.
> It'll waste a little memory, but ensures the fd array will never need to
> expand.

That's a good idea. I was originally considering opening a bunch of file descriptors one by one. But if it will grow even while skipping all the FDs in between, then that seems like it should work. At least for the app which I control. I don't know that it'd be a welcome change on the haproxy side though. And both sides would need it to completely alleviate the issue.