Bug 217620

Summary: RCU stalls with wireguard over bonding over igb on Linux 6.3.0+
Product: Linux Reporter: Manuel 'satmd' Leiner (manuel.leiner)
Component: KernelAssignee: Virtual assignee for kernel bugs (linux-kernel)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: bp, Jason, sam
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:

Description Manuel 'satmd' Leiner 2023-07-01 12:40:19 UTC
I've spent the last week on debugging a problem with my attempt to upgrade my kernel from 6.2.8 to 6.3.8 (now also with 6.4.0 too).

The lenghty and detailed bug reports with all aspects of git bisect are at
https://bugs.gentoo.org/909066

A summary:
- if I do not configure wg0, the kernel does not hang
- if I use a kernel older than commit fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c, it does not hang

The commit refers to code that seems unrelated to the problem for my naiive eye.

The hardware is a Dell PowerEdge R620 running Gentoo ~amd64.

I have so far excluded:
- dracut for generating the initramfs is the same version over all kernels
- linux-firmware has been the same
- CPU microcode has been the same

It's been a long time since I seriously involved with software development and I have been even less involved with kernel development.

Gentoo maintainers recommended me to open a bug with upstream, so here I am.

I currently have no idea how to make progress, but I'm willing to try things.
Comment 1 Manuel 'satmd' Leiner 2023-07-01 14:14:58 UTC
(In reply to Manuel 'satmd' Leiner from comment #0)
> I've spent the last week on debugging a problem with my attempt to upgrade
> my kernel from 6.2.8 to 6.3.8 (now also with 6.4.0 too).
> 
> The lenghty and detailed bug reports with all aspects of git bisect are at
> https://bugs.gentoo.org/909066
> 
> A summary:
> - if I do not configure wg0, the kernel does not hang
> - if I use a kernel older than commit
> fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c, it does not hang
> 
> The commit refers to code that seems unrelated to the problem for my naiive
> eye.
> 
> The hardware is a Dell PowerEdge R620 running Gentoo ~amd64.
> 
> I have so far excluded:
> - dracut for generating the initramfs is the same version over all kernels
> - linux-firmware has been the same
> - CPU microcode has been the same
> 
> It's been a long time since I seriously involved with software development
> and I have been even less involved with kernel development.
> 
> Gentoo maintainers recommended me to open a bug with upstream, so here I am.
> 
> I currently have no idea how to make progress, but I'm willing to try things.

I've just successfully build v6.4 with fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c reverted.
Comment 2 Manuel 'satmd' Leiner 2023-07-01 14:15:28 UTC
(In reply to Manuel 'satmd' Leiner from comment #1)
> (In reply to Manuel 'satmd' Leiner from comment #0)
> > I've spent the last week on debugging a problem with my attempt to upgrade
> > my kernel from 6.2.8 to 6.3.8 (now also with 6.4.0 too).
> > 
> > The lenghty and detailed bug reports with all aspects of git bisect are at
> > https://bugs.gentoo.org/909066
> > 
> > A summary:
> > - if I do not configure wg0, the kernel does not hang
> > - if I use a kernel older than commit
> > fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c, it does not hang
> > 
> > The commit refers to code that seems unrelated to the problem for my naiive
> > eye.
> > 
> > The hardware is a Dell PowerEdge R620 running Gentoo ~amd64.
> > 
> > I have so far excluded:
> > - dracut for generating the initramfs is the same version over all kernels
> > - linux-firmware has been the same
> > - CPU microcode has been the same
> > 
> > It's been a long time since I seriously involved with software development
> > and I have been even less involved with kernel development.
> > 
> > Gentoo maintainers recommended me to open a bug with upstream, so here I
> am.
> > 
> > I currently have no idea how to make progress, but I'm willing to try
> things.
> 
> I've just successfully build v6.4 with
> fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c reverted.

... and which seems to be running stable.
Comment 3 Borislav Petkov 2023-07-02 08:37:03 UTC
Can you boot once plain 6.4 and once with the patch reverted adding "debug ignore_loglevel log_buf_len=16M" to the kernel command line in both cases and upload full dmesg from both?

Thx.
Comment 4 Manuel 'satmd' Leiner 2023-07-02 14:40:25 UTC
I will 
- add those cmdline arguments permamently until the bug is resolved
- test plain v6.4 again
- test v6.4 with reverted fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c
- test v6.4 with patch 54d5e4329efe0d1dba8b4a58720d29493926bed0

I will have to test those during European night time and when my health allows for it. This may take a day or two. I did a lot of tests during daytime and have to give my users a bit of rest too.
Comment 5 Manuel 'satmd' Leiner 2023-07-02 15:45:54 UTC
Small change of plans:

After talking to Jason, I will do things in this order:
- Try v6.4 with patch 54d5e4329efe0d1dba8b4a58720d29493926bed0
- test plain v6.4 again
- test v6.4 with reverted fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c

all with the adjusted kernel arguments.
Comment 6 Manuel 'satmd' Leiner 2023-07-02 20:15:28 UTC
The patch 54d5e4329efe0d1dba8b4a58720d29493926bed0 allows me to successfully boot v6.4.

I'd preferably skip over the other tests if we're able to agree that we don't need these tests anymore. :)
Comment 7 Manuel 'satmd' Leiner 2023-07-02 20:18:08 UTC
Your patch works for me. Tested-by: Manuel Leiner <manuel.leiner@gmx.de>
Comment 8 Jason A. Donenfeld 2023-07-03 11:49:05 UTC
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=7387943fa35516f6f8017a3b0e9ce48a3bef9faa

The fix hit the net tree. Will be in the next stable.