Bug 219121

Summary:	6.10 regression: System freezing/locking up during high IO usage.
Product:	File System	Reporter:	kzd (kzd)
Component:	btrfs	Assignee:	BTRFS virtual assignee (fs_btrfs)
Status:	RESOLVED PATCH_ALREADY_AVAILABLE
Severity:	normal	CC:	fdmanana, lilydjwg, me, octavia.togami+kernelbug, regressions, tad
Priority:	P3
Hardware:	AMD
OS:	Linux
Kernel Version:	6.10	Subsystem:
Regression:	Yes	Bisected commit-id:	f1d97e76915285013037c487d9513ab763005286

Description kzd 2024-08-02 21:20:19 UTC

When using programs that might utilize a higher amount of IO such as emerge, xz, or restic, I have been encountering system-wide freezing with an unresponsive UI/cursor quite frequently. Often requiring me to reset my PC to get back in working order without waiting an excessive amount of time.

Downgrading back to 6.9.10 removes the issue entirely.
The issue has been observed since 6.10 through 6.10.2.

Unfortunately I am not sure how to best produce logs or bisect as I imagine the differences between 6.9.10 and 6.10 are quite sizeable. I am unsure if the issue is hardware specific (using an AMD Ryzen 7950x3d CPU).

Since this is seemingly IO-related, I'll mention that I am using btrfs with zstd compression as my filesystem setup.

Comment 1 Artem S. Tashkinov 2024-08-03 08:46:48 UTC

Do you have swap enabled?

Does disabling it help (sudo swapoff -a)?

Comment 2 kzd 2024-08-03 18:55:45 UTC

(In reply to Artem S. Tashkinov from comment #1)
> Do you have swap enabled?
> 
> Does disabling it help (sudo swapoff -a)?

No swap enabled on this system; 64 GB RAM available.

Comment 3 Artem S. Tashkinov 2024-08-04 11:59:36 UTC

Are there any warning messages in `dmesg`?

What can be seen if you leave `top` running in situations like this?

Comment 4 Artem S. Tashkinov 2024-08-04 12:00:33 UTC

Lastly, could you bisect?

https://docs.kernel.org/admin-guide/bug-bisect.html

Comment 5 kzd 2024-08-04 19:07:38 UTC

(In reply to Artem S. Tashkinov from comment #3)
> Are there any warning messages in `dmesg`?
> 
> What can be seen if you leave `top` running in situations like this?

Nothing leftover in dmesg from what I can tell but it'd be hard to check as the entire system freezes, possibly network/youtube playback included.

I usually use gotop but assume top would show very similar results with just the program that was triggerign the issue near the top of my processes. The window would be frozen while it happens as the system only updates the GUI very intermittently during the total freezes.


(In reply to Artem S. Tashkinov from comment #4)
> Lastly, could you bisect?
> 
> https://docs.kernel.org/admin-guide/bug-bisect.html

It'll take a while but I can make a stab at it assuming I can reliably recreate the issue after my backup was successful on 6.9.10, and assuming restic backups without much to do (just IO scanning) can trigger it.

Since I know the emerge for restic caused it during xz usage I can also give that a whirl as a target triggering use-case. Each test takes time and building kernels will of course take a while, so it may be some time before I can report back on a specific commit.

Comment 6 kzd 2024-08-04 21:12:26 UTC

(In reply to Artem S. Tashkinov from comment #4)
> Lastly, could you bisect?
> 
> https://docs.kernel.org/admin-guide/bug-bisect.html

To add to this, a good test that can reliably peg IO might be good to know of as my current setup with restic is no longer causing enough load to trigger the issue when I went to sanity check on 6.10.3 after a few bisects.

Also can't seem to figure out a good emerge target that should do the same despite the restic emerge clearly getting hung up on xz when I was on 6.10.2 the day I made this report.

I'll leave me system running 6.10.3+ to see if I encounter the issue again but since my main backups are only sizeable on a monthly basis I won't have restic as a viable testing target for some time.

Comment 7 tad 2024-08-09 00:52:26 UTC

the easiest reproducer I've found is anything reading a file larger than ram size, eg. zstdmt a virtual machine image

per distro note, I reported this on RHBZ before finding this report:
https://bugzilla.redhat.com/show_bug.cgi?id=2303810

Comment 8 tad 2024-08-09 01:07:14 UTC

this appears to be this known issue from june regression: https://lore.kernel.org/regressions/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/

Comment 9 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-08-09 08:19:16 UTC

(In reply to tad from comment #8)
> this appears to be this known issue from june regression:
> https://lore.kernel.org/regressions/
> CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/

I suspected that already, but that should be fixed with the latest kernels. But you seem to have similar symptoms, yes?

Comment 10 Filipe David Manana 2024-08-09 11:28:09 UTC

It's possibly the extent map shrinker slowing down memory allocations.

I'm reworking it to make it more efficient, and will probably be like a couple weeks until done.

You can try this patch to see if it makes things better for your workload:

https://gist.githubusercontent.com/fdmanana/a03a3b737f29a83434e3ca1b1b3cd5e6/raw/260f9179d760810f0027ceba0dc4e7b0196a32ce/gistfile1.txt

Comment 11 tad 2024-08-09 17:38:09 UTC

(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #9)
> I suspected that already, but that should be fixed with the latest kernels.
> But you seem to have similar symptoms, yes?

yes, I confirmed 6.10.3 had those three patches from the original list, and indeed still see kswapd0 hitting 100% usage

(In reply to Filipe David Manana from comment #10)
> You can try this patch to see if it makes things better for your workload:

Yes, this helps a lot thank you!
here is my zstd qcow image case:

6.9.12: real: 1m5s, user: 55s, sys: 1m6s
6.10.3: real: 1m55s, user: 56s, sys: 1m56s, clock froze/jumped 18 times
6.10.3+patch: real: 1m9s, user: 56s, sys, 1m10s

Comment 12 Filipe David Manana 2024-08-09 23:42:21 UTC

(In reply to tad from comment #11)
> (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from
> comment #9)
> > I suspected that already, but that should be fixed with the latest kernels.
> > But you seem to have similar symptoms, yes?
> 
> yes, I confirmed 6.10.3 had those three patches from the original list, and
> indeed still see kswapd0 hitting 100% usage
> 
> (In reply to Filipe David Manana from comment #10)
> > You can try this patch to see if it makes things better for your workload:
> 
> Yes, this helps a lot thank you!
> here is my zstd qcow image case:
> 
> 6.9.12: real: 1m5s, user: 55s, sys: 1m6s
> 6.10.3: real: 1m55s, user: 56s, sys: 1m56s, clock froze/jumped 18 times
> 6.10.3+patch: real: 1m9s, user: 56s, sys, 1m10s

Thanks!

May I ask you to test the following slightly different patch too?

https://gist.githubusercontent.com/fdmanana/0ed635cf727eb764fa1739dd5e4f7e66/raw/bcd83a7969ccbaee6fc71bf51cb0312b5f424517/gistfile1.txt

If that also fixes the regression for you, I'll put a changelog to it, send it to the mailing list and merge into the for-next branch for inclusing into 6.10 stable.

Comment 13 octavia.togami+kernelbug 2024-08-10 02:40:05 UTC

> May I ask you to test the following slightly different patch too?
> 
> https://gist.githubusercontent.com/fdmanana/0ed635cf727eb764fa1739dd5e4f7e66/
> raw/bcd83a7969ccbaee6fc71bf51cb0312b5f424517/gistfile1.txt

I tried this patch and it fixes the issue on my end. Thanks for the quick fix.

Comment 14 tad 2024-08-10 04:44:09 UTC

my drive died after compiling the new patch

here are some fresh numbers, but they can't be compared to previous since rebuilt on different drive & test file

6.9.12: real: 49s, user: 1m2s, sys: 55s
6.10.3: real: 1m33s, user: 59s, sys: 1m39s, clock froze/jumped 14 times
6.10.3+new patch: real: 1m17s, user: 59s, sys: 1m24s

it does work however, thank you again!

Comment 15 me 2024-08-10 09:13:26 UTC

The issue is most likely related to this bug:

https://forum.garudalinux.org/t/btrfs-cleaner-and-updatedb-running-at-the-same-time-causing-high-system-load-and-massive-lags-due-to-swapping/38541/6

Comment 16 Artem S. Tashkinov 2024-08-16 13:12:18 UTC

Reassigning to btrfs.

Comment 17 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-08-16 13:24:48 UTC

(In reply to Artem S. Tashkinov from comment #16)
> Reassigning to btrfs.

TWIMC (in case anyone stumbles here): a change to improve things was mainlined yesterday: https://git.kernel.org/torvalds/c/ae1e766f623f7a2a889a0b09eb076dd9a60efbe9

If you are still having trouble, please open a new ticket and afterwards drop a link  to it here.