Bug 219121
Summary: | 6.10 regression: System freezing/locking up during high IO usage. | ||
---|---|---|---|
Product: | File System | Reporter: | kzd (kzd) |
Component: | btrfs | Assignee: | BTRFS virtual assignee (fs_btrfs) |
Status: | RESOLVED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | fdmanana, lilydjwg, me, octavia.togami+kernelbug, regressions, tad |
Priority: | P3 | ||
Hardware: | AMD | ||
OS: | Linux | ||
Kernel Version: | 6.10 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | f1d97e76915285013037c487d9513ab763005286 |
Description
kzd
2024-08-02 21:20:19 UTC
Do you have swap enabled? Does disabling it help (sudo swapoff -a)? (In reply to Artem S. Tashkinov from comment #1) > Do you have swap enabled? > > Does disabling it help (sudo swapoff -a)? No swap enabled on this system; 64 GB RAM available. Are there any warning messages in `dmesg`? What can be seen if you leave `top` running in situations like this? Lastly, could you bisect? https://docs.kernel.org/admin-guide/bug-bisect.html (In reply to Artem S. Tashkinov from comment #3) > Are there any warning messages in `dmesg`? > > What can be seen if you leave `top` running in situations like this? Nothing leftover in dmesg from what I can tell but it'd be hard to check as the entire system freezes, possibly network/youtube playback included. I usually use gotop but assume top would show very similar results with just the program that was triggerign the issue near the top of my processes. The window would be frozen while it happens as the system only updates the GUI very intermittently during the total freezes. (In reply to Artem S. Tashkinov from comment #4) > Lastly, could you bisect? > > https://docs.kernel.org/admin-guide/bug-bisect.html It'll take a while but I can make a stab at it assuming I can reliably recreate the issue after my backup was successful on 6.9.10, and assuming restic backups without much to do (just IO scanning) can trigger it. Since I know the emerge for restic caused it during xz usage I can also give that a whirl as a target triggering use-case. Each test takes time and building kernels will of course take a while, so it may be some time before I can report back on a specific commit. (In reply to Artem S. Tashkinov from comment #4) > Lastly, could you bisect? > > https://docs.kernel.org/admin-guide/bug-bisect.html To add to this, a good test that can reliably peg IO might be good to know of as my current setup with restic is no longer causing enough load to trigger the issue when I went to sanity check on 6.10.3 after a few bisects. Also can't seem to figure out a good emerge target that should do the same despite the restic emerge clearly getting hung up on xz when I was on 6.10.2 the day I made this report. I'll leave me system running 6.10.3+ to see if I encounter the issue again but since my main backups are only sizeable on a monthly basis I won't have restic as a viable testing target for some time. the easiest reproducer I've found is anything reading a file larger than ram size, eg. zstdmt a virtual machine image per distro note, I reported this on RHBZ before finding this report: https://bugzilla.redhat.com/show_bug.cgi?id=2303810 this appears to be this known issue from june regression: https://lore.kernel.org/regressions/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/ (In reply to tad from comment #8) > this appears to be this known issue from june regression: > https://lore.kernel.org/regressions/ > CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/ I suspected that already, but that should be fixed with the latest kernels. But you seem to have similar symptoms, yes? It's possibly the extent map shrinker slowing down memory allocations. I'm reworking it to make it more efficient, and will probably be like a couple weeks until done. You can try this patch to see if it makes things better for your workload: https://gist.githubusercontent.com/fdmanana/a03a3b737f29a83434e3ca1b1b3cd5e6/raw/260f9179d760810f0027ceba0dc4e7b0196a32ce/gistfile1.txt (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #9) > I suspected that already, but that should be fixed with the latest kernels. > But you seem to have similar symptoms, yes? yes, I confirmed 6.10.3 had those three patches from the original list, and indeed still see kswapd0 hitting 100% usage (In reply to Filipe David Manana from comment #10) > You can try this patch to see if it makes things better for your workload: Yes, this helps a lot thank you! here is my zstd qcow image case: 6.9.12: real: 1m5s, user: 55s, sys: 1m6s 6.10.3: real: 1m55s, user: 56s, sys: 1m56s, clock froze/jumped 18 times 6.10.3+patch: real: 1m9s, user: 56s, sys, 1m10s (In reply to tad from comment #11) > (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from > comment #9) > > I suspected that already, but that should be fixed with the latest kernels. > > But you seem to have similar symptoms, yes? > > yes, I confirmed 6.10.3 had those three patches from the original list, and > indeed still see kswapd0 hitting 100% usage > > (In reply to Filipe David Manana from comment #10) > > You can try this patch to see if it makes things better for your workload: > > Yes, this helps a lot thank you! > here is my zstd qcow image case: > > 6.9.12: real: 1m5s, user: 55s, sys: 1m6s > 6.10.3: real: 1m55s, user: 56s, sys: 1m56s, clock froze/jumped 18 times > 6.10.3+patch: real: 1m9s, user: 56s, sys, 1m10s Thanks! May I ask you to test the following slightly different patch too? https://gist.githubusercontent.com/fdmanana/0ed635cf727eb764fa1739dd5e4f7e66/raw/bcd83a7969ccbaee6fc71bf51cb0312b5f424517/gistfile1.txt If that also fixes the regression for you, I'll put a changelog to it, send it to the mailing list and merge into the for-next branch for inclusing into 6.10 stable. > May I ask you to test the following slightly different patch too?
>
> https://gist.githubusercontent.com/fdmanana/0ed635cf727eb764fa1739dd5e4f7e66/
> raw/bcd83a7969ccbaee6fc71bf51cb0312b5f424517/gistfile1.txt
I tried this patch and it fixes the issue on my end. Thanks for the quick fix.
my drive died after compiling the new patch here are some fresh numbers, but they can't be compared to previous since rebuilt on different drive & test file 6.9.12: real: 49s, user: 1m2s, sys: 55s 6.10.3: real: 1m33s, user: 59s, sys: 1m39s, clock froze/jumped 14 times 6.10.3+new patch: real: 1m17s, user: 59s, sys: 1m24s it does work however, thank you again! The issue is most likely related to this bug: https://forum.garudalinux.org/t/btrfs-cleaner-and-updatedb-running-at-the-same-time-causing-high-system-load-and-massive-lags-due-to-swapping/38541/6 Reassigning to btrfs. (In reply to Artem S. Tashkinov from comment #16) > Reassigning to btrfs. TWIMC (in case anyone stumbles here): a change to improve things was mainlined yesterday: https://git.kernel.org/torvalds/c/ae1e766f623f7a2a889a0b09eb076dd9a60efbe9 If you are still having trouble, please open a new ticket and afterwards drop a link to it here. |