Created attachment 304291 [details] dmesg after updating from 6.2.x to 6.3.x, vmalloc error messages started to appear in the dmesg # free total used free shared buff/cache available Mem: 16183724 1473068 205664 33472 14504992 14335700 Swap: 16777212 703596 16073616 (zswap enabled)
(In reply to a1bert from comment #0) > Created attachment 304291 [details] > dmesg > > after updating from 6.2.x to 6.3.x, vmalloc error messages started to appear > in the dmesg > > > > # free > total used free shared buff/cache > available > Mem: 16183724 1473068 205664 33472 14504992 > 14335700 > Swap: 16777212 703596 16073616 > > > (zswap enabled) What setup is your computer? Can you bisect this to find the culprit?
it is small home server/gateway: (NAS/lxc/qemu/DVR/nfs/backup): /dev/md0 on / type ext4 (rw,noatime,nodiratime,errors=remount-ro) /dev/mapper/sopa-motion on /data/motion type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=128,swidth=256,noquota) /dev/sdb3 on /mnt/raid1 type btrfs (rw,noatime,compress=zstd:15,space_cache,skip_balance,subvolid=5,subvol=/) /dev/sdb3 on /data/backup type btrfs (rw,noatime,compress=zstd:15,space_cache,skip_balance,subvolid=12088,subvol=/@backup) /dev/sdb3 on /data/sopa type btrfs (rw,noatime,compress=zstd:15,space_cache,skip_balance,subvolid=8323,subvol=/sopa) /dev/sdb3 on /data/www type btrfs (rw,noatime,compress=zstd:15,space_cache,skip_balance,subvolid=8163,subvol=/www) /dev/sdb3 on /data/tftp type btrfs (rw,noatime,compress=zstd:15,space_cache,skip_balance,subvolid=8164,subvol=/tftp) /dev/sdb3 on /data/media type btrfs (rw,noatime,compress=zstd:15,space_cache,skip_balance,subvolid=8165,subvol=/media) /dev/sdb3 on /data/nfs type btrfs (rw,noatime,compress=zstd:15,space_cache,skip_balance,subvolid=1021,subvol=/nfs) /dev/sdb3 on /data/lxc type btrfs (rw,noatime,compress=zstd:15,space_cache,skip_balance,subvolid=9264,subvol=/lxc) /dev/sdb3 on /data/libvirt type btrfs (rw,noatime,compress=zstd:15,space_cache,skip_balance,subvolid=8126,subvol=/libvirt) /dev/sdb3 on /home type btrfs (rw,noatime,compress=zstd:15,space_cache,skip_balance,subvolid=12042,subvol=/@home) btrfs sub list /home | wc -l 495 Overall: Device size: 4.49TiB Device allocated: 4.40TiB Device unallocated: 93.94GiB Device missing: 0.00B Device slack: 0.00B Used: 4.23TiB Free (estimated): 134.98GiB (min: 134.98GiB) Free (statfs, df): 134.98GiB Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Multiple profiles: no Data,RAID1: Size:2.19TiB, Used:2.11TiB (96.08%) /dev/sdb3 2.19TiB /dev/sda3 2.19TiB Metadata,RAID1: Size:7.00GiB, Used:6.26GiB (89.41%) /dev/sdb3 7.00GiB /dev/sda3 7.00GiB System,RAID1: Size:32.00MiB, Used:416.00KiB (1.27%) /dev/sdb3 32.00MiB /dev/sda3 32.00MiB Unallocated: /dev/sdb3 46.97GiB /dev/sda3 46.97GiB (sorry, I cannot bisect)
(In reply to a1bert from comment #2) > (sorry, I cannot bisect) With a bit of luck the btrfs maintainer (which is known to look into bugzilla reports) might have an idea about the cause or how to find it without a bisection; or somebody else might run into this a bisect the problem. But be warned, if neither happens there is a decent chance that this won't be fixed.
(In reply to a1bert from comment #2) > (sorry, I cannot bisect) See Documentation/admin-guide/bug-bisect.rst for how to perform bisection. Remember: if you'd like to see your regression (like this) being fixed, you'll have to bisect.
This not a regression, the memory allocation failures could happen for various reasons and depend on actual state of the system, how fragmented the memory is, if there are virtual mapping slots available. What could affect it comparing 6.2 and 6.3 is some internal memory allocator strategy or even an indirect change, that's a speculative territory. The report is from inside zstd_alloc_workspace, that's calling kmalloc (with fallback to vmalloc). The allocated size is 2097152 (about 2MiB) and per comment 2 you're using compress=zstd:15. The level 15 is indeed the most memory hungry. The workspaces are preallocated or allocated on demand, which could lead to the warnings. In case the on-demand allocation fails the thread waits until one is free, so the fix is to avoid the allocation warnings.
Hi! I reported a similar issue only a couple of weeks ago. I could at that time reliably create the vmalloc error within a few minutes by running bees. This only happened on kernel 6.3.x. This system has 24GB of RAM and even if I load several VMs, large databases, run compilations, etc, at the same time as bees, all is good on kernels <6.3, but if I run bees without any other service on 6.3, this vmalloc error happen even though there is 15-20GB free ram and I trigger /proc/sys/vm/compact_memory. Not saying it is a fault in Btrfs, and it is probably somewhere else in the kernel. I was able to reproduce the issue in a QEMU VM. https://lore.kernel.org/all/d11418b6-38e5-eb78-1537-c39245dc0b78@tnonline.net/T/
Thanks for the pointer. The cause and symptoms are the same but only in a different place (ioctl). We can add the NOWARN flag but something might be going on in MM that would be of interest of the developers.