Running llama.cpp by default has it using CUDA's allocator which creates pinned pages. Running on the latest official 6.11 kernel results in permanent memory leaks after each invocation with `free -m` reporting more & more memory used with no active process actually using that memory. Similarly, `nr_foll_pin_acquired` and `nr_foll_pin_acquired` in `/proc/vmstat` are horribly imbalanced. llama.cpp discussion https://github.com/ggerganov/llama.cpp/issues/9988 and reported to nvidia https://forums.developer.nvidia.com/t/memory-leak-on-kernel-6-11-0-when-using-cudamallochost/308691 I see a patch proposed in https://lore.kernel.org/lkml/87y12ibbew.fsf@nvdebian.thelocal/T/#ma3aebfc4d8aa152d2c0439bedf0a4862d2510185 but the patch doesn't seem to have been applied in 6.12 RC nor mainline so I wanted to create a bug to make sure this is tracked.
The only thing that doesn't make sense in the explanation. I have 64GiB of RAM and even on a freshly booted machine, the memory usage is only 4 GiB. The maximum allocated by llama.cpp is ~16 GiB. So it's a bit strange to be hitting the issue with the reasoning that it's in the "low memory conditions" case only.
I've confirmed the patch fixes the issue on 6.11. Can't quite get 6.12 booting for some reason to double-check.