After upgrading to kernel version 6.4.0 from 6.3.9, I noticed frequent but random crashes in a user space program. After a lot of reduction, I have come up with the following reproducer program: $ uname -a Linux jacob 6.4.1-gentoo #1 SMP PREEMPT_DYNAMIC Sat Jul 1 19:02:42 EDT 2023 x86_64 AMD Ryzen 9 7950X3D 16-Core Processor AuthenticAMD GNU/Linux $ cat repro.c #define _GNU_SOURCE #include <sched.h> #include <sys/wait.h> #include <unistd.h> void *threadSafeAlloc(size_t n) { static size_t end_index = 0; static char buffer[1 << 25]; size_t start_index = __atomic_load_n(&end_index, __ATOMIC_SEQ_CST); while (1) { if (start_index + n > sizeof(buffer)) _exit(1); if (__atomic_compare_exchange_n(&end_index, &start_index, start_index + n, 1, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) return buffer + start_index; } } int thread(void *arg) { size_t i; size_t n = 1 << 7; char *items; (void)arg; while (1) { items = threadSafeAlloc(n); for (i = 0; i != n; i += 1) items[i] = '@'; for (i = 0; i != n; i += 1) if (items[i] != '@') _exit(2); } } int main(void) { static size_t stacks[2][1 << 9]; size_t i; for (i = 0; i != 2; i += 1) clone(&thread, &stacks[i] + 1, CLONE_THREAD | CLONE_VM | CLONE_SIGHAND, NULL); while (1) { if (fork() == 0) _exit(0); (void)wait(NULL); } } $ cc repro.c $ ./a.out $ echo $? 2 After tuning the various parameters for my computer, exit code 2, which indicates that memory corruption was detected, occurs approximately 99% of the time. Exit code 1, which occurs approximately 1% of the time, means it ran out of statically-allocated memory before reproducing the issue, and increasing the memory usage any more only leads to diminishing returns. There is also something like a 0.1% chance that it segfaults due to memory corruption elsewhere than in the statically-allocated buffer. With this reproducer in hand, I was able to perform the following bisection: git bisect start # status: waiting for both good and bad commits # bad: [6995e2de6891c724bfeb2db33d7b87775f913ad1] Linux 6.4 git bisect bad 6995e2de6891c724bfeb2db33d7b87775f913ad1 # status: waiting for good commit(s), bad commit known # good: [457391b0380335d5e9a5babdec90ac53928b23b4] Linux 6.3 git bisect good 457391b0380335d5e9a5babdec90ac53928b23b4 # good: [d42b1c47570eb2ed818dc3fe94b2678124af109d] Merge tag 'devicetree-for-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux git bisect good d42b1c47570eb2ed818dc3fe94b2678124af109d # bad: [58390c8ce1bddb6c623f62e7ed36383e7fa5c02f] Merge tag 'iommu-updates-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu git bisect bad 58390c8ce1bddb6c623f62e7ed36383e7fa5c02f # good: [888d3c9f7f3ae44101a3fd76528d3dd6f96e9fd0] Merge tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux git bisect good 888d3c9f7f3ae44101a3fd76528d3dd6f96e9fd0 # bad: [86e98ed15b3e34460d1b3095bd119b6fac11841c] Merge tag 'cgroup-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup git bisect bad 86e98ed15b3e34460d1b3095bd119b6fac11841c # bad: [7fa8a8ee9400fe8ec188426e40e481717bc5e924] Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm git bisect bad 7fa8a8ee9400fe8ec188426e40e481717bc5e924 # bad: [0120dd6e4e202e19a0e011e486fb2da40a5ea279] zram: make zram_bio_discard more self-contained git bisect bad 0120dd6e4e202e19a0e011e486fb2da40a5ea279 # good: [fce0b4213edb960859dcc65ea414c8efb11948e1] mm/page_alloc: add helper for checking if check_pages_enabled git bisect good fce0b4213edb960859dcc65ea414c8efb11948e1 # bad: [59f876fb9d68a4d8c20305d7a7a0daf4ee9478a8] mm: avoid passing 0 to __ffs() git bisect bad 59f876fb9d68a4d8c20305d7a7a0daf4ee9478a8 # good: [0050d7f5ee532f92e8ab1efcec6547bfac527973] afs: split afs_pagecache_valid() out of afs_validate() git bisect good 0050d7f5ee532f92e8ab1efcec6547bfac527973 # good: [2ac0af1b66e3b66307f53b1cc446514308ec466d] mm: fall back to mmap_lock if vma->anon_vma is not yet set git bisect good 2ac0af1b66e3b66307f53b1cc446514308ec466d # skip: [0d2ebf9c3f7822e7ba3e4792ea3b6b19aa2da34a] mm/mmap: free vm_area_struct without call_rcu in exit_mmap git bisect skip 0d2ebf9c3f7822e7ba3e4792ea3b6b19aa2da34a # skip: [70d4cbc80c88251de0a5b3e8df3275901f1fa99a] powerc/mm: try VMA lock-based page fault handling first git bisect skip 70d4cbc80c88251de0a5b3e8df3275901f1fa99a # good: [444eeb17437a0ef526c606e9141a415d3b7dfddd] mm: prevent userfaults to be handled under per-vma lock git bisect good 444eeb17437a0ef526c606e9141a415d3b7dfddd # bad: [e06f47a16573decc57498f2d02f9af3bb3e84cf2] s390/mm: try VMA lock-based page fault handling first git bisect bad e06f47a16573decc57498f2d02f9af3bb3e84cf2 # skip: [0bff0aaea03e2a3ed6bfa302155cca8a432a1829] x86/mm: try VMA lock-based page fault handling first git bisect skip 0bff0aaea03e2a3ed6bfa302155cca8a432a1829 # skip: [cd7f176aea5f5929a09a91c661a26912cc995d1b] arm64/mm: try VMA lock-based page fault handling first git bisect skip cd7f176aea5f5929a09a91c661a26912cc995d1b # good: [52f238653e452e0fda61e880f263a173d219acd1] mm: introduce per-VMA lock statistics git bisect good 52f238653e452e0fda61e880f263a173d219acd1 # bad: [c7f8f31c00d187a2c71a241c7f2bd6aa102a4e6f] mm: separate vma->lock from vm_area_struct git bisect bad c7f8f31c00d187a2c71a241c7f2bd6aa102a4e6f # only skipped commits left to test # possible first bad commit: [c7f8f31c00d187a2c71a241c7f2bd6aa102a4e6f] mm: separate vma->lock from vm_area_struct # possible first bad commit: [0d2ebf9c3f7822e7ba3e4792ea3b6b19aa2da34a] mm/mmap: free vm_area_struct without call_rcu in exit_mmap # possible first bad commit: [70d4cbc80c88251de0a5b3e8df3275901f1fa99a] powerc/mm: try VMA lock-based page fault handling first # possible first bad commit: [cd7f176aea5f5929a09a91c661a26912cc995d1b] arm64/mm: try VMA lock-based page fault handling first # possible first bad commit: [0bff0aaea03e2a3ed6bfa302155cca8a432a1829] x86/mm: try VMA lock-based page fault handling first I do not usually see any kernel log output while running the program, just occasional logs about user space segfaults.
Could you report this on the mailing list and CC the commit author?
(In reply to Sam James from comment #1) > Could you report this on the mailing list and CC the commit author? (see https://www.kernel.org/doc/html/v6.4/admin-guide/reporting-issues.html)
#regzbot monitor: https://lore.kernel.org/all/facbfec3-837a-51ed-85fa-31021c17d6ef@gmail.com
Might be related to https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ - might be worth trying to revert the bad commit identified there.
(In reply to Michal Suchánek from comment #4) > Might be related to > https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ > - might be worth trying to revert the bad commit identified there. I can confirm that v6.4 with 0bff0aaea03e2a3ed6bfa302155cca8a432a1829 reverted no longer causes any memory corruption with either my reproducer or the original program.
Temporary workaround fix posted at: https://lore.kernel.org/all/20230703182150.2193578-1-surenb@google.com/
(In reply to Holger Hoffstätte from comment #6) > Temporary workaround fix posted at: > https://lore.kernel.org/all/20230703182150.2193578-1-surenb@google.com/ <sigh> ..and of course it doesn't work as expected: https://lore.kernel.org/all/c2cc745a-22f0-90df-59b0-2abd961cd829@redhat.com/
The discussion continues at https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/.
Everything I've seen suggests 6.4.3 should be fine.
(In reply to Michal Suchánek from comment #4) > Might be related to > https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ (In reply to Sam James from comment #9) > Everything I've seen suggests 6.4.3 should be fine. Yes, go builds fine now too.
(In reply to Jiri Slaby from comment #10) > (In reply to Michal Suchánek from comment #4) > > Might be related to > > > https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ > > (In reply to Sam James from comment #9) > > Everything I've seen suggests 6.4.3 should be fine. > > Yes, go builds fine now too. The build system I originally encountered this issue with also works again with CONFIG_PER_VMA_LOCK=y on 6.4.3.
I think this can be closed now.