In my env,low probability coredump when run the `timeout -k 9 5 perf top --fields overhead,cpu,com,symbol` command. I reproduced it in both the stable-5.10 and stable-6.6 kernel branches. (gdb) bt #0 __strcmp_evex () at ../sysdeps/x86_64/multiarch/strcmp-evex.S:314 #1 0x0000560fa0ef3904 in sort.comm_collapse () at util/sort.c:202 #2 0x0000560fa0efb9ec in hist_entry__collapse (left=left@entry=0x7f5bf4000fd0, right=right@entry=0x7f5bf401d180) at util/hist.c:1312 #3 0x0000560fa0efc5ce in hists__collapse_insert_entry (he=0x7f5bf401d180, root=0x560fdadf6cb8, hists=<optimized out>) at util/hist.c:1620 #4 hists__collapse_resort (hists=hists@entry=0x560fdadf6c80, prog=prog@entry=0x0) at util/hist.c:1704 #5 0x0000560fa0dd99e2 in perf_top__resort_hists (t=t@entry=0x7ffc542dba40) at builtin-top.c:303 #6 0x0000560fa0ddb36f in perf_top__print_sym_table (top=0x7ffc542dba40) at builtin-top.c:350 #7 display_thread (arg=0x7ffc542dba40) at builtin-top.c:700 #8 0x00007f5c0ecabf46 in start_thread (arg=<optimized out>) at pthread_create.c:444 #9 0x00007f5c0ed2b20c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78 (gdb) info register rax 0x69400000 1765801984 rbx 0x7f5bf4019680 140032912496256 rcx 0x560fdadf6c90 94626096573584 rdx 0x0 0 rsi 0x7f5bf4019680 140032912496256 rdi 0x7f5bf4294 34187723412 rbp 0x7f5c0c734b40 0x7f5c0c734b40 ... Register rdi(right->comm->comm_str->str) is invalid. I guess the coredump is caused by concurrent reads and writes comm_str->str. read: display_thread perf_top__print_sym_table perf_top__resort_hists ... hist_entry__collapse comm__str write: process_thread ordered_events__flush ... __thread__set_comm comm__override In addition,I also tested it in the master branch,but it didn't reproduce it.I guess the latest code of master branch may affect the time sequence.By analyzing the code,I think the problem still exists. Thanks.
Created attachment 308147 [details] Compile the perf binary with libasan based on the 5.10 kernel.
I cann't reproduce in the master branch,but the patch you sent is more suitable for the master branch. So i need more time to adapt to the stable-5.10 branch. Also, I went through the code today, but I still haven't confirmed how two threads avoid the problem of concurrent reads and writes, so is the problem logically still there? Look forward to your reply. Thank you.
My patch just adds a missing read lock: https://lore.kernel.org/lkml/20250519224645.1810891-1-irogers@google.com/ My understanding is that the struct comm is associated/owned by a struct thread. There may be multiple comm events and those are held in a link list on the thread. Each struct comm is mainly holding a string, comm_str, which is a reference counted string done to de-duplicate comm strings - the memory saving of this may not be worth the energy. Generally the comm_str doesn't ever change for a struct comm, the exception is when a place holder value is first set. The comm_set bool on thread is tracking this - not my code and I'm not quite following its logic. Changes to the list of struct comm in thread (comm_list) must be done under the comm_lock, my changes help enforce this at compile time with -Wthread-safety. The only call to change the comm_str in comm (comm__override) is made in ____thread__set_comm and this is done with thread__comm_lock exclusively held. There could be a use of a struct comm and comm__override be called for the same struct comm, the lock is guarding the list and not the list members. Things that can be better. The thread owns struct comm but as the comm is passed around without the thread, this connection is lost and there is a possible use after free - address sanitizer (-fsanitize=address) should help catch this. The histogram code has had poor memory safety discipline [1] and this appears to be another crash caused by it. Longer term I think this kind of stuff probably belongs being written in python. The C UI stuff is massively clunky and improvements to the python API mean the code can be clean and free from memory misuse [2]. The crash certainly looks like a use-after-free so please pursue -fsanitize=address which should be fast enough for the bug to reproduce and to gather the necessary data. [1] https://lore.kernel.org/r/20250307061250.320849-2-namhyung@kernel.org [2] https://lore.kernel.org/lkml/20250519195148.1708988-8-irogers@google.com/