Bug 220096 - perf top command probability coredump
Summary: perf top command probability coredump
Status: NEW
Alias: None
Product: Linux
Classification: Unclassified
Component: Kernel (show other bugs)
Hardware: All Linux
: P3 high
Assignee: Tools.Other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-05-07 02:12 UTC by Fei Lang
Modified: 2025-05-20 16:13 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Compile the perf binary with libasan based on the 5.10 kernel. (15.23 KB, text/plain)
2025-05-20 09:32 UTC, Fei Lang
Details

Description Fei Lang 2025-05-07 02:12:14 UTC
In my env,low probability coredump when run the `timeout -k 9 5 perf top --fields overhead,cpu,com,symbol` command. I reproduced it in both the stable-5.10 and stable-6.6 kernel branches.

(gdb) bt
#0  __strcmp_evex () at ../sysdeps/x86_64/multiarch/strcmp-evex.S:314
#1  0x0000560fa0ef3904 in sort.comm_collapse () at util/sort.c:202
#2  0x0000560fa0efb9ec in hist_entry__collapse (left=left@entry=0x7f5bf4000fd0, right=right@entry=0x7f5bf401d180) at util/hist.c:1312
#3  0x0000560fa0efc5ce in hists__collapse_insert_entry (he=0x7f5bf401d180, root=0x560fdadf6cb8, hists=<optimized out>) at util/hist.c:1620
#4  hists__collapse_resort (hists=hists@entry=0x560fdadf6c80, prog=prog@entry=0x0) at util/hist.c:1704
#5  0x0000560fa0dd99e2 in perf_top__resort_hists (t=t@entry=0x7ffc542dba40) at builtin-top.c:303
#6  0x0000560fa0ddb36f in perf_top__print_sym_table (top=0x7ffc542dba40) at builtin-top.c:350
#7  display_thread (arg=0x7ffc542dba40) at builtin-top.c:700
#8  0x00007f5c0ecabf46 in start_thread (arg=<optimized out>) at pthread_create.c:444
#9  0x00007f5c0ed2b20c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

(gdb) info register
rax            0x69400000          1765801984
rbx            0x7f5bf4019680      140032912496256
rcx            0x560fdadf6c90      94626096573584
rdx            0x0                 0
rsi            0x7f5bf4019680      140032912496256
rdi            0x7f5bf4294         34187723412
rbp            0x7f5c0c734b40      0x7f5c0c734b40
...

Register rdi(right->comm->comm_str->str) is invalid.

I guess the coredump is caused by concurrent reads and writes comm_str->str.

read:
display_thread
  perf_top__print_sym_table
    perf_top__resort_hists
     ...
       hist_entry__collapse
         comm__str
write:
process_thread
  ordered_events__flush
   ...
     __thread__set_comm
       comm__override

  In addition,I also tested it in the master branch,but it didn't reproduce it.I guess the latest code of master branch may affect the time sequence.By analyzing the code,I think the problem still exists.

Thanks.
Comment 1 Fei Lang 2025-05-20 09:32:54 UTC
Created attachment 308147 [details]
Compile the perf binary with libasan based on the 5.10 kernel.
Comment 2 Fei Lang 2025-05-20 13:02:57 UTC
  I cann't reproduce in the master branch,but the patch you sent is more suitable for the master branch. So i need more time to adapt to the stable-5.10 branch.
  Also, I went through the code today, but I still haven't confirmed how two threads avoid the problem of concurrent reads and writes, so is the problem logically still there?
  Look forward to your reply. Thank you.
Comment 3 Ian Rogers 2025-05-20 16:13:50 UTC
My patch just adds a missing read lock:
https://lore.kernel.org/lkml/20250519224645.1810891-1-irogers@google.com/

My understanding is that the struct comm is associated/owned by a struct thread. There may be multiple comm events and those are held in a link list on the thread. Each struct comm is mainly holding a string, comm_str, which is a reference counted string done to de-duplicate comm strings - the memory saving of this may not be worth the energy. Generally the comm_str doesn't ever change for a struct comm, the exception is when a place holder value is first set. The comm_set bool on thread is tracking this - not my code and I'm not quite following its logic. Changes to the list of struct comm in thread (comm_list) must be done under the comm_lock, my changes help enforce this at compile time with -Wthread-safety. The only call to change the comm_str in comm (comm__override) is made in ____thread__set_comm and this is done with thread__comm_lock exclusively held. There could be a use of a struct comm and comm__override be called for the same struct comm, the lock is guarding the list and not the list members.

Things that can be better. The thread owns struct comm but as the comm is passed around without the thread, this connection is lost and there is a possible use after free - address sanitizer (-fsanitize=address) should help catch this. The histogram code has had poor memory safety discipline [1] and this appears to be another crash caused by it. Longer term I think this kind of stuff probably belongs being written in python. The C UI stuff is massively clunky and improvements to the python API mean the code can be clean and free from memory misuse [2].

The crash certainly looks like a use-after-free so please pursue -fsanitize=address which should be fast enough for the bug to reproduce and to gather the necessary data.

[1] https://lore.kernel.org/r/20250307061250.320849-2-namhyung@kernel.org
[2] https://lore.kernel.org/lkml/20250519195148.1708988-8-irogers@google.com/

Note You need to log in before you can comment on or make changes to this bug.