Bug 215694

Summary: KernelShark: Locks up when plotting more than 16 CPUs
Product: Tools Reporter: Steven Rostedt (rostedt)
Component: Trace-cmd/KernelsharkAssignee: Default virtual assignee for Trace-cmd and kernelshark (tools_tracecmd_kernelshark)
Status: NEW ---    
Severity: normal CC: ykaradzhov
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: N/A Subsystem:
Regression: No Bisected commit-id:
Attachments: kernelshark: Release input_mutex on not finding record

Description Steven Rostedt 2022-03-16 15:24:36 UTC
Created attachment 300579 [details]
kernelshark: Release input_mutex on not finding record

I have a trace.dat file that has 35 CPUs with data in it (a trace from a box with 128 CPUs). When selecting another CPU over the 16 already selected, KernelShark hangs.

Actually, I looked into it and found that tepdata_get_latency() has a path where it will exit without releasing the stream->input_mutex.
Comment 1 Yordan Karadzhov 2022-03-17 16:47:23 UTC
Hi Steven,

I pushed your fix for the releasing of the mutex. This was definitely a bug in KernelShark, however maybe we have to investigate a bit farther and understand why 'tracecmd_read_at()' failed on this particular file. The fact that we discovered the mutex bug now means it never happened before.

thanks,
Y.
Comment 2 Steven Rostedt 2022-03-17 19:12:08 UTC
On Thu, 17 Mar 2022 16:47:23 +0000
bugzilla-daemon@kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=215694
> 
> Yordan Karadzhov (ykaradzhov@vmware.com) changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |ykaradzhov@vmware.com
> 
> --- Comment #1 from Yordan Karadzhov (ykaradzhov@vmware.com) ---
> Hi Steven,
> 
> I pushed your fix for the releasing of the mutex. This was definitely a bug
> in
> KernelShark, however maybe we have to investigate a bit farther and
> understand
> why 'tracecmd_read_at()' failed on this particular file. The fact that we
> discovered the mutex bug now means it never happened before.
> 

I believe it's a bug due to the compressions. I added a breakpoint here,
and have tracked it down to:

tracecmd_read_at() {
   find_and_peek_event()

which has:

	/* find the cpu that this offset exists in */
	for (cpu = 0; cpu < handle->cpus; cpu++) {
		if (offset >= handle->cpu_data[cpu].file_offset &&
		    offset < handle->cpu_data[cpu].file_offset +
		    handle->cpu_data[cpu].file_size)
			break;
	}

And since all cpu_data[] appear to have the same file_offset of zero, it
can match any one!

Tzvetomir, I thought offsets were all unique, is this not the case anymore?
Then, tracecmd_read_at() can never work, as it uses  record->offset to
retrieve the record again.

-- Steve