Bug 215694 - KernelShark: Locks up when plotting more than 16 CPUs
Summary: KernelShark: Locks up when plotting more than 16 CPUs
Status: NEW
Alias: None
Product: Tools
Classification: Unclassified
Component: Trace-cmd/Kernelshark (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Default virtual assignee for Trace-cmd and kernelshark
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-03-16 15:24 UTC by Steven Rostedt
Modified: 2022-03-17 19:12 UTC (History)
1 user (show)

See Also:
Kernel Version: N/A
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernelshark: Release input_mutex on not finding record (1.09 KB, application/mbox)
2022-03-16 15:24 UTC, Steven Rostedt
Details

Description Steven Rostedt 2022-03-16 15:24:36 UTC
Created attachment 300579 [details]
kernelshark: Release input_mutex on not finding record

I have a trace.dat file that has 35 CPUs with data in it (a trace from a box with 128 CPUs). When selecting another CPU over the 16 already selected, KernelShark hangs.

Actually, I looked into it and found that tepdata_get_latency() has a path where it will exit without releasing the stream->input_mutex.
Comment 1 Yordan Karadzhov 2022-03-17 16:47:23 UTC
Hi Steven,

I pushed your fix for the releasing of the mutex. This was definitely a bug in KernelShark, however maybe we have to investigate a bit farther and understand why 'tracecmd_read_at()' failed on this particular file. The fact that we discovered the mutex bug now means it never happened before.

thanks,
Y.
Comment 2 Steven Rostedt 2022-03-17 19:12:08 UTC
On Thu, 17 Mar 2022 16:47:23 +0000
bugzilla-daemon@kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=215694
> 
> Yordan Karadzhov (ykaradzhov@vmware.com) changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |ykaradzhov@vmware.com
> 
> --- Comment #1 from Yordan Karadzhov (ykaradzhov@vmware.com) ---
> Hi Steven,
> 
> I pushed your fix for the releasing of the mutex. This was definitely a bug
> in
> KernelShark, however maybe we have to investigate a bit farther and
> understand
> why 'tracecmd_read_at()' failed on this particular file. The fact that we
> discovered the mutex bug now means it never happened before.
> 

I believe it's a bug due to the compressions. I added a breakpoint here,
and have tracked it down to:

tracecmd_read_at() {
   find_and_peek_event()

which has:

	/* find the cpu that this offset exists in */
	for (cpu = 0; cpu < handle->cpus; cpu++) {
		if (offset >= handle->cpu_data[cpu].file_offset &&
		    offset < handle->cpu_data[cpu].file_offset +
		    handle->cpu_data[cpu].file_size)
			break;
	}

And since all cpu_data[] appear to have the same file_offset of zero, it
can match any one!

Tzvetomir, I thought offsets were all unique, is this not the case anymore?
Then, tracecmd_read_at() can never work, as it uses  record->offset to
retrieve the record again.

-- Steve

Note You need to log in before you can comment on or make changes to this bug.