Bug 217367 - trace-cmd emits corrupted v7 trace.dat files
Summary: trace-cmd emits corrupted v7 trace.dat files
Status: RESOLVED CODE_FIX
Alias: None
Product: Tools
Classification: Unclassified
Component: Trace-cmd/Kernelshark (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Default virtual assignee for Trace-cmd and kernelshark
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-04-25 16:48 UTC by Douglas RAILLARD
Modified: 2023-06-07 19:36 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Douglas RAILLARD 2023-04-25 16:48:55 UTC
The documentation states:

  BUFFER: id 3, size vary
      ...
      For each CPU of the above count:
         ...
         <8 bytes> long long unsigned integer, size of the trace data for this CPU.


For the last buffer, the size encoded in the BUFFER option is N but should be N + 4 (at least when using zstd compression, haven't tried none and zlib yet).

For all the other buffers, the size reported in the option is correct.

The extra data needed is actually present in the file, so the issue is really just the BUFFER option containing a wrong value for the last buffer.

If the size is not adjusted, the following happens:
* It is not possible to read the last chunk, as the chunk header states N but the data can only provide N - 4 (since the span set by the option was too small).
* Even if we make the read succeed by reading 4 bytes less than asked, zstd fails to decompress the data.

I suppose the reading part of libtraceevent does not bounds-check the ring buffer itself, and the extra 4 bytes happen to always be there for zstd to consume. 

Fixing the encoding bug in libtraceevent without version bump will make older files unreadable, as parsers will expect the extra 4 bytes to be available, which will make parsers attempt to read past EOF. Therefore I would suggest amending the documentation to state that the last buffer needs the size to be adjusted by 4.

I'm not sure if the issue is with the last BUFFER option appearing in the header or the last ring buffer appearing in the file as they are the same in practice atm, but that would need to be clearly stated as well.
Comment 1 Douglas RAILLARD 2023-04-25 17:15:19 UTC
Note that it is entirely possible that all sizes are wrongly encoded, since page alignment adds lots of padding. Adding 4 to the size of every buffer never makes it overlap with the next section in my test file.
Comment 2 Douglas RAILLARD 2023-04-26 13:03:03 UTC
After a bit of investigation it seems that the issue is in libtracecmd:

tracecmd_compress_copy_from() is writing the chunked compressed output. The value of the output param "write_size" is recorded later in size field of the BUFFER option in out_write_cpu_data():

	endian8 = convert_endian_8(handle, data_files[i].write_size);
	if (do_write_check(handle, &endian8, 8))


Inside tracecmd_compress_copy_from(), the "wsize" variable is incremented for each chunk:

	/* data + compress header */
	wsize += (size + 8);

But it is not incremented at the beginning when writing the number of chunks:

	/* save the initial offset and write 0 as initial chunk count */
	offset = lseek(handle->fd, 0, SEEK_CUR);
	write_fd(handle->fd, &chunks, 4);

So if that is correct, the actual data size should be incremented by 4 when decoding the BUFFER option only if compression is used.
Comment 3 Tzvetomir Stoyanov (VMware) 2023-04-26 13:52:46 UTC
You are right, the actual size in case of compressed BUFFER is +4 bytes. I think this can be fixed in tracecmd_compress_copy_from(), by just adding 4 bytes to wsize at the beginning, before the loop. This should not break the existing readers and there is no need to bump the version. I think that Steven has tests for v7 files and can verify that change. Do you want to submit a patch?
Comment 4 Douglas RAILLARD 2023-04-26 14:08:05 UTC
> This should not break the existing readers

It _will_ break existing readers. If there is no way of telling whether the value is fixed or not (e.g. bump in version number or extra flag), how can we know the actual size of the data ?

The consumer code I have is forced to add the +4 manually, and uses that to make a reader for each buffer out of the main reader. If I make a sub-reader of N + 4 bytes where the file/memory buffer/mmaped file only contains N, this _will_ error, rightfully, in the same way it errors now without the workaround. For any buffer that is not the last one, it will also make them possibly overlap and lead the code to think that there is one more chunk that is corrupted (only the 4 first byte of that "last chunk" would be visible).

Please don't introduce a change that will force downstream code into broken bound checking. I'd much rather have a documentation update, especially since the condition is easy to document (+ 4 if compression != none). Otherwise the documentation will have to state "the size might need to be increased by 4, or by 0 based on unobservable state".

Note that this bug was found while implementing the Rust trace.dat parser, so the situation I described is not hypothetical.
Comment 5 Douglas RAILLARD 2023-04-26 14:11:01 UTC
Well, all that unless you are ready to declare all currently created trace.dat v7 files corrupted and unsupported in any future revisions of tools starting from the point where the change to libtracecmd is merged. That would mostly work for our use case, as we don't have that many data in v7 format anyway, but it's not ideal.
Comment 6 Steven Rostedt 2023-04-26 14:30:31 UTC
I haven't taken a full look at the issue, but one solution (to not have to bump to v8) is to add a trace option (the flag you mention) that states that this file has the updated changes. That is, if the trace.dat file has the fix, then the option will exist to tell trace-cmd read it differently. This means we also need to keep the old way of reading the trace.dat file.
Comment 7 Steven Rostedt 2023-05-31 08:51:50 UTC
I don't even know why wsize was even calculated. We do lseek there, and could have simply done:

        /* write chunks count*/
        write_fd(handle->fd, &chunks, 4);
        end_offset = lseek(handle->fd, 0, SEEK_END);
        if (end_offset == (off_t)-1)
                return -1;

        if (read_size)
                *read_size = rsize;
        if (write_size)
                *write_size = end_offset - offset;

And not bothered with a wsize at all.

Now, actually, all trace.dat files that currently do a compression are corrupted by this. They do not match the documentation. So fixing it may make your reader not be able to read older trace.dat files, but that's because the implementation was broken, not because v7 is broken. As you pointed out, the documentation does one thing, but the implementation does something else. So I'm for fixing it by changing the result of tracecmd_compress_copy_from(). I tried it, and older trace-cmd can still read it, even if it is by sheer luck. I guarantee backward compatibility of the libraries and trace-cmd, but not for external parsers. You could either just consider all current v7 compressed files broken, or you could check if it has this bug by looking at the offset compared to the next offset.

I know you were making a libtraceevent rust library, but I guess you are now making a libtracecmd rust library. Unfortunately, the only answer I have here is to fix it as Tzvetomir suggested.

Anyway, I'm going to fix it by having it return the correct (+4) for the write as libtracecmd can handle this, If I could see what code you have, I could possibly suggest workarounds to know if you need to add the +4 or not.
Comment 8 Steven Rostedt 2023-05-31 09:50:36 UTC
Looking more into the code, the reason that trace-cmd doesn't break with this is because it doesn't rely on the size of the section when it reads the compression portion. It gets the offset from the section header, and then looks at the chunks, and reads the file based on the chunk headers. These are correct, and everything "just works".

init_cpu_zpage() gets the offset from the cpu_data->file_offset sets the file location (lseek) and then calls tracecmd_load_chunks_info() to read the chunks. This logic only looks at the chunk sizes. I put in code to see if it goes beyond the file_size of the cpu_data, and sure enough it does.

Could your parser do the same? This will make it work just like the libtracecmd does.
Comment 9 Douglas RAILLARD 2023-05-31 10:41:00 UTC
It does work if I pass no size information for the compressed flyrecord as my code will also count chunks and return EOF when it reaches 0.

I think I'll leave a link to that issue and a comment in the code explaining why the size is ignored for compressed buffer and if someone in 10 years stumbles upon that they can just start using the size. No one will probably care about old traces at that point. It's a shame we loose bound checking for the chunk reading code but it's not the end of the world.
Comment 10 Steven Rostedt 2023-06-01 03:56:28 UTC
You could make sure that the compressed sections are always within the size + 4. It shouldn't hurt if the +4 isn't needed, right?
Comment 11 Douglas RAILLARD 2023-06-01 09:41:01 UTC
It's not really possible. The size will be used for the resulting slice, but it's also used for the code that loads the data in memory. If you ask +4 on the last section, you will be met with an error as the offset will be past the end of the file. The only sane options are:
1) We document the format as implemented by trace-cmd, which means that if compression is used we need to do +4, +0 otherwise.
2) We fix the issue in trace-cmd and declare existing files as corrupted. Then the decision to support that belongs to the parser implementer. In our case, we both can easily by dropping a bound check.

I'd much rather loose the check than introduce some sort of fuzzy check that will result in issues like this one down the line: https://bugzilla.kernel.org/show_bug.cgi?id=210173
Comment 12 Steven Rostedt 2023-06-07 19:36:19 UTC
Fixed by https://git.kernel.org/pub/scm/utils/trace-cmd/trace-cmd.git/commit/?id=4b9213298599e91682129294cd317ec780785947

("libtracecmd: Fix tracecmd_compress_copy_from() write size return")

Note You need to log in before you can comment on or make changes to this bug.