Bug 218834 - [regression] [bisected] commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 corrupts data sent via pseudoterminal device
Summary: [regression] [bisected] commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 corru...
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Linux
Classification: Unclassified
Component: Kernel (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Virtual assignee for kernel bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-13 16:25 UTC by vkrevs
Modified: 2024-07-01 11:05 UTC (History)
1 user (show)

See Also:
Kernel Version:
Subsystem:
Regression: Yes
Bisected commit-id: 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0


Attachments
bisection logs and repro test program (2.48 MB, application/gzip)
2024-05-13 16:25 UTC, vkrevs
Details

Description vkrevs 2024-05-13 16:25:06 UTC
Created attachment 306290 [details]
bisection logs and repro test program

Hello. There appears to be a regression in pseudoterminal support introduced by commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0.

We have a wrapper program that uses a pseudoterminal to transform data fed into input/read from output of a child process. Transformed data is written to the master device, the child process, whose stdin/stdout are mapped to the slave device, reads data from parent by reading from its stdin, processes that data and sends it back to parent by writing the response to its stdout, and then data from the child process is read by the wrapper from pseudoterminal master descriptor.

This used to work fine on various Linux distros such as Amazon Linux2, SLES 12/15, openSUSE 15, RHEL 8/9 with pre 6.0 kernels. However, running our regression suite on Amazon Linux 2023 which uses a 6.x kernel revealed regression. Sometimes, when a lot of data is written by the parent process into the master device's descriptor, some data is lost or corrupted when the child process is reading it from the slave device's descriptor. We've verified that this is a kernel regression introduced via commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 by performing kernel bisection as described in https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html#introguide-bissbs. Bisection was performed using openSUSE 15.5 in VirtualBox (so the kernels had the tainted flag set) but the issue can also be reproduced outside of VirtualBox.

Bisection logs and a repro test case used to perform bisection are in the repro+bisect.zip attachment.

-----------------------------------------------------------------------------------
The test script, ptybug.sh, pipes a large plain text file containing 1027800 plain text records separated by newlines into the wrapper program. Each record is a 0-padded record number padded with spaces to 80 bytes.

The wrapper program sends the unmodified data read from its stdin to the child program via a pseudoterminal, reads the output from the child and prints it on its stdout. Wrapper's stdout is redirected to a file which is then compared with expected output.

The child process reads exactly 1000000 records from its stdin, strips off trailing whitespace, and prints the record on its stdout.

Both the wrapper and the child process mirror the data written into the pseudoterminal master descriptor/read from the pseudoterminal slave descriptor into separate files - which are identical on kernels prior to commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 and which differ on kernels that include that commit.

-----------------------------------------------------------------------------------
How to reproduce:

tar xvf repro+bisect.tar.zstd
cd repro
./ptybug.sh

Expected output (on 5.19.0-rc1* kernels prior to commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0)
rm -f ptybug ptybug_child gen  *.o
cc     ptybug.c   -o ptybug
cc     ptybug_child.c   -o ptybug_child
cc     gen.c   -o gen
Test run 1 ...
Test run 2 ...
Test run 3 ...
Test run 4 ...
Test run 5 ...
PASS

Actual/bad output (on 5.19.0-rc1* starting from commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 and later kernels, including latest)
rm -f ptybug ptybug_child gen  *.o
cc     ptybug.c   -o ptybug
cc     ptybug_child.c   -o ptybug_child
cc     gen.c   -o gen
Test run 1 ...
FAIL Actual and expected output does not match (gen1027800)
FAIL


To see the difference between data written by the master into the pseudoterminal master descriptor and data read by the child from the pseudoterminal slace descriptor (stripping off trailing spaces) run:
diff -u gen1027800_sysin.txt.nospace gen1027800_a_out.txt.copy |more

One possible output is below (every time you run this on a kernel starting from commit 6bb6fa6908ebd3cb4e14cd4f0ce272ec885d2eb0 the results are sligtly different). As you can see lines corresponding to records 17861-17862 have been erroneourly replaced with lines 17858-17859, and so on.

--- gen1027800_expected.txt     2024-04-26 14:39:22.287121535 +0100
+++ gen1027800_a_out.txt.copy   2024-05-13 15:52:48.915952662 +0100
@@ -17858,8 +17858,8 @@
 000000000000017858
 000000000000017859
 000000000000017860
-000000000000017861
-000000000000017862
+000000000000017858
+000000000000017859
 000000000000017863
 000000000000017864
 000000000000017865
@@ -51261,7 +51261,7 @@
 000000000000051261
 000000000000051262
 000000000000051263
-000000000000051264
+0000000000000512651264
 000000000000051265
 000000000000051266
 000000000000051267
@@ -104576,9 +104576,9 @@
 000000000000104576
 000000000000104577
 000000000000104578
-000000000000104579
-000000000000104580
-000000000000104581
+000000000000104576
+000000000000104577
+000000000000104578
 000000000000104582
 000000000000104583
 000000000000104584
@@ -110897,8 +110897,8 @@
 000000000000110897
 000000000000110898
 000000000000110899
-000000000000110900
-000000000000110901
+000000000000110898
+000000000000110899
 000000000000110902
 000000000000110903
 000000000000110904
@@ -279673,9 +279673,9 @@
 000000000000279673
 000000000000279674
 000000000000279675
-000000000000279676
-000000000000279677
-000000000000279678
Comment 1 Artem S. Tashkinov 2024-05-14 07:26:44 UTC
Find the person who made the commit.

Post your findings to LKML and CC that person.

This bug report will die here.
Comment 2 vkrevs 2024-05-14 08:22:13 UTC
(In reply to Artem S. Tashkinov from comment #1)
> Find the person who made the commit.
> 
> Post your findings to LKML and CC that person.
> 
> This bug report will die here.

Already done by someone else:
https://lore.kernel.org/lkml/ZkLv3wn62jqMVpSF@archie.me/T/#u
https://lore.kernel.org/regressions/ZkLv3wn62jqMVpSF@archie.me/T/#u
https://marc.info/?l=linux-serial&m=171566263310483&w=2
Comment 3 vkrevs 2024-05-14 09:27:39 UTC
In reply to:

stty -a -F /dev/pts/1
speed 38400 baud; rows 57; columns 217; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>; eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W; lnext = ^V; discard = ^O; min = 1; time = 0;
-parenb -parodd -cmspar cs8 -hupcl -cstopb cread -clocal -crtscts
-ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl ixon ixoff -iuclc -ixany -imaxbel iutf8
opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt echoctl echoke -flusho -extproc
Comment 4 Artem S. Tashkinov 2024-05-17 19:15:08 UTC
Fixed by:

[PATCH] tty: n_tty: Fix buffer offsets when looked ahead is used

https://lore.kernel.org/lkml/20240514140429.12087-1-ilpo.jarvinen@linux.intel.com/T/

Should land in 6.9.2.
Comment 5 vkrevs 2024-07-01 11:05:23 UTC
The above patch landed in 6.9.6, 6.6.35, 6.1.95 stable kernels.
6.9.6: https://lwn.net/Articles/979263/
6.6.35: https://lwn.net/Articles/979262/
6.1.95: https://lwn.net/Articles/979261/

Note You need to log in before you can comment on or make changes to this bug.