Bug 218937

Summary: TCP connection frozen on sender and receiver. No retries beyond 1.
Product: Networking Reporter: joyson (joysonanuit)
Component: OtherAssignee: Stephen Hemminger (stephen)
Status: NEW ---    
Severity: normal    
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: sender pcap screenshot
receiver pcap screenshot
attachment-22439-0.html

Description joyson 2024-06-05 06:26:40 UTC
Created attachment 306413 [details]
sender pcap screenshot

Hi,

i am facing an issue in TCP. At a random point in packet transfer,  sender stops retrying and receiver stops acking. our previous kernel was 2.6 and current kernel is 5.4. the sequence of events are as below. 

sender:
sends few packets of data.
misses a few ACKs.
retries again. 
does not get an ack. 
stops

receiver:
receives the packets.
sends ack to only few packets. 
does not retry ack for the remaining packets. 

for this FIN, the sender sends RST. 

there is a timeout at receiver end which forces the socket to be closed.
this erroneous socket reaches the end of timeout and sends a FIN with ACK of all the data that it has received(including the ones that it did not ack and the sender was waiting for)
Comment 1 joyson 2024-06-05 06:27:44 UTC
Created attachment 306414 [details]
receiver pcap screenshot
Comment 2 Artem S. Tashkinov 2024-06-05 10:45:20 UTC
Is this reproducible in 6.9.3 or 6.6.32? It's highly unlikely anyone will help unless you run something more supported/modern.
Comment 3 joyson 2024-06-05 11:15:18 UTC
Thanks, Artem.
This is for a commercial product of a company . So kernel is set to 5.4 only. cannot try in newer kernel. Hence the screenshot of pcap as well. Cannot share full pcap for confidentiality reasons. 
We mostly like to understand the possible reasons that could cause it and how to fix. If there are any optimisations/changes from 2.6 to 5.4 that are playing a role.
Comment 4 Stephen Hemminger 2024-06-05 16:06:05 UTC
Are both the sender and receiver Linux? Are there any middle boxes or firewalls in the way?

It looks like there might be an MTU mismatch or non-functional TSO in the NIC.
The packet that gets stuck is larger than 1500.
Comment 5 joyson 2024-06-07 06:50:33 UTC
Created attachment 306435 [details]
attachment-22439-0.html

sender receiver are both Linux. MTU on both sides is 7020.
in the same setup if we run our app thats based on 2.6.38, we do not see
this issue.
only when we run the same up on 5.4.254, we see it.

On Wed, Jun 5, 2024 at 9:36 PM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=218937
>
> --- Comment #4 from Stephen Hemminger (stephen@networkplumber.org) ---
> Are both the sender and receiver Linux? Are there any middle boxes or
> firewalls
> in the way?
>
> It looks like there might be an MTU mismatch or non-functional TSO in the
> NIC.
> The packet that gets stuck is larger than 1500.
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.
Comment 6 joyson 2024-06-10 10:03:51 UTC
some update on it. 

we tried a few thing and last 2 showed us good results. 

the code was like.

sender send all data and call close(fd)
receiver: read all and when read 0, close(fd)

if close() if commented out at sender, no socket freeze or data loss. 
if close() is replaced with shutdown, no socket freeze of data loss. 

looks like close() on sender was the problem.