Bug 202997

Summary: High UDP traffic results in packet receive errors and system-wide UDP failure
Product: Networking Reporter: sagar
Component: IPV4Assignee: Stephen Hemminger (stephen)
Status: RESOLVED INVALID    
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.18.19-041819-generic Subsystem:
Regression: No Bisected commit-id:
Attachments: C file that demonstrates the problem

Description sagar 2019-03-21 22:42:41 UTC
Created attachment 281959 [details]
C file that demonstrates the problem

**OS**: Ubuntu 18.04, Ubuntu 18.10
**Kernel**: 4.18.18, 5.0.3
**Hardware**: Razer Blade 15 2018, Google Cloud Platform


Repeatedly sending large (65k) packets to a UDP socket seemingly depletes the kernel buffers. It results in numerous "packet receive" errors in netstat and proc/net/udp(however buffer errors are not incremented). While the receiver sockets are left open, no other UDP traffic is processed(for example - can't browse the web). 

The attached test, client.c, demonstrates this failure. It intentionally uses waits/whiles to make the failure more evident. The only way to recover UDP functionality is to kill the test. 

Alternatively, closing the receiver socket and rebinding it on ever iteration mitigates this system-wide failure and the test can remain running. Lines 59-61 in client.c show this. 


Neither the sender, nor the receiver socket report any errors (even when the test is modified to call recv).


I also ran this on the Windows subsystem for Linux without any trouble.
Comment 1 Stephen Hemminger 2019-03-22 15:34:42 UTC
Packets greater than 1440 end up getting fragmented. Fragmentation is unreliable in the face of packet loss. If one packet is dropped the other parts of the reassembled data must be held around until timeout.

There are changes in recent kernels that make fragmentation more strict and do not allow some times of fragmentation attacks.

WSL runs slower so sender will be slower.
Comment 2 sagar 2019-03-22 16:13:34 UTC
> If one packet is dropped the other parts of the reassembled data must be held around until timeout.

What defines this timeout? Right now it seems like the only way to recover is to close the receiving socket (or terminate the process).
Comment 3 sagar 2019-03-22 18:12:58 UTC
Hey Stephen, just following up. The system-wide failure only occurs when I increase the rmem and wmem values by running the following commands. 

sysctl net.core.rmem_max 2>/dev/null 1>/dev/null && sudo sysctl -w net.core.rmem_max=1610612736 1>/dev/null 2>/dev/null

sysctl net.core.rmem_default 2>/dev/null 1>/dev/null && sudo sysctl -w net.core.rmem_default=1610612736 1>/dev/null 2>/dev/null

sysctl net.core.wmem_max 2>/dev/null 1>/dev/null && sudo sysctl -w net.core.wmem_max=1610612736 1>/dev/null 2>/dev/null

sysctl net.core.wmem_default 2>/dev/null 1>/dev/null && sudo sysctl -w net.core.wmem_default=1610612736 1>/dev/null 2>/dev/null


Without doing that (or on a fresh reboot on my Razer Blade 15) the packet drops still occur but they don't result in a system-wide udp failure if the sockets are left open. 

Can you help explain this or suggest a better way to handle higher UDP traffic?
Comment 4 sagar 2019-04-12 18:10:21 UTC
Using smaller values for net.core alleviates this issue for the time being.