With the default gc_thresh values and a busy /16 network attached, the neighbour cache can overflow. No indication is given that this happens and it does impact TCP on localhost. Test setup: * about 16k (simulated IP/MAC's) on one interface * web server behind on second interface * routing between the two * HTTP benchmark from the 16k IP's to the web server * for localhost connectivity verification a netperf instance is run on localhost like so: 'netperf -D 1 -l 600 127.0.0.1' Result: Kernel has to learn the 16k IP/MAC combinations, as soon as gc_thresh3 is hit, netperf stalls, no syslog/kernel message indicates the problem. The only indication are log entries like this: "net_ratelimit: 1464 callbacks suppressed" No other messages are logged.
What do you mean by "netperf stalls"? When hit gc_thresh3, netperf should get EINVAL and then it should stop unless it ignores syscall return value.
I have used netperf only used to make the problem easily producible. Every *established*, local (though lo interface) TCP connection seems to be affected. The TCP connection seems to stall, netstat shows that the send queue of the netperf server process fills up. traces (systemtap) on the processes show that poll is not reporting the socket as having data. Neither the sender nor the receiver side is getting an EINVAL on the syscalls. tcpdump on lo shows a distinct "gap" between a single TCP packet and ACK for it. Sometimes the gap is 20 seconds, sometime much more.
Hmm, interesting. loopback traffic should not need a neigh entry at all. How many concurrent TCP connections do you have? Did you see any memory pressure? Thanks.