Bug 61681
Summary: | Incoming TCP4 connections fail to start, don't get past SYN_RECV and then quickly disappear | ||
---|---|---|---|
Product: | Networking | Reporter: | Dave (dcrooke) |
Component: | IPV4 | Assignee: | Stephen Hemminger (stephen) |
Status: | NEW --- | ||
Severity: | normal | CC: | alan, dcrooke, eric.dumazet, nealcardwell, szg00000 |
Priority: | P1 | ||
Hardware: | IA-64 | ||
OS: | Linux | ||
Kernel Version: | 3.4.57 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Dave
2013-09-19 16:41:56 UTC
Really linux just copy the sequence number received in the SYN message back to the SYNACK message. No wraparound issue involved here. Make sure the server really sends a SYNACK message. It might drop your SYN packet for valid reasons. nstat should help to understand why. netfilter / tcp conntrack might be the problem. Are you using it ? Hi Eric, thanks for the quick reply. I have no way to reproduce the problem, but it's definitely not firewall related and we are not using any of the filters you mention. iptables is blank. Port 80 is not firewalled by Amazon. Some people have reported a malformed SYN_ACK due to a NAT device using its own (single) IP on the inside to talk to the Linux server, but this is apparently due to the NAT having to quickly recycle source port numbers due to using a single IP. Linux will apparently return the ACK sequence number for the previous connection, which is understandable. Amazon EC2 only NAT's the server VM IP, the external Internet IP and port from the upstream client is passed in to us as the source. Thus it seems unlikely in this case that the problem is due to port number re-use. The traffic level on the server was low when the problem occurred, perhaps 10 requests per second at the most, and had plenty of file descriptors, Apache children, RAM, CPU, etc. The server had been up for a few months, and no config changes were made to the OS or Apache. I can't figure it out, but hopefully it won't recur :) From logs, can you quantify exactly how long the machine was up when this problem happened? Interesting bugs can happen at 24 days and 49 days, due to 32-bit millisecond-based jiffies values flipping sign, wrapping around, overflowing, etc. Unfortunately, the logs had rolled, but I found this (not my systems so I am not super familiar): [xx@xxxx log]$ stat dmesg.old File: `dmesg.old' Size: 10320 Blocks: 24 IO Block: 4096 regular file Device: ca01h/51713d Inode: 1380 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2013-06-17 14:22:46.495339112 -0500 Modify: 2013-06-17 14:22:46.499339112 -0500 Change: 2013-09-18 11:17:58.625811219 -0500 [bf@cake-app1 log]$ stat dmesg File: `dmesg' Size: 10320 Blocks: 24 IO Block: 4096 regular file Device: ca01h/51713d Inode: 17 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2013-09-18 11:17:58.625811219 -0500 Modify: 2013-09-18 11:17:58.649811219 -0500 Change: 2013-09-18 11:17:58.649811219 -0500 [xx@xxxx log]$ I make this just under 92 days. |