Bug 18822

Summary: TCP Communications gets blocked, then resetted
Product: Networking Reporter: Fred Baumgarten (dc6iq)
Component: IPV4Assignee: Stephen Hemminger (stephen)
Status: RESOLVED DOCUMENTED    
Severity: normal CC: alan, eric.dumazet
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32-24-generic #43-Ubuntu Subsystem:
Regression: No Bisected commit-id:
Attachments: both machine dumps
tar file holding the last 300 lines of the broken tcp communication.
sack=0 on host bacula
both machines with sack = 0
netstat -s before and after problem on both hosts

Description Fred Baumgarten 2010-09-20 10:39:43 UTC
Created attachment 30682 [details]
both machine dumps

having a freshly installed bacula server i could never get a backup from my working machine done to the bacula server.

The Connection transmits approximately 3 GiB, then locks up and resets the connection.

Transmission gets stuck at a certain point, then the bacula server does not reply to retransmitted packets on IPV4 stack. Retry Count on disks machine (my working machine) raises up to 13, then the Connection is gone. bacula server tries to send a Push packet (after KeepAlive timer runs out), and get the final RST packet from disks, because the connection is gone.

In the Attachment you will find the tcpdumps from both machines, actually the sending machine dropped some packets in the dump.

It might be a possible help: disks is running an 64 bit kernel whereas bacula is running 32 bit. I haven't looked into the option bits very well but it looks like there is a problem hidden:

last ack being ok:
09:32:56.142876 IP bacula.elkenet.bacula-sd > disks.elkenet.50766: Flags [.], ack 2005754588, win 9582, options [nop,nop,TS val 21825875 ecr 4530083], length 0
next ack packet:
09:32:56.144763 IP bacula.elkenet.bacula-sd > disks.elkenet.50766: Flags [.], ack 2005773412, win 9308, options [nop,nop,TS val 21825876 ecr 4530083,nop,nop,sack 1 {2005774860:2005776308}], length 0

root@disks:~# uname -a
Linux disks 2.6.32-24-generic #43-Ubuntu SMP Thu Sep 16 14:58:24 UTC 2010 x86_64 GNU/Linux

root@bacula:~# uname -a
Linux bacula 2.6.32-24-generic-pae #43-Ubuntu SMP Thu Sep 16 15:30:27 UTC 2010 i686 GNU/Linux

Doing a 20GB backup on a debian server works fine

server:~# uname -a
Linux server 2.6.32-5-486 #1 Sat Sep 18 01:43:00 UTC 2010 i686 GNU/Linux

Doing a 26 GB backup from a 32 bit Ubuntu works fine as well. Maybe its a 64 bit issue...

root@elke:~# uname -a
Linux elke 2.6.32-24-generic #43-Ubuntu SMP Thu Sep 16 14:17:33 UTC 2010 i686 GNU/Linux

If any further input is required, just let me know...
Comment 1 Eric Dumazet 2010-09-20 17:45:20 UTC
Could you try to get a tcpdump without filter drops ?

You could record on a pcap file, then replay it and use tail command :

tcpdump -i eth0 port 9103 -w file.pcap

...

tcpdump -r file.pcap | tail -n 300


You could try to disable sack and see what happens

echo 0 >/proc/sys/net/ipv4/tcp_sack
Comment 2 Fred Baumgarten 2010-09-24 05:50:58 UTC
Hi Eric !

Thanks for investigating that bug. I am sorry not to reply earlier - but i am a bit busy right now. i did the recording stuff (appended in the new attachment tgz file). Right now i started another job with tcp_sack = 0. I'll post the result later... Last few times the conection broke down at 20 GB and 2 GB.
Comment 3 Fred Baumgarten 2010-09-24 05:53:27 UTC
Created attachment 31232 [details]
tar file holding the last 300 lines of the broken tcp communication.
Comment 4 Fred Baumgarten 2010-09-24 10:48:45 UTC
Created attachment 31262 [details]
sack=0 on host bacula
Comment 5 Fred Baumgarten 2010-09-24 11:19:09 UTC
Created attachment 31272 [details]
both machines with sack = 0
Comment 6 Fred Baumgarten 2010-09-24 11:25:15 UTC
I tried some more constellations

# echo 0 >/proc/sys/net/ipv4/tcp_sack on machine bacula (32 bit)

restarted all bacula servers

-> https://bugzilla.kernel.org/attachment.cgi?id=31262

tcpsack set to 0 on both machines still fails.

-> https://bugzilla.kernel.org/attachment.cgi?id=31272

Hope this helps...

machine disks has

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Pentium(R) Dual-Core  CPU      E5200  @ 2.50GHz
stepping        : 6
cpu MHz         : 2500.000
cache size      : 2048 KB
[...]

machine bacula has

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 28
model name      : Intel(R) Atom(TM) CPU  330   @ 1.60GHz
stepping        : 2
cpu MHz         : 1600.047
cache size      : 512 KB
[...]

Maybe I'll try to set up 64 bit env for bacula as well...
Comment 7 Eric Dumazet 2010-09-24 12:54:52 UTC
OK, could you report "netstat -s" changes ?

netstat -s >before
<transfert>
netstat -s >after

diff after before
Comment 8 Fred Baumgarten 2010-09-24 21:28:43 UTC
Created attachment 31442 [details]
netstat -s before and after problem on both hosts
Comment 9 Fred Baumgarten 2010-09-24 21:30:46 UTC
Hi Eric !

funny for me to see my netstat(8) got improved like that :-) the german manual page still hasn't that feature mentioned, looks like noone did continue my work there...

thanks again for your work.