Bug 8043
Summary: | curious communication breakage with e1000 and NBT | ||
---|---|---|---|
Product: | Drivers | Reporter: | Wolf Wiegand (wiegand) |
Component: | Network | Assignee: | Jesse Brandeburg (jbrandeb) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | bunk, jbrandeb, jeffrey.t.kirsher, okir |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.18.7 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
tcpdump trace
patch to fix tcp zero csum |
Description
Wolf Wiegand
2007-02-20 04:20:57 UTC
Created attachment 10470 [details]
tcpdump trace
Might be related to SNAP:
14:00:37.469645 IP (tos 0x0, ttl 30, id 29939, offset 0, flags [none], proto:
TCP (6), length: 95) winxp.olb.test.45080 > master.olb.test.netbios-ssn: P
53998:54053(55) ack 6120349 win 1450
>>> NBT Session Packet
NBT Session Message
Flags=0x0
Length=51 (0x33)
WARNING: Short packet. Try increasing the snap length by 13
will you please try the latest standalone driver from http://e1000.sf.net, as it is much newer than the driver in the kernels you are using. It should work on any kernel that is not 2.6.20. I'm curious whether this has something to do with the CRC stripping changes. If you get us a more detailed description of how to reproduce I can have our lab try it. Hi, thanks for your response. I've reran the tests with the latest driver, which made no difference: # modinfo e1000 | grep version version: 7.3.20 srcversion: C7395247572355AB0396F9B I've ran some more tests with tcpdump and ethereal. The following packets are being sent when the problem occurs (ie, the file being transferred at that moment cannot be read later on): 10.200.7.230 == linux server, 10.200.7.231 == FreeDOS client: 336.189987 10.200.7.230 10.200.7.231 NBSS [TCP Window Full] NBSS Continuation Message 336.282886 10.200.7.231 10.200.7.230 TCP 39952 > netbios-ssn [ACK] Seq=1076691 Ack=195971747 Win=1450 Len=0 336.283383 10.200.7.230 10.200.7.231 NBSS [TCP Window Full] NBSS Continuation Message 336.506917 10.200.7.230 10.200.7.231 NBSS [TCP Retransmission] NBSS Continuation Message 336.946934 10.200.7.230 10.200.7.231 NBSS [TCP Retransmission] NBSS Continuation Message 337.826998 10.200.7.230 10.200.7.231 NBSS [TCP Retransmission] NBSS Continuation Message 339.587090 10.200.7.230 10.200.7.231 NBSS [TCP Retransmission] NBSS Continuation Message 343.107283 10.200.7.230 10.200.7.231 NBSS [TCP Retransmission] NBSS Continuation Message 350.147649 10.200.7.230 10.200.7.231 NBSS [TCP Retransmission] NBSS Continuation Message 364.028368 10.200.7.230 10.200.7.231 NBSS [TCP Retransmission] NBSS Continuation Message 380.519862 10.200.7.231 10.200.7.230 SMB Read Raw Request, FID: 0x1a65 380.558958 10.200.7.230 10.200.7.231 TCP [TCP Window Full] netbios-ssn > 39952 [ACK] Seq=195973197 Ack=1076746 Win=5840 Len=0 385.559216 10.200.7.230 10.200.7.231 ARP Who has 10.200.7.231? Tell 10.200.7.230 385.559343 10.200.7.231 10.200.7.230 ARP 10.200.7.231 is at 00:30:05:19:8e:04 392.179830 10.200.7.230 10.200.7.231 NBSS [TCP Retransmission] NBSS Continuation Message 392.180049 10.200.7.231 10.200.7.230 TCP [TCP ZeroWindow] 39952 > netbios-ssn [ACK] Seq=1076746 Ack=195973197 Win=0 Len=0 392.180091 10.200.7.231 10.200.7.230 TCP [TCP Window Update] 39952 > netbios-ssn [ACK] Seq=1076746 Ack=195973197 Win=1450 Len=0 392.180517 10.200.7.230 10.200.7.231 NBSS NBSS Continuation Message 392.180593 10.200.7.230 10.200.7.231 NBSS [TCP Window Full] NBSS Continuation Message 392.181931 10.200.7.231 10.200.7.230 TCP 39952 > netbios-ssn [ACK] Seq=1076746 Ack=195974647 Win=1450 Len=0 392.182366 10.200.7.230 10.200.7.231 NBSS NBSS Continuation Message 392.182415 10.200.7.230 10.200.7.231 NBSS [TCP Window Full] NBSS Continuation Message The [TCP Window full] don't seem to be a problem, these also occur when using the realtek card. Steps to reproduce this: - Get http://bitz150.bitz.briteline.de/undis3c.img and dd it onto a floppy disc (pxe boot is also possible), boot from this disc. This contains FreeDOS. - Configure a DHCP server to give out an ip address to the client - During boot, probably some errors will occur as the configured share will not be present. Override the given values with the name of a samba server and a share on it. - When you end up on the command prompt, try to copy a large folder off the network share onto the local hard drive. The hard drive has to be pre-formatted, as the disc contains no tools for this. Unfortunately, the problem only occurs with some clients. At the moment, we can reproduce this on a client where lspci shows the following: 0000:00:00.0 Host bridge: Intel Corp. 82810E DC-133 GMCH [Graphics Memory Controller Hub] (rev 03) 0000:00:01.0 VGA compatible controller: Intel Corp. 82810E DC-133 CGC [Chipset Graphics Controller] (rev 03) 0000:00:1e.0 PCI bridge: Intel Corp. 82801 PCI Bridge (rev 05) 0000:00:1f.0 ISA bridge: Intel Corp. 82801BA ISA Bridge (LPC) (rev 05) 0000:00:1f.1 IDE interface: Intel Corp. 82801BA IDE U100 (rev 05) 0000:00:1f.2 USB Controller: Intel Corp. 82801BA/BAM USB (Hub #1) (rev 05) 0000:00:1f.3 SMBus: Intel Corp. 82801BA/BAM SMBus (rev 05) 0000:00:1f.4 USB Controller: Intel Corp. 82801BA/BAM USB (Hub #2) (rev 05) 0000:00:1f.5 Multimedia audio controller: Intel Corp. 82801BA/BAM AC'97 Audio (rev 05) 0000:01:08.0 Ethernet controller: Intel Corp. 82801BA/BAM/CA/CAM Ethernet Controller (rev 03) Regarding the SNAP message: "WARNING: Short packet. Try increasing the snap length by 13" - This seems to be a tcpdump display issue only, using '-s 180' or so, this does not happen anymore. turn off tcp window scaling, you might have a broken router or peer? We already tried that (was pretty much the first thing we did), and this did not help. I just rechecked this, establishing a direct connection between server and client using a crosslink calbe, the same problem still shows up. Concerning the client network driver, the driver on the DOS disk we are using is a generic 3com (3Com Universal NDIS driver v1.00) driver which is supposed to work with any network card. We've replaced the driver with a e100b-driver (running strings on it reveals "Intel(R) PRO/100 Network Connection Driver v4.57 112304"), which did not help. This may be a lengthy process, but it might be interesting to try to regress through our drivers from 7.3.20 back to 7.0.33 or even back to the last version that's roughly the same as in 2.4.32 (5.4.11) and seeing if the 2.6 kernel works correctly with any of these drivers. It may help to attach a raw tcpdump (tcpdump -s 512 -w /tmp/somefile) - there are some peculiarities in the dump. The pattern looks similar in both cases. However, the dump in attachment #5 [details] is too dumbed down to actually look at sequence numbers etc. - client max window seems to be set at 1450, and Linux server transmits using a segment size of half that window - 725. - master retransmits the same packet all over again, and client ignores it. - client sends another readraw request - server retransmits old segment plus additional data, and now the client groks it. Looking at the dump from attachment #1 [details]: master -> client: 6064241:6064966(725) ack 53650 win 5840 This seems to be an old packet. The next packet we see comes 0.2 seconds later, but notice the huge difference in sequence numbers - the send sequence differs by about 1.6MB! master -> client: 6120349:6121074(725) ack 53998 win 5840 this is repeated several times client -> master: 53998:54053(55) ack 6120349 win 1450 this is the read request The ACK shows that the client hasn't processed any of the reply packets sent above. master -> client: 6121799:6121799(0) ack 54053 win 5840 empty ACK of read request master -> client: 6120349:6121074(725) ack 54053 win 5840 that same old packet again client -> master: 54053:54053(0) ack 6121799 win 0 whoops - now it ACKs that old segment, but it seems the following segment was already sent *and* received. Note that the client advertises a zero window here. There's a rather funky TCP stack at work here... client -> master: 54053:54053(0) ack 6121799 win 1450 Client reopens TCP window. Apparently it needed a little break to process these two segments in the queue. master -> client: 6121799:6122524(725) ack 54053 win 5840 now we go on sending more data In summary, the TCP exchange here is highly unusual, but it should continue after this. The fact that it doesn't would mean (to me) that the client's TCP stack is terminally confused. Here's a funny theory: for some strange reason, the e1000 driver retransmits an old packet which should have been purged from the TX ring long ago. Client's TCP stack says "omigosh" and things go downhill from here. Having a raw tcpdump (with some packets before and after the hang) would help to check that theory Wolf, did you have any luck getting a tcpdump -s 512 -w /tmp/dumpfile while having this problem? ping. Sorry for the delay. I've uploaded the raw tcpdump to http://bitz150.bitz.briteline.de/tcpdump.out.s512.filtered (In reply to comment #7) > This may be a lengthy process, but it might be interesting to try to regress > through our drivers from 7.3.20 back to 7.0.33 or even back to the last > version > that's roughly the same as in 2.4.32 (5.4.11) and seeing if the 2.6 kernel > works > correctly with any of these drivers. This problem also occurs with versions 7.0.33 and 5.7.6. Version 5.4.11 wouldn't compile on kernel 2.6.14.7, and I was not able to make the necessary change in the source code to compile it. I believe we actually fixed this bug in e1000. I'm not sure if a kernel patch was pushed to do the same. The problem is that the e1000 hardware was inserting or misinterpreting an incorrect checksum for packets with 0x0000 checksum. I'll see if I can dig up the patch, as I assume this is still occurring on current kernels. Created attachment 18184 [details]
patch to fix tcp zero csum
this patch was only compile tested but has undergone extensive testing in our out of tree drivers.
not sure which hardware you had (didn't look, sorry) but the same patch is likely needed for e1000 as well. Jeff Kirsher will probably post both to netdev soon for inclusion in 2.6.28 hopefully. |