Overview: First, I'm not positive this is a network related bug. It seems unlikely, but given some of the debugging I've done. I opened it here since it seems like the best place to start. When I view contents of files on an NFS share running over my realtek card the file shows 0x00 and 0xFF characters (and a single random other character here or there depending on the size of the file). This happens to ALL files on the export, and the file on the actual NFS server (which is running the RTL card) is correct and has proper contents, so it's not corrupt on the source file system. It also happens to ALL clients that connect to it (other linux machines, other kernels, even a windows machine). The other file properties (ACL, owner, times, size, etc) are all correct and it doesn't appear to affect anything other than the contents of the file itself. There are no spurious retransmits, duplicates, or other anomalies. The data itself is even the correct length of the contents of the file. The reason I opened this as a network issue is that the NFS server is running two network cards. The Realtek card is the 'local' network, and there is an E1000 card running the 'external' network. If I instead change the exports and associated files to allow the other network card, mount it, and then cat the same file, the contents are correct as they should be. So, it seems related to the network implementation itself, even though nothing else in the packets are broken. I tried reverting back to the 4.19 series kernels on the box (this is a slackware box), specifically 4.19.111 (the latest as of this writing) and was able to make it work just fine over the realtek card as before. I compiled the 5.4.1 kernel using the default slackware configs and installed that, and the problem reappeared. So, I've narrowed it down to 'something' between 4.19.111 and 5.4.1. Steps to reproduce 1. Run 32bit kernel 5.4.1 or later (I'm running 5.4.25 from slackware currently) 2. Export an NFS share. 3. Mount the NFS share from another machine. 4. Cat a text (or other) file and observe the contents aren't correct. Actual Results The file contents are incorrect. Not 'corrupt', but not even really related to the contents at all. In this case, 0x00 and 0xff bytes. 0000 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff ........ ........ 0010 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff ........ ........ 0020 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ........ ...... Expected Results The contents of the file as are present on the NFS server's disk. 0000 54 68 69 73 20 69 73 20 61 20 74 65 73 74 2e 20 This is a test. 0010 49 74 20 73 68 6f 75 6c 64 20 62 65 20 6a 75 73 It shoul d be jus 0020 74 20 73 6f 6d 65 20 74 65 78 74 2e 20 0a t some t ext. . Build Date & Hardware Kernels were built from source 4.19.65, and 4.19.111 yesterday. 5.4.1 was built today from source. This is a 32 bit machine, not 64. Motherboard: Shuttle Inc. SH55J/SH55J, BIOS 080015 06/18/2010 CPU: Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz (family: 0x6, model: 0x25, stepping: 0x2) eth0: r8169 0000:02:00.0 eth0: RTL8168e/8111e, 80:ee:73:06:b5:89, XID 2c1, IRQ 27 r8169 0000:02:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] RTL8211DN Gigabit Ethernet r8169-200:00: attached PHY driver [RTL8211DN Gigabit Ethernet] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE) eth1: e1000: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI e1000 0000:03:09.0 eth1: (PCI:33MHz:32-bit) 00:1b:21:79:35:57 e1000 0000:03:09.0 eth1: Intel(R) PRO/1000 Network Connection NFS Mount: /www from 10.0.0.3:/www Flags: rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.0.0.3,mountvers=3,mountport=34737,mountpro to=udp,local_lock=none,addr=10.0.0.3 Additional Builds and Platforms: Does NOT seem to happen on any 4.9.x kernels. DOES appear to affect 5.4.1 and 5.4.25. Does NOT seem to apply to E1000 card, only RTL8169 card. Additional Info This is all using the same nfs tools and other applications on the server. The only items that changed were the kernel and kernel modules. I can provide packet captures if they are needed, but I'd rather not expose the IP and hostnames publicly. I can also provide kernel build configs if needed, or output from various commands. I've also tried mounting soft, with sync, udp, vers 4, 4.1, and 4.2, and with smaller r and wsizes (based on internet research) I have not tested this by changing the hardware parts out (or removing them), but I have 3 machines all configured the same with the same hardware, and it happens on each.
If the NIC would mess up the packets, then I'd expect checksum errors on the receiving side. What do network statistics on tx and rx side say? Any errors? Further questions: - Do you use jumbo packets? - Do you use NFS over TCP or UDP? - You say you use a 32 bit kernel. Based on the specs of the Shuttle system (chip set, CPU) it should be 64 bit capable. Did you try with a 64 bit kernel? The issue may be related to the 32 bit kernel, else I would expect much more such error reports. Best way forward would be bisecting the issue to find the offending commit. Can you do that?
I totally agree about the checksums. I would have expected way more corruption in the rest of the packet (and then subsequent retransmits) than just the file payload. So it's very strange. My only link to it being network related is that the behavior changes if I swap to a different NIC (and subsequently different IP) address. - I have not specifically enabled jumbo packets, and a quick ping test confirms the MTU limit: [root@vulpes ~] ping -M do -s 8972 10.0.0.4 PING 10.0.0.4 (10.0.0.4) 8972(9000) bytes of data. ping: local error: message too long, mtu=1500 - I'm using the default, which as I understand it, is TCP. My nfsstats -m after mounting shows proto=tcp so it's confirmed. I have tried specifying proto=udp as well, but I forgot to capture the nfsstats output, and that didn't work either. - I have not tried a 64 bit kernel yet. Given that everything is compiled for 32 on the box I'm not sure how well just swapping the kernel out would be. But I can certainly give that a shot. If that doesn't work, I may be able to find a live cd/stick I can configure NFS on briefly and try that out. It'll take me a little bit to give these suggestions a shot. I would also expect a similar issue. I'd love to be running 64 bit slackware on the boxes, but backing up and reinstalling didn't seem like too exciting of a prospect on active boxes. So I've been lazy. But this may be the thing that makes me do that. I've been trying to do exactly as you suggest, pulling in different kernels, compiling, and giving it a shot. Unfortunately I wasn't able to find a way to diff between 4.19.111 and 5.4.1 in the web UI so I could see what might have changed (just limited to the drivers/net/ethernet/realtek directory). There's probably enough changes elsewhere in the kernel I'd be overwhelmed. But, if you can point me in the right direction to finding the history linkage (or how the kernel goes between mainline branches like that), I'll download more kernels and see if I can find an offending commit that may have done it. I'll try to report my findings on 64-bit this weekend when I have some time to sink my teeth into it.
git bisect can be used to identify the offending commit. There's a lot of documentation online how bisecting with git works. You'd have to clone repo git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git, two start points (one good, one bad) you identified already.
Okay, so I was able to confirm at least that it is indeed in the 32 bit kernel only and NOT in the 64 bit. I grabbed a spare drive, took the machine out of service for the day, and now have it dual booting (different partitions) slackware-current with the same kernel (5.4.25) in 32 and 64 bit flavors. So now I've got a good test bed to play with various kernels on the 32bit side without mucking up my existing install. Also while testing, I noticed that the contents of the file seem to differ in this install. It could be a bad pointer, or some inode issue grabbing stuff off the wrong part of the disk. I'm not sure. But, here's the output (10.0.0.135 being the NFS box under scrutiny here). [root@testbox /mnt] mount -t nfs -o ro 10.0.0.135:/share /mnt/tmp [root@testbox /mnt] cat /mnt/tmp/test.txt [root@testbox /mnt] [root@testbox /mnt] hexdump -v /mnt/tmp/test.txt 0000000 0000 0000 0000 0000 ffff ffff ffff ffff 0000010 0000 0000 0000 0000 ffff ffff ffff ffff 0000020 0000 0000 0000 0000 ffff 000002a [root@testbox /mnt] umount /mnt/tmp [root@testbox /mnt] mount -t nfs -o ro 10.0.0.135:/share /mnt/tmp [root@testbox /mnt] cat /mnt/tmp/test.txt This is a test. You should see some text. [root@testbox /mnt] hexdump -v /mnt/tmp/test.txt 0000000 6854 7369 6920 2073 2061 6574 7473 202e 0000010 6f59 2075 6873 756f 646c 7320 6565 7320 0000020 6d6f 2065 6574 7478 0a2e 000002a [root@testbox /mnt] And at least once during all this testing it had the string "libXxf86vm - Extension library for the X". I'll get on to bisecting the commits and rebuilding the kernel a few times. It's an older machine so it takes a few hours. I'll try to tackle that tomorrow.
Great, thanks a lot for your efforts! Very few people are still running 32 bit x86, this would explain why the issue went unnoticed for quite some time.
In addition to bisecting you could check whether the following fixes the issue. It's related to one of the very few changes I think can cause the misbehavior on 32 bit. diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c index d2eef3754..789f2cdf1 100644 --- a/drivers/net/ethernet/realtek/r8169_main.c +++ b/drivers/net/ethernet/realtek/r8169_main.c @@ -5440,6 +5440,8 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) if (sizeof(dma_addr_t) > 4 && tp->mac_version >= RTL_GIGA_MAC_VER_18 && !dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64))) dev->features |= NETIF_F_HIGHDMA; + else + dma_set_mask(&pdev->dev, DMA_BIT_MASK(32); rtl_init_rxcfg(tp); -- 2.25.2
Sorry it took a while to get back to you, but compiling the kernel 20 times on an i3 with 8GB of RAM and a slow old SATA drive was taking a while. Partway through, I decided to look for a way to cross-compile this on another box to make it faster (and that got me playing with distcc, which didn't work well for me). Fortunately, it seemed I could cross-compile on Ubuntu 64bit machine by specifying ARCH="i386" arguments to make (for building make oldconfig and bzImage/modules). That, combined with -j8 on a much faster new machine sped things up. But that didn't strike me until Tuesday of this week. After much trial and error, here is the offending commit, according to git bisect. 93681cd7d94f83903cb3f0f95433d10c28a7e9a5 is the first bad commit commit 93681cd7d94f83903cb3f0f95433d10c28a7e9a5 Author: Heiner Kallweit <hkallweit1@gmail.com> Date: Fri Jul 26 21:51:36 2019 +0200 r8169: enable HW csum and TSO Enable HW csum and TSO per default except on known buggy chip versions. Realtek confirmed that RTL8168evl has a HW issue with TSO. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> :040000 040000 1a234fa156ea927adcc2ee8ebd198a40d9bb1431 f908be887d71a850513f4f91f04660ee8c498de9 M drivers I took a look at the commit and it just turns on a few flags, so doesn't seem to have much specific impact, but maybe something it enabled the calling of has some issues with the 32 bit architecture. Just because I didn't trust it, I built ~1 from that commit and that commit specifically. And sure enough, the commit before was fine, but the one it targeted was indeed what enabled it to do these strange things. Here's the bisect log, in case it helps any, but I suspect it won't matter that much. I'll try your patch against the "bad" commit above and see what happens. Expect me to give some results of that later tonight. git bisect start # good: [cc4c818b2219c58af5f0ca59f3e9f02c48bc0b65] Linux 4.19.65 git bisect good cc4c818b2219c58af5f0ca59f3e9f02c48bc0b65 # bad: [79438f37a69a80822333c86acb06a71abce1bce3] Linux 5.4.1 git bisect bad 79438f37a69a80822333c86acb06a71abce1bce3 # good: [84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d] Linux 4.19 git bisect good 84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d # good: [14cfbdac6680bc87f6b9f6196d4d4ec044a01481] Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next git bisect good 14cfbdac6680bc87f6b9f6196d4d4ec044a01481 # good: [168869492e7009b6861b615f1d030c99bc805e83] docs: kbuild: fix build with pdf and fix some minor issues git bisect good 168869492e7009b6861b615f1d030c99bc805e83 # good: [e444d51b14c4795074f485c79debd234931f0e49] Merge tag 'tty-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty git bisect good e444d51b14c4795074f485c79debd234931f0e49 # bad: [574cc4539762561d96b456dbc0544d8898bd4c6e] Merge tag 'drm-next-2019-09-18' of git://anongit.freedesktop.org/drm/drm git bisect bad 574cc4539762561d96b456dbc0544d8898bd4c6e # bad: [81160dda9a7aad13c04e78bb2cfd3c4630e3afab] Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next git bisect bad 81160dda9a7aad13c04e78bb2cfd3c4630e3afab # bad: [1e46c09ec10049a9e366153b32e41cc557383fdb] Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next git bisect bad 1e46c09ec10049a9e366153b32e41cc557383fdb # bad: [20981a1e6b708536493e10ba90016ae28a9f3749] net: hns3: prevent unnecessary MAC TNL interrupt git bisect bad 20981a1e6b708536493e10ba90016ae28a9f3749 # bad: [f5c977eed725a000dd4efa6d56e86d88085d0b1b] nfp: flower: detect potential pre-tunnel rules git bisect bad f5c977eed725a000dd4efa6d56e86d88085d0b1b # bad: [164f0de315ce3ea703f431cd01fd533e3f3fe358] Merge branch 'mlxsw-Test-coverage-for-DSCP-leftover-fix' git bisect bad 164f0de315ce3ea703f431cd01fd533e3f3fe358 # good: [cd8ae2073963eff4c318d0a1f0e91fc35f0c6a83] sis900: add support for ethtool's EEPROM dump git bisect good cd8ae2073963eff4c318d0a1f0e91fc35f0c6a83 # bad: [ef14358546b115d9ea4108f21c6badd97334ad18] r8169: make use of xmit_more git bisect bad ef14358546b115d9ea4108f21c6badd97334ad18 # bad: [1c822948fa600d2d2e7926640279b16d1f4a7423] net: hns3: add debug messages to identify eth down cause git bisect bad 1c822948fa600d2d2e7926640279b16d1f4a7423 # bad: [93681cd7d94f83903cb3f0f95433d10c28a7e9a5] r8169: enable HW csum and TSO git bisect bad 93681cd7d94f83903cb3f0f95433d10c28a7e9a5 # good: [91c6bfb8315b05d0c3aa8c2cfc172888d05e31d5] mlx4/en_netdev: allow offloading VXLAN over VLAN git bisect good 91c6bfb8315b05d0c3aa8c2cfc172888d05e31d5 # good: [0170d594ded8297760ca9d8eb7eb6a9aff378059] r8169: set GSO size and segment limits git bisect good 0170d594ded8297760ca9d8eb7eb6a9aff378059 # good: [96ea772ef24114e6aa3ce39599474ae794be625f] r8169: remove r8169_csum_workaround git bisect good 96ea772ef24114e6aa3ce39599474ae794be625f # first bad commit: [93681cd7d94f83903cb3f0f95433d10c28a7e9a5] r8169: enable HW csum and TSO
Interesting, thanks for testing. Right, the offending commit just enables TSO per default. Therefore it doesn't seem to cause the issue, it just reveals the issue. What you can do: On newer 32 bit kernels, just disable TSO in a boot script or similar: ethtool -K eth0 sg off tso off Then your system should be stable again. To get an idea about the root cause of the issue: You could manually enable TSO on older kernels to check whether they also suffer from the issue: ethtool -K eth0 sg on tso on If certain older versions are fine with TSO enabled, then bisecting with TSO manually enabled should provide a hint.
I had tried turning off TSO before and that didn't seem to impact it (it was something that seemed related in a similar issue about partial corruption). However, I didn't try (or even know about) scatter gather. So, I gave just ethtool -K eth0 tso off by itself, and it didn't resolve the issue (FWIW, remembering to flush the NFS buffers after changing on the receiver is important for testing). But, when I disabled SG, but didn't explicitly disable TSO it works. I did notice from the ethtool output though that disable SG seems to also disable the TCP segmentation offloading (TSO I presume). So, maybe that'll help narrow it down a little. I do have 4.19.111 and 4.19.65 still around and bootable, so I can boot into those and try out turning SG on and see what happens. If either of those are 'good' kernels, I can then bisect to this broken commit and see if I can find a commit for you. I'll keep digging in the free time I find this week(end) and let you know what I've found. For now, disabling SG and TSO seems to make it work, so I'll put that in a boot script right before it brings up my interfaces. That'll at least let me go back to having things 'normal'. Thanks for helping find this out! Hopefully I can find a related commit for you. I'll know more tomorrow.
Could you please test the following experimental patch on top of 5.4? It *might* improve SG handling, but so far it's an experiment only. diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c index 791d99b9e..ba822e70d 100644 --- a/drivers/net/ethernet/realtek/r8169_main.c +++ b/drivers/net/ethernet/realtek/r8169_main.c @@ -4101,11 +4101,14 @@ static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb, goto err_out; } - txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry); - txd->opts2 = cpu_to_le32(opts[1]); + tp->tx_skb[entry].len = len; txd->addr = cpu_to_le64(mapping); + txd->opts2 = cpu_to_le32(opts[1]); - tp->tx_skb[entry].len = len; + /* Complete memory writes before releasing descriptor */ + dma_wmb(); + + txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry); } if (cur_frag) { -- 2.26.0
I applied that on top of the 5.4.28 release of the kernel and put it on the machine. It didn't seem to help any at all. The bytecount for NFS read reply was correct, but the data in this case was all 0xaa bytes. Disabling SG again seemed to fix the issue (after flushing the NFS cache). So I'd say that this didn't seem to have much an affect. Since it's an experimental patch, I'll reboot to my stock kernel for now. If you've got other patches you'd like me to test, I'd be happy to. I'm still playing with SG and TSO on older kernels to see what happens and then bisect into that. It's not a fast process, but I'm still poking at it when I have time away from work. Since I'm not as familiar with the kernel as some other folks, is there a special path it follows when pulling in data versus the headers for sending via NFS? If the headers all seem correct for everything else, but it's just the data that's wrong, perhaps there's some critical path or pointer being passed in when it adds the NFS payload that's being badly handled via one special function? Maybe something memory mapped or the like, since we're dealing with files from disk and whatnot? Just a thought.
The fact that the headers are correct, but the data not, to me seems to match with the suspicion that this is related to TCP Segmentation Offload (TSO). A TSO-enabled NIC accepts a chunk of data that is too large (size > MTU) to be transmitted without fragmentation over the link, splits the data into separate fragments and automatically adds the user-specified headers to each of the fragments. Hence, this may have nothing to do with NFS...
(In reply to timo from comment #12) > > Hence, this may have nothing to do with NFS... I would tend to agree, but NFS is my only use case in this particular situation. I would expect with the level of strange corruption going on that it'd affect ssh, rsync, web requests, dns, and just about everything else the box serves over that nic. But, it doesn't seem to be. I can reliably open webpages, download large files, ssh, and do everything normally aside from NFS. I can certainly try to capture some packets from large transfers and the like to see if perhaps it's doing something like that I can't spot. Would you think that scp'ing a rather large file and checking their hashes afterwards would be a good indicator? Or is there a simpler test that might manifest it? The other reason I feel that NFS is involved, is that it only seems to be corrupting the payload of the NFS transmission (best that I can tell). The file permissions and other data inside the packet seem perfectly fine. Since, if TSO is only dealing with the TCP headers and considers everything else the body, that might mean something else is going on there. Again, just my suspicions. If you have any additional tests I can run or try out, I'll be quite happy to do so. I'd personally like to know what the issue is. I've got new intel NICs I can swap out to instead of the RTL ones on the motherboard, but I'd enjoy solving the problem so much more than working around it.