Bug 206891

Summary: NFS corruption over RTL card
Product: Networking Reporter: Greg Neujahr (keetfox)
Component: OtherAssignee: Stephen Hemminger (stephen)
Status: NEW ---    
Severity: normal CC: hkallweit1, timo
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.4.1+ Subsystem:
Regression: No Bisected commit-id:

Description Greg Neujahr 2020-03-20 00:38:38 UTC
Overview:
First, I'm not positive this is a network related bug. It seems unlikely, but given some of the debugging I've done. I opened it here since it seems like the best place to start.

When I view contents of files on an NFS share running over my realtek card the file shows 0x00 and 0xFF characters (and a single random other character here or there depending on the size of the file). This happens to ALL files on the export, and the file on the actual NFS server (which is running the RTL card) is correct and has proper contents, so it's not corrupt on the source file system. It also happens to ALL clients that connect to it (other linux machines, other kernels, even a windows machine). The other file properties (ACL, owner, times, size, etc) are all correct and it doesn't appear to affect anything other than the contents of the file itself. There are no spurious retransmits, duplicates, or other anomalies. The data itself is even the correct length of the contents of the file.

The reason I opened this as a network issue is that the NFS server is running two network cards. The Realtek card is the 'local' network, and there is an E1000 card running the 'external' network. If I instead change the exports and associated files to allow the other network card, mount it, and then cat the same file, the contents are correct as they should be. So, it seems related to the network implementation itself, even though nothing else in the packets are broken.

I tried reverting back to the 4.19 series kernels on the box (this is a slackware box), specifically 4.19.111 (the latest as of this writing) and was able to make it work just fine over the realtek card as before. I compiled the 5.4.1 kernel using the default slackware configs and installed that, and the problem reappeared. So, I've narrowed it down to 'something' between 4.19.111 and 5.4.1. 

Steps to reproduce
1. Run 32bit kernel 5.4.1 or later (I'm running 5.4.25 from slackware currently)
2. Export an NFS share.
3. Mount the NFS share from another machine.
4. Cat a text (or other) file and observe the contents aren't correct.

Actual Results
The file contents are incorrect. Not 'corrupt', but not even really related to the contents at all. In this case, 0x00 and 0xff bytes.

0000  00 00 00 00 00 00 00 00  ff ff ff ff ff ff ff ff   ........ ........
0010  00 00 00 00 00 00 00 00  ff ff ff ff ff ff ff ff   ........ ........
0020  00 00 00 00 00 00 00 00  ff ff ff ff ff ff         ........ ......


Expected Results
The contents of the file as are present on the NFS server's disk.

0000  54 68 69 73 20 69 73 20  61 20 74 65 73 74 2e 20   This is  a test. 
0010  49 74 20 73 68 6f 75 6c  64 20 62 65 20 6a 75 73   It shoul d be jus
0020  74 20 73 6f 6d 65 20 74  65 78 74 2e 20 0a         t some t ext. .


Build Date & Hardware
Kernels were built from source 4.19.65, and 4.19.111 yesterday. 5.4.1 was built today from source.
This is a 32 bit machine, not 64.

Motherboard: Shuttle Inc. SH55J/SH55J, BIOS 080015  06/18/2010
CPU: Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz (family: 0x6, model: 0x25, stepping: 0x2)
eth0: 
  r8169 0000:02:00.0 eth0: RTL8168e/8111e, 80:ee:73:06:b5:89, XID 2c1, IRQ 27
  r8169 0000:02:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
  RTL8211DN Gigabit Ethernet r8169-200:00: attached PHY driver [RTL8211DN Gigabit Ethernet] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
eth1:
  e1000: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI
  e1000 0000:03:09.0 eth1: (PCI:33MHz:32-bit) 00:1b:21:79:35:57
  e1000 0000:03:09.0 eth1: Intel(R) PRO/1000 Network Connection
NFS Mount:
  /www from 10.0.0.3:/www
  Flags: rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.0.0.3,mountvers=3,mountport=34737,mountpro
to=udp,local_lock=none,addr=10.0.0.3

Additional Builds and Platforms:
Does NOT seem to happen on any 4.9.x kernels.
DOES appear to affect 5.4.1 and 5.4.25.
Does NOT seem to apply to E1000 card, only RTL8169 card.

Additional Info
This is all using the same nfs tools and other applications on the server. The only items that changed were the kernel and kernel modules. I can provide packet captures if they are needed, but I'd rather not expose the IP and hostnames publicly. I can also provide kernel build configs if needed, or output from various commands.

I've also tried mounting soft, with sync, udp, vers 4, 4.1, and 4.2, and with smaller r and wsizes (based on internet research)

I have not tested this by changing the hardware parts out (or removing them), but I have 3 machines all configured the same with the same hardware, and it happens on each.
Comment 1 Heiner Kallweit 2020-03-20 07:48:42 UTC
If the NIC would mess up the packets, then I'd expect checksum errors on the receiving side. What do network statistics on tx and rx side say? Any errors?
Further questions:
- Do you use jumbo packets?
- Do you use NFS over TCP or UDP?
- You say you use a 32 bit kernel. Based on the specs of the Shuttle system (chip set, CPU) it should be 64 bit capable. Did you try with a 64 bit kernel?

The issue may be related to the 32 bit kernel, else I would expect much more such error reports. Best way forward would be bisecting the issue to find the offending commit. Can you do that?
Comment 2 Greg Neujahr 2020-03-20 12:17:56 UTC
I totally agree about the checksums. I would have expected way more corruption in the rest of the packet (and then subsequent retransmits) than just the file payload. So it's very strange. My only link to it being network related is that the behavior changes if I swap to a different NIC (and subsequently different IP) address. 

- I have not specifically enabled jumbo packets, and a quick ping test confirms the MTU limit:
[root@vulpes ~] ping -M do -s 8972 10.0.0.4
PING 10.0.0.4 (10.0.0.4) 8972(9000) bytes of data.
ping: local error: message too long, mtu=1500

- I'm using the default, which as I understand it, is TCP. My nfsstats -m after mounting shows proto=tcp so it's confirmed. I have tried specifying proto=udp as well, but I forgot to capture the nfsstats output, and that didn't work either.

- I have not tried a 64 bit kernel yet. Given that everything is compiled for 32 on the box I'm not sure how well just swapping the kernel out would be. But I can certainly give that a shot. If that doesn't work, I may be able to find a live cd/stick I can configure NFS on briefly and try that out. It'll take me a little bit to give these suggestions a shot.

I would also expect a similar issue. I'd love to be running 64 bit slackware on the boxes, but backing up and reinstalling didn't seem like too exciting of a prospect on active boxes. So I've been lazy. But this may be the thing that makes me do that.

I've been trying to do exactly as you suggest, pulling in different kernels, compiling, and giving it a shot. Unfortunately I wasn't able to find a way to diff between 4.19.111 and 5.4.1 in the web UI so I could see what might have changed (just limited to the drivers/net/ethernet/realtek directory). There's probably enough changes elsewhere in the kernel I'd be overwhelmed. But, if you can point me in the right direction to finding the history linkage (or how the kernel goes between mainline branches like that), I'll download more kernels and see if I can find an offending commit that may have done it.

I'll try to report my findings on 64-bit this weekend when I have some time to sink my teeth into it.
Comment 3 Heiner Kallweit 2020-03-20 16:59:15 UTC
git bisect can be used to identify the offending commit. There's a lot of documentation online how bisecting with git works.
You'd have to clone repo git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git, two start points (one good, one bad) you identified already.
Comment 4 Greg Neujahr 2020-03-20 23:07:16 UTC
Okay, so I was able to confirm at least that it is indeed in the 32 bit kernel only and NOT in the 64 bit. I grabbed a spare drive, took the machine out of service for the day, and now have it dual booting (different partitions) slackware-current with the same kernel (5.4.25) in 32 and 64 bit flavors. So now I've got a good test bed to play with various kernels on the 32bit side without mucking up my existing install.

Also while testing, I noticed that the contents of the file seem to differ in this install. It could be a bad pointer, or some inode issue grabbing stuff off the wrong part of the disk. I'm not sure. But, here's the output (10.0.0.135 being the NFS box under scrutiny here).

[root@testbox /mnt] mount -t nfs -o ro 10.0.0.135:/share /mnt/tmp
[root@testbox /mnt] cat /mnt/tmp/test.txt
                        [root@testbox /mnt]
[root@testbox /mnt] hexdump -v /mnt/tmp/test.txt
0000000 0000 0000 0000 0000 ffff ffff ffff ffff
0000010 0000 0000 0000 0000 ffff ffff ffff ffff
0000020 0000 0000 0000 0000 ffff
000002a
[root@testbox /mnt] umount /mnt/tmp
[root@testbox /mnt] mount -t nfs -o ro 10.0.0.135:/share /mnt/tmp
[root@testbox /mnt] cat /mnt/tmp/test.txt
This is a test. You should see some text.
[root@testbox /mnt] hexdump -v /mnt/tmp/test.txt
0000000 6854 7369 6920 2073 2061 6574 7473 202e
0000010 6f59 2075 6873 756f 646c 7320 6565 7320
0000020 6d6f 2065 6574 7478 0a2e
000002a
[root@testbox /mnt]

And at least once during all this testing it had the string "libXxf86vm - Extension library for the X".

I'll get on to bisecting the commits and rebuilding the kernel a few times. It's an older machine so it takes a few hours. I'll try to tackle that tomorrow.
Comment 5 Heiner Kallweit 2020-03-20 23:30:33 UTC
Great, thanks a lot for your efforts! Very few people are still running 32 bit x86, this would explain why the issue went unnoticed for quite some time.
Comment 6 Heiner Kallweit 2020-03-20 23:50:21 UTC
In addition to bisecting you could check whether the following fixes the issue.
It's related to one of the very few changes I think can cause the misbehavior on 32 bit.

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index d2eef3754..789f2cdf1 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -5440,6 +5440,8 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (sizeof(dma_addr_t) > 4 && tp->mac_version >= RTL_GIGA_MAC_VER_18 &&
 	    !dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64)))
 		dev->features |= NETIF_F_HIGHDMA;
+	else
+		dma_set_mask(&pdev->dev, DMA_BIT_MASK(32);
 
 	rtl_init_rxcfg(tp);
 
-- 
2.25.2
Comment 7 Greg Neujahr 2020-03-25 22:06:43 UTC
Sorry it took a while to get back to you, but compiling the kernel 20 times on an i3 with 8GB of RAM and a slow old SATA drive was taking a while. Partway through, I decided to look for a way to cross-compile this on another box to make it faster (and that got me playing with distcc, which didn't work well for me). Fortunately, it seemed I could cross-compile on Ubuntu 64bit machine by specifying ARCH="i386" arguments to make (for building make oldconfig and bzImage/modules). That, combined with -j8 on a much faster new machine sped things up. But that didn't strike me until Tuesday of this week.

After much trial and error, here is the offending commit, according to git bisect.

93681cd7d94f83903cb3f0f95433d10c28a7e9a5 is the first bad commit
commit 93681cd7d94f83903cb3f0f95433d10c28a7e9a5
Author: Heiner Kallweit <hkallweit1@gmail.com>
Date:   Fri Jul 26 21:51:36 2019 +0200

    r8169: enable HW csum and TSO

    Enable HW csum and TSO per default except on known buggy chip versions.
    Realtek confirmed that RTL8168evl has a HW issue with TSO.

    Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

:040000 040000 1a234fa156ea927adcc2ee8ebd198a40d9bb1431 f908be887d71a850513f4f91f04660ee8c498de9 M      drivers

I took a look at the commit and it just turns on a few flags, so doesn't seem to have much specific impact, but maybe something it enabled the calling of has some issues with the 32 bit architecture. Just because I didn't trust it, I built ~1 from that commit and that commit specifically. And sure enough, the commit before was fine, but the one it targeted was indeed what enabled it to do these strange things.

Here's the bisect log, in case it helps any, but I suspect it won't matter that much.

I'll try your patch against the "bad" commit above and see what happens. Expect me to give some results of that later tonight.

git bisect start
# good: [cc4c818b2219c58af5f0ca59f3e9f02c48bc0b65] Linux 4.19.65
git bisect good cc4c818b2219c58af5f0ca59f3e9f02c48bc0b65
# bad: [79438f37a69a80822333c86acb06a71abce1bce3] Linux 5.4.1
git bisect bad 79438f37a69a80822333c86acb06a71abce1bce3
# good: [84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d] Linux 4.19
git bisect good 84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d
# good: [14cfbdac6680bc87f6b9f6196d4d4ec044a01481] Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
git bisect good 14cfbdac6680bc87f6b9f6196d4d4ec044a01481
# good: [168869492e7009b6861b615f1d030c99bc805e83] docs: kbuild: fix build with pdf and fix some minor issues
git bisect good 168869492e7009b6861b615f1d030c99bc805e83
# good: [e444d51b14c4795074f485c79debd234931f0e49] Merge tag 'tty-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
git bisect good e444d51b14c4795074f485c79debd234931f0e49
# bad: [574cc4539762561d96b456dbc0544d8898bd4c6e] Merge tag 'drm-next-2019-09-18' of git://anongit.freedesktop.org/drm/drm
git bisect bad 574cc4539762561d96b456dbc0544d8898bd4c6e
# bad: [81160dda9a7aad13c04e78bb2cfd3c4630e3afab] Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
git bisect bad 81160dda9a7aad13c04e78bb2cfd3c4630e3afab
# bad: [1e46c09ec10049a9e366153b32e41cc557383fdb] Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
git bisect bad 1e46c09ec10049a9e366153b32e41cc557383fdb
# bad: [20981a1e6b708536493e10ba90016ae28a9f3749] net: hns3: prevent unnecessary MAC TNL interrupt
git bisect bad 20981a1e6b708536493e10ba90016ae28a9f3749
# bad: [f5c977eed725a000dd4efa6d56e86d88085d0b1b] nfp: flower: detect potential pre-tunnel rules
git bisect bad f5c977eed725a000dd4efa6d56e86d88085d0b1b
# bad: [164f0de315ce3ea703f431cd01fd533e3f3fe358] Merge branch 'mlxsw-Test-coverage-for-DSCP-leftover-fix'
git bisect bad 164f0de315ce3ea703f431cd01fd533e3f3fe358
# good: [cd8ae2073963eff4c318d0a1f0e91fc35f0c6a83] sis900: add support for ethtool's EEPROM dump
git bisect good cd8ae2073963eff4c318d0a1f0e91fc35f0c6a83
# bad: [ef14358546b115d9ea4108f21c6badd97334ad18] r8169: make use of xmit_more
git bisect bad ef14358546b115d9ea4108f21c6badd97334ad18
# bad: [1c822948fa600d2d2e7926640279b16d1f4a7423] net: hns3: add debug messages to identify eth down cause
git bisect bad 1c822948fa600d2d2e7926640279b16d1f4a7423
# bad: [93681cd7d94f83903cb3f0f95433d10c28a7e9a5] r8169: enable HW csum and TSO
git bisect bad 93681cd7d94f83903cb3f0f95433d10c28a7e9a5
# good: [91c6bfb8315b05d0c3aa8c2cfc172888d05e31d5] mlx4/en_netdev: allow offloading VXLAN over VLAN
git bisect good 91c6bfb8315b05d0c3aa8c2cfc172888d05e31d5
# good: [0170d594ded8297760ca9d8eb7eb6a9aff378059] r8169: set GSO size and segment limits
git bisect good 0170d594ded8297760ca9d8eb7eb6a9aff378059
# good: [96ea772ef24114e6aa3ce39599474ae794be625f] r8169: remove r8169_csum_workaround
git bisect good 96ea772ef24114e6aa3ce39599474ae794be625f
# first bad commit: [93681cd7d94f83903cb3f0f95433d10c28a7e9a5] r8169: enable HW csum and TSO
Comment 8 Heiner Kallweit 2020-03-25 23:06:32 UTC
Interesting, thanks for testing. Right, the offending commit just enables TSO per default. Therefore it doesn't seem to cause the issue, it just reveals the issue.
What you can do:

On newer 32 bit kernels, just disable TSO in a boot script or similar:
ethtool -K eth0 sg off tso off
Then your system should be stable again.

To get an idea about the root cause of the issue:
You could manually enable TSO on older kernels to check whether they also suffer from the issue: ethtool -K eth0 sg on tso on
If certain older versions are fine with TSO enabled, then bisecting with TSO manually enabled should provide a hint.
Comment 9 Greg Neujahr 2020-03-26 00:47:58 UTC
I had tried turning off TSO before and that didn't seem to impact it (it was something that seemed related in a similar issue about partial corruption). However, I didn't try (or even know about) scatter gather. So, I gave just 

ethtool -K eth0 tso off

by itself, and it didn't resolve the issue (FWIW, remembering to flush the NFS buffers after changing on the receiver is important for testing). But, when I disabled SG, but didn't explicitly disable TSO it works. I did notice from the ethtool output though that disable SG seems to also disable the TCP segmentation offloading (TSO I presume). So, maybe that'll help narrow it down a little.

I do have 4.19.111 and 4.19.65 still around and bootable, so I can boot into those and try out turning SG on and see what happens. If either of those are 'good' kernels, I can then bisect to this broken commit and see if I can find a commit for you.

I'll keep digging in the free time I find this week(end) and let you know what I've found. For now, disabling SG and TSO seems to make it work, so I'll put that in a boot script right before it brings up my interfaces. That'll at least let me go back to having things 'normal'.

Thanks for helping find this out! Hopefully I can find a related commit for you. I'll know more tomorrow.
Comment 10 Heiner Kallweit 2020-03-28 21:57:54 UTC
Could you please test the following experimental patch on top of 5.4?
It *might* improve SG handling, but so far it's an experiment only.

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 791d99b9e..ba822e70d 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4101,11 +4101,14 @@ static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb,
 			goto err_out;
 		}
 
-		txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
-		txd->opts2 = cpu_to_le32(opts[1]);
+		tp->tx_skb[entry].len = len;
 		txd->addr = cpu_to_le64(mapping);
+		txd->opts2 = cpu_to_le32(opts[1]);
 
-		tp->tx_skb[entry].len = len;
+		/* Complete memory writes before releasing descriptor */
+		dma_wmb();
+
+		txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
 	}
 
 	if (cur_frag) {
-- 
2.26.0
Comment 11 Greg Neujahr 2020-03-31 13:24:16 UTC
I applied that on top of the 5.4.28 release of the kernel and put it on the machine. It didn't seem to help any at all. The bytecount for NFS read reply was correct, but the data in this case was all 0xaa bytes. Disabling SG again seemed to fix the issue (after flushing the NFS cache). So I'd say that this didn't seem to have much an affect. Since it's an experimental patch, I'll reboot to my stock kernel for now.

If you've got other patches you'd like me to test, I'd be happy to. I'm still playing with SG and TSO on older kernels to see what happens and then bisect into that. It's not a fast process, but I'm still poking at it when I have time away from work.

Since I'm not as familiar with the kernel as some other folks, is there a special path it follows when pulling in data versus the headers for sending via NFS? If the headers all seem correct for everything else, but it's just the data that's wrong, perhaps there's some critical path or pointer being passed in when it adds the NFS payload that's being badly handled via one special function? Maybe something memory mapped or the like, since we're dealing with files from disk and whatnot? Just a thought.
Comment 12 timo 2020-04-03 19:30:10 UTC
The fact that the headers are correct, but the data not, to me seems to match with the suspicion that this is related to TCP Segmentation Offload (TSO).

A TSO-enabled NIC accepts a chunk of data that is too large (size > MTU) to be transmitted without fragmentation over the link, splits the data into separate fragments and automatically adds the user-specified headers to each of the fragments.

Hence, this may have nothing to do with NFS...
Comment 13 Greg Neujahr 2020-04-08 12:17:43 UTC
(In reply to timo from comment #12)
> 
> Hence, this may have nothing to do with NFS...

I would tend to agree, but NFS is my only use case in this particular situation. I would expect with the level of strange corruption going on that it'd affect ssh, rsync, web requests, dns, and just about everything else the box serves over that nic. But, it doesn't seem to be. I can reliably open webpages, download large files, ssh, and do everything normally aside from NFS. I can certainly try to capture some packets from large transfers and the like to see if perhaps it's doing something like that I can't spot. Would you think that scp'ing a rather large file and checking their hashes afterwards would be a good indicator? Or is there a simpler test that might manifest it?

The other reason I feel that NFS is involved, is that it only seems to be corrupting the payload of the NFS transmission (best that I can tell). The file permissions and other data inside the packet seem perfectly fine. Since, if TSO is only dealing with the TCP headers and considers everything else the body, that might mean something else is going on there.

Again, just my suspicions. If you have any additional tests I can run or try out, I'll be quite happy to do so. I'd personally like to know what the issue is. I've got new intel NICs I can swap out to instead of the RTL ones on the motherboard, but I'd enjoy solving the problem so much more than working around it.