I have a heterogeneous environment with machines ranging from SunOS 4.1.4 using version 2 NFS to machines with modern Ubuntu 20.04 using version 4.2. All of my NFS servers are Ubuntu 20.04 but self compiled kernels. The kernels are essentially configured the same as Ubuntu low latency except they are configured to be fully tickless and they do not have any of Canonicals patches. On version 5.6.15, NFS appeared to work fine. When I attempted to upgrade to 5.7.0 clients got permission denied on file systems exported in /etc/exports. Exportfs showed the file systems exported to the clients but clients still couldn't mount them. This behavior persists with 5.7.1 and 5.7.2 making 5.7.x useless in NFS servers. All of my servers are Intel based, i7-6700k or i7-6850k so I do not know if this bug is hardware specific or not.
Based on the symptoms, it does sound like an upstream kernel bug. But I don't have an explanation off the top of my head. Are any of your clients using kerberos for NFS? Is every client failing to mount every time, or is this intermittent or only affecting some clients? Could I see the contents of /etc/exports on the server? On a quick skim of nfsd commits in the range v5.6.15..v5.7, the only one that struck me as possibly relevant is 65286b883c6d "nfsd: export upcalls must not return ESTALE when mountd is down". If it's possible, it would be interesting to test kernels just before and after that commit (65286b883c6d^ and 65286b883c6d). A network trace showing the failure might also be useful. (So, something like: tcpdump -s0 -wtmp.pcap -ieth0, then attempt the mount and wait for the failure, then kill tcpdump and attach tmp.pcap to this bug. Or you can email it to me if it might contain confidential information.)
Created attachment 289683 [details] /etc/exports for ice the machine which exports the mail spool as /mail This is the exports file for the machine that is failing hard.
I am not using kerberos at all, I do use NIS for distributing authentication information across my network. Of three machines, one virtual and two physical hosts, exporting NFS shares, the one on a virtual machine is working even under 5.7.2. Of the two physical machines, one exports /home directories, this worked after going to 5.7.2 on all clients except one which is running julinux, julinux is basically a re-themed Ubuntu. After running exportfs -a on the server exporting /home, then julinux was able to mount the share. But the server exporting the /mail (which is contains the mail spool and also a baysian filter database), after upgrading to 5.7.2 NONE of the client machines could mount the shares, and even after trying another exportfs -a they still could not mount. A reboot did not help. Booting back into 5.6.0 made it work again.
I tried a 5.7.4 kernel on the NFS machines today and the same machines worked that worked before and the same machines didn't work that didn't. So I did some troubleshooting to see what was different between the machines that worked and those that did not and what I found is there seems to be some weird interaction between systemd and the kernel. With 5.6.0 kernel all machines work reliably, with 5.7.4, as with 5.7.2, one machine works, one is flaky requiring that I manually run an exportfs, and one does not work even if I do that. So I tried a manual mount from the machine that doesn't work and let it time out to see what error I received. The error I received was connection to mountd timed out. I'm not running firewalls on the server, they are behind a hardware firewall from the outside world. So not a firewall issue. I checked and rpc.mountd was not running, I tried to start it but it failed with a dependency error. So I did a systemctl list-dependencies nfs-mountd.service and it shows me nfs-server.service is not running, so I started it manually, started nfs-mountd service and then everything worked. But for some reason nfs-server is not starting automatically on boot even though it is enabled in systemd and does when running kernel 5.6.0 and earlier.
I've updated to reflect the fact that this also affects 5.7.4. Would like to know if any work has been done on 5.7.6 to justify try it?
The one commit in that range that looks relevant is 65286b883c6de6 "nfsd: export upcalls must not return ESTALE when mountd is down" Though I don't see why that would cause the server to not start on boot. Is there anything in the logs explaining why it failed to start?
I was not able to find any cause in the logs that seemed remotely relevant. Nothing from the kernel in dmesg or kern.log, nothing from systemd in syslog, etc. Systemd list-dependencies lists a number of things for nfs-server.service which it did not show when I last booted. So I am going to build 5.7.6 and try it, this time I will save all the log files so we can go back and look after the fact. If you have any specific suggestions for things to look for please do let me know.
Well, it's a big of a long shot, but if you could test 65286b883c6de6, that'd be interesting. (E.g., build 5.7.4 with 65286b883c6de6 reverted, and see if that fixes it.)
Can you tell me where I find 65286b883c6de6 ? Is this something I just use patch to revert? Sorry not used to working on kernels at this level but always happy to learn and/or contribute.
Also are there any worthwhile hacking / debugging options that would be good to enable to possibly provide more insight. One thing I was going to try if it fails to start this time is to check all the systemd dependencies to see if it might possibly be something else systemd needs running first that it is causing it not to start.
Eh, that patch doesn't actually revert cleanly. If you're comfortable building your own kernel, you could get the tree just before that commit with: git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git cd linux-2.6 git checkout 65286b883c6de6^ then build a kernel, install, and test. Then for extra credit, "git checkout 65286b883c6de6" and try that one. But, best method for building and installing a kernel may depend on your distro. Yes, trying to figure out why systemd's not starting things would be useful.
On Thu, June 25, 2020 3:31 pm, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=208157 > > > --- Comment #11 from bfields@fieldses.org --- > Eh, that patch doesn't actually revert cleanly. > > > If you're comfortable building your own kernel, you could get the tree > just before that commit with: > > git clone > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git > cd linux-2.6 git checkout 65286b883c6de6^ > > then build a kernel, install, and test. Then for extra credit, "git > checkout 65286b883c6de6" and try that one. > > > But, best method for building and installing a kernel may depend on your > distro. > > Yes, trying to figure out why systemd's not starting things would be > useful. > > -- > You are receiving this mail because: > You reported the bug. Ok, I normally build with make -j8 bindeb-pkg, (8 core CPU). And while I can manually install a kernel, using deb package makes it easier and cleaner to remove and does everything automagically.
I am confused, both git checkout's are same number... On Thu, June 25, 2020 3:31 pm, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=208157 > > > --- Comment #11 from bfields@fieldses.org --- > Eh, that patch doesn't actually revert cleanly. > > > If you're comfortable building your own kernel, you could get the tree > just before that commit with: > > git clone > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git > cd linux-2.6 git checkout 65286b883c6de6^ > > then build a kernel, install, and test. Then for extra credit, "git > checkout 65286b883c6de6" and try that one. > > > But, best method for building and installing a kernel may depend on your > distro. > > Yes, trying to figure out why systemd's not starting things would be > useful. > > -- > You are receiving this mail because: > You reported the bug.
(In reply to Robert Dinse from comment #13) > I am confused, both git checkout's are same number... The caret means "parent commit", so if 65286b883c6de6^ is good but 65286b883c6de6 is buggy, that suggests it was 65286b883c6de6 that introduced the bug.
On Thu, June 25, 2020 3:31 pm, bugzilla-daemon@bugzilla.kernel.org wrote: The gethub kernel package fails to build for me. It builds the actual kernel and module but when it goes to build the bindeb-pkg it fails: make[2]: *** [debian/rules:6: build] Error 2 dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2 make[1]: *** [scripts/Makefile.package:83: bindeb-pkg] Error 2 make: *** [Makefile:1417: bindeb-pkg] Error 2 > https://bugzilla.kernel.org/show_bug.cgi?id=208157 > > > --- Comment #11 from bfields@fieldses.org --- > Eh, that patch doesn't actually revert cleanly. > > > If you're comfortable building your own kernel, you could get the tree > just before that commit with: > > git clone > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git > cd linux-2.6 git checkout 65286b883c6de6^ > > then build a kernel, install, and test. Then for extra credit, "git > checkout 65286b883c6de6" and try that one. > > > But, best method for building and installing a kernel may depend on your > distro. > > Yes, trying to figure out why systemd's not starting things would be > useful. > > -- > You are receiving this mail because: > You reported the bug.
Thanks for the explanation, unfortunately build is failing. On Thu, June 25, 2020 6:24 pm, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=208157 > > > --- Comment #14 from bfields@fieldses.org --- > (In reply to Robert Dinse from comment #13) > >> I am confused, both git checkout's are same number... >> > > The caret means "parent commit", so if 65286b883c6de6^ is good but > 65286b883c6de6 is buggy, that suggests it was 65286b883c6de6 that > introduced the bug. > > -- > You are receiving this mail because: > You reported the bug.
I was able to get it to build with default config but not with my config for some reason, but having done that I discovered that is 5.6-rc6 and because of that can eliminate that commit as the issue. The final release of 5.6 worked as far as NFS, it did not fail until 5.7.0. On Thu, June 25, 2020 3:31 pm, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=208157 > > > --- Comment #11 from bfields@fieldses.org --- > Eh, that patch doesn't actually revert cleanly. > > > If you're comfortable building your own kernel, you could get the tree > just before that commit with: > > git clone > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git > cd linux-2.6 git checkout 65286b883c6de6^ > > then build a kernel, install, and test. Then for extra credit, "git > checkout 65286b883c6de6" and try that one. > > > But, best method for building and installing a kernel may depend on your > distro. > > Yes, trying to figure out why systemd's not starting things would be > useful. > > -- > You are receiving this mail because: > You reported the bug.
(In reply to Robert Dinse from comment #17) > I was able to get it to build with default config but not with my > config for some reason, but having done that I discovered that is > 5.6-rc6 and because of that can eliminate that commit as the issue. > > The final release of 5.6 worked as far as NFS, it did not fail until > 5.7.0. Commit 65286b883c6 is in v5.7 but not v5.6, so if v5.6 is good, that doesn't rule out that commit as a candidate.
When I built it had a version number of 5.6rc6 not 5.7rc6, but I can't build with a usable configuration, it builds in the default but errors out with a workable conf. Haven't been able to isolate exactly what in the conf breaks it yet. -_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_- Eskimo North Linux Friendly Internet Access, Shell Accounts, and Hosting. Knowledgeable human assistance, not telephone trees or script readers. See our web site: http://www.eskimo.com/ (206) 812-0051 or (800) 246-6874. On Fri, 26 Jun 2020, bugzilla-daemon@bugzilla.kernel.org wrote: > Date: Fri, 26 Jun 2020 16:52:30 +0000 > From: bugzilla-daemon@bugzilla.kernel.org > To: nanook@eskimo.com > Subject: [Bug 208157] permission denied even though /etc/exports allows > seemingly random > > https://bugzilla.kernel.org/show_bug.cgi?id=208157 > > --- Comment #18 from bfields@fieldses.org --- > (In reply to Robert Dinse from comment #17) >> I was able to get it to build with default config but not with my >> config for some reason, but having done that I discovered that is >> 5.6-rc6 and because of that can eliminate that commit as the issue. >> >> The final release of 5.6 worked as far as NFS, it did not fail until >> 5.7.0. > > Commit 65286b883c6 is in v5.7 but not v5.6, so if v5.6 is good, that doesn't > rule out that commit as a candidate. > > -- > You are receiving this mail because: > You reported the bug.
I was unable to get the git tree build to build with a usable configuration, it would build with the default config but not with a config that actually had everything I needed. So I decided I would build 5.7.6 and give it a try, fully expecting it not to work like 5.7.0-5.7.4, and I had planned to look at all of nfs-server's systemd dependencies and see if perhaps it was actually one of them that was broken and that would explain why I found nothing related to knfsd in the logs. But to my amazement, 5.7.6 worked perfectly, all three NFS servers came up clean and fast so it appears that something changed between 5.7.4 and 5.7.6 that resolved it.
Well maybe not entirely fixed, I am getting some of these in mesg: [ 166.661495] NFS4: Couldn't follow remote path [ 169.781559] NFS4: Couldn't follow remote path [ 172.901550] NFS4: Couldn't follow remote path [ 180.021612] NFS4: Couldn't follow remote path [ 191.151760] NFS4: Couldn't follow remote path [ 204.261968] NFS4: Couldn't follow remote path I can't find anything obviously not working though.
The last errors can be ignored, I have found that they are caused by the fact that my nfs-utils are not current enough to use the new mount system calls provided in 5.7 and need to update them. It seems to have no functional consequence.
This is still a problem in 5.7.9, I just booted one of our NFS servers on to it to test, most clients successfully remounted the exported partition but one did not and when I tried to manually mount it it told me permission denied. Doing an exportfs -a on the server did not fix it, but exportfs -r did.
This seems completely fixed in 5.8 now so I'm going to close this ticket.