Bug 208157 - permission denied even though /etc/exports allows seemingly random
Summary: permission denied even though /etc/exports allows seemingly random
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: bfields
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-06-13 22:26 UTC by Robert Dinse
Modified: 2020-08-09 01:27 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.7.0 - 5.7.4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
/etc/exports for ice the machine which exports the mail spool as /mail (3.15 KB, text/plain)
2020-06-15 21:00 UTC, Robert Dinse
Details

Description Robert Dinse 2020-06-13 22:26:56 UTC
I have a heterogeneous environment with machines ranging from SunOS 4.1.4 using version 2 NFS to machines with modern Ubuntu 20.04 using version 4.2.  All of my NFS servers are Ubuntu 20.04 but self compiled kernels.  The kernels are essentially configured the same as Ubuntu low latency except they are configured to be fully tickless and they do not have any of Canonicals patches.  On version 5.6.15, NFS appeared to work fine.  When I attempted to upgrade to 5.7.0 clients got permission denied on file systems exported in /etc/exports.  Exportfs showed the file systems exported to the clients but clients still couldn't mount them.  This behavior persists with 5.7.1 and 5.7.2 making 5.7.x useless in NFS servers.  All of my servers are Intel based, i7-6700k or i7-6850k so I do not know if this bug is hardware specific or not.
Comment 1 bfields 2020-06-14 17:05:04 UTC
Based on the symptoms, it does sound like an upstream kernel bug.  But I don't have an explanation off the top of my head.

Are any of your clients using kerberos for NFS?

Is every client failing to mount every time, or is this intermittent or only affecting some clients?

Could I see the contents of /etc/exports on the server?

On a quick skim of nfsd commits in the range v5.6.15..v5.7, the only one that struck me as possibly relevant is 65286b883c6d "nfsd: export upcalls must not return ESTALE when mountd is down".  If it's possible, it would be interesting to test kernels just before and after that commit (65286b883c6d^ and 65286b883c6d).

A network trace showing the failure might also be useful.  (So, something like: tcpdump -s0 -wtmp.pcap -ieth0, then attempt the mount and wait for the failure, then kill tcpdump and attach tmp.pcap to this bug.  Or you can email it to me if it might contain confidential information.)
Comment 2 Robert Dinse 2020-06-15 21:00:49 UTC
Created attachment 289683 [details]
/etc/exports for ice the machine which exports the mail spool as /mail

This is the exports file for the machine that is failing hard.
Comment 3 Robert Dinse 2020-06-15 21:07:16 UTC
I am not using kerberos at all, I do use NIS for distributing authentication information across my network.

Of three machines, one virtual and two physical hosts, exporting NFS shares,
the one on a virtual machine is working even under 5.7.2.

Of the two physical machines, one exports /home directories, this worked after
going to 5.7.2 on all clients except one which is running julinux, julinux is
basically a re-themed Ubuntu.

After running exportfs -a on the server exporting /home, then julinux was able to mount the share.

But the server exporting the /mail (which is contains the mail spool and also a baysian filter database), after upgrading to 5.7.2 NONE of the client machines
could mount the shares, and even after trying another exportfs -a they still
could not mount.  A reboot did not help.  Booting back into 5.6.0 made it work
again.
Comment 4 Robert Dinse 2020-06-20 07:59:03 UTC
I tried a 5.7.4 kernel on the NFS machines today and the same machines worked that worked before and the same machines didn't work that didn't.  So I did some troubleshooting to see what was different between the machines that worked and those that did not and what I found is there seems to be some weird interaction between systemd and the kernel.

With 5.6.0 kernel all machines work reliably, with 5.7.4, as with 5.7.2, one machine works, one is flaky requiring that I manually run an exportfs, and one does not work even if I do that.

So I tried a manual mount from the machine that doesn't work and let it time out
to see what error I received.  The error I received was connection to mountd
timed out.  I'm not running firewalls on the server, they are behind a hardware
firewall from the outside world.  So not a firewall issue.

I checked and rpc.mountd was not running, I tried to start it but it failed with
a dependency error.  So I did a systemctl list-dependencies nfs-mountd.service and it shows me nfs-server.service is not running, so I started it manually, started nfs-mountd service and then everything worked.  But for some reason nfs-server is not starting automatically on boot even though it is enabled in systemd and does when running kernel 5.6.0 and earlier.
Comment 5 Robert Dinse 2020-06-24 23:01:18 UTC
I've updated to reflect the fact that this also affects 5.7.4.  Would like to
know if any work has been done on 5.7.6 to justify try it?
Comment 6 bfields 2020-06-25 20:30:38 UTC
The one commit in that range that looks relevant is

65286b883c6de6 "nfsd: export upcalls must not return ESTALE when mountd is down"

Though I don't see why that would cause the server to not start on boot.  Is there anything in the logs explaining why it failed to start?
Comment 7 Robert Dinse 2020-06-25 21:33:26 UTC
I was not able to find any cause in the logs that seemed remotely relevant.  Nothing from the kernel in dmesg or kern.log, nothing from systemd in syslog, etc.  Systemd list-dependencies lists a number of things for nfs-server.service
which it did not show when I last booted.  So I am going to build 5.7.6 and
try it, this time I will save all the log files so we can go back and look after
the fact.  If you have any specific suggestions for things to look for please do
let me know.
Comment 8 bfields 2020-06-25 21:39:01 UTC
Well, it's a big of a long shot, but if you could test 65286b883c6de6, that'd be interesting.  (E.g., build 5.7.4 with 65286b883c6de6 reverted, and see if that fixes it.)
Comment 9 Robert Dinse 2020-06-25 22:00:26 UTC
Can you tell me where I find 65286b883c6de6 ?  Is this something I just use
patch to revert?  Sorry not used to working on kernels at this level but always
happy to learn and/or contribute.
Comment 10 Robert Dinse 2020-06-25 22:02:08 UTC
Also are there any worthwhile hacking / debugging options that would be good to enable to possibly provide more insight.  One thing I was going to try if it fails to start this time is to check all the systemd dependencies to see if it
might possibly be something else systemd needs running first that it is causing
it not to start.
Comment 11 bfields 2020-06-25 22:31:17 UTC
Eh, that patch doesn't actually revert cleanly.

If you're comfortable building your own kernel, you could get the tree just before that commit with:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
cd linux-2.6
git checkout 65286b883c6de6^

then build a kernel, install, and test.  Then for extra credit, "git checkout 65286b883c6de6" and try that one.

But, best method for building and installing a kernel may depend on your distro.

Yes, trying to figure out why systemd's not starting things would be useful.
Comment 12 Robert Dinse 2020-06-25 23:07:34 UTC
On Thu, June 25, 2020 3:31 pm, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=208157
>
>
> --- Comment #11 from bfields@fieldses.org ---
> Eh, that patch doesn't actually revert cleanly.
>
>
> If you're comfortable building your own kernel, you could get the tree
> just before that commit with:
>
> git clone
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> cd linux-2.6 git checkout 65286b883c6de6^
>
> then build a kernel, install, and test.  Then for extra credit, "git
> checkout 65286b883c6de6" and try that one.
>
>
> But, best method for building and installing a kernel may depend on your
> distro.
>
> Yes, trying to figure out why systemd's not starting things would be
> useful.
>
> --
> You are receiving this mail because:
> You reported the bug.

     Ok, I normally build with make -j8 bindeb-pkg, (8 core CPU).
And while I can manually install a kernel, using deb package makes it
easier and cleaner to remove and does everything automagically.
Comment 13 Robert Dinse 2020-06-25 23:28:14 UTC
     I am confused, both git checkout's are same number...

On Thu, June 25, 2020 3:31 pm, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=208157
>
>
> --- Comment #11 from bfields@fieldses.org ---
> Eh, that patch doesn't actually revert cleanly.
>
>
> If you're comfortable building your own kernel, you could get the tree
> just before that commit with:
>
> git clone
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> cd linux-2.6 git checkout 65286b883c6de6^
>
> then build a kernel, install, and test.  Then for extra credit, "git
> checkout 65286b883c6de6" and try that one.
>
>
> But, best method for building and installing a kernel may depend on your
> distro.
>
> Yes, trying to figure out why systemd's not starting things would be
> useful.
>
> --
> You are receiving this mail because:
> You reported the bug.
Comment 14 bfields 2020-06-26 01:24:30 UTC
(In reply to Robert Dinse from comment #13)
>      I am confused, both git checkout's are same number...

The caret means "parent commit", so if 65286b883c6de6^ is good but 65286b883c6de6 is buggy, that suggests it was 65286b883c6de6 that introduced the bug.
Comment 15 Robert Dinse 2020-06-26 01:33:45 UTC
On Thu, June 25, 2020 3:31 pm, bugzilla-daemon@bugzilla.kernel.org wrote:

     The gethub kernel package fails to build for me.  It builds the
actual kernel and module but when it goes to build the bindeb-pkg it
fails:

make[2]: *** [debian/rules:6: build] Error 2
dpkg-buildpackage: error: debian/rules build subprocess returned exit
status 2
make[1]: *** [scripts/Makefile.package:83: bindeb-pkg] Error 2
make: *** [Makefile:1417: bindeb-pkg] Error 2


> https://bugzilla.kernel.org/show_bug.cgi?id=208157
>
>
> --- Comment #11 from bfields@fieldses.org ---
> Eh, that patch doesn't actually revert cleanly.
>
>
> If you're comfortable building your own kernel, you could get the tree
> just before that commit with:
>
> git clone
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> cd linux-2.6 git checkout 65286b883c6de6^
>
> then build a kernel, install, and test.  Then for extra credit, "git
> checkout 65286b883c6de6" and try that one.
>
>
> But, best method for building and installing a kernel may depend on your
> distro.
>
> Yes, trying to figure out why systemd's not starting things would be
> useful.
>
> --
> You are receiving this mail because:
> You reported the bug.
Comment 16 Robert Dinse 2020-06-26 01:34:21 UTC
     Thanks for the explanation, unfortunately build is failing.

On Thu, June 25, 2020 6:24 pm, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=208157
>
>
> --- Comment #14 from bfields@fieldses.org ---
> (In reply to Robert Dinse from comment #13)
>
>> I am confused, both git checkout's are same number...
>>
>
> The caret means "parent commit", so if 65286b883c6de6^ is good but
> 65286b883c6de6 is buggy, that suggests it was 65286b883c6de6 that
> introduced the bug.
>
> --
> You are receiving this mail because:
> You reported the bug.
Comment 17 Robert Dinse 2020-06-26 07:42:41 UTC
     I was able to get it to build with default config but not with my
config for some reason, but having done that I discovered that is
5.6-rc6 and because of that can eliminate that commit as the issue.

     The final release of 5.6 worked as far as NFS, it did not fail until
5.7.0.

On Thu, June 25, 2020 3:31 pm, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=208157
>
>
> --- Comment #11 from bfields@fieldses.org ---
> Eh, that patch doesn't actually revert cleanly.
>
>
> If you're comfortable building your own kernel, you could get the tree
> just before that commit with:
>
> git clone
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> cd linux-2.6 git checkout 65286b883c6de6^
>
> then build a kernel, install, and test.  Then for extra credit, "git
> checkout 65286b883c6de6" and try that one.
>
>
> But, best method for building and installing a kernel may depend on your
> distro.
>
> Yes, trying to figure out why systemd's not starting things would be
> useful.
>
> --
> You are receiving this mail because:
> You reported the bug.
Comment 18 bfields 2020-06-26 16:52:30 UTC
(In reply to Robert Dinse from comment #17)
>      I was able to get it to build with default config but not with my
> config for some reason, but having done that I discovered that is
> 5.6-rc6 and because of that can eliminate that commit as the issue.
> 
>      The final release of 5.6 worked as far as NFS, it did not fail until
> 5.7.0.

Commit 65286b883c6 is in v5.7 but not v5.6, so if v5.6 is good, that doesn't rule out that commit as a candidate.
Comment 19 Robert Dinse 2020-06-26 20:01:12 UTC
      When I built it had a version number of 5.6rc6 not 5.7rc6, but I can't
build with a usable configuration, it builds in the default but errors out
with a workable conf.  Haven't been able to isolate exactly what in the conf
breaks it yet.

-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
  Eskimo North Linux Friendly Internet Access, Shell Accounts, and Hosting.
    Knowledgeable human assistance, not telephone trees or script readers.
  See our web site: http://www.eskimo.com/ (206) 812-0051 or (800) 246-6874.

On Fri, 26 Jun 2020, bugzilla-daemon@bugzilla.kernel.org wrote:

> Date: Fri, 26 Jun 2020 16:52:30 +0000
> From: bugzilla-daemon@bugzilla.kernel.org
> To: nanook@eskimo.com
> Subject: [Bug 208157] permission denied even though /etc/exports allows
>     seemingly random
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=208157
>
> --- Comment #18 from bfields@fieldses.org ---
> (In reply to Robert Dinse from comment #17)
>>      I was able to get it to build with default config but not with my
>> config for some reason, but having done that I discovered that is
>> 5.6-rc6 and because of that can eliminate that commit as the issue.
>>
>>      The final release of 5.6 worked as far as NFS, it did not fail until
>> 5.7.0.
>
> Commit 65286b883c6 is in v5.7 but not v5.6, so if v5.6 is good, that doesn't
> rule out that commit as a candidate.
>
> -- 
> You are receiving this mail because:
> You reported the bug.
Comment 20 Robert Dinse 2020-06-27 08:36:54 UTC
I was unable to get the git tree build to build with a usable configuration, it would build with the default config but not with a config that actually had everything I needed.

So I decided I would build 5.7.6 and give it a try, fully expecting it not to
work like 5.7.0-5.7.4, and I had planned to look at all of nfs-server's systemd
dependencies and see if perhaps it was actually one of them that was broken and
that would explain why I found nothing related to knfsd in the logs.

But to my amazement, 5.7.6 worked perfectly, all three NFS servers came up clean
and fast so it appears that something changed between 5.7.4 and 5.7.6 that
resolved it.
Comment 21 Robert Dinse 2020-06-27 09:07:36 UTC
Well maybe not entirely fixed, I am getting some of these in mesg:

[  166.661495] NFS4: Couldn't follow remote path
[  169.781559] NFS4: Couldn't follow remote path
[  172.901550] NFS4: Couldn't follow remote path
[  180.021612] NFS4: Couldn't follow remote path
[  191.151760] NFS4: Couldn't follow remote path
[  204.261968] NFS4: Couldn't follow remote path


I can't find anything obviously not working though.
Comment 22 Robert Dinse 2020-06-28 01:16:52 UTC
The last errors can be ignored, I have found that they are caused by the fact that my nfs-utils are not current enough to use the new mount system calls provided in 5.7 and need to update them.  It seems to have no functional consequence.
Comment 23 Robert Dinse 2020-07-23 08:33:29 UTC
This is still a problem in 5.7.9, I just booted one of our NFS servers on to it
to test, most clients successfully remounted the exported partition but one did
not and when I tried to manually mount it it told me permission denied.  Doing an exportfs -a on the server did not fix it, but exportfs -r did.
Comment 24 Robert Dinse 2020-08-09 01:27:11 UTC
This seems completely fixed in 5.8 now so I'm going to close this ticket.

Note You need to log in before you can comment on or make changes to this bug.