Bug 24912

Summary: kernel silently ignores 'read-only' option for "bind,ro" mounts
Product: File System Reporter: Sergey Kondakov (virtuousfox)
Component: VFSAssignee: fs_vfs
Status: REOPENED ---    
Severity: normal CC: alan, imwellcushtymelike, jch, ryao, sstewartgallus00
Priority: P1    
Hardware: All   
OS: Linux   
URL: https://patchwork.kernel.org/patch/56569/
Kernel Version: 3.12 Subsystem:
Regression: No Bisected commit-id:

Description Sergey Kondakov 2010-12-15 06:55:03 UTC
as written in http://lwn.net/Articles/281157/ , you still have to issue two commands:
    mount --bind /vital_data /untrusted_container/vital_data
    mount -o remount,ro /untrusted_container/vital_data
instead of just one:
    mount -o bind,ro /vital_data /untrusted_container/vital_data

and worse - you have to create two entries into fstab per each bind-mount to get some directories binded in read-only.

not long ago even `mount` utility was confused by this ridiculous behaviour of kernel [http://thread.gmane.org/gmane.linux.utilities.util-linux-ng/2771] and reported bind-mount with "ro" option while in reality it was in "rw".
now it fixed in `mount` and it shows 'rw' but kernel itself still just ignores 'ro' option unless it passed with redundant command/line in fstab.

patch from bug URL fixes the issue and makes kernel properly obey (few last lines need to be tweaked a little to apply).

they say [http://www.linuxquestions.org/questions/linux-general-1/how-do-i-create-a-read-only-bind-mount-in-fstab-778248/] Fedora, probably, have incorporated the fix but mainstream kernel still is not. what the hell, isn't Linus himself uses Fedora ?

this is pain in an ass to set up proper logical paths for fileserver at boot-time and an unexpected behaviour. please, do something with it :(
Comment 1 Ken Sharp 2013-12-10 23:28:58 UTC
It would nice to be told why this was marked as obsolete rather than just going ahead and doing it.

Has bind changed? Has it been removed? Anything.............?
Comment 2 Alan 2013-12-11 11:48:02 UTC
It was filed against a kernel we no longer care about. If it is still present in current kernels please update the kernel version accordingly.

Bugzilla is not however used as a work queue so if you want to change this send patches to the linux filesystem lists.
Comment 3 Ken Sharp 2013-12-11 12:23:18 UTC
It's true of ALL kernels. I cannot change the version as I do not have admin access.
Comment 4 Sergey Kondakov 2013-12-11 15:31:27 UTC
Isn't "version" imply "starting from X and all the way up" ? Because changing it for every single update over years-persistent bugs would be fantastically ridiculous.

And it is a "work queue", at least for issues, if not for improvements. The one that may be not directed to someone in particular, but anyone's unwillingness to fix some clearly incorrect behaviour does not justify its denial for everyone.

Especially for such a thing, for the introduction of which quite a fuss was made at the time (read-only binds, that is).
Comment 5 Alan 2013-12-11 15:53:29 UTC
A lot of reports get filed and then fixed without anyone even realising, or are invalidated simply by other changes so old ones get closed, they are mostly just noise and their time cost exceeds their value in actually being able to find things that do need sorting out.

They are still in the database so searches for things like a specific CD drive or a specific oops pattern can still find them.

But if you want it to happen then it's sat in bugzilla since 2010, it's not going to suddenely happen. If you want it to happen then I'd do two things

- verify if the change is in Fedora as you claim
- either grab the Fedora patch or forward put the upstream patch
- test it
- submit it.
Comment 6 Steven Stewart-Gallus 2014-07-27 23:37:50 UTC
This bug seems worse than it might appear. I don't think it's possible to work around this bug by remounting filesystems if one is working inside a container.

For example, the following sandbox fails silently due to this bug and working around the bug by remounting fails with EPERM.

#define _GNU_SOURCE

#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <mntent.h>
#include <sched.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/prctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

#include <linux/capability.h>
#include <linux/sched.h>

#define SHELL "./bin/sh"
#define RUNTIME_NAME "sandbox"

#define ARRAY_SIZE(A) (sizeof A / sizeof A[0u])

static char * const shell_arguments[] = { (char *)SHELL, NULL };
static char * const shell_environment[] = {  NULL };

static char fstab_config[] =
      "# <file system>	<mount point>	<type>	<options>\n"
      "proc	proc	proc	ro,nodev,noexec,nosuid\n"
      "sys	sys	sysfs	ro,nodev,noexec,nosuid\n"
      "\n"
      "/dev	dev	none	rbind\n"
      "\n"
      "tmpfs	run	tmpfs	none\n"
      "tmpfs	var	tmpfs	none\n"
      "\n"
      "# Allow connecting to X11\n"
      "/tmp	tmp	none	ro,bind\n"
      "\n"
      "/etc	etc	none	ro,bind\n"
      "\n"
      "/lib	lib	none	ro,bind\n"
      "/lib32	lib32	none	ro,bind\n"
      "/lib64	lib64	none	ro,bind\n"
      "/lib64	lib64	none	ro,bind\n"
      "\n"
      "/bin	bin	none	ro,bind\n"
      "/sbin	sbin	none	ro,bind\n"
      "/usr	usr	none	ro,bind\n";

struct mount {
    char const * source;
    char const * target;
    mode_t mode;
    char const * filesystem_type;
    unsigned long mountflags;
    char const *data;
};

int main(void)
{
    int errnum;

    if (-1 == unshare(CLONE_NEWUSER /* Needed to do the rest of the unsharing */

                      /* Prevent signals, ptracing of other processes */
                      | CLONE_NEWPID

                      /* With chroot prevent messing with user files */
                      | CLONE_NEWNS

                      /* The rest are kind of meh but might as well be done */
                      | CLONE_NEWIPC | CLONE_NEWUTS | CLONE_NEWNET)) {
        perror("unshare");
        return EXIT_FAILURE;
    }

    /* Fork to allow for multithreading and to make the shell less
     * buggy.
     */
    {
        pid_t child = fork();
        if (-1 == child) {
            perror("fork");
            return EXIT_FAILURE;
        }

        if (child != 0) {
            siginfo_t info;
            do {
                errnum = -1 == waitid(P_PID, child, &info, WEXITED) ? errno : 0;
            } while (EINTR == errnum);
            if (errnum != 0) {
                assert(errnum != EINVAL);
                assert(errnum != ECHILD);
                assert(false);
            }
            return info.si_status;
        }
    }

    FILE * tmp = tmpfile();
    if (NULL == tmp) {
        perror("tmpfile");
        return EXIT_FAILURE;
    }

    {
        size_t bytes_to_write = sizeof fstab_config - 1u;
        if (fwrite(fstab_config, 1u, bytes_to_write, tmp) != sizeof fstab_config - 1u) {
            perror("fwrite");
            return EXIT_FAILURE;
        }
    }

    char tmppath[] = "/proc/self/fd/XXXXXXXXXXX";
    sprintf(tmppath, "/proc/self/fd/%i", fileno(tmp));

    FILE * fstab = setmntent(tmppath, "r");
    if (NULL == fstab) {
        perror("setmtent");
        return EXIT_FAILURE;
    }

    if (EOF == fclose(tmp)) {
        perror("fclose");
        return EXIT_FAILURE;
    }

    {
        char const *runtime_dir_string = getenv("XDG_RUNTIME_DIR");
        if (NULL == runtime_dir_string) {
            /* TODO: Fallback somewhere smart */
            fputs("missing XDG_RUNTIME_DIR\n", stderr);
            return EXIT_FAILURE;
        }

        if (-1 == chdir(runtime_dir_string)) {
            perror("chdir");
            return EXIT_FAILURE;
        }
    }

    if (-1 == mkdir(RUNTIME_NAME, S_IRWXU)) {
        errnum = errno;
        if (errnum != EEXIST) {
            perror("mkdir");
            return EXIT_FAILURE;
        }
    }

    if (-1 == mount(NULL, RUNTIME_NAME, "tmpfs", 0, NULL)) {
        perror("mount");
        return EXIT_FAILURE;
    }

    if (-1 == chdir(RUNTIME_NAME)) {
        perror("chdir");
        return EXIT_FAILURE;
    }

    for (;;) {
        errno = 0;
        struct mntent * entry = getmntent(fstab);
        if (NULL == entry) {
            errnum = errno;
            if (errnum != 0) {
                perror("getmntent");
                return EXIT_FAILURE;
            }

            break;
        }

        enum {
            BIND,
            RBIND,
            REMOUNT,
            RO,
            RW,
            SUID,
            NOSUID,
            NODEV,
            NOEXEC
        };
        char * const token[] = {
            [BIND] = "bind",
            [RBIND] = "rbind",
            [REMOUNT] = "remount",
            [RO] = MNTOPT_RO,
            [RW] = MNTOPT_RW,
            [SUID] = MNTOPT_SUID,
            [NOSUID] = MNTOPT_NOSUID,
            [NODEV] = "nodev",
            [NOEXEC] = "noexec",
            NULL
        };
        bool bind = false;
        bool rec = false;
        bool remount = false;
        bool readonly = false;
        bool readwrite = false;
        bool suid = true;
        bool dev = true;
        bool exec = true;
        char *value = NULL;
        {
            char *subopts = entry->mnt_opts;
            if (0 == strcmp("none", subopts)) {
                goto mount;
            }

            while (*subopts != '\0') {
                switch (getsubopt(&subopts, token, &value)) {
                case BIND:
                    bind = true;
                    break;

                case RBIND:
                    bind = true;
                    rec = true;
                    break;

                case REMOUNT:
                    remount = true;
                    break;

                case RO:
                    readonly = true;
                    break;

                case RW:
                    readwrite = true;
                    break;

                case SUID:
                    suid = true;
                    break;

                case NOSUID:
                    suid = false;
                    break;

                case NODEV:
                    dev = false;
                    break;

                case NOEXEC:
                    exec = false;
                    break;

                default:
                    fprintf(stderr, "No match found for token: /%s/\n", value);
                    goto mount;
                }
            }
        }
    mount:
        if (readwrite && readonly) {
            fprintf(stderr, "Only one of '%s' and '%s' can be specified\n",
                    token[RO], token[RW]);
            return EXIT_FAILURE;
        }

        unsigned long mountflags = 0;

        if (bind) {
            mountflags |= MS_BIND;
        }

        if (rec) {
            mountflags |= MS_REC;
        }

        if (remount) {
            mountflags |= MS_REMOUNT;
        }

        if (readonly) {
            mountflags |= MS_RDONLY;
        }

        if (!suid) {
            mountflags |= MS_NOSUID;
        }

        if (!dev) {
            mountflags |= MS_NODEV;
        }

        if (!exec) {
            mountflags |= MS_NOEXEC;
        }


        if (-1 == mkdir(entry->mnt_dir, S_IRWXU)) {
            errnum = errno;
            if (errnum != EEXIST) {
                perror("mkdir");
                return EXIT_FAILURE;
            }
        }

        if (-1 == mount(0 == strcmp("none", entry->mnt_fsname) ? NULL : entry->mnt_fsname, entry->mnt_dir,
                        entry->mnt_type, mountflags,
                        value)) {
            perror("mount");
            return EXIT_FAILURE;
        }
    }

    if (endmntent(fstab) != 1) {
        perror("endmntent");
        return EXIT_FAILURE;
    }

    if (-1 == chdir("..")) {
        perror("chdir");
        return EXIT_FAILURE;
    }

    if (-1 == chdir(RUNTIME_NAME)) {
        perror("chdir");
        return EXIT_FAILURE;
    }

    if (-1 == chroot("./")) {
        perror("chroot");
        return EXIT_FAILURE;
    }

    if (-1 == chdir("/")) {
        perror("chdir");
        return EXIT_FAILURE;
    }

    /* Prevent future privilege gains */
    if (-1 == prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
        perror("prctl");
        return EXIT_FAILURE;
    }

    /* Drop all privileges I might possibly have. I'm not sure I need
     * to do this and I probably can do this in a better way. */
    static unsigned long const capability[] = {
        CAP_CHOWN,
        CAP_DAC_OVERRIDE,
        CAP_DAC_READ_SEARCH,
        CAP_FOWNER,
        CAP_FSETID,
        CAP_KILL,
        CAP_SETGID,
        CAP_SETUID,
        CAP_SETPCAP,
        CAP_LINUX_IMMUTABLE,
        CAP_NET_BIND_SERVICE,
        CAP_NET_BROADCAST,
        CAP_NET_ADMIN,
        CAP_NET_RAW,
        CAP_IPC_LOCK,
        CAP_IPC_OWNER,
        CAP_SYS_MODULE,
        CAP_SYS_RAWIO,
        CAP_SYS_CHROOT,
        CAP_SYS_PTRACE,
        CAP_SYS_PACCT,
        CAP_SYS_ADMIN,
        CAP_SYS_BOOT,
        CAP_SYS_NICE,
        CAP_SYS_RESOURCE,
        CAP_SYS_TIME,
        CAP_SYS_TTY_CONFIG,
        CAP_MKNOD,
        CAP_LEASE,
        CAP_AUDIT_WRITE,
        CAP_AUDIT_CONTROL,
        CAP_SETFCAP,
        CAP_MAC_OVERRIDE,
        CAP_MAC_ADMIN,
        CAP_SYSLOG,
        CAP_WAKE_ALARM
    };

    for (size_t ii = 0u; ii < ARRAY_SIZE(capability); ++ii) {
        if (-1 == prctl(PR_CAPBSET_DROP, capability[ii], 0, 0, 0)) {
            perror("prctl");
            return EXIT_FAILURE;
        }
    }

    /* Favor other processes over this process hierarchy. Only
     * superuser may lower priorities so this is not stoppable. This
     * also makes the process hierarchy nicer for the OOM killer.
     */
    if (-1 == setpriority(PRIO_PROCESS, 0, getpriority(PRIO_PROCESS, 0) + 1)) {
        perror("setpriority");
        return EXIT_FAILURE;
    }

    execve(SHELL, shell_arguments, shell_environment);
    perror("execve");
    return EXIT_FAILURE;
}
Comment 7 Steven Stewart-Gallus 2014-07-29 05:41:30 UTC
A hacky alternative is to use overlayfs. For example, "overlayfs	lib	overlayfs	ro,lowerdir=/lib,upperdir=/lib".
Comment 8 Steven Stewart-Gallus 2014-07-30 17:04:24 UTC
Unfortunately, overlayfs doesn't work as an alternative for readonly recursive bind mounts.
Comment 9 Richard Yao 2014-08-02 05:38:45 UTC
I sent a patch to the LKML to fix this:

https://lkml.org/lkml/2014/8/1/563

You might be able to workaround the problem by fetching the sources for your kernel, applying the patch, recompiling and installing the new kernel binary. You would lose whatever kernel signing RedHat might have by doing this though.
Comment 10 Richard Yao 2014-08-02 05:49:30 UTC
I just spotted the patch in the URL. It is functionally identical to what I sent to the list. It would have been the right fix, but it was never merged. I stumbled across this bug report in some casual Google searches after I had written that patch, so it had been written independently.
Comment 11 Steven Stewart-Gallus 2014-12-21 02:18:30 UTC
while the remount workaround does not work for recursive bind mounts it does seem to work for shared bind mounts which are actually better than recursive bind mounts in many ways.
Comment 12 Steven Stewart-Gallus 2014-12-21 02:25:34 UTC
NVM. I must have made a mistake and this doesn't seem to work for shared bind mounts either.
Comment 13 Steven Stewart-Gallus 2014-12-21 02:53:22 UTC
The remount workaround does not seem to work for a bind to a subdirectory of a mount and not a direct bind.