Created attachment 306818 [details] ftrace with ceph filter Hi, some of our customers (Proxmox VE) are seeing issues with file corruptions when accessing contents located on CephFS via the in-kernel Ceph client [0,1], we managed to reproduce this regression on kernels up to the latest 6.11-rc6. Accessing the same content on the CephFS using the FUSE client or the in-kernel ceph client with older kernels (Ubuntu kernel on v6.5) does not show file corruptions. Unfortunately the corruption is hard to reproduce, seemingly only a small subset of files is affected. However, once a file is affected, the issue is persistent and can easily be reproduced. Bisection with the reproducer points to this commit: "92b6cc5d: netfs: Add iov_iters to (sub)requests to describe various buffers" Description of the issue: A file was copied from local filesystem to cephfs via: ``` cp /tmp/proxmox-backup-server_3.2-1.iso /mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso ``` * sha256sum on local filesystem:`1d19698e8f7e769cf0a0dcc7ba0018ef5416c5ec495d5e61313f9c84a4237607 /tmp/proxmox-backup-server_3.2-1.iso` * sha256sum on cephfs with kernel up to above commit: `1d19698e8f7e769cf0a0dcc7ba0018ef5416c5ec495d5e61313f9c84a4237607 /mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso` * sha256sum on cephfs with kernel after above commit: `89ad3620bf7b1e0913b534516cfbe48580efbaec944b79951e2c14e5e551f736 /mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso` * removing and/or recopying the file does not change the issue, the corrupt checksum remains the same. * accessing the same file from different clients results in the same output: the one with above patch applied do show the incorrect checksum, ones without the patch show the correct checksum. * the issue persists even across reboot of the ceph cluster and/or clients. * the file is indeed corrupt after reading, as verified by a `cmp -b`. Interestingly, the first 4M contain the correct data, the following 4M are read as all zeros, which differs from the original data. * the issue is related to the readahead size: mounting the cephfs with a `rasize=0` makes the issue disappear, same is true for sizes up to 128k (please note that the ranges as initially reported on the mailing list [3] are not correct for rasize [0..128k] the file is not corrupted). Attached please find a ftrace with "*ceph*" as filter while performing a read on the latest kernel 6.11-rc6 while performing ``` dd if=/mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso of=/tmp/test.out bs=8M count=1 ``` the relevant part shown by task `dd-26192`. Please let me know if I can provide further information or debug outputs in order to narrow down the issue. [0] https://forum.proxmox.com/threads/78340/post-676129 [1] https://forum.proxmox.com/threads/149249/ [2] https://forum.proxmox.com/threads/151291/ [3] https://lore.kernel.org/lkml/db686d0c-2f27-47c8-8c14-26969433b13b@proxmox.com/
Is local caching through cachefiles being used? What are the mount parameters?
This showed up an error after 37 ops: /root/xfstests-dev/ltp/fsx -d -N 10000 -S 40 -P /tmp /xfstest.test/112.0 and then the system got a NULL-pointer oops in __rmqueue_pcplist when I tried running it again.
(In reply to David Howells from comment #1) > Is local caching through cachefiles being used? > > What are the mount parameters? There is no caching layer as described in [0] active. Output of ``` $ cat /proc/fs/fscache/{caches,cookies,requests,stats,volumes} CACHE REF VOLS OBJS ACCES S NAME ======== ===== ===== ===== ===== = =============== COOKIE VOLUME REF ACT ACC S FL DEF ======== ======== === === === = == ================ REQUEST OR REF FL ERR OPS COVERAGE ======== == === == ==== === ========= Netfs : DR=0 RA=140 RF=0 WB=0 WBZ=0 Netfs : BW=0 WT=0 DW=0 WP=0 Netfs : ZR=0 sh=0 sk=0 Netfs : DL=548 ds=548 df=0 di=0 Netfs : RD=0 rs=0 rf=0 Netfs : UL=0 us=0 uf=0 Netfs : WR=0 ws=0 wf=0 Netfs : rr=0 sr=0 wsc=0 -- FS-Cache statistics -- Cookies: n=0 v=0 vcol=0 voom=0 Acquire: n=0 ok=0 oom=0 LRU : n=0 exp=0 rmv=0 drp=0 at=0 Invals : n=0 Updates: n=0 rsz=0 rsn=0 Relinqs: n=0 rtr=0 drop=0 NoSpace: nwr=0 ncr=0 cull=0 IO : rd=0 wr=0 mis=0 VOLUME REF nCOOK ACC FL CACHE KEY ======== ===== ===== === == =============== ================ ``` Also, disabling caching by stetting `client_cache_size` to 0 and `client_oc` to false as found in [1] did not change the corrupted read behavior. The following fstab entry has been used by an user affected by the corruption: ``` 10.10.10.14,10.10.10.13,10.10.10.12:/ /mnt/cephfs/hdd-ceph-fs ceph name={username},secret={secret},fs=hdd-ceph-fs,_netdev 0 0 ``` My local reproducer uses a systemd.mount unit to mount the cephfs, with the ceph.conf as attached: ``` # /run/systemd/system/mnt-pve-cephfs.mount [Unit] Description=/mnt/pve/cephfs DefaultDependencies=no Requires=system.slice Wants=network-online.target Before=umount.target remote-fs.target After=systemd-journald.socket system.slice network.target -.mount remote-fs-pre.target network-online.target Conflicts=umount.target [Mount] Where=/mnt/pve/cephfs What=172.16.0.2,172.16.0.3,172.16.0.4:/ Type=ceph Options=name=admin,secretfile=/etc/pve/priv/ceph/cephfs.secret,conf=/etc/pve/ceph.conf,fs=cephfs ``` [0] https://www.kernel.org/doc/html/latest/filesystems/caching/fscache.html [1] https://docs.ceph.com/en/latest/cephfs/client-config-ref/#client-config-reference
Created attachment 306827 [details] ceph conf
Okay, the oops fsx was showing up was in a different set of patches, so irrelevant to this report. Can you capture some netfs tracing? If you can turn on: for i in read sreq rreq failure write write_iter folio; do echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_$i/enable done this will show what's going on inside netfslib. Note that if you look at it, a lot of the lines have a request ID and subrequest index in the form of "R=<request_id>[<subreq_index>]".
Created attachment 306860 [details] netfs tracing dd Trace generated while running: ``` sysctl vm.drop_caches=3 dd if=/mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso of=/tmp/test.out bs=8M count=1 ``` with default 8M value for the `rasize`. This leads to the first 4M containing the correct data, while the latter 4M are corrupted by all zeros.
Created attachment 306861 [details] netfs tracing sha256sum Trace generated while running: ``` sysctl vm.drop_caches=3 sha256sum /mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso ``` with default 8M value for the `rasize`, leading to the corrupt checksum. Note: This and above was performed on linux v6.11-rc6. If you require also different output, e.g. with different values for `rasize` please let me know.
Further testing with the reproducer and the current mainline kernel shows that the issue might be fixed. Bisection of the possible fix points to ee4cdf7b ("netfs: Speed up buffered reading"). Could this additional information help to boil down the part that fixes the cephfs issues so that the fix can be backported to current stable?
Created attachment 306948 [details] Patch for stable (6.11.1) I am happy to report a possible fix regarding this issue: The attached patch does fix the issue for me on current stable as well as the current Proxmox VE kernel based on Ubuntu Kernel 6.8.12. Is this patch complete and if so, can this be included to stable?