Bug 219237

Summary:	[REGRESSION]: cephfs: file corruption when reading content via in-kernel ceph client
Product:	File System	Reporter:	Christian Ebner (c.ebner)
Component:	Other	Assignee:	fs_other
Status:	NEW ---
Severity:	high	CC:	c.ebner, dhowells, mail+kernel-bugzilla
Priority:	P3
Hardware:	All
OS:	Linux
Kernel Version:		Subsystem:
Regression:	Yes	Bisected commit-id:	92b6cc5d
Attachments:	ftrace with ceph filter ceph conf netfs tracing dd netfs tracing sha256sum Patch for stable (6.11.1)

Description Christian Ebner 2024-09-04 14:47:58 UTC

Created attachment 306818 [details]
ftrace with ceph filter

Hi,

some of our customers (Proxmox VE) are seeing issues with file corruptions when accessing contents located on CephFS via the in-kernel Ceph client [0,1], we managed to reproduce this regression on kernels up to the latest 6.11-rc6.
Accessing the same content on the CephFS using the FUSE client or the in-kernel ceph client with older kernels (Ubuntu kernel on v6.5) does not show file corruptions.
Unfortunately the corruption is hard to reproduce, seemingly only a small subset of files is affected. However, once a file is affected, the issue is persistent and can easily be reproduced.

Bisection with the reproducer points to this commit:

"92b6cc5d: netfs: Add iov_iters to (sub)requests to describe various buffers"

Description of the issue:

A file was copied from local filesystem to cephfs via:
```
cp /tmp/proxmox-backup-server_3.2-1.iso /mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso
```
* sha256sum on local filesystem:`1d19698e8f7e769cf0a0dcc7ba0018ef5416c5ec495d5e61313f9c84a4237607 /tmp/proxmox-backup-server_3.2-1.iso`
* sha256sum on cephfs with kernel up to above commit: `1d19698e8f7e769cf0a0dcc7ba0018ef5416c5ec495d5e61313f9c84a4237607 /mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso`
* sha256sum on cephfs with kernel after above commit: `89ad3620bf7b1e0913b534516cfbe48580efbaec944b79951e2c14e5e551f736 /mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso`
* removing and/or recopying the file does not change the issue, the corrupt checksum remains the same.
* accessing the same file from different clients results in the same output: the one with above patch applied do show the incorrect checksum, ones without the patch show the correct checksum.
* the issue persists even across reboot of the ceph cluster and/or clients.
* the file is indeed corrupt after reading, as verified by a `cmp -b`. Interestingly, the first 4M contain the correct data, the following 4M are read as all zeros, which differs from the original data.
* the issue is related to the readahead size: mounting the cephfs with a `rasize=0` makes the issue disappear, same is true for sizes up to 128k (please note that the ranges as initially reported on the mailing list [3] are not correct for rasize [0..128k] the file is not corrupted).

Attached please find a ftrace with "*ceph*" as filter while performing a read on the latest kernel 6.11-rc6 while performing
```
dd if=/mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso of=/tmp/test.out bs=8M count=1
```
the relevant part shown by task `dd-26192`.

Please let me know if I can provide further information or debug outputs in order to narrow down the issue.

[0] https://forum.proxmox.com/threads/78340/post-676129
[1] https://forum.proxmox.com/threads/149249/
[2] https://forum.proxmox.com/threads/151291/
[3] https://lore.kernel.org/lkml/db686d0c-2f27-47c8-8c14-26969433b13b@proxmox.com/

Comment 1 David Howells 2024-09-06 16:32:44 UTC

Is local caching through cachefiles being used?

What are the mount parameters?

Comment 2 David Howells 2024-09-06 16:48:37 UTC

This showed up an error after 37 ops:

   /root/xfstests-dev/ltp/fsx -d -N 10000 -S 40 -P /tmp /xfstest.test/112.0

and then the system got a NULL-pointer oops in __rmqueue_pcplist when I tried running it again.

Comment 3 Christian Ebner 2024-09-07 09:27:41 UTC

(In reply to David Howells from comment #1)
> Is local caching through cachefiles being used?
> 
> What are the mount parameters?

There is no caching layer as described in [0] active.

Output of
```
$ cat /proc/fs/fscache/{caches,cookies,requests,stats,volumes}
CACHE    REF   VOLS  OBJS  ACCES S NAME
======== ===== ===== ===== ===== = ===============
COOKIE   VOLUME   REF ACT ACC S FL DEF
======== ======== === === === = == ================
REQUEST  OR REF FL ERR  OPS COVERAGE
======== == === == ==== === =========
Netfs  : DR=0 RA=140 RF=0 WB=0 WBZ=0
Netfs  : BW=0 WT=0 DW=0 WP=0
Netfs  : ZR=0 sh=0 sk=0
Netfs  : DL=548 ds=548 df=0 di=0
Netfs  : RD=0 rs=0 rf=0
Netfs  : UL=0 us=0 uf=0
Netfs  : WR=0 ws=0 wf=0
Netfs  : rr=0 sr=0 wsc=0
-- FS-Cache statistics --
Cookies: n=0 v=0 vcol=0 voom=0
Acquire: n=0 ok=0 oom=0
LRU    : n=0 exp=0 rmv=0 drp=0 at=0
Invals : n=0
Updates: n=0 rsz=0 rsn=0
Relinqs: n=0 rtr=0 drop=0
NoSpace: nwr=0 ncr=0 cull=0
IO     : rd=0 wr=0 mis=0
VOLUME   REF   nCOOK ACC FL CACHE           KEY
======== ===== ===== === == =============== ================
```

Also, disabling caching by stetting `client_cache_size` to 0 and `client_oc` to false as found in [1] did not change the corrupted read behavior.

The following fstab entry has been used by an user affected by the corruption:
```
 10.10.10.14,10.10.10.13,10.10.10.12:/ /mnt/cephfs/hdd-ceph-fs ceph name={username},secret={secret},fs=hdd-ceph-fs,_netdev 0 0
```

My local reproducer uses a systemd.mount unit to mount the cephfs, with the ceph.conf as attached:
```
# /run/systemd/system/mnt-pve-cephfs.mount
[Unit]
Description=/mnt/pve/cephfs
DefaultDependencies=no
Requires=system.slice
Wants=network-online.target
Before=umount.target remote-fs.target
After=systemd-journald.socket system.slice network.target -.mount remote-fs-pre.target network-online.target
Conflicts=umount.target

[Mount]
Where=/mnt/pve/cephfs
What=172.16.0.2,172.16.0.3,172.16.0.4:/
Type=ceph
Options=name=admin,secretfile=/etc/pve/priv/ceph/cephfs.secret,conf=/etc/pve/ceph.conf,fs=cephfs
```

[0] https://www.kernel.org/doc/html/latest/filesystems/caching/fscache.html
[1] https://docs.ceph.com/en/latest/cephfs/client-config-ref/#client-config-reference

Comment 4 Christian Ebner 2024-09-07 09:28:14 UTC

Created attachment 306827 [details]
ceph conf

Comment 5 David Howells 2024-09-10 22:29:08 UTC

Okay, the oops fsx was showing up was in a different set of patches, so irrelevant to this report.

Can you capture some netfs tracing?  If you can turn on:

for i in read sreq rreq failure write write_iter folio; do
   echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_$i/enable
done

this will show what's going on inside netfslib.

Note that if you look at it, a lot of the lines have a request ID and subrequest index in the form of "R=<request_id>[<subreq_index>]".

Comment 6 Christian Ebner 2024-09-11 08:02:24 UTC

Created attachment 306860 [details]
netfs tracing dd

Trace generated while running:
```
sysctl vm.drop_caches=3
dd if=/mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso of=/tmp/test.out bs=8M count=1
```

with default 8M value for the `rasize`. This leads to the first 4M containing the correct data, while the latter 4M are corrupted by all zeros.

Comment 7 Christian Ebner 2024-09-11 08:06:05 UTC

Created attachment 306861 [details]
netfs tracing sha256sum

Trace generated while running:
```
sysctl vm.drop_caches=3
sha256sum /mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso
```

with default 8M value for the `rasize`, leading to the corrupt checksum.

Note: This and above was performed on linux v6.11-rc6. If you require also different output, e.g. with different values for `rasize` please let me know.

Comment 8 Christian Ebner 2024-09-23 08:02:10 UTC

Further testing with the reproducer and the current mainline kernel shows that the issue might be fixed.
Bisection of the possible fix points to ee4cdf7b ("netfs: Speed up buffered reading").

Could this additional information help to boil down the part that fixes the cephfs issues so that the fix can be backported to current stable?

Comment 9 Christian Ebner 2024-10-02 14:18:05 UTC

Created attachment 306948 [details]
Patch for stable (6.11.1)

I am happy to report a possible fix regarding this issue:

The attached patch does fix the issue for me on current stable as well as the current Proxmox VE kernel based on Ubuntu Kernel 6.8.12.

Is this patch complete and if so, can this be included to stable?