Bug 215673
Summary: | fsx dio testing get BAD DATA on NFS | ||
---|---|---|---|
Product: | File System | Reporter: | Zorro Lang (zlang) |
Component: | NFS | Assignee: | Trond Myklebust (trondmy) |
Status: | NEW --- | ||
Severity: | normal | CC: | anna, chuck.lever, regressions |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Linux v5.17-rc6+ | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
fsx bad file dump
fsx good file dump fsx output log fsx operations running log .config file |
Description
Zorro Lang
2022-03-11 05:15:02 UTC
Created attachment 300551 [details]
fsx bad file dump
Created attachment 300552 [details]
fsx good file dump
Created attachment 300553 [details]
fsx output log
Created attachment 300554 [details]
fsx operations running log
Created attachment 300555 [details]
.config file
The underlying filesystem is xfs: meta-data=/dev/vda2 isize=512 agcount=4, agsize=327680 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 bigtime=1 inobtcount=1 data = bsize=4096 blocks=1310720, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 mount nfs without any specified mount options, all by default nfs-utils 2.5.4 This sounds a lot like a regression, but it's not totally clear from the description. Am I right in assuming that 5.16 worked fine? Or did even some of the 5.17-rc kernels work fine? I'm assuming this is against a Linux server? Do you know if the client is using NFS v4.2 and if it has CONFIG_NFS_v4_2_READ_PLUS enabled? The fsx output looks a lot like the output we sometimes get when using v4.2 READ_PLUS, which is why we have it locked behind the config option with the message "This is intended for developers only" for now (I'm working on patches to address this, but they're not ready yet) (In reply to Anna Schumaker from comment #8) > I'm assuming this is against a Linux server? Do you know if the client is > using NFS v4.2 and if it has CONFIG_NFS_v4_2_READ_PLUS enabled? > > The fsx output looks a lot like the output we sometimes get when using v4.2 > READ_PLUS, which is why we have it locked behind the config option with the > message "This is intended for developers only" for now (I'm working on > patches to address this, but they're not ready yet) Hi Anna, The .config file is in attachment. You check more details in it. I just found "CONFIG_NFS_V4_2_READ_PLUS=y" in it, so it might be that issue. I'll give it a test by disable this configuration. Thanks, Zorro (In reply to Zorro Lang from comment #9) > (In reply to Anna Schumaker from comment #8) > > I'm assuming this is against a Linux server? Do you know if the client is > > using NFS v4.2 and if it has CONFIG_NFS_v4_2_READ_PLUS enabled? > > > > The fsx output looks a lot like the output we sometimes get when using v4.2 > > READ_PLUS, which is why we have it locked behind the config option with the > > message "This is intended for developers only" for now (I'm working on > > patches to address this, but they're not ready yet) > > Hi Anna, > > The .config file is in attachment. You check more details in it. I just > found "CONFIG_NFS_V4_2_READ_PLUS=y" in it, so it might be that issue. I'll > give it a test by disable this configuration. Hi Anna, I think you're right, after I disable CONFIG_NFS_V4_2_READ_PLUS, I can't reproduce this bug anymore. So after you fix it, I'd like to verify that works. Thanks, Zorro > > Thanks, > Zorro I ran the fsx command above over night. .... skipping zero length punch hole skipping zero length punch hole skipping zero size read ^Csignal 2 testcalls = 124721256 [cel@morisot ~]$ mountstats --rpc Stats for klimt:/export/fast mounted on /mnt: RPC statistics: 267015341 RPC requests sent, 267015341 RPC replies received (0 XIDs not found) average backlog queue length: 0 GETATTR: 161259261 ops (60%) avg bytes sent per op: 199 avg bytes received per op: 240 backlog wait: 0.003842 RTT: 0.152617 total execute time: 0.168519 (milliseconds) READ_PLUS: 46936965 ops (17%) avg bytes sent per op: 216 avg bytes received per op: 4465 backlog wait: 0.012698 RTT: 0.200121 total execute time: 0.237943 (milliseconds) WRITE: 27575688 ops (10%) avg bytes sent per op: 33432 avg bytes received per op: 141 backlog wait: 0.011856 RTT: 0.720103 total execute time: 0.739979 (milliseconds) SETATTR: 12061698 ops (4%) avg bytes sent per op: 247 avg bytes received per op: 264 backlog wait: 0.004311 RTT: 0.223075 total execute time: 0.239609 (milliseconds) ALLOCATE: 9592654 ops (3%) avg bytes sent per op: 236 avg bytes received per op: 168 backlog wait: 0.003911 RTT: 0.205160 total execute time: 0.220310 (milliseconds) DEALLOCATE: 9588990 ops (3%) avg bytes sent per op: 236 avg bytes received per op: 168 backlog wait: 0.003670 RTT: 0.200074 total execute time: 0.214885 (milliseconds) .... So... the mount is NFSv4.2, the backing filesystem is XFS, and the client is clearly generating READ_PLUS and hole punches. I haven't been able to reproduce a problem so far with a Linux 5.19.0-rc7-00042-g94d75868c355 NFS server (no changes to the current READ_PLUS code -- this is just what is ready for 5.20). |