Bug 208883
Summary: | CIFS: kernel BUG at fs/cachefiles/rdwr.c:715! | ||
---|---|---|---|
Product: | File System | Reporter: | Xiaoli Feng (fengxiaoli0714) |
Component: | Other | Assignee: | fs_other |
Status: | NEW --- | ||
Severity: | normal | CC: | cebtenzzre, daire, dhowells, dwysocha, flo, ignaciofelipe, maxim.doucet, smfrench, tiwai |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.8.0-rc7+ | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | A test fix patch |
Description
Xiaoli Feng
2020-08-12 04:59:30 UTC
This is also easily reproduced with NFS too (simple unit test below hits the oops on 5.8-rc3 at least), so it's a general fscache issue with current implementation. The fscache-iter rewrite (see https://lkml.org/lkml/2020/8/3/960) replaces all this code. Once that is merged, this problem will no longer be possible. Setting NFS vers=3 1. On NFS client, install and enable cachefilesd 2. On NFS client, mount -o vers=3,fsc 127.0.0.1:/export/dir1 /mnt/dir1 3. On NFS client, dd if=/dev/zero of=/mnt/dir1/file1.bin bs=4096 count=1 4. On NFS client, echo 3 > /proc/sys/vm/drop_caches 5. On NFS client, dd if=/mnt/dir1/file1.bin of=/dev/null (read into fscache) 6. On NFS client, umount /mnt/dir1 7. On NFS client, mount -o vers=3,fsc 127.0.0.1:/export/dir1 /mnt/dir1 8. On NFS client, repeat steps 4-5 (read from fscache) FWIW, I'm pretty sure this happens after this patch was merged: https://www.spinics.net/lists/linux-fsdevel/msg150048.html So if you want to test upstream test something before that was merged. reassign to "other" for fs (since it is fscache related and there doesn't seem to be an fscache component in bugzilla) I guess this isn't about the change in comment 2; it should have been already addressed by commit c5f9d9db83d9 ("cachefiles: Fix corruption of the return value in cachefiles_read_or_alloc_pages()"). The bug we we see now is more likely a regression in 5.8; namely, the readpages aops have been converted to readahead, hence the NULL check with ASSERT() in cachefiles hits falsely. Just dropping those ASSERT() lines should suffice. The downstream bug report confirmed the fix: https://bugzilla.opensuse.org/show_bug.cgi?id=1175245 Created attachment 292169 [details]
A test fix patch
I've been reproducing this bug with every released kernel from 5.8.0-rc7 to 5.8.11 and 5.9-rc4 I've compiled my own kernel using code from 5.8.11 and applying patch from Takashi iwai. Kernel including proposed patch works like a charm. without problems. My configuration is an NFS share mounted using FS-CACHE and cachefilesd (using dedicated SSD as cache device) Is planned to add this patch to 5.8 or 5.9 kernel code? We have had a similar experience, cachefiles/fscache with NFS is practically unusable without this patch from v5.8 onwards. It takes seconds for us to hit the assert with production workloads. I can confirm that 5.10.10 hits the assertion at fs/cachefiles/rdwr.c:716, and 5.10.13 does not. Kernels 5.10.11 and later include commit 76e2b0b "cachefiles: Drop superfluous readpages aops NULL check", so I think this bug can be closed? To clarify, I'm talking about Arch Linux commit hashes and version numbers - I don't know what has to happen upstream before this is closed. 5.11 release? For the record, db58465f1121 ("cachefiles: Drop superfluous readpages aops NULL check")[1] is the upstream commit, which is on the road to 5.11. 1: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=db58465f1121086b524be80be39d1fedbe5387f3 |