Bug 208883

Summary: CIFS: kernel BUG at fs/cachefiles/rdwr.c:715!
Product: File System Reporter: Xiaoli Feng (fengxiaoli0714)
Component: OtherAssignee: fs_other
Status: NEW ---    
Severity: normal CC: cebtenzzre, daire, dhowells, dwysocha, flo, ignaciofelipe, maxim.doucet, smfrench, tiwai
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.8.0-rc7+ Subsystem:
Regression: No Bisected commit-id:
Attachments: A test fix patch

Description Xiaoli Feng 2020-08-12 04:59:30 UTC
Test fscache for cifs. Ther server and client are both linux upstream 5.8.0-rc7+. But the server doesn't set CONFIG_CIFS_FSCACHE. And the client set "CONFIG_CIFS_FSCACHE=y". Mount the share with fsc in client. Create a 1G file in the montpoint. Then execute several same commands to cp this file to another file. Kernel happen panic.

[  793.247925] CacheFiles: Loaded
[  793.264815] FS-Cache: Cache "mycache" added (type cachefiles)
[  793.265787] CacheFiles: File cache on dm-0 registered
[  811.695641] FS-Cache: Duplicate cookie detected
[  811.696463] FS-Cache: O-cookie c=000000004ea7828d [p=00000000af9640d5 fl=218 nc=0 na=0]
[  811.697283] FS-Cache: O-cookie d=0000000000000000 n=0000000000000000
[  811.697993] FS-Cache: O-key=[8] '684860262fd3cc97'
[  811.698461] FS-Cache: N-cookie c=00000000293a8201 [p=00000000af9640d5 fl=2 nc=0 na=1]
[  811.699182] FS-Cache: N-cookie d=000000002b3a168b n=0000000001e83ef6
[  811.699789] FS-Cache: N-key=[8] '684860262fd3cc97'
[  817.300732] CacheFiles: 
[  817.318429] CacheFiles: Assertion failed
[  817.319111] ------------[ cut here ]------------
[  817.320349] kernel BUG at fs/cachefiles/rdwr.c:715!
[  817.321026] invalid opcode: 0000 [#1] SMP PTI
[  817.321342] CPU: 0 PID: 5478 Comm: cp Kdump: loaded Not tainted 5.8.0-rc7test+ #3
[  817.321342] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[  817.321342] RIP: 0010:cachefiles_read_or_alloc_pages.cold.20+0x18/0x4e [cachefiles]
[  817.321342] Code: 4d c1 3d ca 48 c7 c7 c0 57 b6 c0 e8 41 c1 3d ca 0f 0b 48 c7 c7 50 4b b6 c0 e8 33 c1 3d ca 48 c7 c7 c0 57 b6 c0 e8 27 c1 3d ca <0f> 0b 48 c7 c7 50 4b b6 c0 e8 19 c1 3d ca 48 c7 c7 c0 57 b6 c0 e8
[  817.321342] RSP: 0018:ffffa476808bf9f0 EFLAGS: 00010246
[  817.321342] RAX: 000000000000001c RBX: 0000000000000000 RCX: 0000000000000027
[  817.321342] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff8b299bc17fc8
[  817.321342] RBP: ffff8b2990af84e8 R08: 000000000000023d R09: 000000000000002b
[  817.321342] R10: 6f69747265737341 R11: 64656c696166206e R12: ffff8b293058d3c0
[  817.321342] R13: ffff8b299643b5b0 R14: ffffa476808bfcb8 R15: ffff8b2929f0f00c
[  817.321342] FS:  00007f7a81742800(0000) GS:ffff8b299bc00000(0000) knlGS:0000000000000000
[  817.321342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  817.321342] CR2: 00007f993caa1d20 CR3: 00000000b9026000 CR4: 00000000000006f0
[  817.321342] Call Trace:
[  817.321342]  ? ___slab_alloc+0x3a1/0x560
[  817.321342]  ? xas_alloc+0x97/0xc0
[  817.321342]  ? fscache_run_op.isra.13+0x55/0xb0 [fscache]
[  817.321342]  __fscache_read_or_alloc_pages+0x212/0x2c0 [fscache]
[  817.321342]  __cifs_readpages_from_fscache+0x5a/0x1e0 [cifs]
[  817.321342]  cifs_readpages+0x9d/0x960 [cifs]
[  817.321342]  read_pages+0x205/0x270
[  817.321342]  page_cache_readahead_unbounded+0x1b9/0x240
[  817.321342]  generic_file_buffered_read+0x1da/0xb70
[  817.321342]  cifs_strict_readv+0xdf/0x100 [cifs]
[  817.321342]  new_sync_read+0x114/0x1a0
[  817.321342]  vfs_read+0x151/0x180
[  817.321342]  ksys_read+0x59/0xd0
[  817.321342]  do_syscall_64+0x4c/0x90
[  817.321342]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  817.321342] RIP: 0033:0x7f7a80c1e805
[  817.321342] Code: fe ff ff 50 48 8d 3d 7a cb 09 00 e8 65 01 02 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 55 4d 2d 00 8b 00 85 c0 75 0f 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 53 c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89
[  817.321342] RSP: 002b:00007ffe27e6b308 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  817.321342] RAX: ffffffffffffffda RBX: 0000000000100000 RCX: 00007f7a80c1e805
[  817.321342] RDX: 0000000000100000 RSI: 00007f7a815ea000 RDI: 0000000000000003
[  817.321342] RBP: 00007ffe27e6b700 R08: 0000000000100000 R09: 0000000000000000
[  817.321342] R10: 00007ffe27e6b4b8 R11: 0000000000000246 R12: 0000000000000000
[  817.321342] R13: 00007f7a816ea000 R14: 0000000027e6b700 R15: 0000000000000000
[  817.321342] Modules linked in: cachefiles md4 sha512_ssse3 sha512_generic cmac nls_utf8 cifs libarc4 dns_resolver fscache libdes rfkill sunrpc snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg cirrus snd_hda_codec drm_kms_helper snd_hda_core syscopyarea snd_hwdep sysfillrect snd_seq sysimgblt fb_sys_fops snd_seq_device cec snd_pcm drm snd_timer snd soundcore pcspkr joydev virtio_balloon i2c_piix4 ip_tables xfs libcrc32c ata_generic virtio_net ata_piix net_failover virtio_blk failover libata serio_raw dm_mirror dm_region_hash dm_log dm_mod



Reproduce:
server:
# cat  /etc/samba/smb.conf
[cifs]
path=/mnt/cifs
writeable=yes

client:
# systemctl start cachefilesd
# mount //$SERVER/cifs  ~/cifs -o user=root,password=$MYPASSWORD,fsc
# dd if=/dev/urandom of file.txt bs=1M count=1024
# cp -f file.txt ~/cifs/file.txt
# cp -f ~/cifs/file.txt ~/cifs/file1.txt
Repeat this command several times. Then panic.
Comment 1 Dave Wysochanski 2020-08-12 16:20:11 UTC
This is also easily reproduced with NFS too (simple unit test below hits the oops on 5.8-rc3 at least), so it's a general fscache issue with current implementation.

The fscache-iter rewrite (see https://lkml.org/lkml/2020/8/3/960) replaces all this code.  Once that is merged, this problem will no longer be possible.


Setting NFS vers=3
1. On NFS client, install and enable cachefilesd
2. On NFS client, mount -o vers=3,fsc 127.0.0.1:/export/dir1 /mnt/dir1
3. On NFS client, dd if=/dev/zero of=/mnt/dir1/file1.bin bs=4096 count=1
4. On NFS client, echo 3 > /proc/sys/vm/drop_caches
5. On NFS client, dd if=/mnt/dir1/file1.bin of=/dev/null (read into fscache)
6. On NFS client, umount /mnt/dir1
7. On NFS client, mount -o vers=3,fsc 127.0.0.1:/export/dir1 /mnt/dir1
8. On NFS client, repeat steps 4-5 (read from fscache)
Comment 2 Dave Wysochanski 2020-08-12 16:30:34 UTC
FWIW, I'm pretty sure this happens after this patch was merged:
https://www.spinics.net/lists/linux-fsdevel/msg150048.html

So if you want to test upstream test something before that was merged.
Comment 3 Steve French 2020-08-13 02:22:16 UTC
reassign to "other" for fs (since it is fscache related and there doesn't seem to be an fscache component in bugzilla)
Comment 4 Takashi Iwai 2020-08-26 09:33:21 UTC
I guess this isn't about the change in comment 2; it should have been already addressed by commit c5f9d9db83d9 ("cachefiles: Fix corruption of the return value in cachefiles_read_or_alloc_pages()").

The bug we we see now is more likely a regression in 5.8; namely, the readpages aops have been converted to readahead, hence the NULL check with ASSERT() in cachefiles hits falsely.  Just dropping those ASSERT() lines should suffice.

The downstream bug report confirmed the fix:
  https://bugzilla.opensuse.org/show_bug.cgi?id=1175245
Comment 5 Takashi Iwai 2020-08-26 09:34:59 UTC
Created attachment 292169 [details]
A test fix patch
Comment 6 nacho 2020-09-25 22:32:03 UTC
I've been reproducing this bug with every released kernel from  5.8.0-rc7 to 5.8.11 and 5.9-rc4

I've compiled my own kernel using code from 5.8.11 and applying patch from Takashi iwai.

Kernel including proposed patch works like a charm. without problems.

My configuration is an NFS share mounted using FS-CACHE and cachefilesd (using dedicated SSD as cache device)

Is planned to add this patch to 5.8 or 5.9 kernel code?
Comment 7 Daire Byrne 2020-10-01 07:50:43 UTC
We have had a similar experience, cachefiles/fscache with NFS is practically unusable without this patch from v5.8 onwards. It takes seconds for us to hit the assert with production workloads.
Comment 8 Cebtenzzre 2021-02-06 01:44:34 UTC
I can confirm that 5.10.10 hits the assertion at fs/cachefiles/rdwr.c:716, and 5.10.13 does not. Kernels 5.10.11 and later include commit 76e2b0b "cachefiles: Drop superfluous readpages aops NULL check", so I think this bug can be closed?
Comment 9 Cebtenzzre 2021-02-06 02:00:16 UTC
To clarify, I'm talking about Arch Linux commit hashes and version numbers - I don't know what has to happen upstream before this is closed. 5.11 release?
Comment 10 Florian Schmaus 2021-02-09 10:12:47 UTC
For the record, db58465f1121 ("cachefiles: Drop superfluous readpages aops NULL check")[1] is the upstream commit, which is on the road to 5.11.

1: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=db58465f1121086b524be80be39d1fedbe5387f3