Bug 202883

Summary: sometime dead lock in getdents64
Product: File System Reporter: Jiqun Li (jiqun.li)
Component: f2fsAssignee: Default virtual assignee for f2fs (filesystem_f2fs)
Status: RESOLVED CODE_FIX    
Severity: normal CC: chao
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: f2fs-dev Subsystem:
Regression: No Bisected commit-id:

Description Jiqun Li 2019-03-12 06:41:08 UTC
sometimes, dead lock when make system call SYS_getdents64 with fsync() is called by another process.

monkey running on android9.0

1.  task 9785 held sbi->cp_rwsem and waiting lock_page()
2.  task 10349 held mm_sem and waiting sbi->cp_rwsem
3. task 9709 held lock_page() and waiting mm_sem

so this is a dead lock scenario.

task stack is show by crash tools as following

crash_arm64> bt ffffffc03c354080
PID: 9785   TASK: ffffffc03c354080  CPU: 1   COMMAND: "RxIoScheduler-3"
#0 [ffffffc01b50f8a0] __switch_to at ffffff8008086d88
#1 [ffffffc01b50f8c0] __schedule at ffffff8008a90840
#2 [ffffffc01b50f920] schedule at ffffff8008a90f2c
#3 [ffffffc01b50f940] schedule_timeout at ffffff8008a940ec
#4 [ffffffc01b50f9f0] io_schedule_timeout at ffffff8008a90544
#5 [ffffffc01b50fa20] bit_wait_io at ffffff8008a913b4
#6 [ffffffc01b50fa40] __wait_on_bit_lock at ffffff8008a91574
>> #7 [ffffffc01b50fac0] __lock_page at ffffff80081b11e8
#8 [ffffffc01b50fb30] f2fs_sync_node_pages at ffffff8008387d08
#9 [ffffffc01b50fc50] f2fs_write_checkpoint at ffffff8008376620
#10 [ffffffc01b50fd30] f2fs_sync_fs at ffffff800836b030
#11 [ffffffc01b50fda0] f2fs_do_sync_file at ffffff8008358a30
#12 [ffffffc01b50fe50] f2fs_sync_file at ffffff80083591c4
#13 [ffffffc01b50fe90] sys_fsync at ffffff800824cffc




crash-arm64> bt 10349
PID: 10349  TASK: ffffffc018b83080  CPU: 1   COMMAND: "BUGLY_ASYNC_UPL"
#0 [ffffffc01f8cf9a0] __switch_to at ffffff8008086d88
#1 [ffffffc01f8cf9c0] __schedule at ffffff8008a90840
#2 [ffffffc01f8cfa20] schedule at ffffff8008a90f2c
>> #3 [ffffffc01f8cfa40] rwsem_down_read_failed at ffffff8008a93afc
#4 [ffffffc01f8cfab0] down_read at ffffff8008a93360
#5 [ffffffc01f8cfad0] __do_map_lock at ffffff800837e758
#6 [ffffffc01f8cfb00] f2fs_vm_page_mkwrite at ffffff8008359c14
#7 [ffffffc01f8cfb90] do_page_mkwrite at ffffff80081e09cc
#8 [ffffffc01f8cfc10] do_wp_page at ffffff80081e2de8
#9 [ffffffc01f8cfcb0] handle_mm_fault at ffffff80081e5228
#10 [ffffffc01f8cfd80] do_page_fault at ffffff800809d05c
#11 [ffffffc01f8cfdf0] do_mem_abort at ffffff8008081570
#12 [ffffffc01f8cfed0] el0_da at ffffff8008085650
     PC: 00000033  LR: 00000000  SP: 00000000  PSTATE: ffffffffffffffff


crash-arm64> bt 9709
PID: 9709   TASK: ffffffc03e7f3080  CPU: 1   COMMAND: "IntentService[A"
#0 [ffffffc001e677b0] __switch_to at ffffff8008086d88
#1 [ffffffc001e677d0] __schedule at ffffff8008a90840
#2 [ffffffc001e67830] schedule at ffffff8008a90f2c
>> #3 [ffffffc001e67850] rwsem_down_read_failed at ffffff8008a93afc
#4 [ffffffc001e678c0] down_read at ffffff8008a93360
#5 [ffffffc001e678e0] do_page_fault at ffffff800809ceb8
#6 [ffffffc001e67950] do_translation_fault at ffffff800809d250
#7 [ffffffc001e67980] do_mem_abort at ffffff8008081570
>> #8 [ffffffc001e67b80] el1_ia at ffffff8008084fc4
     PC: ffffff8008274114  [compat_filldir64+120]
     LR: ffffff80083584d4  [f2fs_fill_dentries+448]
     SP: ffffffc001e67b80  PSTATE: 80400145
    X29: ffffffc001e67b80  X28: 0000000000000000  X27: 000000000000001a
    X26: 00000000000093d7  X25: ffffffc070d52480  X24: 0000000000000008
    X23: 0000000000000028  X22: 00000000d43dfd60  X21: ffffffc001e67e90
    X20: 0000000000000011  X19: ffffff80093a4000  X18: 0000000000000000
    X17: 0000000000000000  X16: 0000000000000000  X15: 0000000000000000
    X14: ffffffffffffffff  X13: 0000000000000008  X12: 0101010101010101
    X11: 7f7f7f7f7f7f7f7f  X10: 6a6a6a6a6a6a6a6a   X9: 7f7f7f7f7f7f7f7f
     X8: 0000000080808000   X7: ffffff800827409c   X6: 0000000080808000
     X5: 0000000000000008   X4: 00000000000093d7   X3: 000000000000001a
     X2: 0000000000000011   X1: ffffffc070d52480   X0: 0000000000800238
>> #9 [ffffffc001e67be0] f2fs_fill_dentries at ffffff80083584d0
#10 [ffffffc001e67ca0] f2fs_read_inline_dir at ffffff8008372ca4
#11 [ffffffc001e67d20] f2fs_readdir at ffffff80083588a0
#12 [ffffffc001e67de0] iterate_dir at ffffff8008228e90
#13 [ffffffc001e67e30] compat_sys_getdents64 at ffffff8008278894
#14 [ffffffc001e67ed0] __sys_trace at ffffff8008085b48
     PC: 0000003c  LR: 00000000  SP: 00000000  PSTATE: 000000d9
    X12: f48a02ff X11: d4678960 X10: d43dfc00  X9: d4678ae4
     X8: 00000058  X7: d4678994  X6: d43de800  X5: 000000d9
     X4: d43dfc0c  X3: d43dfc10  X2: d46799c8  X1: 00000000
     X0: 00001068
Comment 1 Jiqun Li 2019-03-12 06:53:05 UTC
for task 9709, do_page_fault() is triggered by  __put_user_unaligned() in compat_filldir64(). 

static int compat_filldir64(struct dir_context *ctx, const char *name,
			    int namlen, loff_t offset, u64 ino,
			    unsigned int d_type)
{
        .....
	if (dirent) {
		if (__put_user_unaligned(offset, &dirent->d_off))
			goto efault;
	}


dirent is a local array[] in user space, when want to access dirent->d_off,
do_page_fault() is triggered and want to alloc real memory.

dirent is a valid user space address
Comment 2 Chao Yu 2019-03-12 07:51:36 UTC
Hi Jiqun,

I just sent one patch for that, could you try the patch?

[PATCH] f2fs: fix to avoid deadlock in f2fs_read_inline_dir()

The solution is very similar to what we did in another patch as below,
we just take off the lock directly, since it's unneeded.

f2fs: no need to take page lock in readdir

https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git/commit/?h=dev-test&id=f3d42f6e5087a4b3a52ee4008734af93519dec06
Comment 3 Jiqun Li 2019-03-12 07:54:46 UTC
thanks, I got it
Comment 4 Jiqun Li 2019-03-12 07:55:03 UTC
I will try it
Comment 5 Chao Yu 2019-03-16 08:08:21 UTC
The fixing patch has been merged, close this issue.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=aadcef64b22f668c1a107b86d3521d9cac915c24