Hi, When we do the reboot stress test in a device, we may encounter the following kernel crash occasionally. [ 42.035226] c6 Unable to handle kernel NULL pointer dereference at virtual address 0000000a [ 43.437464] c6 __list_del_entry_valid+0xc/0xd8 [ 43.441962] c6 f2fs_destroy_node_manager+0x218/0x398 [ 43.446984] c6 f2fs_put_super+0x19c/0x2b8 [ 43.451052] c6 generic_shutdown_super+0x70/0xf8 [ 43.455635] c6 kill_block_super+0x2c/0x5c [ 43.459702] c6 kill_f2fs_super+0xac/0xd8 [ 43.463684] c6 deactivate_locked_super+0x5c/0x124 [ 43.468442] c6 deactivate_super+0x5c/0x68 [ 43.472512] c6 cleanup_mnt+0x9c/0x118 [ 43.476231] c6 __cleanup_mnt+0x1c/0x28 [ 43.480043] c6 task_work_run+0x88/0xa8 [ 43.483850] c6 do_notify_resume+0x39c/0x1c88 [ 43.488174] c6 work_pending+0x8/0x14 the code of crash point is: f2fs/node.c void f2fs_destroy_node_manager(struct f2fs_sb_info *sbi) while ((found = __gang_lookup_nat_cache(nm_i, nid, NATVEC_SIZE, natvec))) { unsigned idx; nid = nat_get_nid(natvec[found - 1]) + 1; for (idx = 0; idx < found; idx++) { spin_lock(&nm_i->nat_list_lock); > list_del(&natvec[idx]->list); spin_unlock(&nm_i->nat_list_lock); __del_from_nat_cache(nm_i, natvec[idx]); } } because of the current nat entry in natvec[idx] is a invalid pointer or its member list has null next member. We have encountered this issue for several times in both Andoird Q & R version I analyze these issue as following: 1. the current nat can be found in stack, like as "a" ffffff800806b8d0: ffffffc0af33cbc0 ffffffc0af4869a0 > ffffff800806b8e0: ffffffc0f49baa00 000000000000000a ffffff800806b8f0: ffffffc0af33c040 ffffffc0c69f0e20 ffffff800806b900: ffffffc0c695abc0 ffffffc01e2a4460 2.these invalid entry can be found in nat_root radix tree of f2fs_nm_info 3. I have reviewed the codes about nat_tree_lock, and has not any clues please let me know if you need any other information thanks a lot.
Hi, I checked the code of 4.14.193, I don't have any clue about why this can happen, and I don't remember that there is such corruption condition occured on nid list, because all its update is under nat_tree_lock, let me know if I missed something. Do you apply private patch on 4.14.193?
(In reply to Chao Yu from comment #1) > Hi, > > I checked the code of 4.14.193, I don't have any clue about why this can > happen, > and I don't remember that there is such corruption condition occured on nid > list, because all its update is under nat_tree_lock, let me know if I missed > something. > > Do you apply private patch on 4.14.193? hi Chao, Thanks for your reply, I have checked my codebase, there is no any other private patches in current version. I find that local variables natvec & setvec in f2fs_destroy_node_manager may be inited as 0xaa and 0xaaaaaaaaaaaaaaaa, just like : void f2fs_destroy_node_manager(struct f2fs_sb_info *sbi) { struct f2fs_nm_info *nm_i = NM_I(sbi); struct free_nid *i, *next_i; struct nat_entry *natvec[NATVEC_SIZE]; struct nat_entry_set *setvec[SETVEC_SIZE]; dis: crash_arm64> dis f2fs_destroy_node_manager 0xffffff800842e2a8 <f2fs_destroy_node_manager>: stp x29, x30, [sp,#-96]! 0xffffff800842e2ac <f2fs_destroy_node_manager+4>: stp x28, x27, [sp,#16] 0xffffff800842e2b0 <f2fs_destroy_node_manager+8>: stp x26, x25, [sp,#32] 0xffffff800842e2b4 <f2fs_destroy_node_manager+12>: stp x24, x23, [sp,#48] 0xffffff800842e2b8 <f2fs_destroy_node_manager+16>: stp x22, x21, [sp,#64] 0xffffff800842e2bc <f2fs_destroy_node_manager+20>: stp x20, x19, [sp,#80] 0xffffff800842e2c0 <f2fs_destroy_node_manager+24>: mov x29, sp 0xffffff800842e2c4 <f2fs_destroy_node_manager+28>: sub sp, sp, #0x320 0xffffff800842e2c8 <f2fs_destroy_node_manager+32>: adrp x8, 0xffffff800947e000 <xt_connlimit_locks+768> 0xffffff800842e2cc <f2fs_destroy_node_manager+36>: ldr x8, [x8,#264] 0xffffff800842e2d0 <f2fs_destroy_node_manager+40>: mov x27, x0 0xffffff800842e2d4 <f2fs_destroy_node_manager+44>: str x8, [x29,#-16] 0xffffff800842e2d8 <f2fs_destroy_node_manager+48>: nop 0xffffff800842e2dc <f2fs_destroy_node_manager+52>: ldr x20, [x27,#112] 0xffffff800842e2e0 <f2fs_destroy_node_manager+56>: add x0, sp, #0x110 0xffffff800842e2e4 <f2fs_destroy_node_manager+60>: mov w1, #0xaa // #170 0xffffff800842e2e8 <f2fs_destroy_node_manager+64>: mov w2, #0x200 // #512 0xffffff800842e2ec <f2fs_destroy_node_manager+68>: bl 0xffffff8008be6b80 <__memset> 0xffffff800842e2f0 <f2fs_destroy_node_manager+72>: mov x8, #0xaaaaaaaaaaaaaaaa // #-6148914691236517206 0xffffff800842e2f4 <f2fs_destroy_node_manager+76>: stp x8, x8, [sp,#256] 0xffffff800842e2f8 <f2fs_destroy_node_manager+80>: stp x8, x8, [sp,#240] 0xffffff800842e2fc <f2fs_destroy_node_manager+84>: stp x8, x8, [sp,#224] 0xffffff800842e300 <f2fs_destroy_node_manager+88>: stp x8, x8, [sp,#208] 0xffffff800842e304 <f2fs_destroy_node_manager+92>: stp x8, x8, [sp,#192] 0xffffff800842e308 <f2fs_destroy_node_manager+96>: stp x8, x8, [sp,#176] 0xffffff800842e30c <f2fs_destroy_node_manager+100>: stp x8, x8, [sp,#160] 0xffffff800842e310 <f2fs_destroy_node_manager+104>: stp x8, x8, [sp,#144] 0xffffff800842e314 <f2fs_destroy_node_manager+108>: stp x8, x8, [sp,#128] 0xffffff800842e318 <f2fs_destroy_node_manager+112>: stp x8, x8, [sp,#112] 0xffffff800842e31c <f2fs_destroy_node_manager+116>: stp x8, x8, [sp,#96] 0xffffff800842e320 <f2fs_destroy_node_manager+120>: stp x8, x8, [sp,#80] 0xffffff800842e324 <f2fs_destroy_node_manager+124>: stp x8, x8, [sp,#64] 0xffffff800842e328 <f2fs_destroy_node_manager+128>: stp x8, x8, [sp,#48] 0xffffff800842e32c <f2fs_destroy_node_manager+132>: stp x8, x8, [sp,#32] 0xffffff800842e330 <f2fs_destroy_node_manager+136>: stp x8, x8, [sp,#16] I am not sure this is the root cause about this issue, because these invalid entry can be found in nat_root radix tree of f2fs_nm_info thanks! thanks!
nm_i->nat_list_lock was introduced in 4.19, are you sure your codebase is 4.14.193?
(In reply to Zhiguo.Niu from comment #2) > hi Chao, > > Thanks for your reply, I have checked my codebase, there is no any other > private patches in current version. > > I find that local variables natvec & setvec in f2fs_destroy_node_manager may > be inited as 0xaa and 0xaaaaaaaaaaaaaaaa, just like : > > void f2fs_destroy_node_manager(struct f2fs_sb_info *sbi) > { > struct f2fs_nm_info *nm_i = NM_I(sbi); > struct free_nid *i, *next_i; > struct nat_entry *natvec[NATVEC_SIZE]; > struct nat_entry_set *setvec[SETVEC_SIZE]; > I don't think so, natvec array will be assigned in __gang_lookup_nat_cache(), and natvec[0..found - 1] will be valid, in "destroy nat cache" loop, we will not access natvec array out-of-range. Can you please check whether @found is valid or not (@found should be less or equal than NATVEC_SIZE)? BTW, one possible case could be stack overflow, but during umount(), would that really happen?