Bug 219022

Summary: UBIFS/ext4: Deadlock happens while getting other inodes in the inode evicting process under inode lru traversing context
Product: File System Reporter: Zhihao Cheng (chengzhihao1)
Component: OtherAssignee: fs_other
Status: NEW ---    
Severity: normal    
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: diff
a.c
diff_2
a_2.c

Description Zhihao Cheng 2024-07-10 02:07:10 UTC
Created attachment 306553 [details]
diff

Description:
The inodes reclaiming process(See function prune_icache_sb) collects all reclaimable inodes and mark them with I_FREEING flag at first, at that time, other processes will be stuck if they try getting these inodes(See function find_inode_fast), then the reclaiming process destroy the inodes.                                                                 
In deleted inode writing function ubifs_jnl_write_inode(), UBIFS holds BASEHD's wbuf->io_mutex while traversing all xattr inodes, which could race with inodes reclaiming process(The reclaiming process could try locking BASEHD's wbuf->io_mutex in inode evicting function), then an ABBA deadlock problem would happens as following:                                   
                                                                                    
 1. File A has inode ia and a xattr(with inode ixa), regular file B has inode ib and a xattr.                                                           
 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa                     
 3. Then, following three processes running like this:                              
                                                                                    
        PA                PB                        PC                              
                echo 2 > /proc/sys/vm/drop_caches                                   
                // ib and ia area added into lru, lru->ixa->ib->ia                  
                 shrink_slab                                                        
                  prune_icache_sb                                                   
                   list_lru_walk_one                                                
                    inode_lru_isolate                                               
                     ixa->inode->i_state |= I_FREEING // set inode state            
                    inode_lru_isolate                                               
                     __iget(ib)                                                     
                     spin_unlock(&ib->i_lock)                                       
                     spin_unlock(lru_lock)                                          
                                                   rm file B                        
                                                    ib->nlink = 0                   
                                                    iput(ib)                        
 rm file A                                                                          
  iput(ia)                                                                          
   ubifs_evict_inode(ia)                                                            
    ubifs_jnl_delete_inode(ia)                                                      
     ubifs_jnl_write_inode(ia)
      make_reservation(BASEHD) // Lock wbuf->io_mutex                            
      ubifs_iget(ixa->i_ino)                                                     
       iget_locked                                                               
        find_inode_fast                                                          
         __wait_on_freeing_inode(ixa)                                            
          |          iput(ib) // ib->nlink is 0, do evict                        
          |           ubifs_evict_inode                                          
          |            ubifs_jnl_delete_inode(ib)                                
          ↓             ubifs_jnl_write_inode                                    
     ABBA deadlock ←-----make_reservation(BASEHD)                                
                   dispose_list // cannot be executed by prune_icache_sb         
                    wake_up_bit(&inode->i_state)                                 
                                                                                 

Reproducer:
CONFIG_DETECT_HUNG_TASK=y
CONFIG_MTD_NAND_NANDSIM=m
CONFIG_MTD_UBI=m
CONFIG_UBIFS_FS=m

-smp 1 // single core, make all inodes are put into same lru list

1. Apply diff and compile kernel
2. gcc -oaa a.c -lpthread
3. ./aa

[   45.237580] Add 66 lru
[   46.255128] Add 67 lru
[   46.257548] Add 65 lru
[   46.258042] <1545> try isolate 66
[   46.258735] add inode 66 into dispose list
[   46.259552] <1545> try isolate 67
[   46.262521] wait unlink file_b
[   47.246328] wait unlink file_b done
[   47.247295] wait unlink file_a
[   49.239449] <1504> wait inode 66 I_FREEEING
[   49.292337] wait unlink file_a done
[   49.293586] <1545> prepare to relase ib
[   76.623348] INFO: task aa:1504 blocked for more than 15 seconds.
[   76.624921]       Not tainted 6.10.0-rc7-00026-g828740a7415c-dirty #680
[   76.626583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[   76.628535] task:aa              state:D stack:0     pid:1504  tgid:1504  ppid:1404   flags:0x00000002
[   76.628549] Call Trace:
[   76.628553]  <TASK>
[   76.628561]  __schedule+0x591/0x1270
[   76.628593]  schedule+0x37/0x160
[   76.628600]  __wait_on_freeing_inode+0xd4/0x130
[   76.628611]  ? __pfx_wake_bit_function+0x10/0x10
[   76.628623]  find_inode_fast+0xde/0x1b0
[   76.628632]  iget_locked+0x7d/0x360
[   76.628643]  ubifs_iget+0x4f/0x7b0 [ubifs]
[   76.628725]  ubifs_jnl_write_inode+0x1da/0x790 [ubifs]
[   76.628794]  ? xas_load+0x15/0x200
[   76.628803]  ? xa_load+0x93/0xf0
[   76.628811]  ? __inode_wait_for_writeback+0x8b/0x120
[   76.628821]  ubifs_jnl_delete_inode+0x50/0x190 [ubifs]
[   76.628888]  ubifs_evict_inode+0x161/0x1e0 [ubifs]
[   76.628963]  evict+0x12c/0x2e0
[   76.628971]  iput+0x21e/0x3b0
[   76.628978]  do_unlinkat+0x167/0x4a0
[   76.628989]  __x64_sys_unlink+0x3b/0x60
[   76.628997]  x64_sys_call+0x24de/0x4560
[   76.629007]  do_syscall_64+0xa7/0x230
[   76.629018]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   76.629029] RIP: 0033:0x7ff513501b77
[   76.629049] RSP: 002b:00007ffdf6d8f288 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
[   76.629058] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff513501b77
[   76.629063] RDX: 00007ff5137d1740 RSI: 00007ff50c0008c0 RDI: 0000000000400ead
[   76.629067] RBP: 00007ffdf6d8f3b0 R08: 00007ff513e07700 R09: 0000000000000000
[   76.629071] R10: 0000000000000004 R11: 0000000000000206 R12: 0000000000400850
[   76.629075] R13: 00007ffdf6d8f490 R14: 0000000000000000 R15: 0000000000000000
[   76.629082]  </TASK>
[   76.629085] INFO: task bb:1545 blocked for more than 15 seconds.
[   76.630599]       Not tainted 6.10.0-rc7-00026-g828740a7415c-dirty #680
[   76.632262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[   76.634185] task:bb              state:D stack:0     pid:1545  tgid:1504  ppid:1404   flags:0x00004002
[   76.634194] Call Trace:
[   76.634197]  <TASK>
[   76.634201]  __schedule+0x591/0x1270
[   76.634208]  ? stack_depot_save+0x16/0x30
[   76.634240]  schedule+0x37/0x160
[   76.634246]  schedule_preempt_disabled+0x25/0x50
[   76.634253]  __mutex_lock.constprop.0+0x4a3/0x990
[   76.634262]  ? __link_object+0x194/0x240
[   76.634274]  __mutex_lock_slowpath+0x1f/0x30
[   76.634281]  mutex_lock+0x56/0x70
[   76.634289]  make_reservation+0xe1/0xb00 [ubifs]
[   76.634357]  ubifs_jnl_write_inode+0x10a/0x790 [ubifs]
[   76.634425]  ? prb_read_valid+0x23/0x40
[   76.634434]  ? console_unlock+0x5c/0x180
[   76.634439]  ? __irq_work_queue_local+0x51/0x1b0
[   76.634447]  ? xas_load+0x15/0x200
[   76.634455]  ? xa_load+0x93/0xf0
[   76.634463]  ? __inode_wait_for_writeback+0x8b/0x120
[   76.634471]  ubifs_jnl_delete_inode+0x50/0x190 [ubifs]
[   76.634537]  ubifs_evict_inode+0x161/0x1e0 [ubifs]
[   76.634604]  evict+0x12c/0x2e0
[   76.634611]  iput+0x21e/0x3b0
[   76.634619]  inode_lru_isolate+0x424/0x540
[   76.634628]  __list_lru_walk_one+0xc8/0x300
[   76.634635]  ? __pfx_inode_lru_isolate+0x10/0x10
[   76.634645]  ? __pfx_inode_lru_isolate+0x10/0x10
[   76.634652]  list_lru_walk_one+0x6d/0xb0
[   76.634660]  prune_icache_sb+0x52/0x90
[   76.634670]  super_cache_scan+0x14a/0x200
[   76.634679]  do_shrink_slab+0x1c8/0x5c0
[   76.634691]  shrink_slab+0x5c1/0x7b0
[   76.634702]  drop_slab+0xc9/0x1a0
[   76.634711]  drop_caches_sysctl_handler+0xd2/0x140
[   76.634721]  proc_sys_call_handler+0x1d0/0x2f0
[   76.634730]  proc_sys_write+0x1b/0x30
[   76.634736]  vfs_write+0x243/0x6c0
[   76.634748]  ksys_write+0x7f/0x170
[   76.634756]  __x64_sys_write+0x21/0x30
[   76.634765]  x64_sys_call+0x4531/0x4560
[   76.634773]  do_syscall_64+0xa7/0x230
[   76.634782]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Comment 1 Zhihao Cheng 2024-07-10 02:07:47 UTC
Created attachment 306554 [details]
a.c
Comment 2 Zhihao Cheng 2024-07-11 12:29:40 UTC
Description(ext4):
     1. File A has inode i_reg and an ea inode i_ea
     2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
     3. Then, following three processes running like this:
    
            PA                              PB
     echo 2 > /proc/sys/vm/drop_caches
      shrink_slab
       prune_dcache_sb
       // i_reg is added into lru, lru->i_ea->i_reg
       prune_icache_sb
        list_lru_walk_one
         inode_lru_isolate
          i_ea->i_state |= I_FREEING // set inode state
         inode_lru_isolate
          __iget(i_reg)
          spin_unlock(&i_reg->i_lock)
          spin_unlock(lru_lock)
                                         rm file A
                                          i_reg->nlink = 0
          iput(i_reg) // i_reg->nlink is 0, do evict
           ext4_evict_inode
            ext4_xattr_delete_inode
             ext4_xattr_inode_dec_ref_all
              ext4_xattr_inode_iget
               ext4_iget(i_ea->i_ino)
                iget_locked
                 find_inode_fast
                  __wait_on_freeing_inode(i_ea) ----→ AA deadlock
        dispose_list // cannot be executed by prune_icache_sb
         wake_up_bit(&i_ea->i_state)

Reproducer:
CONFIG_DETECT_HUNG_TASK=y
CONFIG_EXT4_FS=y

rootfs=xfs (none ext4)
-smp 1 // single core, make all inodes are put into same lru list

1. Apply diff_2 and compile kernel
2. gcc -oaa a_2.c -lpthread
3. ./aa

[   68.021371] Add 14 lru
[   68.035222] Add 13 lru
[   68.035744] <1164> try isolate 14 0 0
[   68.036507] add inode 14 into dispose list
[   68.037317] <1164> try isolate 13 0 1
[   68.042992] wait unlink file_a
[   69.027971] wait unlink file_a done
[   69.028901] <1164> prepare to relase ia
[   69.030480] <1164> get xattr 14
[   69.031318] <1164> wait inode 14 I_FREEEING
[   92.432041] INFO: task bb:1164 blocked for more than 15 seconds.
[   92.433606]       Not tainted 6.10.0-rc7-00038-gbf66a9390c24-dirty #715
[   92.435248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[   92.437157] task:bb              state:D stack:0     pid:1164  tgid:1028  ppid:890    flags:0x00004000
[   92.437171] Call Trace:
[   92.437175]  <TASK>
[   92.437182]  __schedule+0x591/0x1270
[   92.437232]  schedule+0x37/0x160
[   92.437238]  __wait_on_freeing_inode+0xd4/0x130
[   92.437250]  ? __pfx_wake_bit_function+0x10/0x10
[   92.437262]  find_inode_fast+0xde/0x1b0
[   92.437271]  iget_locked+0x7d/0x360
[   92.437279]  __ext4_iget+0x19e/0x1710
[   92.437289]  ? __wake_up_klogd+0x69/0xe0
[   92.437300]  ? vprintk_emit+0x2fd/0x470
[   92.437307]  ext4_xattr_inode_iget+0x4a/0x1a0
[   92.437315]  ext4_xattr_inode_dec_ref_all+0xce/0x580
[   92.437324]  ext4_xattr_delete_inode+0x445/0x580
[   92.437333]  ext4_evict_inode+0x33b/0xa90
[   92.437343]  evict+0x12c/0x2e0
[   92.437351]  iput+0x21e/0x3b0
[   92.437359]  inode_lru_isolate+0x407/0x520
[   92.437368]  __list_lru_walk_one+0xc8/0x300
[   92.437375]  ? __pfx_inode_lru_isolate+0x10/0x10
[   92.437385]  ? __pfx_inode_lru_isolate+0x10/0x10
[   92.437392]  list_lru_walk_one+0x6d/0xb0
[   92.437416]  prune_icache_sb+0x52/0x90
[   92.437426]  super_cache_scan+0x14a/0x200
[   92.437435]  do_shrink_slab+0x1c8/0x5c0
[   92.437446]  shrink_slab+0x5c1/0x7b0
[   92.437458]  drop_slab+0xc9/0x1a0
[   92.437466]  drop_caches_sysctl_handler+0xd2/0x140
[   92.437476]  proc_sys_call_handler+0x1d0/0x2f0
[   92.437485]  proc_sys_write+0x1b/0x30
[   92.437492]  vfs_write+0x243/0x6c0
[   92.437503]  ksys_write+0x7f/0x170
[   92.437512]  __x64_sys_write+0x21/0x30
[   92.437521]  x64_sys_call+0x4531/0x4560
[   92.437531]  do_syscall_64+0xa7/0x230
[   92.437542]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Comment 3 Zhihao Cheng 2024-07-11 12:29:50 UTC
Created attachment 306560 [details]
diff_2
Comment 4 Zhihao Cheng 2024-07-11 12:30:02 UTC
Created attachment 306561 [details]
a_2.c