Bug 218935

Summary: UBIFS: problem report: about lpt LEB scanning failed (no issue)
Product: File System Reporter: Zhihao Cheng (chengzhihao1)
Component: OtherAssignee: fs_other
Status: NEW ---    
Severity: normal    
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: test.sh

Description Zhihao Cheng 2024-06-05 02:32:12 UTC
Problem description

Recently I was testing UBIFS with fsstress on a nor flash(simulated by mtdram, 64M size,16K PEB, which means big lpt mode for UBIFS), the utilization rate of one CPU(fsstress program) is 100%, and the fsstress program cannot be killed. The fsstress program stucks in a dead loop:

do_commit -> ubifs_lpt_start_commit:

  while (need_write_all(c)) {

    mutex_unlock(&c->lp_mutex);

    err = lpt_gc(c);

    if (err)

      return err;

    mutex_lock(&c->lp_mutex);

  }

Then I found that lpt_gc_lnum handles the same LEB(lnum 8) every time, and the c->ltab[i].dirty for LEB 8 is not equal to c->leb_size after invoking lpt_gc_lnum(). After analyzing the lpt nodes on LEB 8, lpt_gc_lnum returns early before scanning all lpt nodes. The lpt LEB 8 is shown as(partial):

[  104.740309] LEB 8:14383 len 13, nnode num 31,
[  104.740689] dirty 1
[  104.740905] LEB 8:14396 len 13, nnode num 7,
[  104.741277] dirty 1
[  104.741486] LEB 8:14409 len 13, nnode num 1,
[  104.741870] dirty 1
[  104.742078] LEB 8:14422 len 16, pnode num 745
[  104.742475] dirty 1
[  104.742682] B type 8 0
[  104.742925] LEB 8:14438, pad 2 bytes min_io_size 8
[  104.743301] LEB 8:14440, free 1368 bytes  // Actually, the left 1368 bytes are not 0xff, the scanning function(dump_lpt_leb) parses lpt nodes in a wrong way
[  104.743674] (pid 1095) finish dumping LEB 8

The binary image for LEB 8 is(partial):

0x3840 = 14400

 00003840: 6a e4 60 cf 91 b1 f3 82 03 17 59 11 40 ac b9 fc 99 11 83 c3 83 03  ff  ff   90 6e c3 ec 04 f3 26 a1  j.`.......Y.@..........      ..n....&.
 00003860: bf 09 41 a2 6f 94 15 09 58 ee 5f ce 97 7e 09 b8 86 a0 d8 2c 62 3b 47 37 62 e5 e8 59 86 be 82 fe  ..A.o...X._..~.....,b;G      7b..Y....
 00003880: 17 6d 63 95 ce 80 76 6e ad e6 44 af f6 43 06 ab 41 28 04 99 72 1f 31 91 cb 96 b1 ef 43 6e 22 2c  .mc...vn..D..C..A(..r.1      .....Cn",
 000038a0: 26 57 d0 9c b5 76 8b 08 1d fc 41 07 8c ba 26 3b 45 e1 7b 23 de d5 19 63 f3 6c e8 95 b7 02 5a 89  &W...v....A...&;E.{#...      c.l....Z.
 000038c0: 83 81 0e 72 7c 4b 59 a3 c4 c0 e1 e5 22 7c 27 8d 85 ad c2 93 25 ac 5b 32 c8 02 07 2f 24 f9 e0 f6  ...r|KY....."|'.....%.[      2.../$...
 000038e0: e3 87 f2 bb 62 23 d5 e4 2e b7 8c 41 61 43 2a a4 2f ce 92 4f 62 47 88 a2 11 a6 51 1f da 51 e7 a4  ....b#.....AaC*./..ObG.      ...Q..Q..


Let's parse above data by lpt_gc_lnum().

The nnode(1) is at 8: 14409~14421, corresponding data is '17 59 11 40 ac b9 fc 99 11 83 c3 83 03',  the type field is the lower UBIFS_LPT_TYPE_BITS(4) bits in '0x11' according to ubifs_pack_nnode(), and the data looks good and it can be parsed as a nnode. The next 2 bytes(8: 14422~14423) are 0xff, which means that lpt data is written into flash with an alignment of 8 bytes(See write_cnodes). After modifying the code of lpt_gc_lnum(), let UBIFS skip the 2 bytes(0xff), UBIFS could parse all lpt nodes in LEB 8. But in fact, UBIFS parses these 2 bytes(0xff) as the crc field of pnode(8: 14422~14437), and the crc16 result of the pnode is just 0xffff, so the field(8: 14422~14437) is parsed as a pnode, and the left lpt nnodes cannot be parsed because of the wrong parsing offset.


Why it can happen?

The root cause is that the implementation of lpt area disk layout is simple, it would be better if UBIFS has a length field in LPT node. Otherwise, it could be possible that the crc16 result is right both for offset_A~offset_B(node X) and  offset_A+2~ offset_C(node Y).


Will it happen on a nand flash?

In theory, I would say 'yes'. But I never meet it after testing for a whole day. I guess that the min_io_size for nand is (at least) 512, the length of pending bytes(0xff) is hardly less than 3 bytes, so it is hard to reproduce that the crc16 result is right both for offset_A~offset_B(node X) and  offset_A+2~ offset_C(node Y).


How to reproduce it?

1. You can mount the problem image(disk.tar.gz) directly by a script(mount.sh):

2. You can generate a problem image by a script test.sh (When you see hung task warning or the utilization rate of one CPU becomes 100%, it means the problem occurs).

PS: I report the problem as no issue, because I don't think we can fix it without modifying disk layout. I think it's just a designment nit, no need to fix it. I just want people know the problem if someone meet it one day.
Comment 1 Zhihao Cheng 2024-06-05 03:00:42 UTC
Created attachment 306411 [details]
test.sh