Bug 206443 - general protection fault in ext4 during simultaneous online resize and write operations
Summary: general protection fault in ext4 during simultaneous online resize and write ...
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-02-06 19:16 UTC by Suraj
Modified: 2020-06-29 04:27 UTC (History)
5 users (show)

See Also:
Kernel Version: 5.5
Tree: Mainline
Regression: No


Attachments
proposed_fix.patch (2.80 KB, text/plain)
2020-02-06 19:16 UTC, Suraj
Details
testcase (292 bytes, text/plain)
2020-02-06 19:17 UTC, Suraj
Details
ext4_mb_load_buddy_gfp.trace (4.17 KB, text/plain)
2020-02-06 19:19 UTC, Suraj
Details
ext4_free_blocks.trace (4.11 KB, text/plain)
2020-02-06 19:19 UTC, Suraj
Details
ext4_free_inode.trace (3.78 KB, text/plain)
2020-02-06 19:20 UTC, Suraj
Details
__ext4_new_inode.trace (4.03 KB, text/plain)
2020-02-06 19:21 UTC, Suraj
Details
s_group_info.patch (2.35 KB, patch)
2020-02-07 23:22 UTC, Suraj
Details | Diff
s_flex_group.patch (10.25 KB, patch)
2020-02-07 23:23 UTC, Suraj
Details | Diff

Description Suraj 2020-02-06 19:16:37 UTC
Created attachment 287189 [details]
proposed_fix.patch

While writing to an ext4 file system partition during simultaneous online resize a general protection fault was encountered.

Reproducer:

truncate -s 100G /tmp/foo
sudo bash -c 'while true; do dd if=/dev/zero of=/mnt/xxx bs=1M count=1; sync; rm /mnt/xxx; done' &
while true; do mkfs.ext4 -b 1024 -E resize=26213883 /tmp/foo 2096635 -F; sudo mount -o loop /tmp/foo /mnt; sudo resize2fs /dev/loop0 26213883; sudo umount /mnt; done

The following call trace was observed:

[  886.837106] RIP: 0010:ext4_get_group_desc+0x46/0xa0 [ext4]
[  886.844922] Code: 41 8b 8a a8 00 00 00 41 89 f1 41 8b 42 38 41 d3 e9 49 8b 4a 70 83 e8 01 45 89 c8 21 f0 4a 8b 0c c1 48 85 c9 74 30 49 0f af 02 <48> 03 41 28 48 85 d2 74 03 48 89 0a f3 c3 41 89 f0 48 c7 c1 b8 47
[  886.857215] RSP: 0018:ffffc9000018f7d0 EFLAGS: 00010202
[  886.860998] RAX: 0000000000000040 RBX: ffff8887d634a000 RCX: 6f26075a7d3c6d9e
[  886.865578] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8887d634a000
[  886.870148] RBP: ffff8887d634c000 R08: 0000000000000000 R09: 0000000000000000
[  886.874731] R10: ffff8887d634c000 R11: 0000000000000000 R12: 0000000000000001
[  886.879306] R13: 0000000000000000 R14: ffff8887d634c000 R15: ffff8887d0b32000
[  886.883881] FS:  0000000000000000(0000) GS:ffff8887dfa00000(0000) knlGS:0000000000000000
[  886.890293] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  886.894293] CR2: 00007fc03aa34f30 CR3: 000000000200a006 CR4: 00000000007606e0
[  886.898875] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  886.903522] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  886.908267] PKRU: 55555554
[  886.911046] Call Trace:
[  886.913708]  ext4_read_block_bitmap_nowait+0x2a/0x600 [ext4]
[  886.917718]  ext4_read_block_bitmap+0x14/0x50 [ext4]
[  886.921408]  ext4_mb_mark_diskspace_used+0x58/0x380 [ext4]
[  886.925317]  ext4_mb_new_blocks+0x2a4/0x720 [ext4]
[  886.928928]  ? ext4_find_extent+0x295/0x2e0 [ext4]
[  886.932543]  ext4_ext_map_blocks+0xa60/0xd70 [ext4]
[  886.936191]  ? __lock_page_killable+0x240/0x260
[  886.939698]  ext4_map_blocks+0x3ae/0x5d0 [ext4]
[  886.943205]  ext4_writepages+0x7bc/0xe70 [ext4]
[  886.946712]  ? nvme_queue_rq+0x4d8/0xa90 [nvme]
[  886.950207]  ? __update_load_avg_cfs_rq+0x12b/0x2b0
[  886.953846]  ? __update_load_avg_cfs_rq+0x12b/0x2b0
[  886.957491]  ? do_writepages+0x4b/0xe0
[  886.960673]  ? ext4_mark_inode_dirty+0x1d0/0x1d0 [ext4]
[  886.964458]  do_writepages+0x4b/0xe0
[  886.967566]  ? enqueue_task_fair+0xa8/0xa40
[  886.970914]  ? __writeback_single_inode+0x3d/0x320
[  886.974521]  __writeback_single_inode+0x3d/0x320
[  886.978052]  ? ttwu_do_wakeup+0x19/0x140
[  886.981288]  writeback_sb_inodes+0x1b5/0x490
[  886.984675]  __writeback_inodes_wb+0x5d/0xb0
[  886.988069]  wb_writeback+0x265/0x2f0
[  886.991208]  ? wb_workfn+0x33f/0x400
[  886.994315]  wb_workfn+0x33f/0x400
[  886.997341]  process_one_work+0x195/0x380
[  887.000623]  worker_thread+0x30/0x390
[  887.003759]  ? process_one_work+0x380/0x380
[  887.007112]  kthread+0x113/0x130
[  887.010065]  ? kthread_park+0x90/0x90
[  887.013199]  ret_from_fork+0x35/0x40
[  887.016305] Modules linked in: loop ext4 crc16 mbcache jbd2 sunrpc mousedev evdev psmouse button ena ip_tables x_tables xfs libcrc32c crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd nvme cryptd glue_helper nvme_core dm_mirror dm_region_hash dm_log dm_mod dax ipv6 crc_ccitt nf_defrag_ipv6 autofs4
[  887.035407] ---[ end trace dc25e57808972176 ]---
[  887.038960] RIP: 0010:ext4_get_group_desc+0x46/0xa0 [ext4]
[  887.042867] Code: 41 8b 8a a8 00 00 00 41 89 f1 41 8b 42 38 41 d3 e9 49 8b 4a 70 83 e8 01 45 89 c8 21 f0 4a 8b 0c c1 48 85 c9 74 30 49 0f af 02 <48> 03 41 28 48 85 d2 74 03 48 89 0a f3 c3 41 89 f0 48 c7 c1 b8 47
[  887.055189] RSP: 0018:ffffc9000018f7d0 EFLAGS: 00010202
[  887.058992] RAX: 0000000000000040 RBX: ffff8887d634a000 RCX: 6f26075a7d3c6d9e
[  887.063592] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8887d634a000
[  887.068197] RBP: ffff8887d634c000 R08: 0000000000000000 R09: 0000000000000000
[  887.072795] R10: ffff8887d634c000 R11: 0000000000000000 R12: 0000000000000001
[  887.077399] R13: 0000000000000000 R14: ffff8887d634c000 R15: ffff8887d0b32000
[  887.081996] FS:  0000000000000000(0000) GS:ffff8887dfa00000(0000) knlGS:0000000000000000
[  887.088436] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  887.092447] CR2: 00007fc03aa34f30 CR3: 000000000200a006 CR4: 00000000007606e0
[  887.097060] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  887.101651] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  887.106252] PKRU: 55555554

Progress:

This looked likely to be a use after free in ext4_get_group_desc() of EXT4_SB(sb)->s_group_desc while it was being resized in add_new_gdb():
o_group_desc = EXT4_SB(sb)->s_group_desc;
{snip}
EXT4_SB(sb)->s_group_desc = n_group_desc;
EXT4_SB(sb)->s_gdb_count++;
kvfree(o_group_desc);

Proposed fix:

The attached patch was proposed as a fix to use rcu locking around the access to s_group_desc.
A test run with this patch was done and while the initial problem was no longer encountered. New call traces (attached) were encountered this time.
Comment 1 Suraj 2020-02-06 19:17:38 UTC
Created attachment 287191 [details]
testcase
Comment 2 Suraj 2020-02-06 19:19:07 UTC
Created attachment 287193 [details]
ext4_mb_load_buddy_gfp.trace

Call trace in ext4_mb_load_buddy_gfp
Comment 3 Suraj 2020-02-06 19:19:51 UTC
Created attachment 287195 [details]
ext4_free_blocks.trace

Call trace in ext4_free_blocks
Comment 4 Suraj 2020-02-06 19:20:31 UTC
Created attachment 287197 [details]
ext4_free_inode.trace

Call trace in ext4_free_inode
Comment 5 Suraj 2020-02-06 19:21:02 UTC
Created attachment 287199 [details]
__ext4_new_inode.trace

Call trace in __ext4_new_inode
Comment 6 Suraj 2020-02-06 19:25:53 UTC
Initial bug was hit reliably (~95% of the time) within 30 minutes.
The following traces only occurred on ~50% of runs and some times taking up to 5 hours to hit.
Comment 7 Suraj 2020-02-07 06:05:36 UTC
The other crashes look to be related to access sbi->s_flex_groups and sbi->s_group_info which are reallocated in the same way that s_group_desc is on resize.
Comment 8 Suraj 2020-02-07 23:22:22 UTC
I've attached 2 additional patches which seem to resolve the s_group_info and s_flex_group issues in the same way as the s_group_desc reallocation was resolved. That is to used rcu to protect access to the pointer.
Comment 9 Suraj 2020-02-07 23:22:52 UTC
Created attachment 287249 [details]
s_group_info.patch
Comment 10 Suraj 2020-02-07 23:23:35 UTC
Created attachment 287251 [details]
s_flex_group.patch
Comment 11 Theodore Tso 2020-02-15 20:43:29 UTC
Hi Suraj,

The two patches s_group_info.patch and s_flex_group.patch appear to use the helper function rcu_sbi_array_dereference() without defining it?
Comment 12 Theodore Tso 2020-02-16 01:46:49 UTC
I've posted a proposed improvement[1] to the first proposed patch[2] on LKML.

[1] https://lore.kernel.org/r/20200215233817.GA670792@mit.edu
[2] https://bugzilla.kernel.org/attachment.cgi?id=287189

Suraj, please note that your patches are whitespace damaged, and are lacking the Developer's Certification of Origin.   In the future, it would save me a lot of time you take a look at the Submitting Patches[3] instructions from the kernel documentation.

[3] https://www.kernel.org/doc/html/latest/process/submitting-patches.html

You can either use e-mail to linux-ext4@vger.kernel.org or attach patches to a Bugzilla entry. although the former is certainly preferred.  It's better to send a proposal to the linux-ext4@vger.kernel.org, since that way the patch can also get tracked via patchwork[4], and on lore.kernel.org, as in [1] above.

[4] http://patchwork.ozlabs.org/project/linux-ext4/list/
Comment 13 Suraj 2020-02-19 19:30:40 UTC
Your approach using call_rcu is very similar to something I was also exploring so that looks good to me.

Apologies, I didn't realise that patches attached to the BZ had to be mailing list ready.

There is also the same issue present with the resizing of the s_group_info and s_flex_groups arrays which are addressed in the following patches which I have posted to the EXT4 mailing list:
[1/3] ext4: introduce macro sbi_array_rcu_deref() to access rcu protected fields [1]
[2/3] ext4: fix potential race between s_group_info online resizing and access [2]
[3/3] ext4: fix potential race between s_flex_groups online resizing and access [3]

[1] http://patchwork.ozlabs.org/patch/1240507/
[2] http://patchwork.ozlabs.org/patch/1240506/
[3] http://patchwork.ozlabs.org/patch/1240509/
Comment 14 Theodore Tso 2020-02-20 04:26:23 UTC
Patches to BZ don't have to be perfect, or mailing list ready.  But it would be nice if they actually applied (e.g., not be white-space damaged) and if they actually compiled (not be missing macro definitions).  :-)

In my experience, bugzilla is good for collecting data when we are trying to root-cause a problem.    But it's a lot more work to look at a bug in BZ, since we have to download it first.   Where as if it is sent to the mailing list, it's a lot easier to review it and to send back comments.

For that matter, it's fine to send patches to the mailing list that aren't ready to be applied.   Using a [PATCH RFC] subject prefix is a good way to make that clear; Linus Torvalds has been known to post patches with "Warning!  I haven't even tried to compile it yet"; this is just to show the approach I'm thinking of.   What's important is to make sure expectations are set for why the patch is being sent to the list or being uploaded to BZ.

Thanks for your work on this bug!

Note You need to log in before you can comment on or make changes to this bug.