Bug 219548 - ZRAM: the kernel crashes when storing an EXT4 file system in a ZRAM device
Summary: ZRAM: the kernel crashes when storing an EXT4 file system in a ZRAM device
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P3 high
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-12-02 08:56 UTC by Yu Huabing
Modified: 2024-12-19 10:59 UTC (History)
4 users (show)

See Also:
Kernel Version: all
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Yu Huabing 2024-12-02 08:56:58 UTC
modprobe zram num_devices=1
echo 512M > /sys/block/zram0/disksize
mkfs.ext4 /dev/zram0
mount /dev/zram0 /tmp
Many processes write files under the directory "/tmp". When there is a severe shortage of memory, the kernel crashes.

Linux 5.4: function __zram_bvec_write() in the file "drivers/block/zram/zram_drv.c"
(1)Compress the data in a page.Assuming the compressed data length is 100 bytes.
ret = zcomp_compress(zstrm, src, &comp_len);
(2)Call zs_malloc without flag ___GFP_DIRECT_RECLAIM, and fail to allocate an object from the size class of 112 bytes in the zs_pool.
	if (!handle)
		handle = zs_malloc(zram->mem_pool, comp_len,
				__GFP_KSWAPD_RECLAIM |
				__GFP_NOWARN |
				__GFP_HIGHMEM |
				__GFP_MOVABLE);
(3)Call zs_malloc with flag ___GFP_DIRECT_RECLAIM, and successfully allocate an object from the size class of 112 bytes in the zs_pool.
	if (!handle) {
		zcomp_stream_put(zram->comp);
		atomic64_inc(&zram->stats.writestall);
		handle = zs_malloc(zram->mem_pool, comp_len,
				GFP_NOIO | __GFP_HIGHMEM |
				__GFP_MOVABLE);
		if (handle)
			goto compress_again;
		return -ENOMEM;
	}
(4)Compress the data in a page again.This physical page stores the metadata of the EXT4 file system, and some processes writing files cause changes in the metadata of the EXT4 file system.The length of compressed data changes.Assuming the compressed data length is 200 bytes.
ret = zcomp_compress(zstrm, src, &comp_len);
(5)Writing 200 bytes compressed data to the 112 bytes object results in overwriting the next object.
	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);

	src = zstrm->buffer;
	if (comp_len == PAGE_SIZE)
		src = kmap_atomic(page);
	memcpy(dst, src, comp_len);
	if (comp_len == PAGE_SIZE)
		kunmap_atomic(src);

When creating a ZRAM device, set the flag BDI_CAP_STABLE_WRITES.
zram_add()
	zram->disk->queue->backing_dev_info->capabilities |=
			(BDI_CAP_STABLE_WRITES | BDI_CAP_SYNCHRONOUS_IO);

If the flag BDI_CAP_STABLE_WRITES is set for the storage device, when writing a page of a file in the EXT4 file system, it will first wait for "writing this page back to the storage device" to complete.
ext4_write_begin()
	lock_page(page);
	if (page->mapping != mapping) {
		/* The page got truncated from under us */
		unlock_page(page);
		put_page(page);
		ext4_journal_stop(handle);
		goto retry_grab;
	}
	/* In case writeback began while the page was unlocked */
	wait_for_stable_page(page);

void wait_for_stable_page(struct page *page)
{
	if (bdi_cap_stable_pages_required(inode_to_bdi(page->mapping->host)))
		wait_on_page_writeback(page);
}

static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
{
	return bdi->capabilities & BDI_CAP_STABLE_WRITES;
}

The flag BDI_CAP_STABLE_WRITES is only valid for the file data, and not for the metadata of the file system.During the process of writing a physical page storing EXT4 metadata back to the ZRAM device, a process writing a file causes a change in EXT4 metadata.

All Linux versions that support ZRAM devices have this bug.
Comment 1 Yu Huabing 2024-12-11 06:30:25 UTC
Analyze the code of Linux 6.12.

fs/ext4/mballoc.c
/* Modify block bitmap and group descriptor */
static int
ext4_mb_mark_context(handle_t *handle, struct super_block *sb, bool state,
		     ext4_group_t group, ext4_grpblk_t blkoff,
		     ext4_grpblk_t len, int flags, ext4_grpblk_t *ret_changed)
{
	...
	bitmap_bh = ext4_read_block_bitmap(sb, group);
	...
	gdp = ext4_get_group_desc(sb, group, &gdp_bh);
	...
	if (state) {
		mb_set_bits(bitmap_bh->b_data, blkoff, len);
		ext4_free_group_clusters_set(sb, gdp,
			ext4_free_group_clusters(sb, gdp) - changed);
	} else {
		mb_clear_bits(bitmap_bh->b_data, blkoff, len);
		ext4_free_group_clusters_set(sb, gdp,
			ext4_free_group_clusters(sb, gdp) + changed);
	}

	ext4_block_bitmap_csum_set(sb, gdp, bitmap_bh);
	ext4_group_desc_csum_set(sb, group, gdp);
	...

	err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
	if (err)
		goto out_err;
	err = ext4_handle_dirty_metadata(handle, NULL, gdp_bh);
	if (err)
		goto out_err;
	...
}

ext4_read_block_bitmap() -> ext4_read_block_bitmap_nowait()
fs/ext4/balloc.c
struct buffer_head *
ext4_read_block_bitmap_nowait(struct super_block *sb, ext4_group_t block_group,
			      bool ignore_locked)
{
	...
	desc = ext4_get_group_desc(sb, block_group, NULL);
	...
	bitmap_blk = ext4_block_bitmap(sb, desc);
	...
	bh = sb_getblk(sb, bitmap_blk);
	...
	if (bitmap_uptodate(bh))
		goto verify;

	lock_buffer(bh);
	if (bitmap_uptodate(bh)) {
		unlock_buffer(bh);
		goto verify;
	}
	...
}

/* Write the dirty block buffers of the block device back to the storage device */
__writeback_single_inode() -> do_writepages()
-> mapping->a_ops->writepages() = blkdev_writepages()
-> write_cache_pages() -> block_write_full_folio() -> __block_write_full_folio()
fs/buffer.c
int __block_write_full_folio(struct inode *inode, struct folio *folio,
			get_block_t *get_block, struct writeback_control *wbc)
{
	...
	do {
		if (!buffer_mapped(bh))
			continue;
		if (wbc->sync_mode != WB_SYNC_NONE) {
			lock_buffer(bh);
		} else if (!trylock_buffer(bh)) { /* lock the block buffer */
			folio_redirty_for_writepage(wbc, folio);
			continue;
		}
		if (test_clear_buffer_dirty(bh)) {
			mark_buffer_async_write_endio(bh,
				end_buffer_async_write);/* When write-back completes, call end_buffer_async_write() to unlock the block buffer and clear the PG_writeback flag for the page */
		} else {
			unlock_buffer(bh);
		}
	} while ((bh = bh->b_this_page) != head);

	...
	folio_start_writeback(folio);/* call folio_test_set_writeback(folio) to set the PG_writeback flag for the page */

	do {
		struct buffer_head *next = bh->b_this_page;
		if (buffer_async_write(bh)) {
			submit_bh_wbc(REQ_OP_WRITE | write_flags, bh,
				      inode->i_write_hint, wbc);
			nr_underway++;
		}
		bh = next;
	} while (bh != head);
	folio_unlock(folio);
	...
}

When preparing to modify the block bitmap and the group descriptor, it does not lock the block buffers, and does not wait for the completion of "already started write-back to the storage device".
When writing the block buffer back to the storage device, lock the block buffer and set the write-back flag for the page. Upon completion of the write-back, unlock the block buffer and clear the write-back flag for the page.
The block device was set with the flag BLK_FEAT_STABLE_WRITES, but when preparing to modify the metadata of the EXT4 file system, it does not wait for the completion of "already started write-back to the storage device".
I find this bug in Linux 5.4. I am sure that Linux 6.12 has this bug.
Comment 2 Yu Huabing 2024-12-19 10:59:42 UTC
I confirm that our company has disabled the journaling feature of the EXT4 file system.
mkfs.ext4 -O ^has_journal /dev/zram0
mount /dev/zram0 /tmp
If the journaling feature of the EXT4 file system is disabled, then a metadata block buffer is not locked when modifying it, potentially leading to concurrent execution with "writing the metadata block buffer back to the storage device".

mkfs.ext4 /dev/zram0 or mkfs.ext4 -O has_journal /dev/zram0
mount /dev/zram0 /tmp
Enable the journaling feature of the EXT4 file system. 
When committing a transaction, for each block buffer contained in the transaction, a copy is made and written to the journal, ensuring that the copy is not modified during the journaling process. 
"Writing a metadata block buffers back to the storage device" and "modifying the metadata block buffer" will not be executed concurrently.

The suggestions are as follows.
(1) Add a protection measure to the function __zram_bvec_write() of the ZRAM device driver to prevent kernel crash: If the length of the data after the second compression is different from that after the first compression, print a warning message and reallocate an object from the compression pool.
(2) Add a note in the document "admin-guide/blockdev/zram.rst": "When writing the content of a physical page back to the ZRAM device, the content of the physical page cannot be modified. Therefore, when storing an EXT4 file system in a ZRAM device, the journaling feature cannot be disabled."
(3) When mounting an EXT4 file system, if the STABLE_WRITES flag is set for the storage device and the journaling feature of the EXT4 file system is disabled, then report an error.

Note You need to log in before you can comment on or make changes to this bug.