modprobe zram num_devices=1 echo 512M > /sys/block/zram0/disksize mkfs.ext4 /dev/zram0 mount /dev/zram0 /tmp Many processes write files under the directory "/tmp". When there is a severe shortage of memory, the kernel crashes. Linux 5.4: function __zram_bvec_write() in the file "drivers/block/zram/zram_drv.c" (1)Compress the data in a page.Assuming the compressed data length is 100 bytes. ret = zcomp_compress(zstrm, src, &comp_len); (2)Call zs_malloc without flag ___GFP_DIRECT_RECLAIM, and fail to allocate an object from the size class of 112 bytes in the zs_pool. if (!handle) handle = zs_malloc(zram->mem_pool, comp_len, __GFP_KSWAPD_RECLAIM | __GFP_NOWARN | __GFP_HIGHMEM | __GFP_MOVABLE); (3)Call zs_malloc with flag ___GFP_DIRECT_RECLAIM, and successfully allocate an object from the size class of 112 bytes in the zs_pool. if (!handle) { zcomp_stream_put(zram->comp); atomic64_inc(&zram->stats.writestall); handle = zs_malloc(zram->mem_pool, comp_len, GFP_NOIO | __GFP_HIGHMEM | __GFP_MOVABLE); if (handle) goto compress_again; return -ENOMEM; } (4)Compress the data in a page again.This physical page stores the metadata of the EXT4 file system, and some processes writing files cause changes in the metadata of the EXT4 file system.The length of compressed data changes.Assuming the compressed data length is 200 bytes. ret = zcomp_compress(zstrm, src, &comp_len); (5)Writing 200 bytes compressed data to the 112 bytes object results in overwriting the next object. dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO); src = zstrm->buffer; if (comp_len == PAGE_SIZE) src = kmap_atomic(page); memcpy(dst, src, comp_len); if (comp_len == PAGE_SIZE) kunmap_atomic(src); When creating a ZRAM device, set the flag BDI_CAP_STABLE_WRITES. zram_add() zram->disk->queue->backing_dev_info->capabilities |= (BDI_CAP_STABLE_WRITES | BDI_CAP_SYNCHRONOUS_IO); If the flag BDI_CAP_STABLE_WRITES is set for the storage device, when writing a page of a file in the EXT4 file system, it will first wait for "writing this page back to the storage device" to complete. ext4_write_begin() lock_page(page); if (page->mapping != mapping) { /* The page got truncated from under us */ unlock_page(page); put_page(page); ext4_journal_stop(handle); goto retry_grab; } /* In case writeback began while the page was unlocked */ wait_for_stable_page(page); void wait_for_stable_page(struct page *page) { if (bdi_cap_stable_pages_required(inode_to_bdi(page->mapping->host))) wait_on_page_writeback(page); } static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi) { return bdi->capabilities & BDI_CAP_STABLE_WRITES; } The flag BDI_CAP_STABLE_WRITES is only valid for the file data, and not for the metadata of the file system.During the process of writing a physical page storing EXT4 metadata back to the ZRAM device, a process writing a file causes a change in EXT4 metadata. All Linux versions that support ZRAM devices have this bug.
Analyze the code of Linux 6.12. fs/ext4/mballoc.c /* Modify block bitmap and group descriptor */ static int ext4_mb_mark_context(handle_t *handle, struct super_block *sb, bool state, ext4_group_t group, ext4_grpblk_t blkoff, ext4_grpblk_t len, int flags, ext4_grpblk_t *ret_changed) { ... bitmap_bh = ext4_read_block_bitmap(sb, group); ... gdp = ext4_get_group_desc(sb, group, &gdp_bh); ... if (state) { mb_set_bits(bitmap_bh->b_data, blkoff, len); ext4_free_group_clusters_set(sb, gdp, ext4_free_group_clusters(sb, gdp) - changed); } else { mb_clear_bits(bitmap_bh->b_data, blkoff, len); ext4_free_group_clusters_set(sb, gdp, ext4_free_group_clusters(sb, gdp) + changed); } ext4_block_bitmap_csum_set(sb, gdp, bitmap_bh); ext4_group_desc_csum_set(sb, group, gdp); ... err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh); if (err) goto out_err; err = ext4_handle_dirty_metadata(handle, NULL, gdp_bh); if (err) goto out_err; ... } ext4_read_block_bitmap() -> ext4_read_block_bitmap_nowait() fs/ext4/balloc.c struct buffer_head * ext4_read_block_bitmap_nowait(struct super_block *sb, ext4_group_t block_group, bool ignore_locked) { ... desc = ext4_get_group_desc(sb, block_group, NULL); ... bitmap_blk = ext4_block_bitmap(sb, desc); ... bh = sb_getblk(sb, bitmap_blk); ... if (bitmap_uptodate(bh)) goto verify; lock_buffer(bh); if (bitmap_uptodate(bh)) { unlock_buffer(bh); goto verify; } ... } /* Write the dirty block buffers of the block device back to the storage device */ __writeback_single_inode() -> do_writepages() -> mapping->a_ops->writepages() = blkdev_writepages() -> write_cache_pages() -> block_write_full_folio() -> __block_write_full_folio() fs/buffer.c int __block_write_full_folio(struct inode *inode, struct folio *folio, get_block_t *get_block, struct writeback_control *wbc) { ... do { if (!buffer_mapped(bh)) continue; if (wbc->sync_mode != WB_SYNC_NONE) { lock_buffer(bh); } else if (!trylock_buffer(bh)) { /* lock the block buffer */ folio_redirty_for_writepage(wbc, folio); continue; } if (test_clear_buffer_dirty(bh)) { mark_buffer_async_write_endio(bh, end_buffer_async_write);/* When write-back completes, call end_buffer_async_write() to unlock the block buffer and clear the PG_writeback flag for the page */ } else { unlock_buffer(bh); } } while ((bh = bh->b_this_page) != head); ... folio_start_writeback(folio);/* call folio_test_set_writeback(folio) to set the PG_writeback flag for the page */ do { struct buffer_head *next = bh->b_this_page; if (buffer_async_write(bh)) { submit_bh_wbc(REQ_OP_WRITE | write_flags, bh, inode->i_write_hint, wbc); nr_underway++; } bh = next; } while (bh != head); folio_unlock(folio); ... } When preparing to modify the block bitmap and the group descriptor, it does not lock the block buffers, and does not wait for the completion of "already started write-back to the storage device". When writing the block buffer back to the storage device, lock the block buffer and set the write-back flag for the page. Upon completion of the write-back, unlock the block buffer and clear the write-back flag for the page. The block device was set with the flag BLK_FEAT_STABLE_WRITES, but when preparing to modify the metadata of the EXT4 file system, it does not wait for the completion of "already started write-back to the storage device". I find this bug in Linux 5.4. I am sure that Linux 6.12 has this bug.
I confirm that our company has disabled the journaling feature of the EXT4 file system. mkfs.ext4 -O ^has_journal /dev/zram0 mount /dev/zram0 /tmp If the journaling feature of the EXT4 file system is disabled, then a metadata block buffer is not locked when modifying it, potentially leading to concurrent execution with "writing the metadata block buffer back to the storage device". mkfs.ext4 /dev/zram0 or mkfs.ext4 -O has_journal /dev/zram0 mount /dev/zram0 /tmp Enable the journaling feature of the EXT4 file system. When committing a transaction, for each block buffer contained in the transaction, a copy is made and written to the journal, ensuring that the copy is not modified during the journaling process. "Writing a metadata block buffers back to the storage device" and "modifying the metadata block buffer" will not be executed concurrently. The suggestions are as follows. (1) Add a protection measure to the function __zram_bvec_write() of the ZRAM device driver to prevent kernel crash: If the length of the data after the second compression is different from that after the first compression, print a warning message and reallocate an object from the compression pool. (2) Add a note in the document "admin-guide/blockdev/zram.rst": "When writing the content of a physical page back to the ZRAM device, the content of the physical page cannot be modified. Therefore, when storing an EXT4 file system in a ZRAM device, the journaling feature cannot be disabled." (3) When mounting an EXT4 file system, if the STABLE_WRITES flag is set for the storage device and the journaling feature of the EXT4 file system is disabled, then report an error.