Bug 215557 - zram can corrupt data
Summary: zram can corrupt data
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: PPC-64 Linux
: P1 high
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-02-01 02:19 UTC by Luke-Jr
Modified: 2022-12-05 09:54 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.19.206
Subsystem:
Regression: No
Bisected commit-id:


Attachments
snippet of strace log (3.61 MB, application/zstd)
2022-02-01 19:08 UTC, Luke-Jr
Details

Description Luke-Jr 2022-02-01 02:19:45 UTC
I use ext4 on zram for my temp directories, and sometimes rarely, things get corrupted. Using ext4 on a normal disk works fine in the same scenarios.

I haven't managed to figure out what exactly is going on, but I do have a 157 GB strace log of it happening.

One scenario that fairly reliably reproduces it, is building 3 copies of binutils in parallel. About half the time, /var/tmp/portage/cross-i686-w64-mingw32/binutils-2.37_p1-r2/work/build/binutils/.deps/stabs.Po ends up truncated, and one of the builds fails.

The only other scenario I've seen it happen in (much less reproducible), is running Bitcoin functional tests. In this case, however, the ext4 structure itself got corrupted, and Linux was unable to recover (the directories affected became unusable until reboot).

I suspect it's probably a threading-related issue, but it's plausible it could be page size related (I *think* I'm using 64k pages) though in the latter case I would expect it to be much more common.
Comment 1 dave.rodgman 2022-02-01 10:24:51 UTC
Please could you post the results of running zramctl --output-all, and from mount? Also a snippet from the strace log showing zram being exercised might be useful. Thanks.
Comment 2 Luke-Jr 2022-02-01 18:56:14 UTC
$ zramctl --output-all
NAME       DISKSIZE  DATA COMPR ALGORITHM STREAMS ZERO-PAGES TOTAL MEM-LIMIT MEM-USED MIGRATED MOUNTPOINT
/dev/zram2    62.5G 19.8G  7.1G zstd           64       5644  7.2G        0B    20.6G   502.9K /var/tmp
/dev/zram1    62.5G 16.1G  3.2G zstd           64      30131  3.2G        0B    26.6G    31.9K /tmp
/dev/zram0      16G 15.9G  1.9G zstd           64      19365    2G        0B     2.3G    61.5K [SWAP]
$ mount | grep zram
/dev/zram1 on /tmp type ext4 (rw,discard,stripe=16)
/dev/zram2 on /var/tmp type ext4 (rw,discard,stripe=16)
Comment 3 Luke-Jr 2022-02-01 19:08:29 UTC
Created attachment 300376 [details]
snippet of strace log

Here's 50000 lines from maybe 5 GB or so from the end of the log
Comment 4 dave.rodgman 2022-02-02 09:30:09 UTC
I'd suggest that since you are able to reproduce, it may be helpful to try reproducing under a few slightly different environments to try and isolate this a bit.

As a starting point, are you able to try reproducing using (a) a different compression algorithm (e.g., lz4, lzo); and (b) a different filesystem (e.g., XFS, btrfs); and (c) with all but one cores turned off (e.g. use echo 0 > /sys/devices/system/cpu/cpu<X>/online).

This may help narrow it down somewhat.
Comment 5 Luke-Jr 2022-12-05 09:54:52 UTC
Regarding (a), still having this issue with lz4.

I've configured my system to use btrfs whenever I reboot again. No idea how soon that will be.

Note You need to log in before you can comment on or make changes to this bug.