Created attachment 135751 [details] kernel log, ext4 fs params ext4 lazyinit fails with corruption when competing with mdadm raid5 reconstruction. In the _default_ scenario, mdadm doesn't initialize a new raid5 array but rather leaves the last drives as a rebuilding spare, rather than a full initialization. Adding to the load, the volume was an encrypted LUKS volume (at /dev/mapper/private) So the order of events: 1) create new 4x4TB RAID5 array via mdadm. volume is started degraded and starts rebuilding while allowing access. 2) layer new encrypted LUKS volume: /dev/md1 -> /dev/mapper/private 3) create EXT4 volume with 64bit,sparse_super2, 4) mount volume with journal_async_commit 5) a slow rsync is started in background, averaging 5MB/s Please see attached
Is this something you can reliably reproduce? The log doesn't tell us anything useful, and it's not clear whether the problem is with the dm-crypt (i.e., LUKS) layer, or with the ext4 layer. All the log tells us is that we are waiting forever for a block I/O operation to finish in the jbd2 commit thread, and this is causing the lazyinit thread to give a soft lockup warning (meaning that two minutes has gone by without any forward progress taking place). I suspect the problem is in an interaction between the dm-crypt and md raid5 code, which is being tickled by the I/O patterns that you've described. But before we kick this over to the the device mapper developers, the first question is whether you can reliably reproduce the problem.
I cannot reproduce. A quick glance at the stack made me fear that this was not an obvious ext4 problem but an interaction bug between block layers. Still thought I should submit in case it proved useful...hard to easily repro errors building 16TB arrays on production servers