Bug 205197
Summary: | kernel BUG at fs/ext4/extents_status.c:884 | ||
---|---|---|---|
Product: | File System | Reporter: | Arnaud Bétrémieux (arnaud) |
Component: | ext4 | Assignee: | fs_ext4 (fs_ext4) |
Status: | RESOLVED INVALID | ||
Severity: | normal | CC: | antony.ambrose, tytso |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.3.5 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Arnaud Bétrémieux
2019-10-15 12:38:54 UTC
It looks like the journal inode is corrupted but it shouldn't have BUG'ed on you. Can you reproduce this crash? If so, does this fairly simple patch cause it not to BUG? (It will still fail to mount, but it shouldn't crash.) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index f203bf989a4c..d83b325fb54b 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -375,7 +375,7 @@ static int ext4_valid_extent(struct inode *inode, struct ext4_extent *ext) * - zero length * - overflow/wrap-around */ - if (lblock + len <= lblock) + if (lblock + (ext4_lblk_t) len <= lblock) return 0; return ext4_data_block_valid(EXT4_SB(inode->i_sb), block, len); } Apologies if this is whitespace damaged, but t's a fairly simple edit to apply, and I'm currently on a chromebook so I can't easily get a patch uploaded into bugzilla. Sorry for the delay. I can confirm that although the partition still does not mount, there is indeed no "BUG" with this patch applied. It's been pointed out to me that the patch in #1 should have been a no-op, since a signed integer gets converted to be unsigned before it is added to an unsigned int. Can you confirm that without the patch, you can still reliably reproduce the failure? I just tried it with the same kernel I used at the time of the bug report, and no, I can't reproduce the failure anymore. I'm not sure what changed… sorry ! Strangely, I'm pretty sure I did test with and without the patch and it all seemed to work at the time (BUG with no patch, no BUG with patch). The partition is automounted, so maybe there was an auto-fsck at some point. I should have thought of removing the automount to keep things testable. Working with a 5.4.233 on aarch64 (Qualcomm/Android) platform we get the same error. I am able to reliably reproduce this problem even after applying the patch #1.Could you please let me know what additional information required ? As the partition is FBE encrypted , I am not able to look at the hex dump to check the nature corruption. The reason why no one has paid much attention to it is because the bug is reported against a very old kernel, and upstream developers generally only worry about the upstream kernel. Companies which insist on using old stable kernels need to either engage paid support (e.g., contacting Red Hat if you are using RHEL, etc.) or have their own kernel developers on staff to debug the problem. Upstream developers are volunteers don't have the time to provide free support to companies that are using old kernels. In general, at the minimum we ask kernel engineers working on these kernels to try to reproduce the problem on the latest upstream kernel, and if they can't.... maybe they should work on using a newer upstream kernel, or they should figure out how to backport fixes to old LTS kernels. Also, it seems... weird.... that you can't look at the hex dump. The kernel is able to mount the kernel, so you have access to the encryption key, or at least, to a block device which has the encryption key set up by your user space. So you should be able to run e2fsck -fn /dev/hdXX. This would help provide a hint to the nature of the corruption, so that we could try to reproduce the problem on an upstream kernel. But what we really don't have time to do is to hand-hold users who don't know how to run fsck or apply kernel patches, and trying to run test kernels. If you can let us know what you actually can do, perhaps we might bend the rules and try to give you some debugging help. But it will only be on a best efforts basis, and when we have time, since after all, we're volunteers.... Thank you for the response. I understand now , why there was not much attention to this issue. Sorry for providing a minimal information in the first communication... We have back-ported the interesting changes from upstream (~70 of them) and could still see the problem. I have reported the issue based on old kernel to have the continuity. The old issue reported as well seen while mounting an encrypted sd card and we have also seen this on an encrypted volume, but its onboard storage. I thought it is logical to continue the discussion here as you had given some debugging hints and issue did not progress as the old reporter could not reproduce the problem but we could even after backporting the change. I will create the bug based on the latest kernel in future. Thanks for the hint. The issue could be reproduced in a sequence where we interrupt the power. From our decade long experience working with ext4, we have never seen an issue where we could corrupt the ext4 volume in a way that it is not mountable by executing a power loss sequence. That was main reason to report the issue to the community experts. Ofcourse we have some paid support and also inhouse kernel engineers, and I thought it is also better to report to the community experts as the old bug is still open and we have a reliable reproduction .My current assumption is either that we have a problem with our sequence or problem with handling encrypted ext4 partition. Regarding our knowhow and usage of tooling , we can work with the hex dump and understand the ext4 disk layout and also work with the e2fsprogs to debug the problem. Hence, we expect only some debugging hints and direction and hopefully we try to solve the issues together. As the device resets cyclically , we could not hook into the device and get the /dev/sdXX . The existing tooling only get the encrypted data .We will try to resolve this situation and somehow get the hex dump and provide more details on the nature of corruption and will also provide the fsck output. One of the things I'd recommend doing is to grabbing a compressed raw e2image dump. See the e2image man page for the the -r or the -Q option. It's not hard to build e2image for Android. At one point I had added support for building e2image in the AOSP build files (although this might be before the AOSP build system has gotten updated, so it might require making some minor work on your side; still, it's really not hard to build an AOSP image with e2image and debugfs enabled, and if you're trying to do file system debugging on Android, this is a Really Good idea.) Sorry for the very late reply. We have worked on this issue further and understand that, the issue happen when an ongoing encryption is interrupted. In the next boot, when the system tries to mount the partition which is in partially encrypted state hits a bug on. This is fixed in AOSP by implementing a logic to identify interrupted encrypted partition. This is not a ext4 bug. Thanks for all the hints. I will close this bug. I realized that , I am not the one created this bug. I will close the other one , I have created. https://bugzilla.kernel.org/show_bug.cgi?id=218596 *** Bug 218596 has been marked as a duplicate of this bug. *** |