Bug 216559
Summary: | btrfs crash root mount RAID0 | ||
---|---|---|---|
Product: | File System | Reporter: | Viktor Kuzmin (kvaster) |
Component: | btrfs | Assignee: | BTRFS virtual assignee (fs_btrfs) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | dsterba, jbowler, regressions, wqu |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 6.0.0 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
crash-1
crash-2 crash-3 crash-4 crash-5 crash-6 |
Description
Viktor Kuzmin
2022-10-08 20:41:32 UTC
You mean this change? https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ac0677348f3c2 And could you please share the full log with the DIVIDE by ZERO error? Created attachment 302972 [details]
crash-1
Created attachment 302973 [details]
crash-2
Created attachment 302974 [details]
crash-3
Created attachment 302975 [details]
crash-4
Created attachment 302976 [details]
crash-5
Created attachment 302977 [details]
crash-6
Yes, I'm talking about exactly this commit. I've attached screenshots of kernel crash from remote KVM. Unfortunately I have no plain text full log... I have no more problems after reverting this commit. And the problem is with: stripe_nr = div_u64(stripe_nr, map->sub_stripes); It seems that map->sub_stripes is zero in my case. I believe it was some older mkfs causing the sub_stripes to be zero in your chunk items. Normally I would prefer to make tree-checker to reject such older (and with some invalid values) chunk items. But I believe you'd better mount with older fs, and do a balance to get rid of such old chunks first. Just in case if we go the reject path. This server works for 5 years already, I think. I will make balance and will recheck. Thanks! You can always verify if you have such offending chunk items by: (can be executed on mounted device) # btrfs ins dump-tree -t chunk <device> The target field is "sub_stripes" item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15783 itemsize 112 length 268435456 owner 2 stripe_len 65536 type METADATA|RAID1 io_align 65536 io_width 65536 sector_size 4096 num_stripes 2 sub_stripes 1 stripe 0 devid 1 offset 30408704 dev_uuid 7eec3a5e-6463-4c4b-a2c8-716abd5b08f5 stripe 1 devid 2 offset 9437184 dev_uuid f4e381b7-e378-497d-974d-0a8e7f7e71a7 Although sub_stripes really only makes sense for RAID10 (should be 2 for RAID10), for other profiles they should be 1, no matter what. If you see something like "sub_stripes 0", then that chunk should be balanced. If there is no more chunk items with "sub_stripes 0", then it should be safe to use 6.0 kernel then. Thanks. This command showed me chunks with "sub_stripes 0". And they are gone after "btrfs balance start --full-balance --bg /". (In reply to Viktor Kuzmin from comment #12) > And they are gone after "btrfs balance start --full-balance --bg /". Well, good for you, but one question remains: might others fall into this trap? It sounds like it; the kernel hence should ideally be modified to handle this situation. Or not? I have 12 servers with btrfs RAID0 disks setup. 9 of them were setted up mostly in different time more then a year ago and all of them have this problem - some chunks had 'sub_stripes 0'. And I think others may also fall into this trap. I'm surprised that myself has already submitted such a patch at March 2022 for the same problem. But at that time I don't have a real world report, nor a known progs version to cause such 0 sub_stripes. Could you provide the history of the fses which still have the 0 sub_stripes numbers? I'm particularly interested in which btrfs-progs is causing this problem. I see the same issue. In my case it is on the RAID root and two other non-RAID btrfs partitions. A more recently created btrfs partition does not have the problem. In all three cases there are exactly three "sub_strips 0" items, they are contiguous item numbers and they are right at the start; either items 2,3,4 or 3,4,5 (1-based). I'm using gentoo (I believe from the screenshots @Viktor is too) and I'm running the dev (~) release so I may have been using a mkfs.btrfs that never hit the standard world. Nevertheless I regard the bug as a showstopper; it apparently can't be fixed from a running 6.0.x (or 6.1.x?) system. This is my gentoo bug: https://bugs.gentoo.org/878023 If the mkfs.btrfs version is recorded in the FS I can readily retrieve it, however I am doing the balance on all three affected FSs so I hope the problem will disappear from my sight :-) (In reply to John Bowler from comment #16) > I see the same issue. In my case it is on the RAID root and two other > non-RAID btrfs partitions. A more recently created btrfs partition does not > have the problem. > > In all three cases there are exactly three "sub_strips 0" items, they are > contiguous item numbers and they are right at the start; either items 2,3,4 > or 3,4,5 (1-based). > > I'm using gentoo (I believe from the screenshots @Viktor is too) and I'm > running the dev (~) release so I may have been using a mkfs.btrfs that never > hit the standard world. Nevertheless I regard the bug as a showstopper; it > apparently can't be fixed from a running 6.0.x (or 6.1.x?) system. > > This is my gentoo bug: https://bugs.gentoo.org/878023 > > If the mkfs.btrfs version is recorded in the FS I can readily retrieve it, > however I am doing the balance on all three affected FSs so I hope the > problem will disappear from my sight :-) Or you can try this patch: https://patchwork.kernel.org/project/linux-btrfs/patch/90e84962486d7ab5a8bca92e329fe3ee6864680f.1666312963.git.wqu@suse.com/ This should make btrfs properly handle such older chunk items. It will be backported to v6.0 (the only affected release AFAIK). > This should make btrfs properly handle such older chunk items.
Alas I no longer have a test case - I ran the btrfs balance on all my current file systems.
I can confirm that the work round works; I now have a functional 6.0.3 with all file systems mounted.
Patch has been queued for merge and will appear in some near future 6.0.x stable. Thanks for the reports and fix. |