Bug 217758 - Btrfs scrub on single ssd reports non existing errors on Linux 6.4
Summary: Btrfs scrub on single ssd reports non existing errors on Linux 6.4
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: All Linux
: P3 high
Assignee: BTRFS virtual assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-03 13:37 UTC by Olivier Wuillemin
Modified: 2023-12-12 15:21 UTC (History)
4 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Log file with btrfs scrub "errors" (144.59 KB, application/octet-stream)
2023-08-04 10:18 UTC, Olivier Wuillemin
Details
Extra debug output for scrub errors (2.49 KB, patch)
2023-08-12 01:37 UTC, Qu Wenruo
Details | Diff
log file with patch applied (131.65 KB, text/plain)
2023-08-12 08:44 UTC, Zhaoyang Zhang
Details
dmesg output (74.32 KB, text/plain)
2023-10-14 20:39 UTC, Peter Gerber
Details

Description Olivier Wuillemin 2023-08-03 13:37:57 UTC
My distribution is a Manjaro 23.0.0, CPU is AMD A10.
System disk is an SSD PNY_CS900 120Gb

With Linux 6.4.6.1-MANJARO, a scrub on then system disk root reports errors  and some times freeze the linux :

sudo btrfs scrub status /
UUID:             53e62983-fed7-46ed-b97c-23d19af4f26f
Scrub started:    Tue Aug  1 13:26:14 2023
Status:           running
Duration:         0:03:21
Time left:        1:12:50
ETA:              Tue Aug  1 14:42:25 2023
Total to scrub:   85.39GiB
Bytes scrubbed:   3.75GiB  (4.40%)
Rate:             19.13MiB/s
Error summary:    read=80
  Corrected:      34
  Uncorrectable:  46
  Unverified:     0

In this case, this report is the last one before the Linux freeze.
Scrub not always freese, but when it freeze only low level functionnalities as ping or ps are working.

If Y downgrade the linux kernel to 6.1.41-1-MANJARO there is no errors :

sudo btrfs scrub status /
UUID:             53e62983-fed7-46ed-b97c-23d19af4f26f
Scrub started:    Thu Aug  3 14:17:11 2023
Status:           finished
Duration:         0:04:41
Total to scrub:   66.72GiB
Rate:             225.62MiB/s
Error summary:    no errors found

Ask me if you want more details.
Comment 1 Bagas Sanjaya 2023-08-03 14:00:23 UTC
(In reply to Olivier Wuillemin from comment #0)
> My distribution is a Manjaro 23.0.0, CPU is AMD A10.
> System disk is an SSD PNY_CS900 120Gb
> 
> With Linux 6.4.6.1-MANJARO, a scrub on then system disk root reports errors 
> and some times freeze the linux :
> 
> sudo btrfs scrub status /
> UUID:             53e62983-fed7-46ed-b97c-23d19af4f26f
> Scrub started:    Tue Aug  1 13:26:14 2023
> Status:           running
> Duration:         0:03:21
> Time left:        1:12:50
> ETA:              Tue Aug  1 14:42:25 2023
> Total to scrub:   85.39GiB
> Bytes scrubbed:   3.75GiB  (4.40%)
> Rate:             19.13MiB/s
> Error summary:    read=80
>   Corrected:      34
>   Uncorrectable:  46
>   Unverified:     0
> 
> In this case, this report is the last one before the Linux freeze.
> Scrub not always freese, but when it freeze only low level functionnalities
> as ping or ps are working.
> 
> If Y downgrade the linux kernel to 6.1.41-1-MANJARO there is no errors :
> 
> sudo btrfs scrub status /
> UUID:             53e62983-fed7-46ed-b97c-23d19af4f26f
> Scrub started:    Thu Aug  3 14:17:11 2023
> Status:           finished
> Duration:         0:04:41
> Total to scrub:   66.72GiB
> Rate:             225.62MiB/s
> Error summary:    no errors found
> 
> Ask me if you want more details.

Can you bisect between v6.1 and v6.4? See Documentation/admin-guide/bug-bisect.rst in the kernel sources for instructions.
Comment 2 Olivier Wuillemin 2023-08-03 21:50:59 UTC
Scrub returns no error to kernel version up to  6.3.13-2

Scrub returns errors for version 6.4.6-1 & 6.5.0rc3-1
Comment 3 Qu Wenruo 2023-08-03 23:00:45 UTC
If the system freeze, it may be a kernel crash, please provide the dmesg if possible.

You may want to setup netconsole to catch the dying message on another machine:

https://docs.kernel.org/networking/netconsole.html
Comment 4 Olivier Wuillemin 2023-08-04 10:18:35 UTC
Created attachment 304776 [details]
Log file with btrfs scrub "errors"

This log is the most representative of errors encountered. As the computer is not at the same place I live, I reboot it next morning to execute a memory test. I don't remember if computer was freezed at this time.
Comment 5 Olivier Wuillemin 2023-08-04 10:32:26 UTC
I didn't note true kernel crash. 

As the bug is repetitive, do not create damages and switch of kernel is easy on Manjaro, I may run a 6.4 or 6.5 kernel to make scrub log more voluble if some options exists. But I'm not able to recompile a kernel myself.
Comment 6 Qu Wenruo 2023-08-06 10:52:41 UTC
If it's not a crash but something like a hang, at least the dmesg would still help us to locate why.

There is a branch with performance fixes, but I'm afraid you have to compile it yourself:
https://github.com/adam900710/linux/tree/scrub_testing
Comment 7 Zhaoyang Zhang 2023-08-10 13:41:30 UTC
Hi, I met the same problem and here is the dmesg: 
http://fars.ee/j6S5
Comment 8 Qu Wenruo 2023-08-11 01:13:02 UTC
Thanks for the dmesg, it looks like something related to chunk mapping error.

This at least means it's false alert, but I'm still unsure what's going on.

On those cases I can think of, it looks like there is a chunk being cleaned up while it's also being scrubbed.
This happens when a chunk is going to be deleted in the current transaction, but in the previous transaction there are still extents in it.

This means, the target fs is also under some workload at least.
If that's the case, I can make the scrub process to skip the whole chunk, but this should be rare as we would mark the block group read-only during scrub, thus it should be deleted half way...

@zhaoyang, if you can reproduce the problem reliably, mind to compile the btrfs module to test some debug patches?
Comment 9 Zhaoyang Zhang 2023-08-11 03:58:13 UTC
Sure. It's pretty reproducible when using newer kernel. And what should I do to compile debug patches?
Comment 10 Qu Wenruo 2023-08-12 01:37:53 UTC
Created attachment 304824 [details]
Extra debug output for scrub errors

Here is the new debug patch for the scrub error.

The current debug target is to check if the block group is removed halfway during scrub.
Comment 11 Zhaoyang Zhang 2023-08-12 08:44:10 UTC
Created attachment 304832 [details]
log file with patch applied
Comment 12 Peter Gerber 2023-10-14 20:39:40 UTC
Created attachment 305220 [details]
dmesg output
Comment 13 Peter Gerber 2023-10-14 20:41:47 UTC
I think I may be seeing the same issue.

I see the issue frequently in combination with kernel BUG at include/linux/scatterlist.h:115! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI

See attachment above.
Comment 14 Peter Gerber 2023-10-14 21:14:08 UTC
Thinking about it, I believe I've only seen this error with filesystems converted from ext4. Not sure this means much as most filesystems on this machine have been converted. Anyway, I converted a FS using `btrfs-convert --csum xxhash` and it looks like I can reproduce this error.

I uploaded the converted image at https://gitlab.com/pgerber/btrfs-bug/-/raw/main/image.zst. I hope it helps diagnosing this issue.
Comment 15 Peter Gerber 2023-10-14 21:31:00 UTC
I should probably mention the btrfs-progs version I used:

Codename:	bookworm
user@disp8532:~$ dpkg -l btrfs-progs
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-===============================================
ii  btrfs-progs    6.2-1        amd64        Checksumming Copy on Write Filesystem utilities


user@disp8532:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 12 (bookworm)
Release:	12
Comment 16 Qu Wenruo 2023-10-14 22:39:51 UTC
Checking the debug output, it looks like the bio mapping part has something wrong, as we got the following messages:

> unable to find logical 17708773376 length 4096

Another thing is, the debug output confirmed it's not a block group got deleted halfway, but more like a use-after-free,
I got some internal report about it but unable to pin it down yet.

I'll keep you updated when there is some progress.
Comment 17 Olivier Wuillemin 2023-12-12 15:21:06 UTC
I confirm that system partition has been converted form Ext4 in 2021Q1.

Today I switch to Kernel 6.6.5.3 and start a scrub on the system partition.
Scrub stay in the state "Running" 0 Mb/s 0 bytes scrubbed.
After few time, all system lock up and I have to reset the server.

I fall back to Linux 6.1.66.1  and scrub system partition with success.

Note You need to log in before you can comment on or make changes to this bug.