Bug 66151

Summary: BTRFS scrub hung
Product: File System Reporter: Mace Moneta (moneta.mace)
Component: btrfsAssignee: Josef Bacik (josef)
Status: RESOLVED DOCUMENTED    
Severity: normal CC: cyrond, dsterba, xavier
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.12.1-2.fc21.x86_64 Subsystem:
Regression: No Bisected commit-id:

Description Mace Moneta 2013-11-29 13:36:47 UTC
# btrfs scrub status /
scrub status for 02184910-5849-489f-b970-3ea35912a7af
        scrub started at Tue Nov 26 21:30:01 2013, running for 15 seconds
        total bytes scrubbed: 5.38GiB with 0 errors

# btrfs scrub cancel /
ERROR: scrub cancel failed on /: not running

# btrfs scrub resume /
ERROR: scrub is already running.
To cancel use 'btrfs scrub cancel /'.
To see the status use 'btrfs scrub status [-d] /'.

# btrfs scrub status /
scrub status for 02184910-5849-489f-b970-3ea35912a7af
        scrub started at Tue Nov 26 21:30:01 2013, running for 15 seconds
        total bytes scrubbed: 5.38GiB with 0 errors

Notes:
1. Using btrfs-progs-3.12-1.fc20.x86_64
2. The BTRFS filesystem is on two drive, data RAID0 and metadata RAID1
Comment 1 Xavier Bassery 2013-11-29 14:55:45 UTC
I'll just give you here previous instructions given on IRC to work-around a similar issue (http://pastebin.ca/2477224):

Could you check the status files /var/lib/btrfs/scrub.status.#hash and see if there is something wrong with the access to these files?
When no scrub is actually running, it should be safe to delete any of those.
Hope this will help.
Comment 2 Mace Moneta 2013-11-29 15:14:16 UTC
The files were accessible, and I was able to read them without error:

# ls -l /var/lib/btrfs/scrub.status.*
-rw-------  1 root root 809 Nov 26 21:30 /var/lib/btrfs/scrub.status.02184910-5849-489f-b970-3ea35912a7af
-rw-------. 1 root root 419 Jun 24 08:38 /var/lib/btrfs/scrub.status.057239ee-1cc7-44b2-8fa3-714661dfa7fe
-rw-------  1 root root 423 Sep 21 14:29 /var/lib/btrfs/scrub.status.457cd8ba-f193-44d0-b386-b6ca2b80de3c
-rw-------  1 root root 831 Sep 22 06:39 /var/lib/btrfs/scrub.status.c98d1ee1-f28f-41d6-b7c2-32e7ae073651

I removed them and was able to run a new scrub:

# btrfs scrub status /
scrub status for 02184910-5849-489f-b970-3ea35912a7af
        scrub started at Fri Nov 29 10:08:43 2013 and finished after 70 seconds
        total bytes scrubbed: 26.08GiB with 0 errors

There was no mention of the scrub status file in the btrfs man page or Wiki.
Comment 3 Mace Moneta 2013-11-29 15:24:26 UTC
The status file dated Nov 26 21:30:

# cat /var/lib/btrfs/scrub.status.02184910-5849-489f-b970-3ea35912a7af
 
scrub status:1
02184910-5849-489f-b970-3ea35912a7af:1|data_extents_scrubbed:39619|tree_extents_scrubbed:40121|data_bytes_scrubbed:2233184256|tree_bytes_scrubbed:657342464|read_errors:0|csum_errors:0|verify_errors:0|no_csum:20333|csum_discards:4044|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:12409176064|t_start:1385519401|t_resumed:0|duration:15|canceled:0|finished:0
02184910-5849-489f-b970-3ea35912a7af:2|data_extents_scrubbed:39602|tree_extents_scrubbed:40121|data_bytes_scrubbed:2234720256|tree_bytes_scrubbed:657342464|read_errors:0|csum_errors:0|verify_errors:0|no_csum:20334|csum_discards:4090|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:12389253120|t_start:1385519401|t_resumed:0|duration:15|canceled:0|finished:0
Comment 4 Ruben Kelevra 2014-03-23 14:48:09 UTC
Happend here too, I have a three device raid5 metadata/data.

[root@delling ~]# btrfs scrub status /data_root/
scrub status for 9ad4f49f-a9cf-40d3-9938-7baadd12aa3f
	scrub started at Tue Mar 18 13:06:14 2014, running for 323 seconds
	total bytes scrubbed: 1.46GiB with 0 errors


[root@delling ~]# btrfs scrub start /data_root/
ERROR: scrub is already running.
To cancel use 'btrfs scrub cancel /data_root/'.
To see the status use 'btrfs scrub status [-d] /data_root/'.

[root@delling ~]# btrfs scrub resume /data_root/
ERROR: scrub is already running.
To cancel use 'btrfs scrub cancel /data_root/'.
To see the status use 'btrfs scrub status [-d] /data_root/'.

[root@delling ~]# btrfs scrub cancel /data_root/
ERROR: scrub cancel failed on /data_root/: not running

[root@delling ~]# btrfs scrub status /data_root/
scrub status for 9ad4f49f-a9cf-40d3-9938-7baadd12aa3f
	scrub started at Tue Mar 18 13:06:14 2014, running for 323 seconds
	total bytes scrubbed: 1.46GiB with 0 errors
Comment 5 Ruben Kelevra 2014-03-23 14:49:36 UTC
"rm /var/lib/btrfs/scrub.status.*" solved it, but there is no documentation on wiki for this.


[root@delling ~]# btrfs scrub status /data_root/
scrub status for 9ad4f49f-a9cf-40d3-9938-7baadd12aa3f
	no stats available
	total bytes scrubbed: 0.00 with 0 errors
Comment 6 Ruben Kelevra 2014-03-23 14:50:27 UTC
Since the Scrub-Start shows up an pid, the information should be deleted if the pid is not found.

[root@delling ~]# btrfs scrub start /data_root/
scrub started on /data_root/, fsid 9ad4f49f-a9cf-40d3-9938-7baadd12aa3f (pid=31130)
Comment 7 David Sterba 2014-03-24 11:34:38 UTC
Newer btrfsprogs have the -f option to force start

$ ./btrfs scrub start --help
...
    -f     force to skip checking whether scrub has started/resumed in userspace
           this is useful when scrub stats record file is damaged
Comment 8 David Sterba 2014-03-24 11:38:04 UTC
(In reply to Ruben Kelevra from comment #6)
> Since the Scrub-Start shows up an pid, the information should be deleted if
> the pid is not found.

The pid is not recorded anywhere to do the check.
Comment 9 Ruben Kelevra 2014-03-24 12:14:29 UTC
(In reply to David Sterba from comment #7)
> Newer btrfsprogs have the -f option to force start
> 
> $ ./btrfs scrub start --help
> ...
>     -f     force to skip checking whether scrub has started/resumed in
> userspace
>            this is useful when scrub stats record file is damaged
What means newer? I'm using Arch Linux with the latest Updates on Linux 3.13.6, and don't got this knob. 


(In reply to David Sterba from comment #8)
> (In reply to Ruben Kelevra from comment #6)
> > Since the Scrub-Start shows up an pid, the information should be deleted if
> > the pid is not found.
> 
> The pid is not recorded anywhere to do the check.
Maybe it should? :)