Bug 103081 - btrfs-image gets stuck imaging a disk, 100% CPU, looping over bad file descriptor
Summary: btrfs-image gets stuck imaging a disk, 100% CPU, looping over bad file descri...
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Josef Bacik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-08-18 15:39 UTC by Timothy Miller
Modified: 2015-08-25 17:20 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.1.4
Tree: Mainline
Regression: No


Attachments

Description Timothy Miller 2015-08-18 15:39:37 UTC
I'm using btrfs-image with the following options:

btrfs-image -c9 -t4 -w /dev/sdb ~/btrfs_sdb.z

It spent a lot of time writing the output file, and it's currently at 1672450048 bytes for a 4-drive RAID1 of 1TB drives.  At some point, it stopped doing I/O entirely and is now just using 100% CPU, which has been going on for at least 30 minutes now.  I think it's stuck, so below is info from the debugger.

Note that although I told it to use four threads, I was watching the CPU and I/O load, and at no point did I see it use any of the worker threads.  And as another note, there is nothing that makes it clear whether or not you have to image separate block devices separately if they're in a RAID1 configuration.

(gdb) info threads
  Id   Target Id         Frame
  5    Thread 0x7f4da6c8a700 (LWP 4363) "btrfs-image" pthread_cond_wait () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  4    Thread 0x7f4da6489700 (LWP 4364) "btrfs-image" pthread_cond_wait () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  3    Thread 0x7f4da5c88700 (LWP 4365) "btrfs-image" pthread_cond_wait () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  2    Thread 0x7f4da5487700 (LWP 4366) "btrfs-image" pthread_cond_wait () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
* 1    Thread 0x7f4da7aa58c0 (LWP 4362) "btrfs-image" 0x00007f4da70326f3 in pread64 () at ../sysdeps/unix/syscall-template.S:81

(gdb) thread 1
[Switching to thread 1 (Thread 0x7f4da7aa58c0 (LWP 4362))]
#0  0x00007f4da70326f3 in pread64 () at ../sysdeps/unix/syscall-template.S:81
81      in ../sysdeps/unix/syscall-template.S
(gdb) where
#0  0x00007f4da70326f3 in pread64 () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000433978 in pread64 (__offset=<optimized out>, __nbytes=<optimized out>, __buf=<optimized out>, __fd=<optimized out>) at /usr/include/bits/unistd.h:117
#2  read_data_extent (md=<optimized out>, async=<optimized out>) at btrfs-image.c:908
#3  flush_pending (md=0x7fff5e4e58e0, done=23238559, done@entry=0) at btrfs-image.c:963
#4  0x0000000000433c08 in add_extent (start=1128534638592, size=262144, md=0x7fff5e4e58e0, data=1) at btrfs-image.c:1042
#5  0x000000000043618f in copy_space_cache (root=0x15db980, path=0x1545890, metadump=0x7fff5e4e58e0) at btrfs-image.c:1220
#6  create_metadump (input=<optimized out>, out=<optimized out>, num_threads=4, compress_level=<optimized out>, sanitize=<optimized out>, walk_trees=<optimized out>) at btrfs-image.c:1403
#7  0x000000000040821f in main (argc=2, argv=0x7fff5e4e5c78) at btrfs-image.c:2825

(gdb) thread 2
[Switching to thread 2 (Thread 0x7f4da5487700 (LWP 4366))]
#0  pthread_cond_wait () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185     ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
(gdb) where
#0  pthread_cond_wait () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000000431ceb in dump_worker (data=0x7fff5e4e58e0) at btrfs-image.c:648
#2  0x00007f4da702b204 in start_thread (arg=0x7f4da5487700) at pthread_create.c:310
#3  0x00007f4da6d7257d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

(gdb) thread 3
[Switching to thread 3 (Thread 0x7f4da5c88700 (LWP 4365))]
#0  pthread_cond_wait () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185     in ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
(gdb) where
#0  pthread_cond_wait () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000000431ceb in dump_worker (data=0x7fff5e4e58e0) at btrfs-image.c:648
#2  0x00007f4da702b204 in start_thread (arg=0x7f4da5c88700) at pthread_create.c:310
#3  0x00007f4da6d7257d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

(gdb) thread 4
[Switching to thread 4 (Thread 0x7f4da6489700 (LWP 4364))]
#0  pthread_cond_wait () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185     in ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
(gdb) where
#0  pthread_cond_wait () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000000431ceb in dump_worker (data=0x7fff5e4e58e0) at btrfs-image.c:648
#2  0x00007f4da702b204 in start_thread (arg=0x7f4da6489700) at pthread_create.c:310
#3  0x00007f4da6d7257d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

(gdb) thread 5
[Switching to thread 5 (Thread 0x7f4da6c8a700 (LWP 4363))]
#0  pthread_cond_wait () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185     in ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
(gdb) where
#0  pthread_cond_wait () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000000431ceb in dump_worker (data=0x7fff5e4e58e0) at btrfs-image.c:648
#2  0x00007f4da702b204 in start_thread (arg=0x7f4da6c8a700) at pthread_create.c:310
#3  0x00007f4da6d7257d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109


I attached strace, and I see this over and over again:

pread(-1, 0x16297a0, 65536, 310867132416) = -1 EBADF (Bad file descriptor)
pread(5, "\0", 1, 314051657727)         = 1
pread(-1, 0x16297a0, 65536, 310867132416) = -1 EBADF (Bad file descriptor)
pread(5, "\0", 1, 314051657727)         = 1
pread(-1, 0x16297a0, 65536, 310867132416) = -1 EBADF (Bad file descriptor)
pread(5, "\0", 1, 314051657727)         = 1
pread(-1, 0x16297a0, 65536, 310867132416) = -1 EBADF (Bad file descriptor)
pread(5, "\0", 1, 314051657727)         = 1
pread(-1, 0x16297a0, 65536, 310867132416) = -1 EBADF (Bad file descriptor)
pread(5, "\0", 1, 314051657727)         = 1
pread(-1, 0x16297a0, 65536, 310867132416) = -1 EBADF (Bad file descriptor)
pread(5, "\0", 1, 314051657727)         = 1
pread(-1, 0x16297a0, 65536, 310867132416) = -1 EBADF (Bad file descriptor)
pread(5, "\0", 1, 314051657727)         = 1
pread(-1, 0x16297a0, 65536, 310867132416) = -1 EBADF (Bad file descriptor)
pread(5, "\0", 1, 314051657727)         = 1
pread(-1, 0x16297a0, 65536, 310867132416) = -1 EBADF (Bad file descriptor)
pread(5, "\0", 1, 314051657727)         = 1
pread(-1, 0x16297a0, 65536, 310867132416) = -1 EBADF (Bad file descriptor)


There's nothing in dmesg about any of this.
Comment 1 Qu Wenruo 2015-08-19 01:27:36 UTC
Hi Timothy,

Curious about whether all the 4 devices are online?
Seems one of the device is not online?
And btrfs-image fails to handle such case as it record wrong(-1, error return value from open), and cause the error.

BTW that's a bug of btrfs-progs, not sure if is appropriate to address a kernel bug report here.

Thanks,
Qu
Comment 2 David Sterba 2015-08-25 17:20:53 UTC
Thanks for the fix, added to the devel branch.

Note You need to log in before you can comment on or make changes to this bug.