I made the mistake of storing a VirtualBox 40G .vdi file on a btrfs partition (kernel 3.10). I have managed to copy it to a file marked chattr +C (I read the wiki *NOW*), but I can't delete the original file -- rm hangs the system. (I actually have the original file and another made with cp --reflink of it, while trying to diagnose the issue. I need to rm both of them) rm starts out with lots of hard disk activity, then that stops. I eventually get the attached kernel BUG, then a little bit later it panics due to out of memory. Photo: https://www.dropbox.com/s/8oxdaw8ux53y08g/PA150006.JPG
4149 for (i = 0; i < num_pages; i++) { 4150 p = alloc_page(GFP_ATOMIC); 4151 BUG_ON(!p); no free page for you 4152 attach_extent_buffer_page(new, p); 4153 WARN_ON(PageDirty(p)); 4154 SetPageUptodate(p); 4155 new->pages[i] = p; 4156 } The number of extents itself should not be a problem, I'm testing with milion-extent files (qcow2 images). Probably the file deletion tries to grab a lot of memory which is not available on the system (qgroups are turned on, so it grabs more memory), so you may want to delete the file in pieces, ie. truncate it by small steps backwards. This should limit the amount of memory and delete the files. --- #!/bin/sh if [ -z "$1" ]; then echo "no file" exit 1 fi file="$1" size=`stat --format='%s' "$file"` step=$((128*1023*1024)) next=$(($size - $step)) while [ $next -gt 0 ]; do echo "trunc to $next" truncate -s$next -- "$file" next=$(($next - $step)) done rm -- "$file"
I will try, David. I began starting work on this because VirtualBox locked up and I/O to this file was very slow. Last I could get a reliable number, there were over 40,000 extents. https://btrfs.wiki.kernel.org/index.php/Gotchas implies that 10,000 extents is a big problem. Is that outdated? Am I looking at a problem that copying this to a file made with chattr +C will NOT fix?
FWIW the machine this was on has 8GB RAM and was in single-user mode.
That script slowly worked. It would trim a few GB and then the system would hang again. Eventually I just tar'd up what I needed, deleted the subvolume, and restored from tar. I may have gotten there eventually but I sat through a dozen reboots alreday.
I just ran into a very similar situation, on a same version kernel but on i386 instead of amd64. The file in question far far smaller. I tried to rm it. Since there was a snapshot active, it should have returned just about immediately, but instead took down the entire server. How can I help?
This file was only 5GB. It had 27475 extents, and was created by restore (from the dump/restore program - a restoresymtable file)
Per the thread in the mailing list, others have seen this as well. Perhaps it is related to qgroups, perhaps not, but for the moment it seems that is a possible culprit. http://thread.gmane.org/gmane.comp.file-systems.btrfs/29252/focus=29259
On a system where this has caused a hang once, after btrfs quota disable and a reboot, doing the same action did not cause any problem at all. Seems increasingly likely to be a quota problem. I'm not sure what more info could be needed; this is still flagged NEEDINFO but I don't know what more info could be supplied?
How do you have qgroups setup? I'd like to try and reproduce this locally and to do that I need to know your qgroup setup.
I did nothing other than basically: btrfs sub cre a btrfs sub cre b btrfs sub cre c btrfs quota enable I only turned them on so I could get disk-usage data for each subvolume via btrfs qg show. I did not set any limits and create anything else. I believe I did turn on quotas before loading data on the FS. All data was loaded into subvolumes. Does that help?
I can't seem to reproduce, everything looks fine to me. Can you try to rm the file and run slabtop in another window and see what floats to the top? Use the 'c' command when it loads up to sort by cache size, and then just copy+paste whatever comes to the top for a long time.
Hi Josef, I will try to write some code on Monday to duplicate this and see what I can find for you. John
FYI - I eventually changed to zfs and cannot writ ethe code to duplicate. I am curious if this was ever resolved. FWIW, I did hear from others with similar issues.
This is a semi-automated bugzilla cleanup, report is against an old kernel version. If the problem still happens, please open a new bug. Thanks.