Bug 68411 - general protection fault in read_extent_buffer (from readdir) (possibly corrupt fs)
Summary: general protection fault in read_extent_buffer (from readdir) (possibly corru...
Status: RESOLVED OBSOLETE
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Josef Bacik
URL:
Keywords:
: 63701 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-01-09 20:42 UTC by Zack Weinberg
Modified: 2022-10-03 13:26 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.12.6
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel oops log (5.26 KB, text/plain)
2014-01-09 20:42 UTC, Zack Weinberg
Details

Description Zack Weinberg 2014-01-09 20:42:22 UTC
Created attachment 121431 [details]
kernel oops log

On a two-entire-physical-devices btrfs used for general purpose data storage, a filesystem walk crashed in readdir() with this kernel stack trace:

[<ffffffffa04dd568>] ? read_extent_buffer+0xc8/0x120 [btrfs]
[<ffffffffa04c2510>] ? btrfs_get_extent+0x910/0x990 [btrfs]
[<ffffffffa04d99b8>] ? __do_readpage+0x398/0x780 [btrfs]
[<ffffffffa04c1c00>] ? btrfs_real_readdir+0x550/0x550 [btrfs]
[<ffffffffa04da142>] ? __extent_readpages.constprop.43+0x2d2/0x2f0 [btrfs]
[<ffffffffa04c1c00>] ? btrfs_real_readdir+0x550/0x550 [btrfs]
[<ffffffffa04dbde2>] ? extent_readpages+0x182/0x190 [btrfs]
[<ffffffffa04c1c00>] ? btrfs_real_readdir+0x550/0x550 [btrfs]
[<ffffffff811554ad>] ? alloc_pages_current+0x9d/0x160
[<ffffffff8111dbc3>] ? __do_page_cache_readahead+0x193/0x240
[<ffffffff8111e07a>] ? ondemand_readahead+0x14a/0x280
[<ffffffff81113986>] ? generic_file_aio_read+0x4a6/0x6f0
[<ffffffff8114090f>] ? mmap_region+0x15f/0x5e0
[<ffffffff81172a07>] ? do_sync_read+0x57/0x90
[<ffffffff81172f94>] ? vfs_read+0x94/0x160
[<ffffffff81173a83>] ? SyS_read+0x43/0xa0
[<ffffffff81499b39>] ? system_call_fastpath+0x16/0x1b

(full oops report attached) After this, any attempt to access the file system hung in D-state, including umount, and the computer had to be power-cycled.  Upon reboot into single-user mode, I observed this message in the kernel log:

> BTRFS debug (device sda): unlinked 6 orphans

(The OS boots off /dev/sdb1, which is ext4, and the btrfs volume is /dev/sda + /dev/sdc.)

I then ran btrfsck --repair /dev/sda as root, which prints:

| enabling repair mode
| Checking filesystem on /dev/sda
| UUID: ec93d2c2-7937-40f8-aaa6-c20c9775d93a
| checking extents
| checking free space cache
| cache and super generation don't match, space cache will be invalidated
| checking fs roots
| root 258 inode 4493802 errors 400, nbytes wrong
| root 258 inode 4509858 errors 400, nbytes wrong
| root 258 inode 4510014 errors 400, nbytes wrong
| root 258 inode 4838894 errors 400, nbytes wrong
| root 258 inode 4838895 errors 400, nbytes wrong
| found 41852229430 bytes used err is 1
| total csum bytes: 619630328
| total tree bytes: 3216027648
| total fs tree bytes: 2342981632
| total extent tree bytes: 135536640
| btree space waste bytes: 767795634
| file data blocks allocated: 1744289230848
|  referenced 631766474752
| Btrfs v3.12

So that doesn't sound so bad, but a second btrfsck --repair prints *exactly the same thing*, i.e. the errors have not actually been repaired!

Kernel is Debian's packaged 3.12.6. In this circumstance I am *not* able to compile and test any other kernel, because the only available space on which to do so is the damaged filesystem. However, am happy to do any other tests which may be helpful, and would also be delighted to receive instructions on how to fix the filesystem properly.
Comment 1 David Sterba 2014-01-13 18:09:31 UTC
The is a duplicate of bug 63701, the patch is on the way.

I don't see the code in fsck to fix that. As an ugly manual fix you can try to locate the files

$ btrfs incpect ino 4493802 /path
(and same for the rest from the fsck output0)

and truncate them to 0 if you can afford to lose/recreate the files

$ truncate -s 0 /path/to/file-4493802

(This worked on the reproducer I have at disposal.)
Comment 2 Zack Weinberg 2014-01-13 19:01:00 UTC
Thanks.  I am traveling right now but I will try that on Friday.
Comment 3 Zack Weinberg 2014-01-17 16:31:29 UTC
(In reply to David Sterba from comment #1)
> As an ugly manual fix you can try to locate the files
> 
> $ btrfs inspect ino 4493802 /path
> (and same for the rest from the fsck output0)
> 
> and truncate them to 0 if you can afford to lose/recreate the files
> 
> $ truncate -s 0 /path/to/file-4493802

This did *not* clear up the errors.  I still get the same failures from btrfs check (after unmounting the filesystem).  I also tried mounting the fs again and running a scrub, which found no errors and did not make the problem go away.

I am now trying a force clear of the space cache and unlinking the offending inodes.  Will update on results later today.
Comment 4 Zack Weinberg 2014-01-17 21:53:30 UTC
Truncating *and* unlinking all the affected files seems to have cleared the problem.

Also, now that I know which files were affected, I can say that the cause was significantly different from bug 63701 (although it may well be the same error-in-the-code).  My computer did not experience a power failure at any time in recent memory, *until* I had to forcibly power it off as a *response* to the crash I posted (because `umount` following the crash got stuck in D-state).  I am prepared to believe that the damaged files were secondary to that force-poweroff, though; they had all been modified shortly before the crash.
Comment 5 David Sterba 2014-01-23 12:57:00 UTC
(In reply to Zack Weinberg from comment #3)
> (In reply to David Sterba from comment #1)
> > As an ugly manual fix you can try to locate the files
> > 
> > $ btrfs inspect ino 4493802 /path
> > (and same for the rest from the fsck output0)
> > 
> > and truncate them to 0 if you can afford to lose/recreate the files
> > 
> > $ truncate -s 0 /path/to/file-4493802
> 
> This did *not* clear up the errors.  I still get the same failures from
> btrfs check (after unmounting the filesystem).

fsck is not yet able to fix this error, the point of truncating the file was to stop crashing. Which apparently worked.

> I also tried mounting the fs
> again and running a scrub, which found no errors and did not make the
> problem go away.

This not something scrub could fix, it only verifies the checksums. The bug is a structural inconsistency in the extent items and has to be fixed as such.

> I am now trying a force clear of the space cache and unlinking the offending
> inodes.  Will update on results later today.

Space cache should not be affected by this, I'm not completely sure here, though.
Comment 6 David Sterba 2014-01-23 13:03:21 UTC
(In reply to Zack Weinberg from comment #4)
> Truncating *and* unlinking all the affected files seems to have cleared the
> problem.

Yeah. I haven't noticed before that even if the file is 0 in size, it has incorrect nbytes, which tracks the actually allocated space.

> Also, now that I know which files were affected, I can say that the cause
> was significantly different from bug 63701 (although it may well be the same
> error-in-the-code).

AFAIK the bug is in the code, unrelated to crash failures. Until recently we haven't known what's the real cause, so the reports may point to different areas.

> I am prepared to believe that the damaged files were
> secondary to that force-poweroff, though; they had all been modified shortly
> before the crash.

This sounds correct.
Comment 7 Zack Weinberg 2014-01-23 14:52:32 UTC
(In reply to David Sterba from comment #5)
> (In reply to Zack Weinberg from comment #3)
> > (In reply to David Sterba from comment #1)
> > > $ truncate -s 0 /path/to/file-4493802
> > 
> > This did *not* clear up the errors.  I still get the same failures from
> > btrfs check (after unmounting the filesystem).
> 
> fsck is not yet able to fix this error, the point of truncating the file was
> to stop crashing. Which apparently worked.

Ah, I misunderstood you.  I thought the truncate would either remove the inconsistency altogether or convert it into something fsck could fix.

I would have preferred not to mount the damaged filesystem again until I got a clean result from 'btrfs check'; as is, I did have to mount it to do some of the manual repair actions but I didn't do anything with it other than that.  So I can't really say whether or not the crash would have recurred.

In any case I only got the one crash, and we're agreed that the damage was caused by the crash rather than the other way around.  This doesn't leave me feeling very good about my chances for stability in the future.  How's that patch coming?

> > I am now trying a force clear of the space cache and unlinking the
> offending
> > inodes.  Will update on results later today.
> 
> Space cache should not be affected by this, I'm not completely sure here,
> though.

There were *also* a bunch of complaints about mismatched space cache generation numbers in the fsck reports.  A mount with nospace_cache,clear_cache seems to have corrected that.  (clear_cache by itself did not.)
Comment 8 David Sterba 2014-01-31 17:44:32 UTC
(In reply to Zack Weinberg from comment #7)
> How's that patch coming?

Runtime fix is in 3.14 pull,

Btrfs: don't use ram_bytes for uncompressed inline items
Comment 9 Zack Weinberg 2014-02-11 16:29:06 UTC
Do you still need more information from me in order to address the kernel bug(s)?  Should I file a new bug (and if so, where?) re the missing fsck features?
Comment 10 David Sterba 2014-03-21 10:51:28 UTC
*** Bug 63701 has been marked as a duplicate of this bug. ***
Comment 11 David Sterba 2022-10-03 13:26:45 UTC
This is a semi-automated bugzilla cleanup, report is against an old kernel version. If the problem still happens, please open a new bug. Thanks.

Note You need to log in before you can comment on or make changes to this bug.