Bug 12563 - btrfs: oops while running fsstress on compressed file system
Summary: btrfs: oops while running fsstress on compressed file system
Status: CLOSED INVALID
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: fs_btrfs@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-28 10:51 UTC by Eric Whitney
Modified: 2009-03-04 10:55 UTC (History)
0 users

See Also:
Kernel Version: 2.6.29-rc2
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Fix (1.07 KB, patch)
2009-01-30 13:27 UTC, Chris Mason
Details | Diff

Description Eric Whitney 2009-01-28 10:51:35 UTC
Latest working kernel version: Has never fully passed test
Earliest failing kernel version: Prior to mainline merge

Distribution: uname -a:
Linux bl465cb.lnx.usa.hp.com 2.6.29-rc2-enw #1 SMP Sat Jan 17 14:19:15 EST 2009 x86_64 GNU/Linux

Hardware Environment:

dual socket, quad core x86_64 Intel and AMD systems with backplane RAID

representative btrfs-show:
Label: none  uuid: 2bdf12bf-8b27-49fd-970b-b2c1196e9f5d
	Total devices 6 FS bytes used 31.90MB
	devid    5 size 68.33GB used 1.01GB path /dev/cciss/c1d4
	devid    1 size 68.33GB used 1.02GB path /dev/cciss/c1d0
	devid    6 size 68.33GB used 1.01GB path /dev/cciss/c1d5
	devid    3 size 68.33GB used 2.00GB path /dev/cciss/c1d2
	devid    4 size 68.33GB used 2.00GB path /dev/cciss/c1d3
	devid    2 size 68.33GB used 1.00GB path /dev/cciss/c1d1


Software Environment: autotest client modified to run fstress against btrfs

fs mounted as: /dev/cciss/c1d5 on /mnt type btrfs (rw,compress)

Problem Description:  fstress will force a characteristic oops within minutes or seconds when run against a compressed btrfs file system.  The oops will occur on either a single or multi-device fs with high reliability.

This is the same bug Chris and I were looking at in late December.

(This bug also reproduces reliably using btrfs-unstable as of 28 Jan 09;
commit a717531942f488209dded30f6bc648167bcefa72)

oops from 2.6.29-rc2: (I/O errors at mount time not believed to be a factor, and not visible in latest btrfs-unstable test)

Jan 21 17:34:51 bl465cb kernel: [   83.424862] device fsid fd49278bbf12df2b-5d9f6e19c1b20b97 <6>devid 1 transid 5 /dev/cciss/c1d0
Jan 21 17:34:51 bl465cb kernel: [   83.428681] device fsid fd49278bbf12df2b-5d9f6e19c1b20b97 <6>devid 2 transid 5 /dev/cciss/c1d1
Jan 21 17:34:51 bl465cb kernel: [   83.432291] device fsid fd49278bbf12df2b-5d9f6e19c1b20b97 <6>devid 3 transid 5 /dev/cciss/c1d2
Jan 21 17:34:51 bl465cb kernel: [   83.435876] device fsid fd49278bbf12df2b-5d9f6e19c1b20b97 <6>devid 4 transid 5 /dev/cciss/c1d3
Jan 21 17:34:51 bl465cb kernel: [   83.439453] device fsid fd49278bbf12df2b-5d9f6e19c1b20b97 <6>devid 5 transid 5 /dev/cciss/c1d4
Jan 21 17:34:51 bl465cb kernel: [   83.443111] device fsid fd49278bbf12df2b-5d9f6e19c1b20b97 <6>devid 6 transid 5 /dev/cciss/c1d5
Jan 21 17:35:04 bl465cb kernel: [   96.456646] device fsid fd49278bbf12df2b-5d9f6e19c1b20b97 <6>devid 6 transid 9 /dev/cciss/c1d5
Jan 21 17:35:04 bl465cb kernel: [   96.480410] btrfs: use compression
Jan 21 17:36:13 bl465cb kernel: [  165.720340] end_request: I/O error, dev cciss/c1d5, sector 131072
Jan 21 17:36:13 bl465cb kernel: [  165.727463] btrfs: disabling barriers on dev /dev/cciss/c1d5
Jan 21 17:36:13 bl465cb kernel: [  165.729823] end_request: I/O error, dev cciss/c1d4, sector 131072
Jan 21 17:36:13 bl465cb kernel: [  165.732255] btrfs: disabling barriers on dev /dev/cciss/c1d4
Jan 21 17:36:13 bl465cb kernel: [  165.735461] end_request: I/O error, dev cciss/c1d3, sector 131072
Jan 21 17:36:13 bl465cb kernel: [  165.738159] btrfs: disabling barriers on dev /dev/cciss/c1d3
Jan 21 17:36:13 bl465cb kernel: [  165.740376] end_request: I/O error, dev cciss/c1d2, sector 131072
Jan 21 17:36:13 bl465cb kernel: [  165.742764] btrfs: disabling barriers on dev /dev/cciss/c1d2
Jan 21 17:36:13 bl465cb kernel: [  165.745014] end_request: I/O error, dev cciss/c1d1, sector 131072
Jan 21 17:36:13 bl465cb kernel: [  165.747855] btrfs: disabling barriers on dev /dev/cciss/c1d1
Jan 21 17:36:13 bl465cb kernel: [  165.751349] end_request: I/O error, dev cciss/c1d0, sector 131072
Jan 21 17:36:13 bl465cb kernel: [  165.756045] btrfs: disabling barriers on dev /dev/cciss/c1d0
Jan 21 18:41:20 bl465cb kernel: [ 4072.370454] stack segment: 0000 [#1] SMP 
Jan 21 18:41:20 bl465cb kernel: [ 4072.374009] last sysfs file: /sys/devices/pci0000:40/0000:40:12.0/0000:52:00.0/class
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] CPU 1 
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] Modules linked in: iptable_filter ip_tables x_tables parport_pc lp parport loop ipmi_devintf ipmi_si ipmi_msghandler pcspkr psmouse i2c_piix4 serio_raw shpchp pci_hotplug i2c_core container button ipv6 evdev usbhid ext3 hid jbd mbcache cciss scsi_mod bnx2 ehci_hcd ohci_hcd uhci_hcd usbcore thermal processor fan thermal_sys fuse
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] Pid: 5207, comm: btrfs-delalloc- Not tainted 2.6.29-rc2-enw #1
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] RIP: 0010:[<ffffffff803c8f83>]  [<ffffffff803c8f83>] fill_window+0x143/0x480
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] RSP: 0018:ffff880425013c80  EFLAGS: 00010212
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] RAX: 0000000000001000 RBX: 0000000000001000 RCX: b6e3880000000000
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] RBP: b6e3880000000000 R08: 0000000000000000 R09: 0000000000000000
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] R10: 0000000000000010 R11: ffffc200139774bc R12: 0000000000000003
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] R13: 0000000000000001 R14: 0000000000000003 R15: 000000000000d0fb
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] FS:  00007f85a58506e0(0000) GS:ffff88042eccfb00(0000) knlGS:0000000000000000
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] CR2: 00007f85a52f95f0 CR3: 00000004299c5000 CR4: 00000000000006e0
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] Process btrfs-delalloc- (pid: 5207, threadinfo ffff880425012000, task ffff88042d828610)
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] Stack:
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145]  0000000200000282 0000000000008000 b6e3880000000000 0000000000000015
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145]  ffffc20013977afc ffffc20013977000 ffff88022e427e60 00008000803ca89c
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145]  000000000000cc08 ffffffff80539700 0000000000000011 ffffc200139779b0
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145] Call Trace:
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145]  [<ffffffff803c99ae>] ? deflate_fast+0x24e/0x2e0
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145]  [<ffffffff803c9ce2>] ? zlib_deflate+0x112/0x330
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145]  [<ffffffff803719d8>] ? btrfs_zlib_compress_pages+0x158/0x3b0
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145]  [<ffffffff8034e820>] ? compress_file_range+0x3f0/0x4f0
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145]  [<ffffffff8050aa5c>] ? thread_return+0x38/0x6fc
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145]  [<ffffffff8034e94e>] ? async_cow_start+0x2e/0x50
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145]  [<ffffffff80368401>] ? worker_loop+0x61/0x160
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145]  [<ffffffff803683a0>] ? worker_loop+0x0/0x160
Jan 21 18:41:20 bl465cb kernel: [ 4072.377145]  [<ffffffff802525ab>] ? kthread+0x4b/0x80
Jan 21 18:41:20 bl465cb kernel: [ 4072.484395]  [<ffffffff8020d23a>] ? child_rip+0xa/0x20
Jan 21 18:41:20 bl465cb kernel: [ 4072.484395]  [<ffffffff80252560>] ? kthread+0x0/0x80
Jan 21 18:41:20 bl465cb kernel: [ 4072.484395]  [<ffffffff8020d230>] ? child_rip+0x0/0x20
Jan 21 18:41:20 bl465cb kernel: [ 4072.484395] Code: 8b 5c 24 78 89 5c 24 7c 81 7c 24 7c b0 15 00 00 b8 b0 15 00 00 0f 46 44 24 7c 29 44 24 7c 83 f8 0f 89 44 24 74 0f 8e ec 00 00 00 <0f> b6 45 00 0f b6 55 01 49 8d 04 07 48 8d 14 10 48 03 44 24 40 
Jan 21 18:41:20 bl465cb kernel: [ 4072.494396] RIP  [<ffffffff803c8f83>] fill_window+0x143/0x480
Jan 21 18:41:20 bl465cb kernel: [ 4072.504469]  RSP <ffff880425013c80>
Jan 21 18:41:20 bl465cb kernel: [ 4072.507517] ---[ end trace 7c7026a226679987 ]---
Jan 21 18:41:20 bl465cb kernel: [ 4072.512052] note: btrfs-delalloc-[5207] exited with preempt_count 1


Steps to reproduce:
1) mkfs
2) mount compressed
3) run fstress
Comment 1 Chris Mason 2009-01-30 13:27:10 UTC
Created attachment 20047 [details]
Fix

The problem is the workload is causing dirty pages to be sent down that are past i_size.  The compression code isn't properly avoiding work in this case, and is actually doing some bad math that makes it try to find_get_pages that don't exist.

Since I've "fixed" this once already, I'll send the patch to you for testing before I push it out this time.
Comment 2 Eric Whitney 2009-02-03 10:10:25 UTC
After a large number of test runs using a btrfs-unstable kernel to which your patch has been applied, there's good news and (perhaps) bad news.

The good news is that the patch does appear to fix the bug, in that it has not
occurred in over 20 test runs on four different hardware configurations.

The bad news is that another oops I'd only seen once in unmodified 2.6.29-rc2 now
occurs with high reliability on two of those four hardware configurations (but not at all on the other two).  I've filed #12625 to track that issue.  I've also
noticed a couple of I/O stall cases with this patch, FWIW.
Comment 3 Eric Whitney 2009-02-10 15:07:38 UTC
This bug has not appeared in 2.6.29-rc4 testing, though not all fsstress test runs on compressed btrfs filesystems succeeded.  Two out of eight test runs succeeded, four almost immediately failed with null pointer derefs per bug #12625, and two appeared to enter an I/O stalled state midway through.

The fix in -rc4 is likely good, with at least four of eight runs either completing or making it well past the point where this bug would originally occur.
Comment 4 Eric Whitney 2009-02-24 15:39:00 UTC
This bug has not appeared in 2.6.29-rc5 or -rc6 testing, with a number of fsstress runs completing successfully on the same test configurations on which it was originally discovered.  (Other bugs have been seen, however, with more fsstress runs failing than succeeding.)

I think this bug can be marked closed.
Comment 5 Eric Whitney 2009-03-04 10:55:00 UTC
Marking bug as closed per Chris Mason.

Note You need to log in before you can comment on or make changes to this bug.