Bug 13369

Summary: kernel BUG at fs/ext4/inode.c:3123
Product: File System Reporter: Chris Komner (ckom26)
Component: ext4Assignee: Eric Sandeen (sandeen)
Status: RESOLVED CODE_FIX    
Severity: blocking CC: nao.nakashima, sandeen, tytso
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30-rc6 Subsystem:
Regression: No Bisected commit-id:
Attachments: Kernel logs of the bug, had to pull the plug to get the second one since the box wouldn't reboot normally.
dmesg output of the bug
yet another snip
kernel .config

Description Chris Komner 2009-05-23 12:04:51 UTC
Being haunted by a consistent bug in 2.6.29.[234] that kept pdflush stuck at 100% cpu, I fetched and compiled 2.6.30-rc6. Now, I keep getting "kernel BUG at fs/ext4/inode.c:3123!" (usually when I fire up ktorrent). In the attached kernel logs you can see 2 different occurences of the bug.

A nearby single core cpu, low end box has not exhibited a single bug or corruption for 3 months now, since I've converted to ext4, so I tried the "nosmp" kernel boot parameter on the dual core laptop in question. Lo and behold, the bug didn't rear its ugly head.

All partitions are mounted with "nodelalloc".
Comment 1 Chris Komner 2009-05-23 12:07:40 UTC
Created attachment 21506 [details]
Kernel logs of the bug, had to pull the plug to get the second one since the box wouldn't reboot normally.
Comment 2 Eric Sandeen 2009-05-23 15:06:18 UTC
This is:
static int ext4_normal_writepage(struct page *page,
                                struct writeback_control *wbc)
{
...
        if (page_has_buffers(page)) {
                /* if page has buffers it should all be mapped
                 * and allocated. If there are not buffers attached
                 * to the page we know the page is dirty but it lost
                 * buffers. That means that at some moment in time
                 * after write_begin() / write_end() has been called
                 * all buffers have been clean and thus they must have been
                 * written at least once. So they are all mapped and we can
                 * happily proceed with mapping them and writing the page.
                 */
                BUG_ON(walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
                                        ext4_bh_unmapped_or_delay));
        }

Does ktorrent preallocate space?  You might try turning that off to see if it goes away, as a hint.  I can't look at this much today, it's my birthday and I'm going to have some non-computer fun.  ;)

If you want, you could modify the test in ext4_normal_writepage() at this spot to something like:

        if (page_has_buffers(page)) {
                /*
                 * (big comment deleted)
                 */
                if (walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
                                        ext4_bh_unmapped_or_delay)) {
                        struct buffer_head *bh;

                        bh = (struct buffer_head *)page_private(page);
                        printk("bh state 0x%x\n", bh->b_state);
                        BUG();
                }
        }

so that we'd see the actual buffer head state that was wrong.  (this assumes you have page-sized blocks; most likely true, if you have 4k blocks, which is the mkfs default)
Comment 3 Chris Komner 2009-05-23 21:40:22 UTC
Space preallocation is turned off in ktorrent. The modification didn't help much, but then again the block size is 1k. Additional info from dumpe2fs:
Filesystem volume name:   STORAGE
Last mounted on:          <not available>
Filesystem UUID:          7d9c4f5a-c7fe-4b2d-a02a-dc095ba40e7c
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              10799616
Block count:              172787076
Reserved block count:     0
Free blocks:              8744435
Free inodes:              10795146
First block:              1
Block size:               1024
Fragment size:            1024
Reserved GDT blocks:      256
Blocks per group:         8192
Fragments per group:      8192
Inodes per group:         512
Inode blocks per group:   128
Flex block group size:    16
Filesystem created:       Tue Apr 28 14:51:31 2009
Last mount time:          Sat May 23 23:11:38 2009
Last write time:          Sat May 23 23:11:38 2009
Mount count:              3
Maximum mount count:      400
Last checked:             Sat May 23 10:17:43 2009
Check interval:           34560000 (13 months, 1 week, 3 days)
Next check after:         Sun Jun 27 10:17:43 2010
Lifetime writes:          5176 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      71ca129e-0529-45c7-944c-03f3ca7ddd83
Journal backup:           inode blocks
Journal size:             32M


Oh, and happy birthday!
Comment 4 Chris Komner 2009-05-23 21:46:29 UTC
Created attachment 21510 [details]
dmesg output of the bug

kernel tree bumped to -rc6-git6
Comment 5 Theodore Tso 2009-05-24 02:52:48 UTC
So I'm glad you're using a filesystem with a 1k blocksize since it provides a good test case for us --- and we don't do enough testing for the case where the blocksize != pagesize.   But you do realize that the ext4 filesystem will be far more efficient with the larger 4k blocksize, right?

I'm not able to reproduce the problem on a filesystem with 1k blocksize running fsx, fsstress, and dbench using a 2.6.30-rc6 kernel with the full ext4 patch queue, using a netbook with an dual-core N270 Atom processor.   I'm retrying with a stock 2.6.30-rc6 kernel now...
Comment 6 Theodore Tso 2009-05-24 05:08:05 UTC
I've tried replicating using fsstress, fsx, and dbench using a stock 2.6.30-rc6 on a dual-core netbook and a 1k block filesystem; no luck.

I tried replicating using fsstress, fsx, and dbench using 2.6.30-rc6 + the ext4 patch queue on a x61s Lenovo laptop (Intel Core 2 Duo L7500), again with a 1k block filesystem.  No luck.  I'm now trying to do a bittorrent download using transmission onto the 1k block filesystem.

How easily can you reproduce the problem on your system?
Comment 7 Chris Komner 2009-05-24 12:02:05 UTC
The 1k blocksize was decided accidentally by an obsolete and forgotten /etc/mke2fs.conf, and ATM I can't back up the data and reformat 4k.

The reproduction of the problem is as easy as it gets: I just launch ktorrent and wait. Of note might be that after the bug occurs, normal reboot procedure hangs, the kernel seems to sync and remount read-only with the magic sysrq combos, but fails to reboot or poweroff and there's always some loss of data in the partition.

The latest dmesg snip follows attached.
Comment 8 Chris Komner 2009-05-24 12:03:48 UTC
Created attachment 21512 [details]
yet another snip
Comment 9 Eric Sandeen 2009-05-24 16:32:18 UTC
So the modification w/ the printk didn't print anything?  Must not have been the first bh on the page that was wrong, then, I didn't anticipate 1k blocks, I'll come up with another patch later that will print the state for all bh's on the page.

-Eric
Comment 10 Aneesh Kumar K.V 2009-05-24 18:27:55 UTC
With data=journal and blocksize= 1k I am seeing some problem explained
here

http://bugzilla.kernel.org/show_bug.cgi?id=12624

-aneesh
Comment 11 Aneesh Kumar K.V 2009-05-24 18:28:23 UTC
On Sun, May 24, 2009 at 02:52:48AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13369
> 
> 
> Theodore Tso <tytso@mit.edu> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |tytso@mit.edu
> 
> 
> 
> 
> --- Comment #5 from Theodore Tso <tytso@mit.edu>  2009-05-24 02:52:48 ---
> So I'm glad you're using a filesystem with a 1k blocksize since it provides a
> good test case for us --- and we don't do enough testing for the case where
> the
> blocksize != pagesize.   But you do realize that the ext4 filesystem will be
> far more efficient with the larger 4k blocksize, right?
> 
> I'm not able to reproduce the problem on a filesystem with 1k blocksize
> running
> fsx, fsstress, and dbench using a 2.6.30-rc6 kernel with the full ext4 patch
> queue, using a netbook with an dual-core N270 Atom processor.   I'm retrying
> with a stock 2.6.30-rc6 kernel now...
> 

It is with nodelalloc. I am also finding some unmapped buffer_heads in
nodelalloc mode with data=journal. See the latest update for bug

http://bugzilla.kernel.org/show_bug.cgi?id=12624

-aneesh
Comment 12 Aneesh Kumar K.V 2009-05-24 19:20:34 UTC
Can you try with the patch series found at

http://article.gmane.org/gmane.comp.file-systems.ext4/12962
(Threading is not working for gmane currently so the two patches
you may need are
http://thread.gmane.org/gmane.linux.file-systems/31465
http://thread.gmane.org/gmane.comp.file-systems.ext4/12960
)

I guess mmap and truncate combination can result in we missing pagefault
when we write to second block in the same page. That would mean we may
have to do block allocation in writepage.

-aneesh
Comment 13 Chris Komner 2009-05-24 20:43:22 UTC
I applied both patches, recompiled and rebooted. A few seconds after ktorrent started, I got this:

kernel BUG at fs/ext4/inode.c:3123!
invalid opcode: 0000 [#1] PREEMPT SMP 
last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
Modules linked in: nvidia(P) oss_usb oss_hdaudio osscore aes_i586 aes_generic ipt_MASQUERADE ipt_REDIRECT xt_multiport xt_tcpudp xt_conntrack xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables uvcvideo videodev v4l1_compat uhci_hcd sr_mod ohci1394 sdhci_pci sdhci ieee1394 cdrom mmc_core ehci_hcd ac button battery sg i2c_i801 usbcore i2c_core psmouse ata_piix evdev arc4 ecb iwl3945 iwlcore mac80211 led_class cfg80211 fuse dell_laptop backlight rfkill b44 dcdbas mii rtc_cmos rtc_core rtc_lib [last unloaded: soundcore]

Pid: 1060, comm: kjournald2 Tainted: P           (2.6.30-rc6-git6 #2) Vostro 1500                     
EIP: 0060:[<c01ffd13>] EFLAGS: 00010202 CPU: 1
EIP is at ext4_normal_writepage+0x183/0x190
EAX: 00000001 EBX: c28cb8e0 ECX: 00001000 EDX: 00000000
ESI: 00001000 EDI: 00000000 EBP: f756be00 ESP: f756bdd4
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process kjournald2 (pid: 1060, ti=f756a000 task=f6d33390 task.ti=f756a000)
Stack:
 00001000 00000000 c01fc0f0 8000082d 0000000c 00000000 f756bf1c 10541259
 eb4d7d30 00000002 eb4d7d30 f756bea0 c01780d7 10541259 c28cb8e0 c01789b9
 00000000 00000007 10541259 00000000 00000001 eb4d7d30 c01780c0 f74fc0a8
Call Trace:
 [<c01fc0f0>] ? ext4_bh_unmapped_or_delay+0x0/0x50
 [<c01780d7>] ? __writepage+0x17/0x60
 [<c01789b9>] ? write_cache_pages+0x1a9/0x3f0
 [<c01780c0>] ? __writepage+0x0/0x60
 [<c01ffb90>] ? ext4_normal_writepage+0x0/0x190
 [<c0178c27>] ? generic_writepages+0x27/0x50
 [<c02228d2>] ? jbd2_journal_commit_transaction+0x3b2/0x12b0
 [<c013eeaf>] ? try_to_del_timer_sync+0x4f/0x70
 [<c0227a87>] ? kjournald2+0xd7/0x230
 [<c014c430>] ? autoremove_wake_function+0x0/0x60
 [<c02279b0>] ? kjournald2+0x0/0x230
 [<c014c002>] ? kthread+0x52/0xa0
 [<c014bfb0>] ? kthread+0x0/0xa0
 [<c0103d67>] ? kernel_thread_helper+0x7/0x10
Code: 00 00 00 75 21 8b 4c 24 18 89 d8 83 c4 20 ba b0 17 20 c0 5b 5e 5f c9 e9 fc e5 fc ff f3 90 8b 4f 48 e9 b4 fe ff ff e8 6d 30 f3 ff <0f> 0b eb fe 0f 0b eb fe 90 8d 74 26 00 55 89 e5 57 56 89 ce 53 
EIP: [<c01ffd13>] ext4_normal_writepage+0x183/0x190 SS:ESP 0068:f756bdd4
---[ end trace d3629642779515bf ]---


In between, I had booted the stock archlinux kernel, version 2.6.29.4 and I couldn't reproduce the bug; instead I got these errors, which might be odd because I had just e2fsck'ed the device:
ext4: find_group_flex failed, fallback succeeded dir 572486
ext4: find_group_flex failed, fallback succeeded dir 572439
ext4: find_group_flex failed, fallback succeeded dir 572469
ext4: find_group_flex failed, fallback succeeded dir 572470
ext4: find_group_flex failed, fallback succeeded dir 572478
ext4: find_group_flex failed, fallback succeeded dir 572444
ext4: find_group_flex failed, fallback succeeded dir 572477
ext4: find_group_flex failed, fallback succeeded dir 572447
ext4: find_group_flex failed, fallback succeeded dir 572482
ext4: find_group_flex failed, fallback succeeded dir 572446

Fsck is always needed for this partition, after the inode.c bug and the mandatory cold reboot, otherwise it aborts:
ext4_abort called.
EXT4-fs error (device sda9): ext4_put_super: Couldn't clean up the journal

My kernel .config follows.
Comment 14 Chris Komner 2009-05-24 20:44:26 UTC
Created attachment 21520 [details]
kernel .config
Comment 15 Aneesh Kumar K.V 2009-05-26 13:16:29 UTC
Will you be able to try these patches. They are on top
of stable patch from the ext4 patch queue found at 

http://repo.or.cz/w/ext4-patch-queue.git

Apply the patches upto stable-boundary-undo.patch

Then apply the below five patches. Apply them in the below order.

page_cache_size.patch
1.patch
2.patch
unmap_or_unwritten
ext4_writepage

-aneesh
Comment 16 Chris Komner 2009-05-27 21:32:01 UTC
Patches applied 24 hours ago and no bugs whatsoever, so far. Thanks.
Comment 17 Eric Sandeen 2009-08-18 17:38:22 UTC
Assigning to myself just so I can close.
Comment 18 Eric Sandeen 2009-08-18 17:38:50 UTC
It sounds like this is fixed, based on comment #16.  If this shows up again, please do re-open.

Thanks,
-Eric