Bug 11688

Summary: Oops while writing on a newly ext4 formated 1TB partition on a external USB drive
Product: File System Reporter: Vedran Furač (vedranf)
Component: ext3Assignee: Jan Kara (jack)
Status: RESOLVED CODE_FIX    
Severity: high CC: vedranf
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27-rc7 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Fix for the bug in ext4_write_begin and ext4_da_write_begin

Description Vedran Furač 2008-10-02 18:38:53 UTC
Latest working kernel version: ?
Earliest failing kernel version: 2.6.27-rc7
Distribution: Debian
Hardware Environment: Maxtor external USB drive (1TB)
Software Environment: Debian
Problem Description: Oops while writing on a newly ext4 formated 1TB partition on a external USB drive

Steps to reproduce: Copy ~30GB to a ... (happened twice in a row)

Oct  3 02:50:46 pollux kernel: [ 6028.969850] ------------[ cut here ]------------
Oct  3 02:50:46 pollux kernel: [ 6028.969875] kernel BUG at fs/jbd/transaction.c:275!
Oct  3 02:50:46 pollux kernel: [ 6028.969892] invalid opcode: 0000 [#1] SMP
Oct  3 02:50:46 pollux kernel: [ 6028.969912] Modules linked in: ext4dev jbd2 crc16 fuse omfs jfs xfs reiserfs ext2 hfs hfsplus vfat fat iso$
Oct  3 02:50:46 pollux kernel: [ 6028.970425]
Oct  3 02:50:46 pollux kernel: [ 6028.970438] Pid: 5892, comm: kio_file Not tainted (2.6.27-rc7-686 #1)
Oct  3 02:50:46 pollux kernel: [ 6028.970458] EIP: 0060:[<f895bb01>] EFLAGS: 00210287 CPU: 0
Oct  3 02:50:46 pollux kernel: [ 6028.970487] EIP is at journal_start+0x2f/0xa9 [jbd]
Oct  3 02:50:46 pollux kernel: [ 6028.970505] EAX: f1c378c0 EBX: d42c33ec ECX: f72f1e00 EDX: 00000012
Oct  3 02:50:46 pollux kernel: [ 6028.970525] ESI: f717ca00 EDI: 00000022 EBP: 00000012 ESP: e963bc20
Oct  3 02:50:46 pollux kernel: [ 6028.970545]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Oct  3 02:50:46 pollux kernel: [ 6028.970564] Process kio_file (pid: 5892, ti=e963a000 task=f19e4720 task.ti=e963a000)
Oct  3 02:50:46 pollux kernel: [ 6028.970585] Stack: d42c33ec f77c7e48 f77c7e48 00000022 e963bc6c f89979aa f77c7e48 f77c7e48
Oct  3 02:50:46 pollux kernel: [ 6028.970631]        00000022 c01a143e c01884d5 f77c7e50 c0188a22 f2553e48 00000000 00000080
Oct  3 02:50:46 pollux kernel: [ 6028.970676]        00000080 c0188c3d 00000080 f6fe8e70 f26c3e50 000007d0 c035d07c 00000080
Oct  3 02:50:46 pollux kernel: [ 6028.970721] Call Trace:
Oct  3 02:50:46 pollux kernel: [ 6028.970744]  [<f89979aa>] ext3_dquot_drop+0x29/0x5d [ext3]
Oct  3 02:50:46 pollux kernel: [ 6028.970780]  [<c01a143e>] vfs_dq_drop+0x35/0x36
Oct  3 02:50:46 pollux kernel: [ 6028.970804]  [<c01884d5>] clear_inode+0x53/0xb3
Oct  3 02:50:46 pollux kernel: [ 6028.970824]  [<c0188a22>] dispose_list+0x46/0xca
Oct  3 02:50:46 pollux kernel: [ 6028.970843]  [<c0188c3d>] shrink_icache_memory+0x197/0x1bf
Oct  3 02:50:46 pollux kernel: [ 6028.970865]  [<c0161ba9>] shrink_slab+0xd3/0x132
Oct  3 02:50:46 pollux kernel: [ 6028.970887]  [<c0161e45>] try_to_free_pages+0x23d/0x347
Oct  3 02:50:46 pollux kernel: [ 6028.970907]  [<c0160c4f>] isolate_pages_global+0x0/0x42
Oct  3 02:50:46 pollux kernel: [ 6028.970928]  [<c015dc5b>] __alloc_pages_internal+0x213/0x34e
Oct  3 02:50:46 pollux kernel: [ 6028.970950]  [<c0159472>] __grab_cache_page+0x2d/0x60
Oct  3 02:50:46 pollux kernel: [ 6028.970970]  [<f96a765a>] ext4_da_write_begin+0x58/0xe9 [ext4dev]
Oct  3 02:50:46 pollux kernel: [ 6028.971010]  [<c0159bc8>] generic_file_buffered_write+0xef/0x556
Oct  3 02:50:46 pollux kernel: [ 6028.971036]  [<c01be51e>] cap_inode_need_killpriv+0x25/0x35
Oct  3 02:50:46 pollux kernel: [ 6028.971060]  [<c01bf8a9>] security_inode_need_killpriv+0xc/0xd
Oct  3 02:50:46 pollux kernel: [ 6028.971082]  [<c0158d60>] file_remove_suid+0x18/0x47
Oct  3 02:50:46 pollux kernel: [ 6028.971105]  [<c015a69b>] __generic_file_aio_write_nolock+0x465/0x4c8
Oct  3 02:50:46 pollux kernel: [ 6028.971127]  [<c0118a91>] kunmap_atomic+0x67/0x8a
Oct  3 02:50:46 pollux kernel: [ 6028.971150]  [<f898ea73>] ext3_readpages+0x0/0x15 [ext3]
Oct  3 02:50:46 pollux kernel: [ 6028.971178]  [<c02c19d3>] _spin_lock+0x5/0x7
Oct  3 02:50:46 pollux kernel: [ 6028.971200]  [<c018b69b>] mnt_drop_write+0x5a/0xbb
Oct  3 02:50:46 pollux kernel: [ 6028.971221]  [<c015a750>] generic_file_aio_write+0x52/0xa9
Oct  3 02:50:46 pollux kernel: [ 6028.971242]  [<f96a2055>] ext4_file_write+0xa7/0x11d [ext4dev]
Oct  3 02:50:46 pollux kernel: [ 6028.971273]  [<c017817c>] do_sync_write+0xbf/0x100
Oct  3 02:50:46 pollux kernel: [ 6028.971295]  [<c01d774f>] __freed_request+0x67/0xd4
Oct  3 02:50:46 pollux kernel: [ 6028.971317]  [<c01331fc>] autoremove_wake_function+0x0/0x2d
Oct  3 02:50:46 pollux kernel: [ 6028.971343]  [<f8898068>] ide_intr+0x198/0x1a3 [ide_core]
Oct  3 02:50:46 pollux kernel: [ 6028.971381]  [<c01bf8d0>] security_file_permission+0xc/0xd
Oct  3 02:50:46 pollux kernel: [ 6028.971403]  [<c01780bd>] do_sync_write+0x0/0x100
Oct  3 02:50:46 pollux kernel: [ 6028.971422]  [<c01788e6>] vfs_write+0x83/0x120
Oct  3 02:50:46 pollux kernel: [ 6028.971442]  [<c0178d99>] sys_write+0x3c/0x63
Oct  3 02:50:46 pollux kernel: [ 6028.971461]  [<c01037f3>] sysenter_do_call+0x12/0x2f
Oct  3 02:50:46 pollux kernel: [ 6028.971482]  =======================
Oct  3 02:50:46 pollux kernel: [ 6028.971496] Code: 56 89 c6 53 bb e2 ff ff ff 83 ec 04 64 a1 00 00 3c c0 8b 80 60 03 00 00 85 f6 89 04 24 7$
Oct  3 02:50:46 pollux kernel: [ 6028.971668] EIP: [<f895bb01>] journal_start+0x2f/0xa9 [jbd] SS:ESP 0068:e963bc20
Oct  3 02:50:46 pollux kernel: [ 6028.971940] ---[ end trace 8adcfbc820a80e03 ]---
Comment 1 Vedran Furač 2008-10-08 17:43:22 UTC
More readable version:

[ 5015.394427] ------------[ cut here ]------------
[ 5015.394434] kernel BUG at fs/jbd/transaction.c:275!
[ 5015.394436] invalid opcode: 0000 [#1] SMP
[ 5015.394439] Modules linked in: isofs zlib_inflate udf crc_itu_t ext4dev jbd2 crc16 nls_utf8 ntfs nls_base radeon drm tun autofs4 ipv6 video output ac battery nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc quota_v2 fuse it87 hwmon_vid eeprom tuner_simple tuner_types tuner snd_cs4232 snd_opl3_lib snd_hwdep snd_cs4231_lib snd_mpu401 snd_mpu401_uart ns558 snd_seq_midi gameport snd_rawmidi bttv snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm_oss rtc_cmos snd_mixer_oss videodev v4l1_compat ir_common rtc_core compat_ioctl32 i2c_algo_bit rtc_lib v4l2_common videobuf_dma_sg videobuf_core snd_bt87x btcx_risc tveeprom snd_pcm snd_seq_dummy serio_raw pcspkr psmouse shpchp snd_seq_oss snd_seq_midi_event snd_seq snd_timer snd_seq_device pci_hotplug snd soundcore snd_page_alloc parport_pc parport button i2c_nforce2 nvidia_agp i2c_core agpgart evdev joydev sd_mod ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod ide_cd_mod cdrom ide_disk ata_generic libata dock usbhid hid ff_memless 8139cp amd74xx usb_storage scsi_mod floppy ide_pci_generic 8139too mii ide_core ohci_hcd ehci_hcd usbcore thermal processor fan thermal_sys
[ 5015.394510]
[ 5015.394514] Pid: 6978, comm: kio_file Tainted: G        W (2.6.27-rc7-686 #1)
[ 5015.394517] EIP: 0060:[<f90b5b01>] EFLAGS: 00210287 CPU: 0
[ 5015.394530] EIP is at journal_start+0x2f/0xa9 [jbd]
[ 5015.394533] EAX: f4591160 EBX: f5d413ec ECX: f70a0800 EDX: 00000012
[ 5015.394536] ESI: f714d000 EDI: 0000004a EBP: 00000012 ESP: f71d5c20
[ 5015.394539]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 5015.394542] Process kio_file (pid: 6978, ti=f71d4000 task=f34fcd00 task.ti=f71d4000)
[ 5015.394544] Stack: f5d413ec f77a0c9c f77a0c9c 0000004a f71d5c6c f911d9aa f77a0c9c f77a0c9c
[ 5015.394551]        0000004a c01a143e c01884d5 f77a0ca4 c0188a22 f5c824ec 00000000 00000080
[ 5015.394557]        00000080 c0188c3d 00000080 f7688e50 f0aac76c 00000c1c c035d07c 00000080
[ 5015.394563] Call Trace:
[ 5015.394566]  [<f911d9aa>] ext3_dquot_drop+0x29/0x5d [ext3]
[ 5015.394585]  [<c01a143e>] vfs_dq_drop+0x35/0x36
[ 5015.394594]  [<c01884d5>] clear_inode+0x53/0xb3
[ 5015.394598]  [<c0188a22>] dispose_list+0x46/0xca
[ 5015.394602]  [<c0188c3d>] shrink_icache_memory+0x197/0x1bf
[ 5015.394605]  [<c0161ba9>] shrink_slab+0xd3/0x132
[ 5015.394612]  [<c0161e45>] try_to_free_pages+0x23d/0x347
[ 5015.394616]  [<c0160c4f>] isolate_pages_global+0x0/0x42
[ 5015.394620]  [<c015dc5b>] __alloc_pages_internal+0x213/0x34e
[ 5015.394624]  [<c0159472>] __grab_cache_page+0x2d/0x60
[ 5015.394627]  [<f958a65a>] ext4_da_write_begin+0x58/0xe9 [ext4dev]
[ 5015.394649]  [<c0159bc8>] generic_file_buffered_write+0xef/0x556
[ 5015.394657]  [<c01be51e>] cap_inode_need_killpriv+0x25/0x35
[ 5015.394663]  [<c01bf8a9>] security_inode_need_killpriv+0xc/0xd
[ 5015.394668]  [<c0158d60>] file_remove_suid+0x18/0x47
[ 5015.394674]  [<c015a69b>] __generic_file_aio_write_nolock+0x465/0x4c8
[ 5015.394678]  [<c0118a91>] kunmap_atomic+0x67/0x8a
[ 5015.394684]  [<c02c19d3>] _spin_lock+0x5/0x7
[ 5015.394691]  [<c0188f44>] touch_atime+0xe/0xce
[ 5015.394694]  [<c015a750>] generic_file_aio_write+0x52/0xa9
[ 5015.394698]  [<f9585055>] ext4_file_write+0xa7/0x11d [ext4dev]
[ 5015.394711]  [<c017817c>] do_sync_write+0xbf/0x100
[ 5015.394717]  [<c01331fc>] autoremove_wake_function+0x0/0x2d
[ 5015.394724]  [<c01bf8d0>] security_file_permission+0xc/0xd
[ 5015.394728]  [<c01780bd>] do_sync_write+0x0/0x100
[ 5015.394732]  [<c01788e6>] vfs_write+0x83/0x120
[ 5015.394735]  [<c0178d99>] sys_write+0x3c/0x63
[ 5015.394739]  [<c01037f3>] sysenter_do_call+0x12/0x2f
[ 5015.394744]  =======================
[ 5015.394746] Code: 56 89 c6 53 bb e2 ff ff ff 83 ec 04 64 a1 00 00 3c c0 8b 80 60 03 00 00 85 f6 89 04 24 74 7e 85 c0 74 11 89 c3 8b 00 39 30 74 04 <0f> 0b eb fe ff 43 08 eb 69 a1 68 ee 0b f9 ba 50 00 00 00 bb f4
[ 5015.394776] EIP: [<f90b5b01>] journal_start+0x2f/0xa9 [jbd] SS:ESP 0068:f71d5c20
[ 5015.394789] ---[ end trace c6daa70af0b08ca0 ]---
Comment 2 Theodore Tso 2008-11-06 12:27:45 UTC
Can you replicate this on a newer kernel?  This looks like a failure in ext3/jbd code, actually, triggered when the kernel was trying to free pages to get back some memory.
Comment 3 Theodore Tso 2009-01-17 16:24:07 UTC
Hi, any more information about this bug? How much memory do you have?

What ext3 filesystems do you have mounted?  The oops is in ext3 code:


handle_t *journal_start(journal_t *journal, int nblocks)
{
	handle_t *handle = journal_current_handle();
	int err;

	if (!journal)
		return ERR_PTR(-EROFS);

	if (handle) {
		J_ASSERT(handle->h_transaction->t_journal == journal); <----
		handle->h_ref++;
		return handle;
	}

It seems to be related ext3's quota code -- and again it seems to be related to an attempt to free memory when the system is under memory pressure.  This looks very much like a quota and/or ext3 bug.

 
Comment 4 Vedran Furač 2009-01-18 15:39:05 UTC
Sure, I'm waiting for 2.6.28 to enter debian so I can test. I have 1.7GB of memory, of that more than 1GB is free (excluding cache). All other filesystems are ext3 including the source (ext3->ext4). Quota is enabled but not set for any user. I disabled it (init.d/quota stop) and tried writing (ext3->ext4) again and oops didn't happen. Will test more, but this could be the issue. What I don't understand is why it doesn't happen when I'm writing to ext3.
Comment 5 Theodore Tso 2009-01-18 16:56:49 UTC
Ext4's delayed allocation will cause more memory pressure, because blocks aren't as aggressively pushed out to disk.   How much memory do you have in your system, how many ext3/ext4 filesystems do you have mounted, how many of them are active (and how active) versus totally quiescent, and how big is the journal for those filesystems?
Comment 6 Vedran Furač 2009-01-18 17:18:15 UTC
                total       used       free     shared    buffers     cached
Mem:       1773      1730       43        0            23           1185
-/+ buffers/cache:   521       1252
Swap:            0          0          0

/dev/mapper/vg-root on / type ext3 (rw,errors=remount-ro)
/dev/mapper/vg-home on /home type ext3 (rw,usrquota)
/dev/mapper/vg-var on /var type ext3 (rw)
/dev/sda3 on /media/disk type ext4dev (rw,nosuid,nodev,uhelper=hal)
/dev/sda2 on /media/disk-1 type ext3 (rw,nosuid,nodev,uhelper=hal,data=ordered)

Copying from /home to /media/disk so only these are active if I understand you correctly. 

# debugfs -R "stat <8>" /dev/mapper/vg-home                               
debugfs 1.41.3 (12-Oct-2008)
Inode: 8   Type: regular    Mode:  0600   Flags: 0x0
Generation: 0    Version: 0x00000000
User:     0   Group:     0   Size: 134217728
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 262416
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x487a96ce -- Mon Jul 14 01:59:10 2008
atime: 0x00000000 -- Thu Jan  1 01:00:00 1970
mtime: 0x487a96ce -- Mon Jul 14 01:59:10 2008
Size of extra inode fields: 0

# debugfs -R "stat <8>" /dev/sda3                                           
debugfs 1.41.3 (12-Oct-2008)
Inode: 8   Type: regular    Mode:  0600   Flags: 0x80000
Generation: 0    Version: 0x00000000
User:     0   Group:     0   Size: 134217728
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 262144
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x48ee6d44 -- Thu Oct  9 22:44:52 2008
atime: 0x00000000 -- Thu Jan  1 01:00:00 1970
mtime: 0x48ee6d44 -- Thu Oct  9 22:44:52 2008
Size of extra inode fields: 0
Comment 7 Theodore Tso 2009-01-18 18:28:12 UTC
So you only have 2 megabytes of memory....  Yeah, this sounds like a low-memory pressure situation.   Ext4's delayed allocation will definitely use more memory, and it looks like you are tickling an ext3 memory pressure bug.   Is there a reason why you aren't mounting your filesystems with noatime?   That's undoubtedly responsible for the ext3 journal activity, and I suspect mounting /home with noatime will make the problem go away.  It might also be interesting disabling quotas on /home, to see if the problem goes away or moves to something else.

What is happening is that when ext4 starts writing, it needs memory for the page cache.  To get that space, the VM system is trying to get space, so it calls among other things, shrink_icache_memory, to shrink the inode cache, which then calls clear_inode, which ended up calling ext3_dquot_drop, through the quota code.  Ext3_dquot_drop() calls journal_start(), which then trips over a sanity check.   In any case, given what you've told me, I'm even more convinced this is an ext3 and/or a quota system bug, which is getting triggered to severe memory pressure caused by ext4 and the fact that you have a very tiny amount of memory.
Comment 8 Vedran Furač 2009-01-19 00:50:56 UTC
Errr, that was output of $ free -m, so it's ~1.7GB of memory.
Regarding noatime... I didn't change debian defaults. 
Disabling quotas (/etc/init.d/quota stop) seems to fix the problem. I copied some 70GB and oops didn't happen.

I need to test it more,  but I'm 90% sure it's quota related bug. You can assign this bug to someone who is in charge of ext3 and quota.
Comment 9 Jan Kara 2009-01-20 09:18:42 UTC
Ted, I've looked into this and it is a bug in ext4. It should call grab_cache_page_write_begin() with AOP_FLAG_NOFS since it cannot handle recursion into the filesystem for freeing memory. The problem is that at the time grab_cache_page_write_begin() is called, a transaction is already started - this is what quota code complains about (because it also needs to start a transaction but current->journal_info is already occupied by ext4).

If we wanted to avoid AOP_FLAG_NOFS we could get the page, start a transaction, lock and check the page but that's going to be a bit nasty.
Comment 10 Jan Kara 2009-02-05 08:30:38 UTC
Created attachment 20124 [details]
Fix for the bug in ext4_write_begin and ext4_da_write_begin

This patch should fix the problem. Testing welcome ;)
Comment 11 Jan Kara 2009-02-24 08:22:03 UTC
OK, the patch was merged into ext4-tree. Closing the bug.