Bug 60786

Summary: xfstest generic/068 failure - kernel BUG at fs/jbd2/transaction.c:2143!
Product: File System Reporter: Eric Whitney (enwlinux)
Component: ext4Assignee: fs_ext4 (fs_ext4)
Status: RESOLVED CODE_FIX    
Severity: normal CC: tytso
Priority: P1    
Hardware: ARM   
OS: Linux   
Kernel Version: 3.11-rc6 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel trace from Pandaboard running 3.11-rc6

Description Eric Whitney 2013-08-23 23:51:09 UTC
Created attachment 107296 [details]
kernel trace from Pandaboard running 3.11-rc6

Xfstest generic/068 fails roughly 40% of the time when run on a Pandaboard ES (ARM) against a test filesystem mounted with the data=journal option.  In this case, an assertion fails in the kernel: "kernel BUG at fs/jbd2/transaction.c:2143!" (full trace attached).  This is a long-standing bug which may have become more visible in recent releases;  it's now become a notable annoyance during routine regression testing.  Recovering from the bug requires rebooting and power-cycling the Pandaboard.

The same bug has been seen on x86-64 in previous releases / release candidates, but appears to occur at much lower frequency.  It did not occur once in 20 trials on an x86-64 KVM guest configured similarly to the Pandaboard test system running 3.11-rc6 (although another BUG, reported separately, did).

It looks like Jan Kara may have made the most recent comments related to this and another similar bug: http://www.spinics.net/lists/linux-ext4/msg36858.html.
(Note that the statement currently at line 2143 in transaction.c was at line 1986 in 3.8.)

Test system hardware configuration:

Pandaboard ES (2 core TI OMAP4 ARM CPU, 1 GB memory)
1 TB Western Digital SATA III disk in USB2-attached external enclosure
Test file systems located in three 5.37 GB partitions on USB-attached disk


Test system software configuration:

Mainline kernel 3.11-rc6 on Ubuntu 12.10 base
e2fsprogs, master branch 93061ea0f4d9f94579d09480f17dd47169c82081
xfsprogs, master branch 20526770e76013587c6148fef9e30ac85332fa65
xfstests, master branch 54d6adf7c98500a177f0507c5905ed6dfb2fa888
xfstests-bld, master branch 723c23e1b2756b911dbb50bbd08c0afa5b836e3b
Comment 1 Theodore Tso 2014-03-12 20:44:05 UTC
As I mentioned on the list, I'm able to reproduce this very easily these days by running xfstests generic/068 with data=journal.

With some additional debugging code, it looks like this is a race between writeback and truncate.

[   36.049876] jbd2_journal_dirty_metadata: vdc-8: bad jh for block 66150: transaction (f545d980, 55), jh->b_transaction (  (null), 0), jh->b_next_transaction (  (null), 0), jlist 0
[   36.049892] ------------[ cut here ]------------
[   36.049894] kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2147!
[   36.049896] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[   36.049897] Modules linked in:
[   36.049899] CPU: 0 PID: 3127 Comm: fstest Not tainted 3.14.0-rc2-00026-gaa2b10b-dirty #1679
[   36.049900] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   36.049901] task: f4b6a910 ti: f5bfc000 task.ti: f5bfc000
[   36.049902] EIP: 0060:[<c02e1421>] EFLAGS: 00010206 CPU: 0
[   36.049906] EIP is at jbd2_journal_invalidatepage+0x208/0x279
[   36.049907] EAX: 00e2c025 EBX: f45cc4c8 ECX: 00000000 EDX: 00000001
[   36.049908] ESI: f221e800 EDI: f7817f1c EBP: f5bfdcdc ESP: f5bfdca8
[   36.049908]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[   36.049909] CR0: 80050033 CR2: b759e000 CR3: 3484a000 CR4: 000006f0
[   36.049915] Stack:
[   36.049917]  00000001 f45cc4c8 00001000 f45cc4c8 00001000 00000000 00000001 00000000
[   36.049919]  f221eb54 f221e814 c02a395d f7817f1c 00001000 f5bfdcfc c02a3955 00001000
[   36.049921]  f221e800 00000000 c02a395d c025bd12 f7817f1c f5bfdd04 c02a396a f5bfdd14
[   36.049921] Call Trace:
[   36.049925]  [<c02a395d>] ? __ext4_journalled_invalidatepage+0xc4/0xc4
[   36.049927]  [<c02a3955>] __ext4_journalled_invalidatepage+0xbc/0xc4
[   36.049928]  [<c02a395d>] ? __ext4_journalled_invalidatepage+0xc4/0xc4
[   36.049932]  [<c025bd12>] ? buffer_check_dirty_writeback+0x74/0x74
[   36.049933]  [<c02a396a>] ext4_journalled_invalidatepage+0xd/0x22
[   36.049936]  [<c020a7f6>] do_invalidatepage+0x22/0x26
[   36.049937]  [<c020a849>] truncate_inode_page+0x4f/0x77
[   36.049939]  [<c020a9aa>] truncate_inode_pages_range+0x108/0x2b7
[   36.049940]  [<c020ab6a>] truncate_inode_pages+0x11/0x15
[   36.049942]  [<c020abc5>] truncate_pagecache+0x57/0x73
[   36.049943]  [<c02a962c>] ext4_setattr+0x403/0x4a9
[   36.049946]  [<c024b9d3>] notify_change+0x1b7/0x295
[   36.049949]  [<c0236693>] do_truncate+0x65/0x84
[   36.049951]  [<c023654f>] ? do_dentry_open+0x157/0x1f4
[   36.049953]  [<c0241cbf>] do_last+0x795/0x87f
[   36.049954]  [<c0241fa2>] path_openat+0x1f9/0x426
[   36.049958]  [<c018fdcd>] ? sched_clock_cpu+0xc6/0xe7
[   36.049959]  [<c0243000>] do_filp_open+0x2c/0x65
[   36.049962]  [<c0237098>] do_sys_open+0x59/0xce
[   36.049965]  [<c07b81a4>] ? setup_singlestep+0x9/0xf0
[   36.049967]  [<c023712a>] SyS_open+0x1d/0x1f
[   36.049968]  [<c07bc846>] sysenter_do_call+0x12/0x38
[   36.049981] Code: e0 ff ff 8b 45 f0 e8 26 50 4d 00 8b 55 cc eb 05 ba 01 00 00 00 89 d8 89 55 cc e8 61 de ff ff 8b 03 8b 55 cc a9 00 00 40 00 74 02 <0f> 0b f0 80 23 df f0 80 23 f7 f0 80 23 bf f0 80 63 01 fd f0 80
[   36.049984] EIP: [<c02e1421>] jbd2_journal_invalidatepage+0x208/0x279 SS:ESP 0068:f5bfdca8
[   36.049986] ---[ end trace bc4c15dc0d962c6e ]---
[   36.115606] ------------[ cut here ]------------
[   36.116785] WARNING: CPU: 1 PID: 714 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1359 jbd2_journal_dirty_metadata+0x16c/0x1d0()
[   36.119764] Modules linked in:
[   36.120601] CPU: 1 PID: 714 Comm: kworker/u8:2 Tainted: G      D      3.14.0-rc2-00026-gaa2b10b-dirty #1679
[   36.123024] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   36.124502] Workqueue: writeback bdi_writeback_workfn (flush-254:32)
[   36.126168]  c01aa168 00000000 f505db80 c07af560 00000000 f505db98 c0168591 c02e0fc6
[   36.128457]  f46044b0 f545d980 f221eb10 f505dba8 c016861b 00000009 00000000 f505dbc8
[   36.130665]  c02e0fc6 00000000 f221e800 f45cc4c8 f45cc4c8 f545c7d8 00001000 f505dbe4
[   36.132851] Call Trace:
[   36.133595]  [<c01aa168>] ? console_unlock+0x355/0x382
[   36.134886]  [<c07af560>] dump_stack+0x48/0x60
[   36.136018]  [<c0168591>] warn_slowpath_common+0x66/0x7d
[   36.137387]  [<c02e0fc6>] ? jbd2_journal_dirty_metadata+0x16c/0x1d0
[   36.138919]  [<c016861b>] warn_slowpath_null+0x14/0x18
[   36.139876]  [<c02e0fc6>] jbd2_journal_dirty_metadata+0x16c/0x1d0
[   36.140812]  [<c02c6c4c>] __ext4_handle_dirty_metadata+0xd4/0x180
[   36.142349]  [<c02a3a46>] write_end_fn+0x3b/0x52
[   36.143540]  [<c02a55eb>] ext4_walk_page_buffers+0x4e/0x6a
[   36.144918]  [<c02a648e>] ext4_writepage+0x2f8/0x363
[   36.146172]  [<c02a3a0b>] ? ext4_bh_delay_or_unwritten+0x21/0x21
[   36.147709]  [<c02a397f>] ? ext4_journalled_invalidatepage+0x22/0x22
[   36.149307]  [<c02071a0>] ? dec_bdi_stat+0x46/0x4d
[   36.150539]  [<c02a64f9>] ? ext4_writepage+0x363/0x363
[   36.151829]  [<c02a6509>] __writepage+0x10/0x2e
[   36.152974]  [<c0207807>] write_cache_pages+0x1f8/0x2f7
[   36.154325]  [<c02a64f9>] ? ext4_writepage+0x363/0x363
[   36.155621]  [<c02a77e5>] ext4_writepages+0xb7/0x513
[   36.156894]  [<c01383e7>] ? paravirt_sched_clock+0x9/0xd
[   36.158224]  [<c0138b9f>] ? sched_clock+0x9/0xc
[   36.159371]  [<c018fac4>] ? sched_clock_local+0x11/0xfa
[   36.160714]  [<c018fe38>] ? local_clock+0x18/0x22
[   36.161911]  [<c019cb8a>] ? lock_release_holdtime.part.26+0x65/0x72
[   36.163524]  [<c0208769>] do_writepages+0x1c/0x29
[   36.164711]  [<c0255ab4>] __writeback_single_inode+0x78/0x2d6
[   36.166155]  [<c0256859>] writeback_sb_inodes+0x19c/0x2ee
[   36.167536]  [<c0256a17>] __writeback_inodes_wb+0x6c/0x95
[   36.168881]  [<c0256b68>] wb_writeback+0x128/0x2e7
[   36.169773]  [<c019e523>] ? trace_hardirqs_on+0xb/0xd
[   36.170738]  [<c0256ec1>] bdi_writeback_workfn+0x85/0x31f
[   36.172055]  [<c017cf45>] ? process_one_work+0x149/0x3ad
[   36.173379]  [<c017cfdf>] process_one_work+0x1e3/0x3ad
[   36.174670]  [<c017da00>] worker_thread+0x12c/0x1c9
[   36.175873]  [<c017d8d4>] ? rescuer_thread+0x221/0x221
[   36.177195]  [<c01823ba>] kthread+0x93/0x98
[   36.178239]  [<c019e523>] ? trace_hardirqs_on+0xb/0xd
[   36.179514]  [<c07bc7b7>] ret_from_kernel_thread+0x1b/0x28
[   36.180921]  [<c0182327>] ? __kthread_parkme+0x55/0x55
[   36.182206] ---[ end trace bc4c15dc0d962c6f ]---
[   36.183387] ------------[ cut here ]------------
[   36.184568] WARNING: CPU: 1 PID: 714 at /usr/projects/linux/ext4/fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0xf6/0x180()
[   36.187507] Modules linked in:
[   36.188317] CPU: 1 PID: 714 Comm: kworker/u8:2 Tainted: G      D W    3.14.0-rc2-00026-gaa2b10b-dirty #1679
[   36.190760] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   36.192191] Workqueue: writeback bdi_writeback_workfn (flush-254:32)
[   36.193870]  c01aa168 00000000 f505dba0 c07af560 00000000 f505dbb8 c0168591 c02c6c6e
[   36.196040]  f45cc4c8 f545c7d8 ffffffea f505dbc8 c016861b 00000009 00000000 f505dbe4
[   36.198229]  c02c6c6e 0000040f c07f8cb4 f45cc4c8 f545c7d8 00001000 f505dbfc c02a3a46
[   36.199931] Call Trace:
[   36.200322]  [<c01aa168>] ? console_unlock+0x355/0x382
[   36.201518]  [<c07af560>] dump_stack+0x48/0x60
[   36.202613]  [<c0168591>] warn_slowpath_common+0x66/0x7d
[   36.203972]  [<c02c6c6e>] ? __ext4_handle_dirty_metadata+0xf6/0x180
[   36.205496]  [<c016861b>] warn_slowpath_null+0x14/0x18
[   36.206773]  [<c02c6c6e>] __ext4_handle_dirty_metadata+0xf6/0x180
[   36.208269]  [<c02a3a46>] write_end_fn+0x3b/0x52
[   36.209423]  [<c02a55eb>] ext4_walk_page_buffers+0x4e/0x6a
[   36.210807]  [<c02a648e>] ext4_writepage+0x2f8/0x363
[   36.212023]  [<c02a3a0b>] ? ext4_bh_delay_or_unwritten+0x21/0x21
[   36.213519]  [<c02a397f>] ? ext4_journalled_invalidatepage+0x22/0x22
[   36.215066]  [<c02071a0>] ? dec_bdi_stat+0x46/0x4d
[   36.216246]  [<c02a64f9>] ? ext4_writepage+0x363/0x363
[   36.217573]  [<c02a6509>] __writepage+0x10/0x2e
[   36.218692]  [<c0207807>] write_cache_pages+0x1f8/0x2f7
[   36.219975]  [<c02a64f9>] ? ext4_writepage+0x363/0x363
[   36.221262]  [<c02a77e5>] ext4_writepages+0xb7/0x513
[   36.222472]  [<c01383e7>] ? paravirt_sched_clock+0x9/0xd
[   36.223823]  [<c0138b9f>] ? sched_clock+0x9/0xc
[   36.224935]  [<c018fac4>] ? sched_clock_local+0x11/0xfa
[   36.226215]  [<c018fe38>] ? local_clock+0x18/0x22
[   36.227397]  [<c019cb8a>] ? lock_release_holdtime.part.26+0x65/0x72
[   36.228919]  [<c0208769>] do_writepages+0x1c/0x29
[   36.229744]  [<c0255ab4>] __writeback_single_inode+0x78/0x2d6
[   36.230848]  [<c0256859>] writeback_sb_inodes+0x19c/0x2ee
[   36.232194]  [<c0256a17>] __writeback_inodes_wb+0x6c/0x95
[   36.233565]  [<c0256b68>] wb_writeback+0x128/0x2e7
[   36.234734]  [<c019e523>] ? trace_hardirqs_on+0xb/0xd
[   36.235968]  [<c0256ec1>] bdi_writeback_workfn+0x85/0x31f
[   36.237337]  [<c017cf45>] ? process_one_work+0x149/0x3ad
[   36.238632]  [<c017cfdf>] process_one_work+0x1e3/0x3ad
[   36.239896]  [<c017da00>] worker_thread+0x12c/0x1c9
[   36.241120]  [<c017d8d4>] ? rescuer_thread+0x221/0x221
[   36.242395]  [<c01823ba>] kthread+0x93/0x98
[   36.243467]  [<c019e523>] ? trace_hardirqs_on+0xb/0xd
[   36.244738]  [<c07bc7b7>] ret_from_kernel_thread+0x1b/0x28
[   36.246084]  [<c0182327>] ? __kthread_parkme+0x55/0x55
[   36.247369] ---[ end trace bc4c15dc0d962c70 ]---
[   36.248505] EXT4-fs: write_end_fn:1039: aborting transaction: error 22 in __ext4_handle_dirty_metadata
[   36.250813] EXT4: jbd2_journal_dirty_metadata failed: handle type 2 started at line 1753, credits 9/8, errcode -22
[   36.253304] EXT4-fs error (device vdc) in __ext4_journalled_writepage:1776: error 22
Comment 2 Eric Whitney 2015-07-22 18:32:22 UTC
A patch included in 4.2-rc1 has resolved this problem: "ext4: fix race between truncate and __ext4_journalled_writepage()" (bdf96838aea6)

The method used to force the bug to occur in the past was to make repeated runs of generic/068 on a test filesystem mounted with the data=journal option.  The test system would typically hang with a few tens of runs at the most.

1000 test runs completed normally both on x86_64 and 32 bit ARM running 4.2-rc1 kernels.  Regular regression runs containing generic/068 have also completed normally using 4.2-rc2 and -rc3 kernels on the same architectures.

Bugzilla #60788 describes the same bug.
Comment 3 Eric Whitney 2015-07-22 18:37:27 UTC
*** Bug 60788 has been marked as a duplicate of this bug. ***