Bug 34642

Summary: kjournald failed
Product: File System Reporter: O01eg (o01eg)
Component: OtherAssignee: Jan Kara (jack)
Status: RESOLVED CODE_FIX    
Severity: normal CC: akpm, alan, kernel, olaf
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.36-3.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: /proc/config.gz
Debugging patch for assertion failure in journal_commit_transaction
Patch fixing assertion failure in JBD

Description O01eg 2011-05-06 20:24:18 UTC
Created attachment 56882 [details]
/proc/config.gz

Sometimes I get such messages in kernel:

May  6 23:43:55 o01eg kernel: [  495.167733] ------------[ cut here ]------------
May  6 23:43:55 o01eg kernel: [  495.167738] kernel BUG at fs/jbd/commit.c:529!
May  6 23:43:55 o01eg kernel: [  495.167741] invalid opcode: 0000 [#1] PREEMPT SMP 
May  6 23:43:55 o01eg kernel: [  495.167746] last sysfs file: /sys/devices/platform/it87.656/temp2_input
May  6 23:43:55 o01eg kernel: [  495.167749] CPU 2 
May  6 23:43:55 o01eg kernel: [  495.167750] Modules linked in: ppp_deflate zlib_deflate bsd_comp ppp_async option usb_wwan oss_hdaudio osscore vhba vboxnetadp vboxnetflt vboxdrv tun nvidiafb ppdev parport_pc fb_ddc parport vgastate
May  6 23:43:55 o01eg kernel: [  495.167770] 
May  6 23:43:55 o01eg kernel: [  495.167774] Pid: 1909, comm: kjournald Not tainted 2.6.36-gentoo-r5 #1 P35T-DS3P/P35T-DS3P
May  6 23:43:55 o01eg kernel: [  495.167777] RIP: 0010:[<ffffffff811788a3>]  [<ffffffff811788a3>] journal_commit_transaction+0x94d/0x1262
May  6 23:43:55 o01eg kernel: [  495.167787] RSP: 0018:ffff88007c57fd30  EFLAGS: 00010202
May  6 23:43:55 o01eg kernel: [  495.167790] RAX: 0000000000000023 RBX: ffff880071ac5bc0 RCX: ffff88007c57e010
May  6 23:43:55 o01eg kernel: [  495.167794] RDX: 0000000000000100 RSI: ffff88007f46c468 RDI: ffff88007e6a6c24
May  6 23:43:55 o01eg kernel: [  495.167797] RBP: ffff88007c57fe50 R08: 0000000000000000 R09: ffff880001b0d560
May  6 23:43:55 o01eg kernel: [  495.167800] R10: 0000000000000002 R11: 0000000000000002 R12: ffff88007e6a6c00
May  6 23:43:55 o01eg kernel: [  495.167803] R13: ffff88007e6a6d4c R14: ffff88007eac3000 R15: 0000000000000000
May  6 23:43:55 o01eg kernel: [  495.167807] FS:  0000000000000000(0000) GS:ffff880001b00000(0000) knlGS:0000000000000000
May  6 23:43:55 o01eg kernel: [  495.167811] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May  6 23:43:55 o01eg kernel: [  495.167814] CR2: 00000000015c9660 CR3: 000000006549b000 CR4: 00000000000006e0
May  6 23:43:55 o01eg kernel: [  495.167817] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May  6 23:43:55 o01eg kernel: [  495.167820] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May  6 23:43:55 o01eg kernel: [  495.167824] Process kjournald (pid: 1909, threadinfo ffff88007c57e000, task ffff88007e7fc7b0)
May  6 23:43:55 o01eg kernel: [  495.167827] Stack:
May  6 23:43:55 o01eg kernel: [  495.167829]  ffff880001b12540 ffff88007f8c47b0 ffff88007c57e010 000000734a4bcd0b
May  6 23:43:55 o01eg kernel: [  495.167833] <0> ffff88007c57e000 ffff88007c57e000 ffff88007eac3000 ffff88007e6a6d4c
May  6 23:43:55 o01eg kernel: [  495.167839] <0> ffff88007e6a6c00 0000000000000121 ffff88007e6a6c24 ffff88007c57e010
May  6 23:43:55 o01eg kernel: [  495.167845] Call Trace:
May  6 23:43:55 o01eg kernel: [  495.167852]  [<ffffffff81061551>] ? lock_timer_base+0x27/0x4d
May  6 23:43:55 o01eg kernel: [  495.167857]  [<ffffffff81061820>] ? try_to_del_timer_sync+0x7b/0x85
May  6 23:43:55 o01eg kernel: [  495.167862]  [<ffffffff8117bb05>] kjournald+0xe6/0x22c
May  6 23:43:55 o01eg kernel: [  495.167867]  [<ffffffff8106d995>] ? autoremove_wake_function+0x0/0x34
May  6 23:43:55 o01eg kernel: [  495.167872]  [<ffffffff8117ba1f>] ? kjournald+0x0/0x22c
May  6 23:43:55 o01eg kernel: [  495.167875]  [<ffffffff8106d58e>] kthread+0x7d/0x85
May  6 23:43:55 o01eg kernel: [  495.167881]  [<ffffffff81026754>] kernel_thread_helper+0x4/0x10
May  6 23:43:55 o01eg kernel: [  495.167885]  [<ffffffff8106d511>] ? kthread+0x0/0x85
May  6 23:43:55 o01eg kernel: [  495.167889]  [<ffffffff81026750>] ? kernel_thread_helper+0x0/0x10
May  6 23:43:55 o01eg kernel: [  495.167892] Code: 48 c7 85 58 ff ff ff 00 00 00 00 45 31 f6 c7 85 48 ff ff ff 00 00 00 00 48 c7 85 50 ff ff ff 00 00 00 00 4d 89 ef e9 8d 02 00 00 <0f> 0b eb fe 41 f6 04 24 02 74 24 49 8b 45 00 f0 80 60 02 ef 4c 
May  6 23:43:55 o01eg kernel: [  495.167934] RIP  [<ffffffff811788a3>] journal_commit_transaction+0x94d/0x1262
May  6 23:43:55 o01eg kernel: [  495.167939]  RSP <ffff88007c57fd30>
May  6 23:43:55 o01eg kernel: [  495.167943] ---[ end trace 51ef317e59fa350a ]---

After this any proccess freezes (state D) which has wrote to the one ext3 partition mounted with option "defaults,data=ordered". Other partition are not affected. Read operation is possible.
Comment 1 O01eg 2011-05-06 20:41:47 UTC
With new kernel:
[  430.603855] ------------[ cut here ]------------
[  430.603865] kernel BUG at fs/jbd/commit.c:505!
[  430.603869] invalid opcode: 0000 [#1] PREEMPT SMP 
[  430.603878] last sysfs file: /sys/devices/platform/it87.656/temp3_alarm
[  430.603882] CPU 0 
[  430.603885] Modules linked in: ppp_deflate zlib_deflate bsd_comp ppp_async option usb_wwan netconsole oss_hdaudio osscore tun ppdev parport_pc parport
[  430.603913] 
[  430.603918] Pid: 1634, comm: kjournald Tainted: G        W   2.6.38.3 #1 Gigabyte Technology Co., Ltd. P35T-DS3P/P35T-DS3P
[  430.603931] RIP: 0010:[<ffffffff8118435e>]  [<ffffffff8118435e>] journal_commit_transaction+0x94d/0x1213
[  430.603944] RSP: 0018:ffff88007ba57d30  EFLAGS: 00010202
[  430.603948] RAX: 0000000000000012 RBX: ffff880063b4d200 RCX: ffff88007ba56010
[  430.603952] RDX: 0000000000000100 RSI: ffff88007cc28458 RDI: ffff88007bb27024
[  430.603956] RBP: ffff88007ba57e50 R08: 0000000000000000 R09: 0000000000000000
[  430.603960] R10: 0000000000000000 R11: ffff88007fc0c5d0 R12: ffff88007bb27000
[  430.603964] R13: ffff88007bb2714c R14: ffff88007bf71800 R15: 0000000000000000
[  430.603969] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[  430.603974] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  430.603978] CR2: 00007fc6cdd51000 CR3: 0000000063900000 CR4: 00000000000006f0
[  430.603982] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  430.603986] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  430.603990] Process kjournald (pid: 1634, threadinfo ffff88007ba56000, task ffff88007d232da0)
[  430.603994] Stack:
[  430.603997]  ffff88007ba56000 ffff88007ba57fd8 ffff88007ba56010 0000006441fd3500
[  430.604072]  ffff88007ba56000 ffff88007ba56000 ffff88007bf71800 ffff88007bb2714c
[  430.604082]  ffff88007bb27000 ffff88007bb27024 ffff880000000091 ffff88007ba56010
[  430.604092] Call Trace:
[  430.604099]  [<ffffffff81073405>] ? autoremove_wake_function+0x0/0x34
[  430.604105]  [<ffffffff81066a5f>] ? lock_timer_base+0x27/0x4d
[  430.604110]  [<ffffffff81066d2e>] ? try_to_del_timer_sync+0x7b/0x85
[  430.604116]  [<ffffffff81187559>] kjournald+0xe6/0x22c
[  430.604121]  [<ffffffff81073405>] ? autoremove_wake_function+0x0/0x34
[  430.604127]  [<ffffffff81187473>] ? kjournald+0x0/0x22c
[  430.604132]  [<ffffffff81073006>] kthread+0x7d/0x85
[  430.604139]  [<ffffffff8102b754>] kernel_thread_helper+0x4/0x10
[  430.604201]  [<ffffffff81072f89>] ? kthread+0x0/0x85
[  430.604207]  [<ffffffff8102b750>] ? kernel_thread_helper+0x0/0x10
[  430.604210] Code: 48 c7 85 58 ff ff ff 00 00 00 00 45 31 f6 c7 85 48 ff ff ff 00 00 00 00 48 c7 85 50 ff ff ff 00 00 00 00 4d 89 ef e9 8d 02 00 00 <0f> 0b eb fe 41 f6 04 24 02 74 24 49 8b 45 00 f0 80 60 02 f7 4c 
[  430.604689] RIP  [<ffffffff8118435e>] journal_commit_transaction+0x94d/0x1213
[  430.604696]  RSP <ffff88007ba57d30>
[  430.604721] ---[ end trace 9a427e3322e84bd6 ]---
Comment 2 Jan Kara 2011-05-09 11:37:30 UTC
Thanks for report! OK, so we wrongly estimated number of needed credits (the assertion is that number of buffers in BJ_Metadata list is larger than the number of remaining transaction credits). I guess the commit 523334ba508a8baaf5fc9f15fbad9ed04f334f48 should have fixed this (merged in 2.6.39-rc3). So can you possibly try latest 2.6.39-rc kernel? Thanks.
Comment 3 O01eg 2011-05-10 14:32:48 UTC
Thanks. It's look like fixed.
Comment 4 O01eg 2011-05-11 16:37:25 UTC
No, it isn't. Bug wasn't fixed but it is caused more rarely.

[91087.583431] ------------[ cut here ]------------
[91087.583452] kernel BUG at fs/jbd/commit.c:503!
[91087.583464] invalid opcode: 0000 [#1] PREEMPT SMP 
[91087.583482] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda10/uevent
[91087.583500] CPU 3 
[91087.583507] Modules linked in: ppp_deflate zlib_deflate bsd_comp ppp_async option usb_wwan ipt_LOG xt_state iptable_filter ipt_MASQUERADE ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_DSCP xt_dscp xt_string xt_NFQUEUE xt_multiport xt_mark xt_hashlimit xt_conntrack xt_connmark nf_conntrack oss_hdaudio osscore tun ppdev parport_pc parport
[91087.583644] 
[91087.583651] Pid: 1643, comm: kjournald Tainted: G   M    W   2.6.39-rc6 #1 Gigabyte Technology Co., Ltd. P35T-DS3P/P35T-DS3P
[91087.583678] RIP: 0010:[<ffffffff8118497b>]  [<ffffffff8118497b>] journal_commit_transaction+0x972/0x1267
[91087.583703] RSP: 0018:ffff88007b06dcf0  EFLAGS: 00010202
[91087.583714] RAX: 0000000000000012 RBX: ffff880035334380 RCX: ffff88007b06c010
[91087.583728] RDX: 0000000000000100 RSI: ffff88007cc28468 RDI: ffff88007b8d8824
[91087.583742] RBP: ffff88007b06de50 R08: ffff880035334380 R09: ffff880018a2bba8
[91087.583757] R10: 0000000000011640 R11: 0007ffffffffffff R12: ffff88007b8d8800
[91087.583770] R13: ffff88007b8d894c R14: ffff88007b82b000 R15: 0000000000000000
[91087.583784] FS:  0000000000000000(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000
[91087.583799] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[91087.583813] CR2: 00007f2f59680fc0 CR3: 0000000006c64000 CR4: 00000000000006e0
[91087.583828] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[91087.583841] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[91087.583855] Process kjournald (pid: 1643, threadinfo ffff88007b06c000, task ffff88007d160f20)
[91087.583870] Stack:
[91087.583876]  0000000000000003 0000000000000000 ffff88007b06de50 ffff88007b06c010
[91087.583898]  ffff88007b06c000 ffff88007b06c010 ffff88007b06c010 000052d7fb0d8a8c
[91087.583921]  ffff88007b06c000 ffff88007b8d894c ffff88007b82b000 ffff88007b8d8800
[91087.583943] Call Trace:
[91087.583955]  [<ffffffff81072071>] ? wake_up_bit+0x25/0x25
[91087.583968]  [<ffffffff810653e3>] ? lock_timer_base+0x27/0x4d
[91087.583982]  [<ffffffff810656b2>] ? try_to_del_timer_sync+0x7b/0x85
[91087.583996]  [<ffffffff81187bb0>] kjournald+0xe6/0x22c
[91087.584003]  [<ffffffff81072071>] ? wake_up_bit+0x25/0x25
[91087.584003]  [<ffffffff81187aca>] ? commit_timeout+0xb/0xb
[91087.584003]  [<ffffffff81071c72>] kthread+0x7d/0x85
[91087.584003]  [<ffffffff815cc954>] kernel_thread_helper+0x4/0x10
[91087.584003]  [<ffffffff81071bf5>] ? kthread_worker_fn+0x14c/0x14c
[91087.584003]  [<ffffffff815cc950>] ? gs_change+0xb/0xb
[91087.584003] Code: 48 c7 85 28 ff ff ff 00 00 00 00 45 31 f6 c7 85 20 ff ff ff 00 00 00 00 48 c7 85 18 ff ff ff 00 00 00 00 4d 89 ef e9 8c 02 00 00 <0f> 0b eb fe 41 f6 04 24 02 74 24 49 8b 45 00 f0 80 60 02 f7 4c 
[91087.584003] RIP  [<ffffffff8118497b>] journal_commit_transaction+0x972/0x1267
[91087.584003]  RSP <ffff88007b06dcf0>
[91087.599334] ---[ end trace a30b2d7091e47ff4 ]---
Comment 5 Jan Kara 2011-05-11 17:04:58 UTC
Drat, so probably there's a similar problem elsewhere. What kind of load do you use (because I haven't seen a report like this in ages, so you must do something special).
Comment 6 O01eg 2011-05-11 18:28:38 UTC
It started unexpected and I also didn't get this bug before.
In first causes it was linked with thunderbird activity. I try to check volume (to find badblocks), try to backup ~/.thunderbird/ and restore them (to delete wrong inodes) but it wasn't help.
After you fix thunderbird could work entire day.
P.S. I missed, I use "data=journal" option for this volume.
Comment 7 Jan Kara 2011-05-12 07:55:10 UTC
(In reply to comment #6)
> P.S. I missed, I use "data=journal" option for this volume.
Ah, that explains a lot! data=journal is not used/tested very much and uses rather different code paths. I'll have a look.
Comment 8 Olaf Hering 2012-02-15 15:53:08 UTC
This report indicates that data=journal has still issues with 3.2
https://bugzilla.novell.com/show_bug.cgi?id=747122
Comment 9 Jan Kara 2012-05-14 08:38:15 UTC
Umm, I forgot there's this bug open and updated only https://bugzilla.novell.com/show_bug.cgi?id=747122. So there's commit b22570d9abb3d844e65c15c8bc0d57a78129e3b4 which fixed bug in data=journal mode handling causing exactly the reported assertion failure (Olaf's failures with older kernels were exactly because of this problem). OTOH 3.2 should already have that commit so it may not be all. I'm just checking with Olaf whether he is really able to reproduce with 3.2 and can gather some debug data.

I'll attach the debugging patch here so if there's somebody who can reproduce the issue, please run with the debugging patch and report. Thanks.
Comment 10 Jan Kara 2012-05-14 08:42:22 UTC
Created attachment 73288 [details]
Debugging patch for assertion failure in journal_commit_transaction

The patch should work with pretty much any kernel version modulo some minor context changes...
Comment 11 Jan Kara 2012-09-19 14:38:53 UTC
Created attachment 80581 [details]
Patch fixing assertion failure in JBD

This patch was confirmed to fix the issue. I'll push it to Linus in the next merge window.
Comment 12 Jan Kara 2012-09-19 14:39:28 UTC
Closing the bug.