Distribution: Debian sid (unstable) Hardware Environment: qemu x86 Software Environment: Minimal Debian sid Problem Description: Mounting one of the attached intentionally corrupted ext3 filesystems and accessing it causes a soft lockup. I believe they probably are the same bug (both happen in __log_wait_for_space()), but the backtraces vary quite a bit so I attach two test cases. My kernel is 2.6.25.8 + three patches posted by Duane Griffin for #10882, namely (I hope I understand the git terminology right, at least you can grep lkml archives for these): - 6ae4ecf..8019bf2 - 7712682..bc030f4 - 0b8cf80..ea0236b Steps to reproduce: 1. bunzip2 one of the filesystem images 2. mount fs.img /mnt 3. ls /mnt
Created attachment 16603 [details] Test case 1, hdb.329, bzip2 compressed Back trace: fstest:~# mount /dev/hdb /mnt [ 20.221730] EXT3-fs warning: feature flags set on rev 0 fs, running e2fsck is recommended [ 20.227197] kjournald starting. Commit interval 5 seconds [ 20.228444] EXT3-fs warning (device hdb): ext3_update_dynamic_rev: updating to rev 1 because of new feature flag, running e2fsck is recommended [ 20.231276] EXT3 FS on hdb, internal journal [ 20.231879] EXT3-fs: mounted filesystem with ordered data mode. fstest:~# cd /mnt fstest:/mnt# ls [ 91.272751] BUG: soft lockup - CPU#0 stuck for 61s! [ls:644] [ 91.272751] [ 91.272751] Pid: 644, comm: ls Not tainted (2.6.25.8 #4) [ 91.272751] EIP: 0060:[<c057c3e0>] EFLAGS: 00000246 CPU: 0 [ 91.272751] EIP is at mutex_lock_nested+0x180/0x1f0 [ 91.272751] EAX: c79dd578 EBX: c79dd558 ECX: c7ac0000 EDX: c7ac0000 [ 91.272751] ESI: 00000246 EDI: c7a34ea0 EBP: c7ac1e68 ESP: c7ac1e30 [ 91.272751] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 91.272751] CR0: 8005003b CR2: b7f6e280 CR3: 07ab8000 CR4: 00000690 [ 91.272751] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 91.272751] DR6: 00000000 DR7: 00000000 [ 91.272751] [<c0321207>] ? __log_wait_for_space+0x67/0xb0 [ 91.272751] [<c0321207>] ? __log_wait_for_space+0x67/0xb0 [ 91.272751] [<c0321207>] __log_wait_for_space+0x67/0xb0 [ 91.272751] [<c031ec94>] start_this_handle+0xc4/0x390 [ 91.272751] [<c0272894>] ? filldir64+0xa4/0xf0 [ 91.272751] [<c031f0e9>] journal_start+0xa9/0x110 [ 91.272751] [<c02f59c9>] ext3_journal_start_sb+0x29/0x50 [ 91.272751] [<c02727f0>] ? filldir64+0x0/0xf0 [ 91.272751] [<c02eb3bc>] ext3_dirty_inode+0x2c/0x80 [ 91.272751] [<c02840fc>] __mark_inode_dirty+0x2c/0x190 [ 91.272751] [<c0272aef>] ? vfs_readdir+0x4f/0xa0 [ 91.272751] [<c0272aef>] ? vfs_readdir+0x4f/0xa0 [ 91.272751] [<c0279d14>] touch_atime+0xc4/0x110 [ 91.272751] [<c0272b31>] vfs_readdir+0x91/0xa0 [ 91.272751] [<c02727f0>] ? filldir64+0x0/0xf0 [ 91.272751] [<c0272ba9>] sys_getdents64+0x69/0xb0 [ 91.272751] [<c0202f1e>] syscall_call+0x7/0xb [ 91.272751] =======================
Created attachment 16605 [details] Test case 2, hdb.516, bzip2 compressed Backtraces I've seen (3 different): -------------------- [ 893.458977] kjournald starting. Commit interval 5 seconds [ 893.463750] EXT3 FS on hdb, internal journal [ 893.464313] EXT3-fs: mounted filesystem with ordered data mode. [ 893.482219] EXT3-fs error (device hdb): ext3_lookup: bad inode number: 1517 [ 958.851735] BUG: soft lockup - CPU#0 stuck for 61s! [find:5604] [ 958.851735] [ 958.851735] Pid: 5604, comm: find Not tainted (2.6.25.8 #2) [ 958.851735] EIP: 0060:[<c0320c78>] EFLAGS: 00000246 CPU: 0 [ 958.851735] EIP is at cleanup_journal_tail+0xa8/0x100 [ 958.851735] EAX: c7bdc000 EBX: c4019000 ECX: 00000000 EDX: 00000001 [ 958.851735] ESI: 00002001 EDI: 00000009 EBP: c400bd3c ESP: c400bd28 [ 958.851735] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 958.851735] CR0: 8005003b CR2: b7e48270 CR3: 04007000 CR4: 00000690 [ 958.851735] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 958.851735] DR6: 00000000 DR7: 00000000 [ 958.851735] [<c0320e45>] log_do_checkpoint+0x135/0x420 [ 958.851735] [<c023110d>] ? run_hrtimer_pending+0xfd/0x130 [ 958.851735] [<c023f82a>] ? softlockup_tick+0xaa/0x120 [ 958.851735] [<c057d933>] ? _spin_lock_irq+0x33/0x40 [ 958.851735] [<c057d77d>] ? _spin_unlock_irq+0x1d/0x30 [ 958.851735] [<c02234f7>] ? run_timer_softirq+0x157/0x1d0 [ 958.851735] [<c023721e>] ? tick_periodic+0x2e/0x80 [ 958.851735] [<c0237282>] ? tick_handle_periodic+0x12/0x70 [ 958.851735] [<c021f275>] ? __do_softirq+0x85/0xa0 [ 958.851735] [<c021f3ba>] ? irq_exit+0x2a/0x80 [ 958.851735] [<c0204e46>] ? do_IRQ+0x46/0xa0 [ 958.851735] [<c02030df>] ? common_interrupt+0x23/0x28 [ 958.851735] [<c057c378>] ? mutex_lock_nested+0x188/0x1f0 [ 958.851735] [<c0321197>] ? __log_wait_for_space+0x67/0xb0 [ 958.851735] [<c0321197>] ? __log_wait_for_space+0x67/0xb0 [ 958.851735] [<c057d8b2>] ? _spin_lock+0x32/0x40 [ 958.851735] [<c03211c7>] __log_wait_for_space+0x97/0xb0 [ 958.851735] [<c031ec24>] start_this_handle+0xc4/0x390 [ 958.851735] [<c0272894>] ? filldir64+0xa4/0xf0 [ 958.851735] [<c031f079>] journal_start+0xa9/0x110 [ 958.851735] [<c02f5959>] ext3_journal_start_sb+0x29/0x50 [ 958.851735] [<c02727f0>] ? filldir64+0x0/0xf0 [ 958.851735] [<c02eb30c>] ext3_dirty_inode+0x2c/0x80 [ 958.851735] [<c02840fc>] __mark_inode_dirty+0x2c/0x190 [ 958.851735] [<c0272aef>] ? vfs_readdir+0x4f/0xa0 [ 958.851735] [<c0272aef>] ? vfs_readdir+0x4f/0xa0 [ 958.851735] [<c0279d14>] touch_atime+0xc4/0x110 [ 958.851735] [<c0272b31>] vfs_readdir+0x91/0xa0 [ 958.851735] [<c02727f0>] ? filldir64+0x0/0xf0 [ 958.851735] [<c0272ba9>] sys_getdents64+0x69/0xb0 [ 958.851735] [<c0202f1e>] syscall_call+0x7/0xb [ 958.851735] ======================= -------------------- -------------------- [ 22.254565] kjournald starting. Commit interval 5 seconds [ 22.259806] Clocksource tsc unstable (delta = 400048540 ns) [ 22.265069] EXT3 FS on hdb, internal journal [ 22.265468] EXT3-fs: mounted filesystem with ordered data mode. [ 22.281437] EXT3-fs error (device hdb): ext3_lookup: bad inode number: 1517 [ 86.979690] BUG: soft lockup - CPU#0 stuck for 61s! [find:661] [ 86.979690] [ 86.979690] Pid: 661, comm: find Not tainted (2.6.25.8 #4) [ 86.979690] EIP: 0060:[<c0481a75>] EFLAGS: 00000246 CPU: 0 [ 86.979690] EIP is at _raw_spin_unlock+0x35/0x70 [ 86.979690] EAX: c7ab2000 EBX: c7aa6014 ECX: c0320cf0 EDX: 00000001 [ 86.979690] ESI: 00002001 EDI: 00000009 EBP: c5eabd14 ESP: c5eabd10 [ 86.979690] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 86.979690] CR0: 80050033 CR2: b7fed270 CR3: 05eae000 CR4: 00000690 [ 86.979690] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 86.979690] DR6: 00000000 DR7: 00000000 [ 86.979690] [<c057d85d>] _spin_unlock+0x1d/0x20 [ 86.979690] [<c0320cf0>] cleanup_journal_tail+0xb0/0x100 [ 86.979690] [<c0320eb5>] log_do_checkpoint+0x135/0x420 [ 86.979690] [<c023110d>] ? run_hrtimer_pending+0xfd/0x130 [ 86.979690] [<c023f82a>] ? softlockup_tick+0xaa/0x120 [ 86.979690] [<c057d9a3>] ? _spin_lock_irq+0x33/0x40 [ 86.979690] [<c057d7ed>] ? _spin_unlock_irq+0x1d/0x30 [ 86.979690] [<c02234f7>] ? run_timer_softirq+0x157/0x1d0 [ 86.979690] [<c023721e>] ? tick_periodic+0x2e/0x80 [ 86.979690] [<c0237282>] ? tick_handle_periodic+0x12/0x70 [ 86.979690] [<c021f275>] ? __do_softirq+0x85/0xa0 [ 86.979690] [<c021f3ba>] ? irq_exit+0x2a/0x80 [ 86.979690] [<c0204e46>] ? do_IRQ+0x46/0xa0 [ 86.979690] [<c02030df>] ? common_interrupt+0x23/0x28 [ 86.979690] [<c057c3e8>] ? mutex_lock_nested+0x188/0x1f0 [ 86.979690] [<c0321207>] ? __log_wait_for_space+0x67/0xb0 [ 86.979690] [<c0321207>] ? __log_wait_for_space+0x67/0xb0 [ 86.979690] [<c057d922>] ? _spin_lock+0x32/0x40 [ 86.979690] [<c0321237>] __log_wait_for_space+0x97/0xb0 [ 86.979690] [<c031ec94>] start_this_handle+0xc4/0x390 [ 86.979690] [<c0272894>] ? filldir64+0xa4/0xf0 [ 86.979690] [<c031f0e9>] journal_start+0xa9/0x110 [ 86.979690] [<c02f59c9>] ext3_journal_start_sb+0x29/0x50 [ 86.979690] [<c02727f0>] ? filldir64+0x0/0xf0 [ 86.979690] [<c02eb3bc>] ext3_dirty_inode+0x2c/0x80 [ 86.979690] [<c02840fc>] __mark_inode_dirty+0x2c/0x190 [ 86.979690] [<c0272aef>] ? vfs_readdir+0x4f/0xa0 [ 86.979690] [<c0272aef>] ? vfs_readdir+0x4f/0xa0 [ 86.979690] [<c0279d14>] touch_atime+0xc4/0x110 [ 86.979690] [<c0272b31>] vfs_readdir+0x91/0xa0 [ 86.979690] [<c02727f0>] ? filldir64+0x0/0xf0 [ 86.979690] [<c0272ba9>] sys_getdents64+0x69/0xb0 [ 86.979690] [<c0202f1e>] syscall_call+0x7/0xb [ 86.979690] ======================= -------------------- -------------------- fstest:~# mount /dev/hdb /mnt [ 22.012017] kjournald starting. Commit interval 5 seconds [ 22.015344] EXT3-fs warning: mounting fs with errors, running e2fsck is recommended [ 22.025326] EXT3 FS on hdb, internal journal [ 22.025884] EXT3-fs: recovery complete. [ 22.026439] EXT3-fs: mounted filesystem with ordered data mode. fstest:~# cd /mnt fstest:/mnt# ls [ 89.130150] BUG: soft lockup - CPU#0 stuck for 61s! [ls:644] [ 89.130150] [ 89.130150] Pid: 644, comm: ls Not tainted (2.6.25.8 #4) [ 89.130150] EIP: 0060:[<c0321207>] EFLAGS: 00000286 CPU: 0 [ 89.130150] EIP is at __log_wait_for_space+0x67/0xb0 [ 89.130150] EAX: 22222222 EBX: 00000100 ECX: 00000000 EDX: c7abe000 [ 89.130150] ESI: c7abb400 EDI: c7abb414 EBP: c7abfe80 ESP: c7abfe70 [ 89.130150] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 89.130150] CR0: 8005003b CR2: b7fee280 CR3: 07ad5000 CR4: 00000690 [ 89.130150] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 89.130150] DR6: 00000000 DR7: 00000000 [ 89.130150] [<c031ec94>] start_this_handle+0xc4/0x390 [ 89.130150] [<c0272894>] ? filldir64+0xa4/0xf0 [ 89.130150] [<c031f0e9>] journal_start+0xa9/0x110 [ 89.130150] [<c02f59c9>] ext3_journal_start_sb+0x29/0x50 [ 89.130150] [<c02727f0>] ? filldir64+0x0/0xf0 [ 89.130150] [<c02eb3bc>] ext3_dirty_inode+0x2c/0x80 [ 89.130150] [<c02840fc>] __mark_inode_dirty+0x2c/0x190 [ 89.130150] [<c0272aef>] ? vfs_readdir+0x4f/0xa0 [ 89.130150] [<c0272aef>] ? vfs_readdir+0x4f/0xa0 [ 89.130150] [<c0279d14>] touch_atime+0xc4/0x110 [ 89.130150] [<c0272b31>] vfs_readdir+0x91/0xa0 [ 89.130150] [<c02727f0>] ? filldir64+0x0/0xf0 [ 89.130150] [<c0272ba9>] sys_getdents64+0x69/0xb0 [ 89.130150] [<c0202f1e>] syscall_call+0x7/0xb [ 89.130150] ======================= --------------------
I'm unable to reproduce this on recent git, with or without my patches. It may have already been fixed, but it does take a while (maybe ~5s) before it errors out, so it may just be differences in our test environments. Did you try leaving it to see if it eventually completed? Would you be able to test with 2.6.26-rc7?
OK, that's weird, I continued testing and just saw it happen. It seems to be transient. Most of the time it terminates with an error and an aborted journal after a short while (~5s), but occasionally it hangs seemingly indefinitely (>5m). Cranking up the debugging, we get this continually repeated in the log: [ 100.444274] (fs/jbd/checkpoint.c, 305): log_do_checkpoint: Start checkpoint [ 100.630135] (fs/jbd/checkpoint.c, 313): log_do_checkpoint: cleanup_journal_tail returned 1 Which suggests the loop in __log_wait_for_space is not terminating. When it terminates the above is interrupted by: [ 100.060124] (fs/jbd/journal.c, 192): kjournald: kjournald wakes [ 100.063089] (fs/jbd/journal.c, 200): kjournald: woke because of timeout [ 100.064640] (fs/jbd/journal.c, 144): kjournald: commit_sequence=8, commit_request=9 [ 100.067558] (fs/jbd/journal.c, 147): kjournald: OK, requests differ [ 100.069126] (fs/jbd/commit.c, 312): journal_commit_transaction: super block updated ...etc
Created attachment 16624 [details] Kernel config, v2.6.26-rc8, gzip compressed I can still reproduce this on 2.6.26-rc8. Sometimes the backtrace shows mutex_lock_nested. I don't know enough about kernel internals, but could it be that preemption makes this bug more evasive? I use CONFIG_PREEMPT_NONE in my test setup and have so far been able to reproduce this 100%. I'll attach to this bug the the .config with which I can reproduce the lockup. Here's 2.6.26-rc8 output after waiting for a few minutes: -------------------- # mount /dev/hdb /mnt [ 24.657984] EXT3-fs warning: feature flags set on rev 0 fs, running e2fsck is recommended [ 24.670579] kjournald starting. Commit interval 5 seconds [ 24.673276] EXT3-fs warning (device hdb): ext3_update_dynamic_rev: updating to rev 1 because of new feature flag, running e2fsck is recommended [ 24.677074] EXT3 FS on hdb, internal journal [ 24.678111] EXT3-fs: mounted filesystem with ordered data mode. # ls /mnt [ 93.424204] BUG: soft lockup - CPU#0 stuck for 61s! [ls:668] [ 93.424204] [ 93.424204] Pid: 668, comm: ls Not tainted (2.6.26-rc8 #2) [ 93.424204] EIP: 0060:[<c048799e>] EFLAGS: 00000202 CPU: 0 [ 93.424204] EIP is at _raw_spin_lock+0x2e/0x110 [ 93.424204] EAX: c78eb280 EBX: c7ade814 ECX: 00000000 EDX: 00000000 [ 93.424204] ESI: c7ade800 EDI: c7ade814 EBP: c7ae7e50 ESP: c7ae7e28 [ 93.424204] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 93.424204] CR0: 8005003b CR2: b7f78280 CR3: 07ae4000 CR4: 00000690 [ 93.424204] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 93.424204] DR6: 00000000 DR7: 00000000 [ 93.424204] [<c057d0b8>] ? mutex_lock_nested+0x188/0x1f0 [ 93.424204] [<c0325e87>] ? __log_wait_for_space+0x67/0xb0 [ 93.424204] [<c0325e87>] ? __log_wait_for_space+0x67/0xb0 [ 93.424204] [<c057e642>] _spin_lock+0x32/0x40 [ 93.424204] [<c0325e8e>] ? __log_wait_for_space+0x6e/0xb0 [ 93.424204] [<c0325e8e>] __log_wait_for_space+0x6e/0xb0 [ 93.424204] [<c0323914>] start_this_handle+0xc4/0x390 [ 93.424204] [<c0274bc4>] ? filldir64+0xa4/0xf0 [ 93.424204] [<c0323d69>] journal_start+0xa9/0x110 [ 93.424204] [<c02f9e99>] ext3_journal_start_sb+0x29/0x50 [ 93.424204] [<c0274b20>] ? filldir64+0x0/0xf0 [ 93.424204] [<c02ef75c>] ext3_dirty_inode+0x2c/0x80 [ 93.424204] [<c028765c>] __mark_inode_dirty+0x2c/0x190 [ 93.424204] [<c057e642>] ? _spin_lock+0x32/0x40 [ 93.424204] [<c027c2ac>] touch_atime+0xbc/0x110 [ 93.424204] [<c0274e61>] vfs_readdir+0x91/0xa0 [ 93.424204] [<c0274b20>] ? filldir64+0x0/0xf0 [ 93.424204] [<c0274ed9>] sys_getdents64+0x69/0xb0 [ 93.424204] [<c02030ba>] syscall_call+0x7/0xb [ 93.424204] ======================= [ 158.917557] BUG: soft lockup - CPU#0 stuck for 61s! [ls:668] [ 158.917557] [ 158.917557] Pid: 668, comm: ls Not tainted (2.6.26-rc8 #2) [ 158.917557] EIP: 0060:[<c0239a00>] EFLAGS: 00000246 CPU: 0 [ 158.917557] EIP is at lock_acquire+0x0/0x10 [ 158.917557] EAX: c7ade824 EBX: c7ade814 ECX: 00000000 EDX: 00000000 [ 158.917557] ESI: c7ade800 EDI: c7ade814 EBP: c7ae7e68 ESP: c7ae7e54 [ 158.917557] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 158.917557] CR0: 8005003b CR2: b7f78280 CR3: 07ae4000 CR4: 00000690 [ 158.917557] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 158.917557] DR6: 00000000 DR7: 00000000 [ 158.917557] [<c057e63b>] ? _spin_lock+0x2b/0x40 [ 158.917557] [<c0325e8e>] ? __log_wait_for_space+0x6e/0xb0 [ 158.917557] [<c0325e8e>] __log_wait_for_space+0x6e/0xb0 [ 158.917557] [<c0323914>] start_this_handle+0xc4/0x390 [ 158.917557] [<c0274bc4>] ? filldir64+0xa4/0xf0 [ 158.917557] [<c0323d69>] journal_start+0xa9/0x110 [ 158.917557] [<c02f9e99>] ext3_journal_start_sb+0x29/0x50 [ 158.917557] [<c0274b20>] ? filldir64+0x0/0xf0 [ 158.917557] [<c02ef75c>] ext3_dirty_inode+0x2c/0x80 [ 158.917557] [<c028765c>] __mark_inode_dirty+0x2c/0x190 [ 158.917557] [<c057e642>] ? _spin_lock+0x32/0x40 [ 158.917557] [<c027c2ac>] touch_atime+0xbc/0x110 [ 158.917557] [<c0274e61>] vfs_readdir+0x91/0xa0 [ 158.917557] [<c0274b20>] ? filldir64+0x0/0xf0 [ 158.917557] [<c0274ed9>] sys_getdents64+0x69/0xb0 [ 158.917557] [<c02030ba>] syscall_call+0x7/0xb [ 158.917557] ======================= [ 224.420909] BUG: soft lockup - CPU#0 stuck for 61s! [ls:668] [ 224.420909] [ 224.420909] Pid: 668, comm: ls Not tainted (2.6.26-rc8 #2) [ 224.420909] EIP: 0060:[<c03267f0>] EFLAGS: 00000246 CPU: 0 [ 224.420909] EIP is at __log_space_left+0x0/0x30 [ 224.420909] EAX: c7ade800 EBX: 00000100 ECX: c057d488 EDX: 00000001 [ 224.420909] ESI: c7ade800 EDI: c7ade814 EBP: c7ae7e80 ESP: c7ae7e6c [ 224.420909] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 224.420909] CR0: 8005003b CR2: b7f78280 CR3: 07ae4000 CR4: 00000690 [ 224.420909] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 224.420909] DR6: 00000000 DR7: 00000000 [ 224.420909] [<c0325e67>] ? __log_wait_for_space+0x47/0xb0 [ 224.420909] [<c0323914>] start_this_handle+0xc4/0x390 [ 224.420909] [<c0274bc4>] ? filldir64+0xa4/0xf0 [ 224.420909] [<c0323d69>] journal_start+0xa9/0x110 [ 224.420909] [<c02f9e99>] ext3_journal_start_sb+0x29/0x50 [ 224.420909] [<c0274b20>] ? filldir64+0x0/0xf0 [ 224.420909] [<c02ef75c>] ext3_dirty_inode+0x2c/0x80 [ 224.420909] [<c028765c>] __mark_inode_dirty+0x2c/0x190 [ 224.420909] [<c057e642>] ? _spin_lock+0x32/0x40 [ 224.420909] [<c027c2ac>] touch_atime+0xbc/0x110 [ 224.420909] [<c0274e61>] vfs_readdir+0x91/0xa0 [ 224.420909] [<c0274b20>] ? filldir64+0x0/0xf0 [ 224.420909] [<c0274ed9>] sys_getdents64+0x69/0xb0 [ 224.420909] [<c02030ba>] syscall_call+0x7/0xb [ 224.420909] ======================= [ 289.914263] BUG: soft lockup - CPU#0 stuck for 61s! [ls:668] [ 289.914263] [ 289.914263] Pid: 668, comm: ls Not tainted (2.6.26-rc8 #2) [ 289.914263] EIP: 0060:[<c04878f8>] EFLAGS: 00000246 CPU: 0 [ 289.914263] EIP is at _raw_spin_unlock+0x48/0x70 [ 289.914263] EAX: c78eb280 EBX: c7ade814 ECX: 00000000 EDX: 00000001 [ 289.914263] ESI: 10000001 EDI: 00000009 EBP: c7ae7d14 ESP: c7ae7d10 [ 289.914263] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 289.914263] CR0: 8005003b CR2: b7f78280 CR3: 07ae4000 CR4: 00000690 [ 289.914263] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 289.914263] DR6: 00000000 DR7: 00000000 [ 289.914263] [<c057e57d>] _spin_unlock+0x1d/0x20 [ 289.914263] [<c0325970>] cleanup_journal_tail+0xb0/0x100 [ 289.914263] [<c0325a17>] log_do_checkpoint+0x17/0x420 [ 289.914263] [<c0231bad>] ? run_hrtimer_pending+0xfd/0x130 [ 289.914263] [<c057e6c3>] ? _spin_lock_irq+0x33/0x40 [ 289.914263] [<c057e50d>] ? _spin_unlock_irq+0x1d/0x30 [ 289.914263] [<c02240e7>] ? run_timer_softirq+0x157/0x1d0 [ 289.914263] [<c023832e>] ? tick_periodic+0x2e/0x80 [ 289.914263] [<c0238392>] ? tick_handle_periodic+0x12/0x70 [ 289.914263] [<c021fe75>] ? __do_softirq+0x85/0xa0 [ 289.914263] [<c021ffba>] ? irq_exit+0x2a/0x80 [ 289.914263] [<c0205066>] ? do_IRQ+0x46/0xa0 [ 289.914263] [<c020325b>] ? common_interrupt+0x23/0x28 [ 289.914263] [<c057d0b8>] ? mutex_lock_nested+0x188/0x1f0 [ 289.914263] [<c0325e87>] ? __log_wait_for_space+0x67/0xb0 [ 289.914263] [<c0325e87>] ? __log_wait_for_space+0x67/0xb0 [ 289.914263] [<c057e642>] ? _spin_lock+0x32/0x40 [ 289.914263] [<c0325eb7>] __log_wait_for_space+0x97/0xb0 [ 289.914263] [<c0323914>] start_this_handle+0xc4/0x390 [ 289.914263] [<c0274bc4>] ? filldir64+0xa4/0xf0 [ 289.914263] [<c0323d69>] journal_start+0xa9/0x110 [ 289.914263] [<c02f9e99>] ext3_journal_start_sb+0x29/0x50 [ 289.914263] [<c0274b20>] ? filldir64+0x0/0xf0 [ 289.914263] [<c02ef75c>] ext3_dirty_inode+0x2c/0x80 [ 289.914263] [<c028765c>] __mark_inode_dirty+0x2c/0x190 [ 289.914263] [<c057e642>] ? _spin_lock+0x32/0x40 [ 289.914263] [<c027c2ac>] touch_atime+0xbc/0x110 [ 289.914263] [<c0274e61>] vfs_readdir+0x91/0xa0 [ 289.914263] [<c0274b20>] ? filldir64+0x0/0xf0 [ 289.914263] [<c0274ed9>] sys_getdents64+0x69/0xb0 [ 289.914263] [<c02030ba>] syscall_call+0x7/0xb [ 289.914263] ======================= [ 355.417616] BUG: soft lockup - CPU#0 stuck for 61s! [ls:668] [ 355.417616] [ 355.417616] Pid: 668, comm: ls Not tainted (2.6.26-rc8 #2) [ 355.417616] EIP: 0060:[<c03258c0>] EFLAGS: 00000246 CPU: 0 [ 355.417616] EIP is at cleanup_journal_tail+0x0/0x100 [ 355.417616] EAX: c7ade800 EBX: 00000100 ECX: 00000000 EDX: 00000001 [ 355.417616] ESI: c7ade800 EDI: c7ade814 EBP: c7ae7e68 ESP: c7ae7d40 [ 355.417616] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 355.417616] CR0: 8005003b CR2: b7f78280 CR3: 07ae4000 CR4: 00000690 [ 355.417616] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 355.417616] DR6: 00000000 DR7: 00000000 [ 355.417616] [<c0325b35>] ? log_do_checkpoint+0x135/0x420 [ 355.417616] [<c057e6c3>] ? _spin_lock_irq+0x33/0x40 [ 355.417616] [<c057e50d>] ? _spin_unlock_irq+0x1d/0x30 [ 355.417616] [<c0231bad>] ? run_hrtimer_pending+0xfd/0x130 [ 355.417616] [<c057e6c3>] ? _spin_lock_irq+0x33/0x40 [ 355.417616] [<c057e50d>] ? _spin_unlock_irq+0x1d/0x30 [ 355.417616] [<c02240e7>] ? run_timer_softirq+0x157/0x1d0 [ 355.417616] [<c023832e>] ? tick_periodic+0x2e/0x80 [ 355.417616] [<c0238392>] ? tick_handle_periodic+0x12/0x70 [ 355.417616] [<c021fe75>] ? __do_softirq+0x85/0xa0 [ 355.417616] [<c021ffba>] ? irq_exit+0x2a/0x80 [ 355.417616] [<c057d0b8>] ? mutex_lock_nested+0x188/0x1f0 [ 355.417616] [<c0325e87>] ? __log_wait_for_space+0x67/0xb0 [ 355.417616] [<c0325e87>] ? __log_wait_for_space+0x67/0xb0 [ 355.417616] [<c057e642>] ? _spin_lock+0x32/0x40 [ 355.417616] [<c0325eb7>] __log_wait_for_space+0x97/0xb0 [ 355.417616] [<c0323914>] start_this_handle+0xc4/0x390 [ 355.417616] [<c0274bc4>] ? filldir64+0xa4/0xf0 [ 355.417616] [<c0323d69>] journal_start+0xa9/0x110 [ 355.417616] [<c02f9e99>] ext3_journal_start_sb+0x29/0x50 [ 355.417616] [<c0274b20>] ? filldir64+0x0/0xf0 [ 355.417616] [<c02ef75c>] ext3_dirty_inode+0x2c/0x80 [ 355.417616] [<c028765c>] __mark_inode_dirty+0x2c/0x190 [ 355.417616] [<c057e642>] ? _spin_lock+0x32/0x40 [ 355.417616] [<c027c2ac>] touch_atime+0xbc/0x110 [ 355.417616] [<c0274e61>] vfs_readdir+0x91/0xa0 [ 355.417616] [<c0274b20>] ? filldir64+0x0/0xf0 [ 355.417616] [<c0274ed9>] sys_getdents64+0x69/0xb0 [ 355.417616] [<c02030ba>] syscall_call+0x7/0xb [ 355.417616] ======================= [ 420.910969] BUG: soft lockup - CPU#0 stuck for 61s! [ls:668] [ 420.910969] [ 420.910969] Pid: 668, comm: ls Not tainted (2.6.26-rc8 #2) [ 420.910969] EIP: 0060:[<c0325e72>] EFLAGS: 00000246 CPU: 0 [ 420.910969] EIP is at __log_wait_for_space+0x52/0xb0 [ 420.910969] EAX: 00000000 EBX: 00000100 ECX: c057d488 EDX: f00003df [ 420.910969] ESI: c7ade800 EDI: c7ade814 EBP: c7ae7e80 ESP: c7ae7e70 [ 420.910969] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 420.910969] CR0: 8005003b CR2: b7f78280 CR3: 07ae4000 CR4: 00000690 [ 420.910969] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 420.910969] DR6: 00000000 DR7: 00000000 [ 420.910969] [<c0323914>] start_this_handle+0xc4/0x390 [ 420.910969] [<c0274bc4>] ? filldir64+0xa4/0xf0 [ 420.910969] [<c0323d69>] journal_start+0xa9/0x110 [ 420.910969] [<c02f9e99>] ext3_journal_start_sb+0x29/0x50 [ 420.910969] [<c0274b20>] ? filldir64+0x0/0xf0 [ 420.910969] [<c02ef75c>] ext3_dirty_inode+0x2c/0x80 [ 420.910969] [<c028765c>] __mark_inode_dirty+0x2c/0x190 [ 420.910969] [<c057e642>] ? _spin_lock+0x32/0x40 [ 420.910969] [<c027c2ac>] touch_atime+0xbc/0x110 [ 420.910969] [<c0274e61>] vfs_readdir+0x91/0xa0 [ 420.910969] [<c0274b20>] ? filldir64+0x0/0xf0 [ 420.910969] [<c0274ed9>] sys_getdents64+0x69/0xb0 [ 420.910969] [<c02030ba>] syscall_call+0x7/0xb [ 420.910969] ======================= [ 486.414322] BUG: soft lockup - CPU#0 stuck for 61s! [ls:668] [ 486.414322] [ 486.414322] Pid: 668, comm: ls Not tainted (2.6.26-rc8 #2) [ 486.414322] EIP: 0060:[<c0239a00>] EFLAGS: 00000246 CPU: 0 [ 486.414322] EIP is at lock_acquire+0x0/0x10 [ 486.414322] EAX: c7ade9d0 EBX: c7ade9c0 ECX: 00000000 EDX: 00000000 [ 486.414322] ESI: c7ade800 EDI: c7ade814 EBP: c7ae7d3c ESP: c7ae7d28 [ 486.414322] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 486.414322] CR0: 8005003b CR2: b7f78280 CR3: 07ae4000 CR4: 00000690 [ 486.414322] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 486.414322] DR6: 00000000 DR7: 00000000 [ 486.414322] [<c057e63b>] ? _spin_lock+0x2b/0x40 [ 486.414322] [<c0325a37>] ? log_do_checkpoint+0x37/0x420 [ 486.414322] [<c0325a37>] log_do_checkpoint+0x37/0x420 [ 486.414322] [<c0231bad>] ? run_hrtimer_pending+0xfd/0x130 [ 486.414322] [<c057e6c3>] ? _spin_lock_irq+0x33/0x40 [ 486.414322] [<c057e50d>] ? _spin_unlock_irq+0x1d/0x30 [ 486.414322] [<c02240e7>] ? run_timer_softirq+0x157/0x1d0 [ 486.414322] [<c023832e>] ? tick_periodic+0x2e/0x80 [ 486.414322] [<c0238392>] ? tick_handle_periodic+0x12/0x70 [ 486.414322] [<c021fe75>] ? __do_softirq+0x85/0xa0 [ 486.414322] [<c021ffba>] ? irq_exit+0x2a/0x80 [ 486.414322] [<c0205066>] ? do_IRQ+0x46/0xa0 [ 486.414322] [<c020325b>] ? common_interrupt+0x23/0x28 [ 486.414322] [<c057d0b8>] ? mutex_lock_nested+0x188/0x1f0 [ 486.414322] [<c0325e87>] ? __log_wait_for_space+0x67/0xb0 [ 486.414322] [<c0325e87>] ? __log_wait_for_space+0x67/0xb0 [ 486.414322] [<c057e642>] ? _spin_lock+0x32/0x40 [ 486.414322] [<c0325eb7>] __log_wait_for_space+0x97/0xb0 [ 486.414322] [<c0323914>] start_this_handle+0xc4/0x390 [ 486.414322] [<c0274bc4>] ? filldir64+0xa4/0xf0 [ 486.414322] [<c0323d69>] journal_start+0xa9/0x110 [ 486.414322] [<c02f9e99>] ext3_journal_start_sb+0x29/0x50 [ 486.414322] [<c0274b20>] ? filldir64+0x0/0xf0 [ 486.414322] [<c02ef75c>] ext3_dirty_inode+0x2c/0x80 [ 486.414322] [<c028765c>] __mark_inode_dirty+0x2c/0x190 [ 486.414322] [<c057e642>] ? _spin_lock+0x32/0x40 [ 486.414322] [<c027c2ac>] touch_atime+0xbc/0x110 [ 486.414322] [<c0274e61>] vfs_readdir+0x91/0xa0 [ 486.414322] [<c0274b20>] ? filldir64+0x0/0xf0 [ 486.414322] [<c0274ed9>] sys_getdents64+0x69/0xb0 [ 486.414322] [<c02030ba>] syscall_call+0x7/0xb [ 486.414322] ======================= [ 551.907675] BUG: soft lockup - CPU#0 stuck for 61s! [ls:668] [ 551.907675] [ 551.907675] Pid: 668, comm: ls Not tainted (2.6.26-rc8 #2) [ 551.907675] EIP: 0060:[<c04878b0>] EFLAGS: 00000287 CPU: 0 [ 551.907675] EIP is at _raw_spin_unlock+0x0/0x70 [ 551.907675] EAX: c7ade814 EBX: c7ade814 ECX: c0325eb0 EDX: 00000001 [ 551.907675] ESI: c7ade800 EDI: c7ade814 EBP: c7ae7e68 ESP: c7ae7e60 [ 551.907675] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 551.907675] CR0: 8005003b CR2: b7f78280 CR3: 07ae4000 CR4: 00000690 [ 551.907675] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 551.907675] DR6: 00000000 DR7: 00000000 [ 551.907675] [<c057e57d>] ? _spin_unlock+0x1d/0x20 [ 551.907675] [<c0325eb0>] __log_wait_for_space+0x90/0xb0 [ 551.907675] [<c0323914>] start_this_handle+0xc4/0x390 [ 551.907675] [<c0274bc4>] ? filldir64+0xa4/0xf0 [ 551.907675] [<c0323d69>] journal_start+0xa9/0x110 [ 551.907675] [<c02f9e99>] ext3_journal_start_sb+0x29/0x50 [ 551.907675] [<c0274b20>] ? filldir64+0x0/0xf0 [ 551.907675] [<c02ef75c>] ext3_dirty_inode+0x2c/0x80 [ 551.907675] [<c028765c>] __mark_inode_dirty+0x2c/0x190 [ 551.907675] [<c057e642>] ? _spin_lock+0x32/0x40 [ 551.907675] [<c027c2ac>] touch_atime+0xbc/0x110 [ 551.907675] [<c0274e61>] vfs_readdir+0x91/0xa0 [ 551.907675] [<c0274b20>] ? filldir64+0x0/0xf0 [ 551.907675] [<c0274ed9>] sys_getdents64+0x69/0xb0 [ 551.907675] [<c02030ba>] syscall_call+0x7/0xb [ 551.907675] ======================= --------------------
Forgot to say, in the above comment I meant v2.6.26-rc8 + the latest patches to fix the problems in #10882.
Created attachment 16815 [details] jbd-abort-instead-of-waiting-for-nonexistent-transactions.patch I finally got some time to look at this and think I have a fix. Could you please test the attached patch and let me know whether it fixes things for you?
Created attachment 16818 [details] jbd-abort-instead-of-waiting-for-nonexistent-transactions.patch Whoops, got the locking order wrong in that first one. Try this instead.
Ok, finally found some time to test your patch with 2.6.26.1, and it seems to fix the cases I attached (but I didn't test that it still works with non-broken filesystems :). I found some new crashes however with 2.6.26.1, I'll file separate reports for them.
Actually the new crash I found seems to be one from #10882, test case hdb.25.softlockup. I'll go and reopen that bug.
Thanks for the testing, I'll submit the patch. Is it OK to add your tested-by to the changelog, BTW? Note the fix for test case hdb.25.softlockup from #10882 went into mainline during the 2.6.27 merge window, so you may want to test with 2.6.27-rc1 before re-opening it.
Ah, good to know, I'll reclose the bug then :) It's OK to add tested-by: me. Thanks.