Bug 51571 - Assertion of j_running_transaction on jbd2_journal_flush()
Summary: Assertion of j_running_transaction on jbd2_journal_flush()
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jan Kara
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-12-11 09:26 UTC by yjwsignal
Modified: 2012-12-13 12:41 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.4.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Patch fixing assertion failure in jbd2_journal_flush() (2.09 KB, patch)
2012-12-12 00:10 UTC, Jan Kara
Details | Diff

Description yjwsignal 2012-12-11 09:26:14 UTC
This BUG is occurred at the remount operations.

[  214.601983] kernel BUG at fs/jbd2/journal.c : 1769

       int jbd2_journal_flush(journal_t *journal)
       {
           < SNIP >
line:1767  mutex_unlock(&journal->j_checkpoint_mutex);
line:1768  write_lock(&journal->j_state_lock);
line:1769  J_ASSERT(!journal->j_running_transaction);
           < SNIP >
       }

The call-stacks of this Procecss-A are below :
  jbd2_journal_flush+0xF4/0x164
  ext4_mark_recovery_complete
  ext4_remount
  do_remount_sb

And, on this situation, the variable of journal is like below:
  j_flags = 0x38
  j_superblock->s_sequence = 0x049F
  j_tail_sequence = 0x049F
  j_transaction_sequence = 0x04A0
  j_commit_sequence = 0x049E
  j_commit_request = 0x049E
  j_head = 0x25DE
  j_tail = 0x25DE
  j_free = 0x7FFF
  j_barrier_count = 0x1
  j_last_sync_writer = 0x0570
  j_running_transaction->t_tid = 0x49F
  j_running_transaction->t_state = T_RUNNING
  j_running_transaction->t_start_time = 0x00000031F66EEAB4
  j_running_transaction->t_updates = 0

And, The call-stacks of other Process-B are below :

   __schedule+0x634
   start_this_handle+0x164
   jbd2__journal_start+0x98
   ext4_journal_start_sb+0x1a4
   ext4_rename+0x38
   vfs_rename+0x28c

i.e The process-B is scheduled at "wait_event()" of the below "start_this_handle()". 

static int start_this_handle(journal_t *journal, handle_t *handle,
           gfp_t gfp_mask)
{
  < SNIP >
  /* Wait on the journal's transaction barrier if necessary */
  if (journal->j_barrier_count) {
    read_unlock(&journal->j_state_lock);
    wait_event(journal->j_wait_transaction_locked,
        journal->j_barrier_count == 0);
    goto repeat;
  }

  if (!journal->j_running_transaction) {
    read_unlock(&journal->j_state_lock);
    if (!new_transaction)
      goto alloc_transaction;
    write_lock(&journal->j_state_lock);
    if (!journal->j_running_transaction) {
      jbd2_get_transaction(journal, new_transaction);
      new_transaction = NULL;
    }
    write_unlock(&journal->j_state_lock);
    goto repeat;
  }
  < SNIP >
}

Would you tell me about the reason why this bug is occurred??

Thanks
Comment 1 Jan Kara 2012-12-12 00:01:04 UTC
Thanks for detailed report! I presume you have a crash dump when you could provide such details? Can you have a look at what journal->j_running_transaction->t_handle_count is? I can see the transaction has been started some 14 ms ago. That's not too long but still plenty in terms of CPU time.

What I think is happening is:
Process A                                    Process B
start_this_handle().
  if (journal->j_barrier_count) # false
  if (!journal->j_running_transaction) { #true
    read_unlock(&journal->j_state_lock);
                                           jbd2_journal_lock_updates()
                                           jbd2_journal_flush()
                                             write_lock(&journal->j_state_lock);
                                             if (journal->j_running_transaction) {
                                               #false
                                             ... wait for committing trans ...
                                             write_unlock(&journal->j_state_lock);
    ...
    write_lock(&journal->j_state_lock);
    jbd2_get_transaction(journal, new_transaction); # Sets j_running_transaction
    write_unlock(&journal->j_state_lock);
    goto repeat; # eventually blocks on j_barrier_count > 0

I will attach here a patch that should fix this...
Comment 2 Jan Kara 2012-12-12 00:10:56 UTC
Created attachment 88941 [details]
Patch fixing assertion failure in jbd2_journal_flush()

Can you test this patch please?
Comment 3 yjwsignal 2012-12-12 04:26:28 UTC
Dear Jan Kara.

journal->j_running_transaction->t_handle_count is 0x0.

I also agree with your opinion. Thank you.

But, this bug is rarely happened.
ie, It is very difficult to reproduce this bug again.

But, as you memtioned, I also think that this bug has connection with
j_barrier_count at this race situation.

I wonder that this patch has the possibility to be applyed to the jbd2 at the next release of kernel.

Thank you.
Comment 4 Jan Kara 2012-12-12 15:18:18 UTC
Thanks, t_handle_count == 0 confirms the scenario I described likely happened. I'll push the fix upstream.
Comment 5 Jan Kara 2012-12-12 15:19:16 UTC
Patch sent.
Comment 6 yjwsignal 2012-12-13 00:42:16 UTC
Dear Jan Kara

Thank your for your help, sincerely.

Besides, I want to know where is handled your patch, or by whom. please.
and at which kernel site. etc.

Thank you.
Comment 7 Jan Kara 2012-12-13 12:41:12 UTC
I sent the fix to ext4 maintainer and linux-ext4 mailing list - see e.g. http://lists.openwall.net/linux-ext4/2012/12/12/11

Note You need to log in before you can comment on or make changes to this bug.