This BUG is occurred at the remount operations. [ 214.601983] kernel BUG at fs/jbd2/journal.c : 1769 int jbd2_journal_flush(journal_t *journal) { < SNIP > line:1767 mutex_unlock(&journal->j_checkpoint_mutex); line:1768 write_lock(&journal->j_state_lock); line:1769 J_ASSERT(!journal->j_running_transaction); < SNIP > } The call-stacks of this Procecss-A are below : jbd2_journal_flush+0xF4/0x164 ext4_mark_recovery_complete ext4_remount do_remount_sb And, on this situation, the variable of journal is like below: j_flags = 0x38 j_superblock->s_sequence = 0x049F j_tail_sequence = 0x049F j_transaction_sequence = 0x04A0 j_commit_sequence = 0x049E j_commit_request = 0x049E j_head = 0x25DE j_tail = 0x25DE j_free = 0x7FFF j_barrier_count = 0x1 j_last_sync_writer = 0x0570 j_running_transaction->t_tid = 0x49F j_running_transaction->t_state = T_RUNNING j_running_transaction->t_start_time = 0x00000031F66EEAB4 j_running_transaction->t_updates = 0 And, The call-stacks of other Process-B are below : __schedule+0x634 start_this_handle+0x164 jbd2__journal_start+0x98 ext4_journal_start_sb+0x1a4 ext4_rename+0x38 vfs_rename+0x28c i.e The process-B is scheduled at "wait_event()" of the below "start_this_handle()". static int start_this_handle(journal_t *journal, handle_t *handle, gfp_t gfp_mask) { < SNIP > /* Wait on the journal's transaction barrier if necessary */ if (journal->j_barrier_count) { read_unlock(&journal->j_state_lock); wait_event(journal->j_wait_transaction_locked, journal->j_barrier_count == 0); goto repeat; } if (!journal->j_running_transaction) { read_unlock(&journal->j_state_lock); if (!new_transaction) goto alloc_transaction; write_lock(&journal->j_state_lock); if (!journal->j_running_transaction) { jbd2_get_transaction(journal, new_transaction); new_transaction = NULL; } write_unlock(&journal->j_state_lock); goto repeat; } < SNIP > } Would you tell me about the reason why this bug is occurred?? Thanks
Thanks for detailed report! I presume you have a crash dump when you could provide such details? Can you have a look at what journal->j_running_transaction->t_handle_count is? I can see the transaction has been started some 14 ms ago. That's not too long but still plenty in terms of CPU time. What I think is happening is: Process A Process B start_this_handle(). if (journal->j_barrier_count) # false if (!journal->j_running_transaction) { #true read_unlock(&journal->j_state_lock); jbd2_journal_lock_updates() jbd2_journal_flush() write_lock(&journal->j_state_lock); if (journal->j_running_transaction) { #false ... wait for committing trans ... write_unlock(&journal->j_state_lock); ... write_lock(&journal->j_state_lock); jbd2_get_transaction(journal, new_transaction); # Sets j_running_transaction write_unlock(&journal->j_state_lock); goto repeat; # eventually blocks on j_barrier_count > 0 I will attach here a patch that should fix this...
Created attachment 88941 [details] Patch fixing assertion failure in jbd2_journal_flush() Can you test this patch please?
Dear Jan Kara. journal->j_running_transaction->t_handle_count is 0x0. I also agree with your opinion. Thank you. But, this bug is rarely happened. ie, It is very difficult to reproduce this bug again. But, as you memtioned, I also think that this bug has connection with j_barrier_count at this race situation. I wonder that this patch has the possibility to be applyed to the jbd2 at the next release of kernel. Thank you.
Thanks, t_handle_count == 0 confirms the scenario I described likely happened. I'll push the fix upstream.
Patch sent.
Dear Jan Kara Thank your for your help, sincerely. Besides, I want to know where is handled your patch, or by whom. please. and at which kernel site. etc. Thank you.
I sent the fix to ext4 maintainer and linux-ext4 mailing list - see e.g. http://lists.openwall.net/linux-ext4/2012/12/12/11