Bug 9817 - oops in jbd2_journal_commit_transaction after journal abort
Summary: oops in jbd2_journal_commit_transaction after journal abort
Status: REJECTED UNREPRODUCIBLE
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-25 14:22 UTC by Eric Sandeen
Modified: 2009-01-17 18:14 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.24-0.174.rc8.git7
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
more dmesg (51 bytes, text/plain)
2008-01-25 14:24 UTC, Eric Sandeen
Details
local copy (12.16 KB, text/plain)
2008-01-25 14:24 UTC, Eric Sandeen
Details

Description Eric Sandeen 2008-01-25 14:22:09 UTC
This was after firewire went south, but probably still should not have oopsed the box:

firewire_core: Failed to add device.
firewire_sbp2: released fw5.0
firewire_sbp2: error status: 0:9
firewire_core: created new fw device fw1 (0 config rom retries, S100)
firewire_core: phy config: card 0, new root=ffc1, gap_count=5
firewire_core: BM lock failed, making local node (ffc0) root.
firewire_core: phy config: card 0, new root=ffc0, gap_count=5
firewire_core: phy config: card 0, new root=ffc1, gap_count=5
firewire_sbp2: error status: 0:9
firewire_sbp2: error status: 0:9
firewire_sbp2: error status: 0:9
firewire_sbp2: error status: 0:9
firewire_sbp2: error status: 0:9
firewire_sbp2: failed to reconnect to fw6.1
firewire_sbp2: logged in to fw6.1 LUN 0000 (0 retries)

....

Aborting journal on device md0.
EXT4-fs error (device md0) in ext4_dirty_inode: IO failure
EXT4-fs error (device md0) in ext4_reserve_inode_write: Journal has aborted
EXT4-fs error (device md0) in ext4_mb_free_blocks: Journal has aborted
EXT4-fs error (device md0) in ext4_write_begin: Journal has aborted
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
 [<ffffffff8831a17e>] :jbd2:jbd2_journal_commit_transaction+0xdc8/0x1392
PGD 22a9fd067 PUD 22a9c1067 PMD 0 
Oops: 0002 [1] SMP 
CPU 3 
Modules linked in: raid0 ipt_MASQUERADE iptable_nat nf_nat bridge rfcomm l2cap bluetooth autofs4 nfs lockd nfs_acl sunrpc nf_conntrack_ipv4 ipt_REJECT iptable_filter ip_tables nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables x_tables cpufreq_ondemand ext4dev jbd2 crc16 loop dm_multipath radeon drm ipv6 kvm_amd kvm floppy snd_hda_intel rtc_cmos snd_seq_dummy pcspkr snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device shpchp snd_pcm_oss snd_mixer_oss snd_pcm snd_timer forcedeth snd_page_alloc snd_hwdep snd soundcore i2c_nforce2 i2c_core button sg sr_mod cdrom firewire_sbp2 firewire_ohci firewire_core crc_itu_t pata_amd dm_snapshot dm_zero dm_mirror dm_mod sata_nv pata_acpi ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Pid: 4033, comm: kjournald2 Not tainted 2.6.24-0.174.rc8.git7.fc9.fw #1
RIP: 0010:[<ffffffff8831a17e>]  [<ffffffff8831a17e>] :jbd2:jbd2_journal_commit_transaction+0xdc8/0x1392
RSP: 0018:ffff810227c67de0  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000ffffffff
RDX: 0000000000000000 RSI: ffff81023bca3cd0 RDI: ffff81021e4aa000
RBP: ffff81021e4aa000 R08: 0000000000000002 R09: ffff8101c751c150
R10: ffffe2000aa786a0 R11: 00000000fffffffa R12: 0000000000000000
R13: ffff81023bca3cd0 R14: 0000000000000000 R15: 0000000000000000
FS:  00002aaaaaadcf60(0000) GS:ffff81023fc01d48(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000022d997000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kjournald2 (pid: 4033, threadinfo ffff810227c66000, task ffff81021c4be000)
Stack:  0000000000000000 0000000000000000 ffff81021c466000 0000000000000000
 0000000800000000 00000000ffffffff ffff810227c67eb0 ffffffff8159ad98
 0000000000000001 0000000000000005 000000000000047f 0000000000000000
Call Trace:
 [<ffffffff8104ae86>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8831e6d1>] :jbd2:kjournald2+0x134/0x380
 [<ffffffff8104ae86>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8831e59d>] :jbd2:kjournald2+0x0/0x380
 [<ffffffff8104ad66>] kthread+0x47/0x75
 [<ffffffff81275240>] trace_hardirqs_on_thunk+0x35/0x3a
 [<ffffffff8100cde8>] child_rip+0xa/0x12
 [<ffffffff8100c4ff>] restore_args+0x0/0x30
 [<ffffffff8104ad1f>] kthread+0x0/0x75
 [<ffffffff8100cdde>] child_rip+0x0/0x12


Code: f0 0f ba 33 01 be 3e 01 00 00 48 c7 c7 8e f8 31 88 e8 31 58 
RIP  [<ffffffff8831a17e>] :jbd2:jbd2_journal_commit_transaction+0xdc8/0x1392
 RSP <ffff810227c67de0>
CR2: 0000000000000000
---[ end trace c3e02ae5f2e18995 ]---
ext4_abort called.
EXT4-fs error (device md0): ext4_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
Comment 1 Eric Sandeen 2008-01-25 14:24:00 UTC
Created attachment 14579 [details]
more dmesg

more dmesg
Comment 2 Eric Sandeen 2008-01-25 14:24:39 UTC
Created attachment 14580 [details]
local copy
Comment 3 Jarod Wilson 2008-01-26 13:59:57 UTC
Can't recall at the moment whether or not the kernel had Stefan's latest sbp2 login/reconnect fixes or not. My thought is not. Should be reasonably easy to reproduce the situation sometime next week though.
Comment 4 Jarod Wilson 2008-01-26 14:05:58 UTC
Oops. By "latest", I meant the latest ones already in linux1394-git as of right now, not the ones just posted to lkml earlier today. However, those latest ones look like they could be relevant to the firewire side of this issue too.
Comment 5 Stefan Richter 2008-01-26 14:20:51 UTC
From the attachment:
> firewire_core: created new fw device fw0 (0 config rom retries, S400)
> firewire_core: created new fw device fw1 (0 config rom retries, S100)
> firewire_core: phy config: card 0, new root=ffc1, gap_count=5
> firewire_core: created new fw device fw2 (0 config rom retries, S400)
> firewire_core: created new fw device fw3 (0 config rom retries, S400)
> firewire_core: phy config: card 1, new root=ffc1, gap_count=5
> firewire_core: created new fw device fw4 (0 config rom retries, S800)
...
> firewire_core: created new fw device fw5 (0 config rom retries, S800)
...
> firewire_core: created new fw device fw6 (0 config rom retries, S800)

How many FireWire devices have actually been on the bus?  And is it true that one of them was a 100 Mbit/s device?

Jarod and I did some improvements regarding firewire-sbp2's login and reconnection behavior lately.  However, there is still something to do in the area of reconnection while IO is going on (bug 9734).

That said, it's always good if a filesystem doesn't oops if a connection suddenly goes away.  So, if you filesystem guys are interested in stabilizing this, just take a USB disk, do some filesystem I/O, and unplug during the I/O...
Comment 6 Eric Sandeen 2008-01-26 14:30:53 UTC
Agreed; we shouldn't have oopsed.  Still working out where/when/why we did.
Comment 7 Stefan Richter 2008-01-26 15:06:02 UTC
BTW, I just filed bug 9828 for the "WARNING: at fs/sysfs/dir.c:424 sysfs_add_one()" in the attachment.  This warning is unrelated to this bug here.
Comment 8 Jarod Wilson 2008-01-26 16:44:25 UTC
(In reply to comment #5)
> From the attachment:
> > firewire_core: created new fw device fw0 (0 config rom retries, S400)
> > firewire_core: created new fw device fw1 (0 config rom retries, S100)
> > firewire_core: phy config: card 0, new root=ffc1, gap_count=5
> > firewire_core: created new fw device fw2 (0 config rom retries, S400)
> > firewire_core: created new fw device fw3 (0 config rom retries, S400)
> > firewire_core: phy config: card 1, new root=ffc1, gap_count=5
> > firewire_core: created new fw device fw4 (0 config rom retries, S800)
> ...
> > firewire_core: created new fw device fw5 (0 config rom retries, S800)
> ...
> > firewire_core: created new fw device fw6 (0 config rom retries, S800)
> 
> How many FireWire devices have actually been on the bus?  And is it true that
> one of them was a 100 Mbit/s device?

This is one of my test boxes, with 3 firewire controllers in it. The S100 device (fw1) is a dv camera hooked to the first card (fw0), the second controller (fw2) has a webcam (fw3) hooked to it. fw4 is a FW800 card, with two FW800 drives (fw5 and fw6) hooked up to it.

> Jarod and I did some improvements regarding firewire-sbp2's login and
> reconnection behavior lately.  However, there is still something to do in the
> area of reconnection while IO is going on (bug 9734).

I plan to dig into this one more next week and see if I can reproduce with the latest and greatest linux1394-git tree and with some of the pending/under review patches as well.
Comment 9 Stefan Richter 2008-01-26 16:59:34 UTC
A few "error status: 0:9", then "failed to reconnect", then "logged in" should happen less often with patch "try to increase reconnect_hold".  The target refused reconnect but accepted re-login because fw-sbp2 was already too late with its first reconnect attempt.
Comment 10 Stefan Richter 2008-01-26 17:01:20 UTC
PS:  As noted in the changelog of the patch, the patch is ineffective with Initio and Prolific bridges which don't give us extended reconnect_hold.
Comment 11 Natalie Protasevich 2008-06-02 15:25:23 UTC
Any updates on this problem, testing with latest kernel?
Thanks.
Comment 12 Theodore Tso 2009-01-17 18:14:51 UTC
This bug report so old, and I've seen multiple cases where we don't oops after the journal is aborted, so I'm going to close this bug. If you can reproduce this with a newer kernel, please open a new bug.  Thanks!!

Note You need to log in before you can comment on or make changes to this bug.