Bug 10128 - fw_core_handle_bus_reset - BUG at lib/list_debug.c:33
Summary: fw_core_handle_bus_reset - BUG at lib/list_debug.c:33
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: IEEE1394 (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Stefan Richter
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-02-28 02:18 UTC by Stefan Richter
Modified: 2008-11-29 04:47 UTC (History)
1 user (show)

See Also:
Kernel Version: all
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Stefan Richter 2008-02-28 02:18:21 UTC
From http://bugzilla.kernel.org/show_bug.cgi?id=9870#c2 :

VT6307 OHCI 1.0 controller, Iogear hub, Fire-i camera cause bus resets which end up in

kernel BUG at lib/list_debug.c:33!
invalid opcode: 0000 [1] SMP 
CPU 0 
Modules linked in: radeon drm ipt_MASQUERADE iptable_nat nf_nat bridge rfcomm
l2cap bluetooth autofs4 sunrpc nf_conntrack_ipv4 ipt_REJECT iptable_filter
ip_tables nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp ip6t_ipv6header
ip6t_REJECT ip6table_filter ip6_tables x_tables ipv6 cpufreq_ondemand
dm_multipath firewire_sbp2 parport_pc parport floppy snd_intel8x0
snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
pcspkr firewire_ohci firewire_core snd_seq_device serio_raw snd_pcm_oss
crc_itu_t snd_mixer_oss k8temp hwmon snd_pcm snd_timer snd soundcore
snd_page_alloc forcedeth i2c_nforce2 i2c_core button sr_mod sg cdrom pata_amd
dm_snapshot dm_zero dm_mirror dm_mod shpchp sata_sil libata sd_mod scsi_mod
ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.24-9.fc9 #1
RIP: 0010:[<ffffffff8113259a>]  [<ffffffff8113259a>] __list_add+0x47/0x5b
RSP: 0018:ffffffff8155adf0  EFLAGS: 00010086
RAX: 0000000000000079 RBX: ffff81003e9cc6e8 RCX: ffff81003d53c198
RDX: ffffffff813a37a0 RSI: 0000000000000001 RDI: ffffffff813a92a0
RBP: ffff81003d53c198 R08: ffffffff813a92c0 R09: ffff810001005900
R10: 000000000000a7b9 R11: 0000000000000000 R12: ffff81003e0e0000
R13: 0000000000000000 R14: 000000000000003f R15: ffff81003d53c220
FS:  0000000040a00950(0000) GS:ffffffff813ee000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000647000 CR3: 000000003c844000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81498000, task ffffffff813a37a0)
Stack:  ffff81003e0e0000 ffffffff88193fbb ffffffff813a4008 ffffffff00000002
 ffff81003d53c1a8 ffff81003e0e0648 0000000100012a4e ffff81003e0e064c
 ffff81003e0e0640 0000000300000000 ffff81003e0e04a8 0000000000000246
Call Trace:
 <IRQ>  [<ffffffff88193fbb>]
:firewire_core:fw_core_handle_bus_reset+0x67a/0x758
 [<ffffffff81276950>] _spin_unlock_irqrestore+0x3e/0x44
 [<ffffffff8103d32c>] tasklet_action+0x2e/0xb0
 [<ffffffff8103d35b>] tasklet_action+0x5d/0xb0
 [<ffffffff8103d23d>] __do_softirq+0x5e/0xe0
 [<ffffffff81053c99>] lock_release_holdtime+0x27/0x49
 [<ffffffff8100d0cc>] call_softirq+0x1c/0x28
 [<ffffffff8100e6ce>] do_softirq+0x31/0x86
 [<ffffffff8103d19b>] irq_exit+0x4e/0x92
 [<ffffffff8100e861>] do_IRQ+0x13e/0x161
 [<ffffffff8100b066>] default_idle+0x0/0x51
 [<ffffffff8100b066>] default_idle+0x0/0x51
 [<ffffffff8100c456>] ret_from_intr+0x0/0xf
 <EOI>  [<ffffffff8101ce78>] lapic_next_event+0x0/0xa
 [<ffffffff8100b066>] default_idle+0x0/0x51
 [<ffffffff8100b09d>] default_idle+0x37/0x51
 [<ffffffff8100b09b>] default_idle+0x35/0x51
 [<ffffffff8100b155>] cpu_idle+0x9e/0xc6
 [<ffffffff814a2b2b>] start_kernel+0x301/0x30d
 [<ffffffff814a211d>] _sinittext+0x11d/0x124


Code: 0f 0b eb fe 48 89 7e 08 48 89 37 48 89 57 08 48 89 3a 5a c3 
RIP  [<ffffffff8113259a>] __list_add+0x47/0x5b
 RSP <ffffffff8155adf0>
---[ end trace d4a4f763a33e3c92 ]---
Kernel panic - not syncing: Aiee, killing interrupt handler!

See also
http://bugzilla.kernel.org/show_bug.cgi?id=9870#c23
http://bugzilla.kernel.org/show_bug.cgi?id=9870#c24
http://bugzilla.kernel.org/show_bug.cgi?id=9870#c26
http://bugzilla.kernel.org/show_bug.cgi?id=9870#c28
Comment 1 Stefan Richter 2008-03-03 11:37:32 UTC
The following comment in update_tree() may be of significance:

    /* assert(node0->port_count == node1->port_count) */

It is easy to see that we will get list corruption if this isn't true.  I think the condition is not guaranteed to be true anymore if there were one or more self ID complete events with corrupt selfID data between two valid selfID complete events.

What is not easy to see is how to bail out if the condition isn't true.  The simplest thing to do would be to report all old nodes as lost and all current nodes as newly added.  But at the moment I don't even know how to do that.
Comment 2 Stefan Richter 2008-03-04 07:46:23 UTC
Downstream bug: https://bugzilla.redhat.com/show_bug.cgi?id=434866
"kernel panic with firewire module on Asus A7N8X-E motherboard"
Comment 3 Jarod Wilson 2008-03-19 08:02:45 UTC
Something I'm guessing is highly relevant that I noticed after inducing this panic by simply connecting a problematic jmicron controller into a pcilynx card in a box running a sniffer is that the panic happens after exactly 255 bus resets.
Comment 4 Jarod Wilson 2008-03-19 11:23:41 UTC
Still digging, but it does NOT appear to be the case that node0->port_count != node1->port_count. However, I am seeing a card->self_id_count that is always 0.

...
firewire_core: phy config: card 0, new root=ffc0, gap_count=7
firewire_core: sid changed, card sid: 0, sid: 4
firewire_core: n0pc: 2, n1pc: 2
firewire_core: n0pc: 6, n1pc: 6
firewire_core: n0pc: 3, n1pc: 3
firewire_core: phy config: card 0, new root=ffc0, gap_count=7
firewire_core: sid changed, card sid: 0, sid: 4
firewire_core: n0pc: 2, n1pc: 2
firewire_core: n0pc: 6, n1pc: 6
firewire_core: n0pc: 3, n1pc: 3
firewire_core: phy config: card 0, new root=ffc0, gap_count=7
firewire_core: sid changed, card sid: 0, sid: 4
firewire_core: n0pc: 2, n1pc: 2
firewire_core: n0pc: 6, n1pc: 6
firewire_core: n0pc: 3, n1pc: 3
firewire_core: n0pc: 2, n1pc: 2
firewire_core: n0pc: 6, n1pc: 6
list_add corruption. prev->next should be next (ffffffff81585e20), but was ffff810076c47d30. (prev=ffff810076c47d30).
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:33!
invalid opcode: 0000 [1] SMP DEBUG_PAGEALLOC
CPU 0 
Modules linked in: firewire_ohci firewire_core crc_itu_t rfcomm l2cap bluetooth autofs4 fuse sunrpc ipv6 nf_conntrack_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables cpufreq_ondemand acpi_cpufreq freq_table dm_mirror dm_multipath dm_mod snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm 3c59x snd_timer mii iTCO_wdt snd_page_alloc wmi tg3 button iTCO_vendor_support snd_hwdep i82975x_edac snd serio_raw parport_pc edac_core soundcore parport sr_mod pcspkr cdrom floppy sg ata_piix ahci libata sd_mod scsi_mod raid456 async_xor async_memcpy async_tx xor raid1 ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: crc_itu_t]
Pid: 0, comm: swapper Not tainted 2.6.25-0.124.rc6.fc9 #1
RIP: 0010:[<ffffffff8113ad3e>]  [<ffffffff8113ad3e>] __list_add+0x47/0x5b
RSP: 0018:ffffffff81585d50  EFLAGS: 00010082
RAX: 0000000000000079 RBX: 0000000000000004 RCX: 0000000000000079
RDX: ffffffff813cf7a0 RSI: 0000000000000001 RDI: ffffffff813d6800
RBP: ffffffff81585d50 R08: ffffffff813d6820 R09: 0000000000000082
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
R13: ffff810076c47d20 R14: ffff810076979738 R15: ffff810079942090
FS:  0000000000000000(0000) GS:ffffffff81414000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f63c40221c8 CR3: 0000000000201000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff814be000, task ffffffff813cf7a0)
Stack:  ffffffff81585e60 ffffffff881ee1ac ffffffff81585d90 0000000281012136
 ffffffff81012136 ffffffff81585e20 ffff8100799426a0 ffffffff00000003
 ffffffff0000003f ffff810076979748 ffff8100799426e8 00000001813cf7a0
Call Trace:
 <IRQ>  [<ffffffff881ee1ac>] :firewire_core:fw_core_handle_bus_reset+0x699/0x773
 [<ffffffff81012136>] ? native_sched_clock+0x50/0x6d
 [<ffffffff88212605>] ? :firewire_ohci:bus_reset_tasklet+0x1d4/0x417
 [<ffffffff88212839>] :firewire_ohci:bus_reset_tasklet+0x408/0x417
 [<ffffffff8100c5ef>] ? restore_args+0x0/0x30
 [<ffffffff810383b7>] tasklet_action+0x5e/0xb2
 [<ffffffff81038264>] __do_softirq+0x5f/0xe3
 [<ffffffff8100d1bc>] call_softirq+0x1c/0x28
 [<ffffffff8100eab7>] do_softirq+0x39/0x8a
 [<ffffffff810381c4>] irq_exit+0x4e/0x8f
 [<ffffffff8100ec76>] do_IRQ+0x144/0x166
 [<ffffffff8100b0e4>] ? mwait_idle+0x0/0x50
 [<ffffffff8100c546>] ret_from_intr+0x0/0xf
 <EOI>  [<ffffffff8100b129>] ? mwait_idle+0x45/0x50
 [<ffffffff8100a72f>] ? enter_idle+0x22/0x24
 [<ffffffff8100b071>] ? cpu_idle+0xa4/0xce
 [<ffffffff81289ba6>] ? rest_init+0x5a/0x5c


Code: 81 48 89 c6 31 c0 e8 f7 8e ef ff 0f 0b eb fe 4c 8b 00 49 39 f0 74 18 48 89 c1 4c 89 c2 48 c7 c7 32 52 38 81 31 c0 e8 d7 8e ef ff <0f> 0b eb fe 48 89 7e 08 48 89 37 48 89 57 08 48 89 3a c9 c3 55 
RIP  [<ffffffff8113ad3e>] __list_add+0x47/0x5b
 RSP <ffffffff81585d50>
BUG: scheduling while atomic: swapper/0/0x10000100
INFO: lockdep is turned off.
irq event stamp: 69333309
hardirqs last  enabled at (69333308): [<ffffffff81299514>] _spin_unlock_irqrestore+0x3f/0x47
hardirqs last disabled at (69333309): [<ffffffff812993ad>] _spin_lock_irqsave+0x19/0x6a
softirqs last  enabled at (69333296): [<ffffffff810382df>] __do_softirq+0xda/0xe3
softirqs last disabled at (69333299): [<ffffffff8100d1bc>] call_softirq+0x1c/0x28
CPU 0:
Modules linked in: firewire_ohci firewire_core crc_itu_t rfcomm l2cap bluetooth autofs4 fuse sunrpc ipv6 nf_conntrack_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables cpufreq_ondemand acpi_cpufreq freq_table dm_mirror dm_multipath dm_mod snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm 3c59x snd_timer mii iTCO_wdt snd_page_alloc wmi tg3 button iTCO_vendor_support snd_hwdep i82975x_edac snd serio_raw parport_pc edac_core soundcore parport sr_mod pcspkr cdrom floppy sg ata_piix ahci libata sd_mod scsi_mod raid456 async_xor async_memcpy async_tx xor raid1 ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: crc_itu_t]
Pid: 0, comm: swapper Tainted: G      D  2.6.25-0.124.rc6.fc9 #1
RIP: 0010:[<ffffffff8100b129>]  [<ffffffff8100b129>] mwait_idle+0x45/0x50
RSP: 0018:ffffffff814bff48  EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffffff814bff48 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 00000000000000d2 RDI: ffff81000100a440
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: ffffffff814bfeb8 R12: 0000000000000046
R13: ffffffff814bfee8 R14: ffffffff813de100 R15: ffffffff813cf7a0
FS:  0000000000000000(0000) GS:ffffffff81414000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f63c40221c8 CR3: 0000000000201000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
 [<ffffffff8100a72f>] ? enter_idle+0x22/0x24
 [<ffffffff8100b071>] ? cpu_idle+0xa4/0xce
 [<ffffffff81289ba6>] ? rest_init+0x5a/0x5c
Comment 5 Jarod Wilson 2008-03-19 11:52:23 UTC
Forgot all about having already tracked this down into update_tree... Pulling over the code snippet from bug 9870, comment #28:

                for (i = 0; i < node0->port_count; i++) {
                        if (node0->ports[i] && node1->ports[i]) {
                                /*
                                 * This port didn't change, queue the
                                 * connected node for further
                                 * investigation.
                                 */
                                if (node0->ports[i]->color == card->color)
                                        continue;
                                list_add_tail(&node0->ports[i]->link, &list0);
                                list_add_tail(&node1->ports[i]->link, &list1);

Seems to be one of those list_add_tail calls leading to the panic.
Comment 6 Stefan Richter 2008-03-19 14:39:24 UTC
Some patches posted:

"firewire: insist on successive self ID complete events"
http://thread.gmane.org/gmane.linux.kernel.firewire.devel/11841/focus=11848

"firewire: fw-ohci: add self ID error check"
http://thread.gmane.org/gmane.linux.kernel.firewire.devel/11841/focus=11847

"firewire: fw-ohci: catch self_id_count == 0"
http://thread.gmane.org/gmane.linux.kernel.firewire.devel/11841/focus=11849

Jarod says these patches don't help.
Comment 7 Jarod Wilson 2008-03-19 14:44:46 UTC
Looks like we're stuck looping with firewire-core initiating the bus resets. Will drill down further later tonight or tomorrow to try to figure out who keeps calling fw_core_initiate_bus_reset...
Comment 8 Jarod Wilson 2008-03-21 20:23:09 UTC
The via vt6307 + iogear + unibrain fire-i combination no longer continuously panics until bus reset with this patch:

https://sourceforge.net/mailarchive/forum.php?thread_name=tkrat.47c2348a94015e02%40s5r6.in-berlin.de&forum_name=linux1394-devel

I'm assuming it'll fix the case w/the jmicron controller as well (will check Monday), and might possibly help nforce2 controllers. Could still be a legit problem with handling lots of bus resets, but that patch is definitely an improvement.
Comment 9 Jarod Wilson 2008-04-15 08:29:22 UTC
The fix referenced in comment #8 did help the jmicron card, but wasn't the entire fix. It also required:

http://git.kernel.org/?p=linux/kernel/git/ieee1394/linux1394-2.6.git;a=commit;h=1012835d4950d999a2f5854074b8c9da27769e99

Still need to see if this panic is reproduceable via a forced bus reset storm...
Comment 10 Stefan Richter 2008-10-16 14:14:29 UTC
Is this bug dead yet?
Comment 11 Stefan Richter 2008-11-29 04:47:10 UTC
Since we had several fixes in the topology code and added more consistency checks of the self ID buffer, I assume this is fixed by now.

Note You need to log in before you can comment on or make changes to this bug.