From http://bugzilla.kernel.org/show_bug.cgi?id=9870#c2 : VT6307 OHCI 1.0 controller, Iogear hub, Fire-i camera cause bus resets which end up in kernel BUG at lib/list_debug.c:33! invalid opcode: 0000 [1] SMP CPU 0 Modules linked in: radeon drm ipt_MASQUERADE iptable_nat nf_nat bridge rfcomm l2cap bluetooth autofs4 sunrpc nf_conntrack_ipv4 ipt_REJECT iptable_filter ip_tables nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables x_tables ipv6 cpufreq_ondemand dm_multipath firewire_sbp2 parport_pc parport floppy snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq pcspkr firewire_ohci firewire_core snd_seq_device serio_raw snd_pcm_oss crc_itu_t snd_mixer_oss k8temp hwmon snd_pcm snd_timer snd soundcore snd_page_alloc forcedeth i2c_nforce2 i2c_core button sr_mod sg cdrom pata_amd dm_snapshot dm_zero dm_mirror dm_mod shpchp sata_sil libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd Pid: 0, comm: swapper Not tainted 2.6.24-9.fc9 #1 RIP: 0010:[<ffffffff8113259a>] [<ffffffff8113259a>] __list_add+0x47/0x5b RSP: 0018:ffffffff8155adf0 EFLAGS: 00010086 RAX: 0000000000000079 RBX: ffff81003e9cc6e8 RCX: ffff81003d53c198 RDX: ffffffff813a37a0 RSI: 0000000000000001 RDI: ffffffff813a92a0 RBP: ffff81003d53c198 R08: ffffffff813a92c0 R09: ffff810001005900 R10: 000000000000a7b9 R11: 0000000000000000 R12: ffff81003e0e0000 R13: 0000000000000000 R14: 000000000000003f R15: ffff81003d53c220 FS: 0000000040a00950(0000) GS:ffffffff813ee000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000647000 CR3: 000000003c844000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffffffff81498000, task ffffffff813a37a0) Stack: ffff81003e0e0000 ffffffff88193fbb ffffffff813a4008 ffffffff00000002 ffff81003d53c1a8 ffff81003e0e0648 0000000100012a4e ffff81003e0e064c ffff81003e0e0640 0000000300000000 ffff81003e0e04a8 0000000000000246 Call Trace: <IRQ> [<ffffffff88193fbb>] :firewire_core:fw_core_handle_bus_reset+0x67a/0x758 [<ffffffff81276950>] _spin_unlock_irqrestore+0x3e/0x44 [<ffffffff8103d32c>] tasklet_action+0x2e/0xb0 [<ffffffff8103d35b>] tasklet_action+0x5d/0xb0 [<ffffffff8103d23d>] __do_softirq+0x5e/0xe0 [<ffffffff81053c99>] lock_release_holdtime+0x27/0x49 [<ffffffff8100d0cc>] call_softirq+0x1c/0x28 [<ffffffff8100e6ce>] do_softirq+0x31/0x86 [<ffffffff8103d19b>] irq_exit+0x4e/0x92 [<ffffffff8100e861>] do_IRQ+0x13e/0x161 [<ffffffff8100b066>] default_idle+0x0/0x51 [<ffffffff8100b066>] default_idle+0x0/0x51 [<ffffffff8100c456>] ret_from_intr+0x0/0xf <EOI> [<ffffffff8101ce78>] lapic_next_event+0x0/0xa [<ffffffff8100b066>] default_idle+0x0/0x51 [<ffffffff8100b09d>] default_idle+0x37/0x51 [<ffffffff8100b09b>] default_idle+0x35/0x51 [<ffffffff8100b155>] cpu_idle+0x9e/0xc6 [<ffffffff814a2b2b>] start_kernel+0x301/0x30d [<ffffffff814a211d>] _sinittext+0x11d/0x124 Code: 0f 0b eb fe 48 89 7e 08 48 89 37 48 89 57 08 48 89 3a 5a c3 RIP [<ffffffff8113259a>] __list_add+0x47/0x5b RSP <ffffffff8155adf0> ---[ end trace d4a4f763a33e3c92 ]--- Kernel panic - not syncing: Aiee, killing interrupt handler! See also http://bugzilla.kernel.org/show_bug.cgi?id=9870#c23 http://bugzilla.kernel.org/show_bug.cgi?id=9870#c24 http://bugzilla.kernel.org/show_bug.cgi?id=9870#c26 http://bugzilla.kernel.org/show_bug.cgi?id=9870#c28
The following comment in update_tree() may be of significance: /* assert(node0->port_count == node1->port_count) */ It is easy to see that we will get list corruption if this isn't true. I think the condition is not guaranteed to be true anymore if there were one or more self ID complete events with corrupt selfID data between two valid selfID complete events. What is not easy to see is how to bail out if the condition isn't true. The simplest thing to do would be to report all old nodes as lost and all current nodes as newly added. But at the moment I don't even know how to do that.
Downstream bug: https://bugzilla.redhat.com/show_bug.cgi?id=434866 "kernel panic with firewire module on Asus A7N8X-E motherboard"
Something I'm guessing is highly relevant that I noticed after inducing this panic by simply connecting a problematic jmicron controller into a pcilynx card in a box running a sniffer is that the panic happens after exactly 255 bus resets.
Still digging, but it does NOT appear to be the case that node0->port_count != node1->port_count. However, I am seeing a card->self_id_count that is always 0. ... firewire_core: phy config: card 0, new root=ffc0, gap_count=7 firewire_core: sid changed, card sid: 0, sid: 4 firewire_core: n0pc: 2, n1pc: 2 firewire_core: n0pc: 6, n1pc: 6 firewire_core: n0pc: 3, n1pc: 3 firewire_core: phy config: card 0, new root=ffc0, gap_count=7 firewire_core: sid changed, card sid: 0, sid: 4 firewire_core: n0pc: 2, n1pc: 2 firewire_core: n0pc: 6, n1pc: 6 firewire_core: n0pc: 3, n1pc: 3 firewire_core: phy config: card 0, new root=ffc0, gap_count=7 firewire_core: sid changed, card sid: 0, sid: 4 firewire_core: n0pc: 2, n1pc: 2 firewire_core: n0pc: 6, n1pc: 6 firewire_core: n0pc: 3, n1pc: 3 firewire_core: n0pc: 2, n1pc: 2 firewire_core: n0pc: 6, n1pc: 6 list_add corruption. prev->next should be next (ffffffff81585e20), but was ffff810076c47d30. (prev=ffff810076c47d30). ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:33! invalid opcode: 0000 [1] SMP DEBUG_PAGEALLOC CPU 0 Modules linked in: firewire_ohci firewire_core crc_itu_t rfcomm l2cap bluetooth autofs4 fuse sunrpc ipv6 nf_conntrack_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables cpufreq_ondemand acpi_cpufreq freq_table dm_mirror dm_multipath dm_mod snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm 3c59x snd_timer mii iTCO_wdt snd_page_alloc wmi tg3 button iTCO_vendor_support snd_hwdep i82975x_edac snd serio_raw parport_pc edac_core soundcore parport sr_mod pcspkr cdrom floppy sg ata_piix ahci libata sd_mod scsi_mod raid456 async_xor async_memcpy async_tx xor raid1 ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: crc_itu_t] Pid: 0, comm: swapper Not tainted 2.6.25-0.124.rc6.fc9 #1 RIP: 0010:[<ffffffff8113ad3e>] [<ffffffff8113ad3e>] __list_add+0x47/0x5b RSP: 0018:ffffffff81585d50 EFLAGS: 00010082 RAX: 0000000000000079 RBX: 0000000000000004 RCX: 0000000000000079 RDX: ffffffff813cf7a0 RSI: 0000000000000001 RDI: ffffffff813d6800 RBP: ffffffff81585d50 R08: ffffffff813d6820 R09: 0000000000000082 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: ffff810076c47d20 R14: ffff810076979738 R15: ffff810079942090 FS: 0000000000000000(0000) GS:ffffffff81414000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f63c40221c8 CR3: 0000000000201000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffffffff814be000, task ffffffff813cf7a0) Stack: ffffffff81585e60 ffffffff881ee1ac ffffffff81585d90 0000000281012136 ffffffff81012136 ffffffff81585e20 ffff8100799426a0 ffffffff00000003 ffffffff0000003f ffff810076979748 ffff8100799426e8 00000001813cf7a0 Call Trace: <IRQ> [<ffffffff881ee1ac>] :firewire_core:fw_core_handle_bus_reset+0x699/0x773 [<ffffffff81012136>] ? native_sched_clock+0x50/0x6d [<ffffffff88212605>] ? :firewire_ohci:bus_reset_tasklet+0x1d4/0x417 [<ffffffff88212839>] :firewire_ohci:bus_reset_tasklet+0x408/0x417 [<ffffffff8100c5ef>] ? restore_args+0x0/0x30 [<ffffffff810383b7>] tasklet_action+0x5e/0xb2 [<ffffffff81038264>] __do_softirq+0x5f/0xe3 [<ffffffff8100d1bc>] call_softirq+0x1c/0x28 [<ffffffff8100eab7>] do_softirq+0x39/0x8a [<ffffffff810381c4>] irq_exit+0x4e/0x8f [<ffffffff8100ec76>] do_IRQ+0x144/0x166 [<ffffffff8100b0e4>] ? mwait_idle+0x0/0x50 [<ffffffff8100c546>] ret_from_intr+0x0/0xf <EOI> [<ffffffff8100b129>] ? mwait_idle+0x45/0x50 [<ffffffff8100a72f>] ? enter_idle+0x22/0x24 [<ffffffff8100b071>] ? cpu_idle+0xa4/0xce [<ffffffff81289ba6>] ? rest_init+0x5a/0x5c Code: 81 48 89 c6 31 c0 e8 f7 8e ef ff 0f 0b eb fe 4c 8b 00 49 39 f0 74 18 48 89 c1 4c 89 c2 48 c7 c7 32 52 38 81 31 c0 e8 d7 8e ef ff <0f> 0b eb fe 48 89 7e 08 48 89 37 48 89 57 08 48 89 3a c9 c3 55 RIP [<ffffffff8113ad3e>] __list_add+0x47/0x5b RSP <ffffffff81585d50> BUG: scheduling while atomic: swapper/0/0x10000100 INFO: lockdep is turned off. irq event stamp: 69333309 hardirqs last enabled at (69333308): [<ffffffff81299514>] _spin_unlock_irqrestore+0x3f/0x47 hardirqs last disabled at (69333309): [<ffffffff812993ad>] _spin_lock_irqsave+0x19/0x6a softirqs last enabled at (69333296): [<ffffffff810382df>] __do_softirq+0xda/0xe3 softirqs last disabled at (69333299): [<ffffffff8100d1bc>] call_softirq+0x1c/0x28 CPU 0: Modules linked in: firewire_ohci firewire_core crc_itu_t rfcomm l2cap bluetooth autofs4 fuse sunrpc ipv6 nf_conntrack_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables cpufreq_ondemand acpi_cpufreq freq_table dm_mirror dm_multipath dm_mod snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm 3c59x snd_timer mii iTCO_wdt snd_page_alloc wmi tg3 button iTCO_vendor_support snd_hwdep i82975x_edac snd serio_raw parport_pc edac_core soundcore parport sr_mod pcspkr cdrom floppy sg ata_piix ahci libata sd_mod scsi_mod raid456 async_xor async_memcpy async_tx xor raid1 ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: crc_itu_t] Pid: 0, comm: swapper Tainted: G D 2.6.25-0.124.rc6.fc9 #1 RIP: 0010:[<ffffffff8100b129>] [<ffffffff8100b129>] mwait_idle+0x45/0x50 RSP: 0018:ffffffff814bff48 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffffffff814bff48 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 00000000000000d2 RDI: ffff81000100a440 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: ffffffff814bfeb8 R12: 0000000000000046 R13: ffffffff814bfee8 R14: ffffffff813de100 R15: ffffffff813cf7a0 FS: 0000000000000000(0000) GS:ffffffff81414000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f63c40221c8 CR3: 0000000000201000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff8100a72f>] ? enter_idle+0x22/0x24 [<ffffffff8100b071>] ? cpu_idle+0xa4/0xce [<ffffffff81289ba6>] ? rest_init+0x5a/0x5c
Forgot all about having already tracked this down into update_tree... Pulling over the code snippet from bug 9870, comment #28: for (i = 0; i < node0->port_count; i++) { if (node0->ports[i] && node1->ports[i]) { /* * This port didn't change, queue the * connected node for further * investigation. */ if (node0->ports[i]->color == card->color) continue; list_add_tail(&node0->ports[i]->link, &list0); list_add_tail(&node1->ports[i]->link, &list1); Seems to be one of those list_add_tail calls leading to the panic.
Some patches posted: "firewire: insist on successive self ID complete events" http://thread.gmane.org/gmane.linux.kernel.firewire.devel/11841/focus=11848 "firewire: fw-ohci: add self ID error check" http://thread.gmane.org/gmane.linux.kernel.firewire.devel/11841/focus=11847 "firewire: fw-ohci: catch self_id_count == 0" http://thread.gmane.org/gmane.linux.kernel.firewire.devel/11841/focus=11849 Jarod says these patches don't help.
Looks like we're stuck looping with firewire-core initiating the bus resets. Will drill down further later tonight or tomorrow to try to figure out who keeps calling fw_core_initiate_bus_reset...
The via vt6307 + iogear + unibrain fire-i combination no longer continuously panics until bus reset with this patch: https://sourceforge.net/mailarchive/forum.php?thread_name=tkrat.47c2348a94015e02%40s5r6.in-berlin.de&forum_name=linux1394-devel I'm assuming it'll fix the case w/the jmicron controller as well (will check Monday), and might possibly help nforce2 controllers. Could still be a legit problem with handling lots of bus resets, but that patch is definitely an improvement.
The fix referenced in comment #8 did help the jmicron card, but wasn't the entire fix. It also required: http://git.kernel.org/?p=linux/kernel/git/ieee1394/linux1394-2.6.git;a=commit;h=1012835d4950d999a2f5854074b8c9da27769e99 Still need to see if this panic is reproduceable via a forced bus reset storm...
Is this bug dead yet?
Since we had several fixes in the topology code and added more consistency checks of the self ID buffer, I assume this is fixed by now.