Bug 10128
Summary: | fw_core_handle_bus_reset - BUG at lib/list_debug.c:33 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Stefan Richter (stefanr) |
Component: | IEEE1394 | Assignee: | Stefan Richter (stefanr) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | jarod |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | all | Subsystem: | |
Regression: | --- | Bisected commit-id: |
Description
Stefan Richter
2008-02-28 02:18:21 UTC
The following comment in update_tree() may be of significance: /* assert(node0->port_count == node1->port_count) */ It is easy to see that we will get list corruption if this isn't true. I think the condition is not guaranteed to be true anymore if there were one or more self ID complete events with corrupt selfID data between two valid selfID complete events. What is not easy to see is how to bail out if the condition isn't true. The simplest thing to do would be to report all old nodes as lost and all current nodes as newly added. But at the moment I don't even know how to do that. Downstream bug: https://bugzilla.redhat.com/show_bug.cgi?id=434866 "kernel panic with firewire module on Asus A7N8X-E motherboard" Something I'm guessing is highly relevant that I noticed after inducing this panic by simply connecting a problematic jmicron controller into a pcilynx card in a box running a sniffer is that the panic happens after exactly 255 bus resets. Still digging, but it does NOT appear to be the case that node0->port_count != node1->port_count. However, I am seeing a card->self_id_count that is always 0. ... firewire_core: phy config: card 0, new root=ffc0, gap_count=7 firewire_core: sid changed, card sid: 0, sid: 4 firewire_core: n0pc: 2, n1pc: 2 firewire_core: n0pc: 6, n1pc: 6 firewire_core: n0pc: 3, n1pc: 3 firewire_core: phy config: card 0, new root=ffc0, gap_count=7 firewire_core: sid changed, card sid: 0, sid: 4 firewire_core: n0pc: 2, n1pc: 2 firewire_core: n0pc: 6, n1pc: 6 firewire_core: n0pc: 3, n1pc: 3 firewire_core: phy config: card 0, new root=ffc0, gap_count=7 firewire_core: sid changed, card sid: 0, sid: 4 firewire_core: n0pc: 2, n1pc: 2 firewire_core: n0pc: 6, n1pc: 6 firewire_core: n0pc: 3, n1pc: 3 firewire_core: n0pc: 2, n1pc: 2 firewire_core: n0pc: 6, n1pc: 6 list_add corruption. prev->next should be next (ffffffff81585e20), but was ffff810076c47d30. (prev=ffff810076c47d30). ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:33! invalid opcode: 0000 [1] SMP DEBUG_PAGEALLOC CPU 0 Modules linked in: firewire_ohci firewire_core crc_itu_t rfcomm l2cap bluetooth autofs4 fuse sunrpc ipv6 nf_conntrack_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables cpufreq_ondemand acpi_cpufreq freq_table dm_mirror dm_multipath dm_mod snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm 3c59x snd_timer mii iTCO_wdt snd_page_alloc wmi tg3 button iTCO_vendor_support snd_hwdep i82975x_edac snd serio_raw parport_pc edac_core soundcore parport sr_mod pcspkr cdrom floppy sg ata_piix ahci libata sd_mod scsi_mod raid456 async_xor async_memcpy async_tx xor raid1 ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: crc_itu_t] Pid: 0, comm: swapper Not tainted 2.6.25-0.124.rc6.fc9 #1 RIP: 0010:[<ffffffff8113ad3e>] [<ffffffff8113ad3e>] __list_add+0x47/0x5b RSP: 0018:ffffffff81585d50 EFLAGS: 00010082 RAX: 0000000000000079 RBX: 0000000000000004 RCX: 0000000000000079 RDX: ffffffff813cf7a0 RSI: 0000000000000001 RDI: ffffffff813d6800 RBP: ffffffff81585d50 R08: ffffffff813d6820 R09: 0000000000000082 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: ffff810076c47d20 R14: ffff810076979738 R15: ffff810079942090 FS: 0000000000000000(0000) GS:ffffffff81414000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f63c40221c8 CR3: 0000000000201000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffffffff814be000, task ffffffff813cf7a0) Stack: ffffffff81585e60 ffffffff881ee1ac ffffffff81585d90 0000000281012136 ffffffff81012136 ffffffff81585e20 ffff8100799426a0 ffffffff00000003 ffffffff0000003f ffff810076979748 ffff8100799426e8 00000001813cf7a0 Call Trace: <IRQ> [<ffffffff881ee1ac>] :firewire_core:fw_core_handle_bus_reset+0x699/0x773 [<ffffffff81012136>] ? native_sched_clock+0x50/0x6d [<ffffffff88212605>] ? :firewire_ohci:bus_reset_tasklet+0x1d4/0x417 [<ffffffff88212839>] :firewire_ohci:bus_reset_tasklet+0x408/0x417 [<ffffffff8100c5ef>] ? restore_args+0x0/0x30 [<ffffffff810383b7>] tasklet_action+0x5e/0xb2 [<ffffffff81038264>] __do_softirq+0x5f/0xe3 [<ffffffff8100d1bc>] call_softirq+0x1c/0x28 [<ffffffff8100eab7>] do_softirq+0x39/0x8a [<ffffffff810381c4>] irq_exit+0x4e/0x8f [<ffffffff8100ec76>] do_IRQ+0x144/0x166 [<ffffffff8100b0e4>] ? mwait_idle+0x0/0x50 [<ffffffff8100c546>] ret_from_intr+0x0/0xf <EOI> [<ffffffff8100b129>] ? mwait_idle+0x45/0x50 [<ffffffff8100a72f>] ? enter_idle+0x22/0x24 [<ffffffff8100b071>] ? cpu_idle+0xa4/0xce [<ffffffff81289ba6>] ? rest_init+0x5a/0x5c Code: 81 48 89 c6 31 c0 e8 f7 8e ef ff 0f 0b eb fe 4c 8b 00 49 39 f0 74 18 48 89 c1 4c 89 c2 48 c7 c7 32 52 38 81 31 c0 e8 d7 8e ef ff <0f> 0b eb fe 48 89 7e 08 48 89 37 48 89 57 08 48 89 3a c9 c3 55 RIP [<ffffffff8113ad3e>] __list_add+0x47/0x5b RSP <ffffffff81585d50> BUG: scheduling while atomic: swapper/0/0x10000100 INFO: lockdep is turned off. irq event stamp: 69333309 hardirqs last enabled at (69333308): [<ffffffff81299514>] _spin_unlock_irqrestore+0x3f/0x47 hardirqs last disabled at (69333309): [<ffffffff812993ad>] _spin_lock_irqsave+0x19/0x6a softirqs last enabled at (69333296): [<ffffffff810382df>] __do_softirq+0xda/0xe3 softirqs last disabled at (69333299): [<ffffffff8100d1bc>] call_softirq+0x1c/0x28 CPU 0: Modules linked in: firewire_ohci firewire_core crc_itu_t rfcomm l2cap bluetooth autofs4 fuse sunrpc ipv6 nf_conntrack_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables cpufreq_ondemand acpi_cpufreq freq_table dm_mirror dm_multipath dm_mod snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm 3c59x snd_timer mii iTCO_wdt snd_page_alloc wmi tg3 button iTCO_vendor_support snd_hwdep i82975x_edac snd serio_raw parport_pc edac_core soundcore parport sr_mod pcspkr cdrom floppy sg ata_piix ahci libata sd_mod scsi_mod raid456 async_xor async_memcpy async_tx xor raid1 ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: crc_itu_t] Pid: 0, comm: swapper Tainted: G D 2.6.25-0.124.rc6.fc9 #1 RIP: 0010:[<ffffffff8100b129>] [<ffffffff8100b129>] mwait_idle+0x45/0x50 RSP: 0018:ffffffff814bff48 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffffffff814bff48 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 00000000000000d2 RDI: ffff81000100a440 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: ffffffff814bfeb8 R12: 0000000000000046 R13: ffffffff814bfee8 R14: ffffffff813de100 R15: ffffffff813cf7a0 FS: 0000000000000000(0000) GS:ffffffff81414000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f63c40221c8 CR3: 0000000000201000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff8100a72f>] ? enter_idle+0x22/0x24 [<ffffffff8100b071>] ? cpu_idle+0xa4/0xce [<ffffffff81289ba6>] ? rest_init+0x5a/0x5c Forgot all about having already tracked this down into update_tree... Pulling over the code snippet from bug 9870, comment #28: for (i = 0; i < node0->port_count; i++) { if (node0->ports[i] && node1->ports[i]) { /* * This port didn't change, queue the * connected node for further * investigation. */ if (node0->ports[i]->color == card->color) continue; list_add_tail(&node0->ports[i]->link, &list0); list_add_tail(&node1->ports[i]->link, &list1); Seems to be one of those list_add_tail calls leading to the panic. Some patches posted: "firewire: insist on successive self ID complete events" http://thread.gmane.org/gmane.linux.kernel.firewire.devel/11841/focus=11848 "firewire: fw-ohci: add self ID error check" http://thread.gmane.org/gmane.linux.kernel.firewire.devel/11841/focus=11847 "firewire: fw-ohci: catch self_id_count == 0" http://thread.gmane.org/gmane.linux.kernel.firewire.devel/11841/focus=11849 Jarod says these patches don't help. Looks like we're stuck looping with firewire-core initiating the bus resets. Will drill down further later tonight or tomorrow to try to figure out who keeps calling fw_core_initiate_bus_reset... The via vt6307 + iogear + unibrain fire-i combination no longer continuously panics until bus reset with this patch: https://sourceforge.net/mailarchive/forum.php?thread_name=tkrat.47c2348a94015e02%40s5r6.in-berlin.de&forum_name=linux1394-devel I'm assuming it'll fix the case w/the jmicron controller as well (will check Monday), and might possibly help nforce2 controllers. Could still be a legit problem with handling lots of bus resets, but that patch is definitely an improvement. The fix referenced in comment #8 did help the jmicron card, but wasn't the entire fix. It also required: http://git.kernel.org/?p=linux/kernel/git/ieee1394/linux1394-2.6.git;a=commit;h=1012835d4950d999a2f5854074b8c9da27769e99 Still need to see if this panic is reproduceable via a forced bus reset storm... Is this bug dead yet? Since we had several fixes in the topology code and added more consistency checks of the self ID buffer, I assume this is fixed by now. |