Most recent kernel where this bug did not occur: 2.6.22.9 Distribution: Gentoo Hardware Environment: Sparc64 (Blade 100 UltraSPARC 2e) Software Environment: Problem Description: I found a kernel oops in /var/log/messages after booting both my Blade 100s. The oops appears to have triggered by firewire_ohci, but I'm no kernel hacker and could be mistaken. Here is the text of one of them. The one from the other machine is identical except for the timestamps and machine name. Oct 13 20:26:04 succubus OOPS: Bogus kernel PC [0000000000000000] in fault handler Oct 13 20:26:04 succubus OOPS: RPC [0000000010068cd0] Oct 13 20:26:04 succubus RPC: <ar_context_add_page+0xd8/0x160 [firewire_ohci]> Oct 13 20:26:04 succubus OOPS: Fault was to vaddr[1004e000] Oct 13 20:26:04 succubus Call Trace: Oct 13 20:26:04 succubus [00000000004076f4] sparc64_realfault_common+0x18/0x20 Oct 13 20:26:04 succubus [0000000010068cd0] ar_context_add_page+0xd8/0x160 [firewire_ohci] Oct 13 20:26:04 succubus [0000000010068d90] ar_context_init+0x38/0x60 [firewire_ohci] Oct 13 20:26:04 succubus [000000001006ac50] pci_probe+0xf8/0x340 [firewire_ohci] Oct 13 20:26:04 succubus [00000000005299bc] pci_device_probe+0x64/0xa0 Oct 13 20:26:04 succubus [0000000000550a28] driver_probe_device+0x90/0x1c0 Oct 13 20:26:04 succubus [0000000000550bc0] __driver_attach+0x68/0x80 Oct 13 20:26:04 succubus [000000000054fe5c] bus_for_each_dev+0x44/0x80 Oct 13 20:26:04 succubus [0000000000550218] bus_add_driver+0x80/0x1c0 Oct 13 20:26:04 succubus [0000000000529b7c] __pci_register_driver+0x44/0xa0 Oct 13 20:26:04 succubus [000000000047d5cc] sys_init_module+0x134/0x1400 Oct 13 20:26:04 succubus [0000000000406094] linux_sparc_syscall32+0x3c/0x40 Oct 13 20:26:04 succubus [00000000000133d8] 0x133e0 Oct 13 20:26:04 succubus Unable to handle kernel NULL pointer dereference Oct 13 20:26:04 succubus tsk->{mm,active_mm}->context = 00000000000001f9 Oct 13 20:26:04 succubus tsk->{mm,active_mm}->pgd = fffff8004f9f8000 Oct 13 20:26:04 succubus \|/ ____ \|/ Oct 13 20:26:04 succubus "@'/ .. \`@" Oct 13 20:26:04 succubus /_| \__/ |_\ Oct 13 20:26:04 succubus \__U_/ Oct 13 20:26:04 succubus modprobe(648): Oops [#1] Oct 13 20:26:04 succubus TSTATE: 0000001911009601 TPC: 0000000000000000 TNPC: 0000000000000004 Y: 00000000 Not tainted Oct 13 20:26:04 succubus TPC: <0x8> Oct 13 20:26:04 succubus g0: 0000000000470a40 g1: 00000000005d2b28 g2: 0000000000000000 g3: 0000000000000000 Oct 13 20:26:04 succubus g4: fffff8004f393480 g5: 0000000000000098 g6: fffff8004fa74000 g7: 00c06500180065c0 Oct 13 20:26:04 succubus o0: fffff8004effb870 o1: 00000000c0650000 o2: 0000000000002000 o3: 0000000000000000 Oct 13 20:26:04 succubus o4: 0000000000000004 o5: 000001ff00420000 sp: fffff8004fa76e71 ret_pc: 0000000010068cd0 Oct 13 20:26:04 succubus RPC: <ar_context_add_page+0xd8/0x160 [firewire_ohci]> Oct 13 20:26:04 succubus l0: fffff8004fba4000 l1: 00000000c0650000 l2: fffff8004effb870 l3: fffff8004f608d90 Oct 13 20:26:04 succubus l4: 000000000063cc00 l5: 0000000000ff0000 l6: fffff8004fa776e0 l7: fffffffffffffff0 Oct 13 20:26:04 succubus i0: 000000000000ff00 i1: 0000000010069cc0 i2: fffff8004f608d90 i3: 0000000000000000 Oct 13 20:26:04 succubus i4: 0000000000000040 i5: 0000000000000000 i6: fffff8004fa76f31 i7: 0000000010068d90 Oct 13 20:26:04 succubus I7: <ar_context_init+0x38/0x60 [firewire_ohci]> Oct 13 20:26:04 succubus Caller[0000000010068d90]: ar_context_init+0x38/0x60 [firewire_ohci] Oct 13 20:26:04 succubus Caller[000000001006ac50]: pci_probe+0xf8/0x340 [firewire_ohci] Oct 13 20:26:04 succubus Caller[00000000005299bc]: pci_device_probe+0x64/0xa0 Oct 13 20:26:04 succubus Caller[0000000000550a28]: driver_probe_device+0x90/0x1c0 Oct 13 20:26:04 succubus Caller[0000000000550bc0]: __driver_attach+0x68/0x80 Oct 13 20:26:04 succubus Caller[000000000054fe5c]: bus_for_each_dev+0x44/0x80 Oct 13 20:26:04 succubus Caller[0000000000550218]: bus_add_driver+0x80/0x1c0 Oct 13 20:26:04 succubus Caller[0000000000529b7c]: __pci_register_driver+0x44/0xa0 Oct 13 20:26:04 succubus Caller[000000000047d5cc]: sys_init_module+0x134/0x1400 Oct 13 20:26:04 succubus Caller[0000000000406094]: linux_sparc_syscall32+0x3c/0x40 Oct 13 20:26:04 succubus Caller[00000000000133d8]: 0x133e0 Oct 13 20:26:04 succubus Instruction DUMP: 01000000 01000000 01000000 <1>Unable to handle kernel NULL pointer dereference Oct 13 20:26:04 succubus tsk->{mm,active_mm}->context = 00000000000001f9 Oct 13 20:26:04 succubus tsk->{mm,active_mm}->pgd = fffff8004f9f8000 Oct 13 20:26:04 succubus \|/ ____ \|/ Oct 13 20:26:04 succubus "@'/ .. \`@" Oct 13 20:26:04 succubus /_| \__/ |_\ Oct 13 20:26:04 succubus \__U_/ Oct 13 20:26:04 succubus modprobe(648): Oops [#2] Oct 13 20:26:04 succubus TSTATE: 0000004480009605 TPC: 0000000000428a5c TNPC: 0000000000428a60 Y: 00000000 Tainted: G D Oct 13 20:26:04 succubus TPC: <instruction_dump+0x44/0xa0> Oct 13 20:26:04 succubus g0: 78745f6164645f70 g1: 000000000063def0 g2: 0000000000000000 g3: 000000000063dc00 Oct 13 20:26:04 succubus g4: fffff8004f393480 g5: 0000000000000098 g6: fffff8004fa74000 g7: 000000000063def0 Oct 13 20:26:04 succubus o0: 00000000006050d0 o1: 000000000000003c o2: 0000000001000000 o3: 000000000000003e Oct 13 20:26:04 succubus o4: 0000000010068d90 o5: 000000000068d168 sp: fffff8004fa76aa1 ret_pc: 0000000000428a80 Oct 13 20:26:04 succubus RPC: <instruction_dump+0x68/0xa0> Oct 13 20:26:04 succubus l0: 0000000000000000 l1: 0000000000605000 l2: 0000000000000000 l3: 0000001911009601 Oct 13 20:26:04 succubus l4: 0000000000000000 l5: 000000000063dc00 l6: 000000000063dc00 l7: 0000000000000000 Oct 13 20:26:04 succubus i0: 0000000000000000 i1: 00000000000133e0 i2: fffff8004fa775d0 i3: 0000000000000000 Oct 13 20:26:04 succubus i4: 00000000000000ff i5: 000000000000000b i6: fffff8004fa76b61 i7: 00000000004294b8 Oct 13 20:26:04 succubus I7: <die_if_kernel+0x160/0x2c0> Oct 13 20:26:04 succubus Caller[00000000004294b8]: die_if_kernel+0x160/0x2c0 Oct 13 20:26:04 succubus Caller[000000000044d130]: unhandled_fault+0x78/0xe0 Oct 13 20:26:04 succubus Caller[00000000004076f4]: sparc64_realfault_common+0x18/0x20 Oct 13 20:26:04 succubus Caller[0000000010068cd0]: ar_context_add_page+0xd8/0x160 [firewire_ohci] Oct 13 20:26:04 succubus Caller[0000000010068d90]: ar_context_init+0x38/0x60 [firewire_ohci] Oct 13 20:26:04 succubus Caller[000000001006ac50]: pci_probe+0xf8/0x340 [firewire_ohci] Oct 13 20:26:04 succubus Caller[00000000005299bc]: pci_device_probe+0x64/0xa0 Oct 13 20:26:04 succubus Caller[0000000000550a28]: driver_probe_device+0x90/0x1c0 Oct 13 20:26:04 succubus Caller[0000000000550bc0]: __driver_attach+0x68/0x80 Oct 13 20:26:04 succubus Caller[000000000054fe5c]: bus_for_each_dev+0x44/0x80 Oct 13 20:26:04 succubus Caller[0000000000550218]: bus_add_driver+0x80/0x1c0 Oct 13 20:26:04 succubus Caller[0000000000529b7c]: __pci_register_driver+0x44/0xa0 Oct 13 20:26:04 succubus Caller[000000000047d5cc]: sys_init_module+0x134/0x1400 Oct 13 20:26:04 succubus Caller[0000000000406094]: linux_sparc_syscall32+0x3c/0x40 Oct 13 20:26:04 succubus Caller[00000000000133d8]: 0x133e0 Oct 13 20:26:04 succubus Instruction DUMP: 80a42000 12480005 9610203e <d4060000> 4000cf10 a0102001 833c2000 901460d0 83287002
Reply-To: stefanr@s5r6.in-berlin.de For the linux1394-devel and sparclinux mailinglists to see: Since Linux 2.6.23, the firewire-ohci driver oopses on startup when it sets up its AR DMA contexts (AR = asynchronous reception). See quote from bugzilla below. The last modification that directly affected firewire-ohci's AR DMA was already before 2.6.22 (firewire: fix async reception on big endian machines). http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=history;f=drivers/firewire/fw-ohci.c Patches to fw-ohci.c in 2.6.23 which are not in 2.6.22.9: firewire: missing newline in printk firewire: fw-ohci: flush MMIO write before msleep firewire: fw-ohci: dma_free_coherent needs IRQs enabled They are all obviously unrelated to the bug. bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=9160 > > Summary: kernel oops in firewire_ohci > Product: Drivers > Version: 2.5 > KernelVersion: 2.6.23.1 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: IEEE1394 > AssignedTo: drivers_ieee1394@kernel-bugs.osdl.org > ReportedBy: seraph@xs4all.nl > > > Most recent kernel where this bug did not occur: 2.6.22.9 > Distribution: Gentoo > Hardware Environment: Sparc64 (Blade 100 UltraSPARC 2e) > Software Environment: > Problem Description: > > I found a kernel oops in /var/log/messages after booting both my Blade 100s. > The oops appears to have triggered by firewire_ohci, but I'm no kernel hacker > and could be mistaken. > > Here is the text of one of them. The one from the other machine is identical > except for the timestamps and machine name. > > Oct 13 20:26:04 succubus OOPS: Bogus kernel PC [0000000000000000] in fault > handler > Oct 13 20:26:04 succubus OOPS: RPC [0000000010068cd0] > Oct 13 20:26:04 succubus RPC: <ar_context_add_page+0xd8/0x160 > [firewire_ohci]> > Oct 13 20:26:04 succubus OOPS: Fault was to vaddr[1004e000] > Oct 13 20:26:04 succubus Call Trace: > Oct 13 20:26:04 succubus [00000000004076f4] > sparc64_realfault_common+0x18/0x20 > Oct 13 20:26:04 succubus [0000000010068cd0] ar_context_add_page+0xd8/0x160 > [firewire_ohci] > Oct 13 20:26:04 succubus [0000000010068d90] ar_context_init+0x38/0x60 > [firewire_ohci] > Oct 13 20:26:04 succubus [000000001006ac50] pci_probe+0xf8/0x340 > [firewire_ohci] > Oct 13 20:26:04 succubus [00000000005299bc] pci_device_probe+0x64/0xa0 > Oct 13 20:26:04 succubus [0000000000550a28] driver_probe_device+0x90/0x1c0 > Oct 13 20:26:04 succubus [0000000000550bc0] __driver_attach+0x68/0x80 > Oct 13 20:26:04 succubus [000000000054fe5c] bus_for_each_dev+0x44/0x80 > Oct 13 20:26:04 succubus [0000000000550218] bus_add_driver+0x80/0x1c0 > Oct 13 20:26:04 succubus [0000000000529b7c] __pci_register_driver+0x44/0xa0 > Oct 13 20:26:04 succubus [000000000047d5cc] sys_init_module+0x134/0x1400 > Oct 13 20:26:04 succubus [0000000000406094] linux_sparc_syscall32+0x3c/0x40 > Oct 13 20:26:04 succubus [00000000000133d8] 0x133e0 > Oct 13 20:26:04 succubus Unable to handle kernel NULL pointer dereference > Oct 13 20:26:04 succubus tsk->{mm,active_mm}->context = 00000000000001f9 > Oct 13 20:26:04 succubus tsk->{mm,active_mm}->pgd = fffff8004f9f8000 > Oct 13 20:26:04 succubus \|/ ____ \|/ > Oct 13 20:26:04 succubus "@'/ .. \`@" > Oct 13 20:26:04 succubus /_| \__/ |_\ > Oct 13 20:26:04 succubus \__U_/ > Oct 13 20:26:04 succubus modprobe(648): Oops [#1] > Oct 13 20:26:04 succubus TSTATE: 0000001911009601 TPC: 0000000000000000 TNPC: > 0000000000000004 Y: 00000000 Not tainted > Oct 13 20:26:04 succubus TPC: <0x8> > Oct 13 20:26:04 succubus g0: 0000000000470a40 g1: 00000000005d2b28 g2: > 0000000000000000 g3: 0000000000000000 > Oct 13 20:26:04 succubus g4: fffff8004f393480 g5: 0000000000000098 g6: > fffff8004fa74000 g7: 00c06500180065c0 > Oct 13 20:26:04 succubus o0: fffff8004effb870 o1: 00000000c0650000 o2: > 0000000000002000 o3: 0000000000000000 > Oct 13 20:26:04 succubus o4: 0000000000000004 o5: 000001ff00420000 sp: > fffff8004fa76e71 ret_pc: 0000000010068cd0 > Oct 13 20:26:04 succubus RPC: <ar_context_add_page+0xd8/0x160 > [firewire_ohci]> > Oct 13 20:26:04 succubus l0: fffff8004fba4000 l1: 00000000c0650000 l2: > fffff8004effb870 l3: fffff8004f608d90 > Oct 13 20:26:04 succubus l4: 000000000063cc00 l5: 0000000000ff0000 l6: > fffff8004fa776e0 l7: fffffffffffffff0 > Oct 13 20:26:04 succubus i0: 000000000000ff00 i1: 0000000010069cc0 i2: > fffff8004f608d90 i3: 0000000000000000 > Oct 13 20:26:04 succubus i4: 0000000000000040 i5: 0000000000000000 i6: > fffff8004fa76f31 i7: 0000000010068d90 > Oct 13 20:26:04 succubus I7: <ar_context_init+0x38/0x60 [firewire_ohci]> > Oct 13 20:26:04 succubus Caller[0000000010068d90]: ar_context_init+0x38/0x60 > [firewire_ohci] > Oct 13 20:26:04 succubus Caller[000000001006ac50]: pci_probe+0xf8/0x340 > [firewire_ohci] > Oct 13 20:26:04 succubus Caller[00000000005299bc]: pci_device_probe+0x64/0xa0 > Oct 13 20:26:04 succubus Caller[0000000000550a28]: > driver_probe_device+0x90/0x1c0 > Oct 13 20:26:04 succubus Caller[0000000000550bc0]: __driver_attach+0x68/0x80 > Oct 13 20:26:04 succubus Caller[000000000054fe5c]: bus_for_each_dev+0x44/0x80 > Oct 13 20:26:04 succubus Caller[0000000000550218]: bus_add_driver+0x80/0x1c0 > Oct 13 20:26:04 succubus Caller[0000000000529b7c]: > __pci_register_driver+0x44/0xa0 > Oct 13 20:26:04 succubus Caller[000000000047d5cc]: > sys_init_module+0x134/0x1400 > Oct 13 20:26:04 succubus Caller[0000000000406094]: > linux_sparc_syscall32+0x3c/0x40 > Oct 13 20:26:04 succubus Caller[00000000000133d8]: 0x133e0 > Oct 13 20:26:04 succubus Instruction DUMP: 01000000 01000000 01000000 > <1>Unable to handle kernel NULL pointer dereference > Oct 13 20:26:04 succubus tsk->{mm,active_mm}->context = 00000000000001f9 > Oct 13 20:26:04 succubus tsk->{mm,active_mm}->pgd = fffff8004f9f8000 > Oct 13 20:26:04 succubus \|/ ____ \|/ > Oct 13 20:26:04 succubus "@'/ .. \`@" > Oct 13 20:26:04 succubus /_| \__/ |_\ > Oct 13 20:26:04 succubus \__U_/ > Oct 13 20:26:04 succubus modprobe(648): Oops [#2] > Oct 13 20:26:04 succubus TSTATE: 0000004480009605 TPC: 0000000000428a5c TNPC: > 0000000000428a60 Y: 00000000 Tainted: G D > Oct 13 20:26:04 succubus TPC: <instruction_dump+0x44/0xa0> > Oct 13 20:26:04 succubus g0: 78745f6164645f70 g1: 000000000063def0 g2: > 0000000000000000 g3: 000000000063dc00 > Oct 13 20:26:04 succubus g4: fffff8004f393480 g5: 0000000000000098 g6: > fffff8004fa74000 g7: 000000000063def0 > Oct 13 20:26:04 succubus o0: 00000000006050d0 o1: 000000000000003c o2: > 0000000001000000 o3: 000000000000003e > Oct 13 20:26:04 succubus o4: 0000000010068d90 o5: 000000000068d168 sp: > fffff8004fa76aa1 ret_pc: 0000000000428a80 > Oct 13 20:26:04 succubus RPC: <instruction_dump+0x68/0xa0> > Oct 13 20:26:04 succubus l0: 0000000000000000 l1: 0000000000605000 l2: > 0000000000000000 l3: 0000001911009601 > Oct 13 20:26:04 succubus l4: 0000000000000000 l5: 000000000063dc00 l6: > 000000000063dc00 l7: 0000000000000000 > Oct 13 20:26:04 succubus i0: 0000000000000000 i1: 00000000000133e0 i2: > fffff8004fa775d0 i3: 0000000000000000 > Oct 13 20:26:04 succubus i4: 00000000000000ff i5: 000000000000000b i6: > fffff8004fa76b61 i7: 00000000004294b8 > Oct 13 20:26:04 succubus I7: <die_if_kernel+0x160/0x2c0> > Oct 13 20:26:04 succubus Caller[00000000004294b8]: die_if_kernel+0x160/0x2c0 > Oct 13 20:26:04 succubus Caller[000000000044d130]: unhandled_fault+0x78/0xe0 > Oct 13 20:26:04 succubus Caller[00000000004076f4]: > sparc64_realfault_common+0x18/0x20 > Oct 13 20:26:04 succubus Caller[0000000010068cd0]: > ar_context_add_page+0xd8/0x160 [firewire_ohci] > Oct 13 20:26:04 succubus Caller[0000000010068d90]: ar_context_init+0x38/0x60 > [firewire_ohci] > Oct 13 20:26:04 succubus Caller[000000001006ac50]: pci_probe+0xf8/0x340 > [firewire_ohci] > Oct 13 20:26:04 succubus Caller[00000000005299bc]: pci_device_probe+0x64/0xa0 > Oct 13 20:26:04 succubus Caller[0000000000550a28]: > driver_probe_device+0x90/0x1c0 > Oct 13 20:26:04 succubus Caller[0000000000550bc0]: __driver_attach+0x68/0x80 > Oct 13 20:26:04 succubus Caller[000000000054fe5c]: bus_for_each_dev+0x44/0x80 > Oct 13 20:26:04 succubus Caller[0000000000550218]: bus_add_driver+0x80/0x1c0 > Oct 13 20:26:04 succubus Caller[0000000000529b7c]: > __pci_register_driver+0x44/0xa0 > Oct 13 20:26:04 succubus Caller[000000000047d5cc]: > sys_init_module+0x134/0x1400 > Oct 13 20:26:04 succubus Caller[0000000000406094]: > linux_sparc_syscall32+0x3c/0x40 > Oct 13 20:26:04 succubus Caller[00000000000133d8]: 0x133e0 > Oct 13 20:26:04 succubus Instruction DUMP: 80a42000 12480005 9610203e > <d4060000> 4000cf10 a0102001 833c2000 901460d0 83287002 > >
Did you actually enable the new firewire drivers (firewire-ohci etc., as opposed to the old drivers ohci1394 etc.) already when you ran Linux 2.6.22.9?
Oh yes, I've been using the new drivers ever since they appeared in the stable tree. Everything was just fine up to and including 2.6.22.9, I checked my old logs. Could it be Sparc-specific? I also use the new drivers on my laptop (Pentium 4 Mobile) and it seems perfectly fine with the 2.6.23.1 kernel.
Created attachment 13181 [details] firewire: fw-ohci: fix DMA mapping for ar context Could you apply and test this patch? The diff is against a development version but it should be applicable to 2.6.23(.1) with only harmless line offsets.
Created attachment 13182 [details] firewire: fw-ohci: fix DMA mapping for ar context Oops, the previous version of the patch had a silly error.
Created attachment 13183 [details] firewire: fw-ohci: optimize initialization of ar context Since you have hardware which actually does something when dma_sync_* is called, it would be nice if you would also test this small optimization patch.
PS: The fix has to be applied before the optimization can be applied.
Well, I'm afraid the patch didn't help. Here is the oops I got after booting with it applied: Oct 16 20:56:31 succubus OOPS: Bogus kernel PC [0000000000000000] in fault handler Oct 16 20:56:31 succubus OOPS: RPC [00000000100a2cec] Oct 16 20:56:31 succubus RPC: <ar_context_add_page+0xf4/0x180 [firewire_ohci]> Oct 16 20:56:31 succubus OOPS: Fault was to vaddr[10072000] Oct 16 20:56:31 succubus Call Trace: Oct 16 20:56:31 succubus [00000000004076f4] sparc64_realfault_common+0x18/0x20 Oct 16 20:56:31 succubus [00000000100a2cec] ar_context_add_page+0xf4/0x180 [firewire_ohci] Oct 16 20:56:31 succubus [00000000100a2db0] ar_context_init+0x38/0x60 [firewire_ohci] Oct 16 20:56:31 succubus [00000000100a4c70] pci_probe+0xf8/0x340 [firewire_ohci] Oct 16 20:56:31 succubus [00000000005299bc] pci_device_probe+0x64/0xa0 Oct 16 20:56:31 succubus [0000000000550a28] driver_probe_device+0x90/0x1c0 Oct 16 20:56:31 succubus [0000000000550bc0] __driver_attach+0x68/0x80 Oct 16 20:56:31 succubus [000000000054fe5c] bus_for_each_dev+0x44/0x80 Oct 16 20:56:31 succubus [0000000000550218] bus_add_driver+0x80/0x1c0 Oct 16 20:56:31 succubus [0000000000529b7c] __pci_register_driver+0x44/0xa0 Oct 16 20:56:31 succubus [000000000047d5cc] sys_init_module+0x134/0x1400 Oct 16 20:56:31 succubus [0000000000406094] linux_sparc_syscall32+0x3c/0x40 Oct 16 20:56:31 succubus [00000000000133d8] 0x133e0 Oct 16 20:56:31 succubus Unable to handle kernel NULL pointer dereference Oct 16 20:56:31 succubus tsk->{mm,active_mm}->context = 00000000000001f5 Oct 16 20:56:31 succubus tsk->{mm,active_mm}->pgd = fffff8004e92c000 Oct 16 20:56:31 succubus \|/ ____ \|/ Oct 16 20:56:31 succubus "@'/ .. \`@" Oct 16 20:56:31 succubus /_| \__/ |_\ Oct 16 20:56:31 succubus \__U_/ Oct 16 20:56:31 succubus modprobe(641): Oops [#1] Oct 16 20:56:31 succubus TSTATE: 0000004411009601 TPC: 0000000000000000 TNPC: 0000000000000004 Y: 00000000 Not tainted Oct 16 20:56:31 succubus TPC: <0x8> Oct 16 20:56:31 succubus g0: fffff8004eb775d0 g1: 0000000000000000 g2: 00000000005d2b28 g3: 00c06500180065c0 Oct 16 20:56:31 succubus g4: fffff8004ebdd180 g5: 00000000000000b8 g6: fffff8004eb74000 g7: 0000000000000000 Oct 16 20:56:31 succubus o0: fffff8004effb870 o1: 00000000c0650000 o2: 0000000000002000 o3: 0000000000000000 Oct 16 20:56:31 succubus o4: 0000000000000004 o5: 0000000000006500 sp: fffff8004eb76e71 ret_pc: 00000000100a2cec Oct 16 20:56:31 succubus RPC: <ar_context_add_page+0xf4/0x180 [firewire_ohci]> Oct 16 20:56:31 succubus l0: 000000000000ff00 l1: fffff8004f8c2000 l2: fffff8004effb870 l3: 00000000c0650000 Oct 16 20:56:31 succubus l4: fffff8004f664d90 l5: 000000000063cc00 l6: 0000000000ff0000 l7: fffffffffffffff0 Oct 16 20:56:31 succubus i0: 00000000c0650000 i1: 00000000100a3ce0 i2: fffff8004f664d90 i3: 0000000000000000 Oct 16 20:56:31 succubus i4: 0000000000000040 i5: 0000000000000000 i6: fffff8004eb76f31 i7: 00000000100a2db0 Oct 16 20:56:31 succubus I7: <ar_context_init+0x38/0x60 [firewire_ohci]> Oct 16 20:56:31 succubus Caller[00000000100a2db0]: ar_context_init+0x38/0x60 [firewire_ohci] Oct 16 20:56:31 succubus Caller[00000000100a4c70]: pci_probe+0xf8/0x340 [firewire_ohci] Oct 16 20:56:31 succubus Caller[00000000005299bc]: pci_device_probe+0x64/0xa0 Oct 16 20:56:31 succubus Caller[0000000000550a28]: driver_probe_device+0x90/0x1c0 Oct 16 20:56:31 succubus Caller[0000000000550bc0]: __driver_attach+0x68/0x80 Oct 16 20:56:31 succubus Caller[000000000054fe5c]: bus_for_each_dev+0x44/0x80 Oct 16 20:56:31 succubus Caller[0000000000550218]: bus_add_driver+0x80/0x1c0 Oct 16 20:56:31 succubus Caller[0000000000529b7c]: __pci_register_driver+0x44/0xa0 Oct 16 20:56:31 succubus Caller[000000000047d5cc]: sys_init_module+0x134/0x1400 Oct 16 20:56:31 succubus Caller[0000000000406094]: linux_sparc_syscall32+0x3c/0x40 Oct 16 20:56:31 succubus Caller[00000000000133d8]: 0x133e0 Oct 16 20:56:31 succubus Instruction DUMP: 01000000 01000000 01000000 <1>Unable to handle kernel NULL pointer dereference Oct 16 20:56:31 succubus tsk->{mm,active_mm}->context = 00000000000001f5 Oct 16 20:56:31 succubus tsk->{mm,active_mm}->pgd = fffff8004e92c000 Oct 16 20:56:31 succubus \|/ ____ \|/ Oct 16 20:56:31 succubus "@'/ .. \`@" Oct 16 20:56:31 succubus /_| \__/ |_\ Oct 16 20:56:31 succubus \__U_/ Oct 16 20:56:31 succubus modprobe(641): Oops [#2] Oct 16 20:56:31 succubus TSTATE: 0000004480009605 TPC: 0000000000428a5c TNPC: 0000000000428a60 Y: 00000000 Tainted: G D Oct 16 20:56:31 succubus TPC: <instruction_dump+0x44/0xa0> Oct 16 20:56:31 succubus g0: 78745f6164645f70 g1: 000000000063def0 g2: 0000000000000000 g3: 000000000063dc00 Oct 16 20:56:31 succubus g4: fffff8004ebdd180 g5: 00000000000000b8 g6: fffff8004eb74000 g7: 000000000063def0 Oct 16 20:56:31 succubus o0: 00000000006050d0 o1: 000000000000003c o2: 0000000001000000 o3: 000000000000003e Oct 16 20:56:31 succubus o4: 00000000100a2db0 o5: 000000000068d168 sp: fffff8004eb76aa1 ret_pc: 0000000000428a80 Oct 16 20:56:31 succubus RPC: <instruction_dump+0x68/0xa0> Oct 16 20:56:31 succubus l0: 0000000000000000 l1: 0000000000605000 l2: 0000000000000000 l3: 0000004411009601 Oct 16 20:56:31 succubus l4: 0000000000000000 l5: 000000000063dc00 l6: 000000000063dc00 l7: 0000000000000000 Oct 16 20:56:31 succubus i0: 0000000000000000 i1: 00000000000133e0 i2: fffff8004eb775d0 i3: 0000000000000000 Oct 16 20:56:31 succubus i4: 00000000000000ff i5: 000000000000000b i6: fffff8004eb76b61 i7: 00000000004294b8 Oct 16 20:56:31 succubus I7: <die_if_kernel+0x160/0x2c0> Oct 16 20:56:31 succubus Caller[00000000004294b8]: die_if_kernel+0x160/0x2c0 Oct 16 20:56:31 succubus Caller[000000000044d130]: unhandled_fault+0x78/0xe0 Oct 16 20:56:31 succubus Caller[00000000004076f4]: sparc64_realfault_common+0x18/0x20 Oct 16 20:56:31 succubus Caller[00000000100a2cec]: ar_context_add_page+0xf4/0x180 [firewire_ohci] Oct 16 20:56:31 succubus Caller[00000000100a2db0]: ar_context_init+0x38/0x60 [firewire_ohci] Oct 16 20:56:31 succubus Caller[00000000100a4c70]: pci_probe+0xf8/0x340 [firewire_ohci] Oct 16 20:56:31 succubus Caller[00000000005299bc]: pci_device_probe+0x64/0xa0 Oct 16 20:56:31 succubus Caller[0000000000550a28]: driver_probe_device+0x90/0x1c0 Oct 16 20:56:31 succubus Caller[0000000000550bc0]: __driver_attach+0x68/0x80 Oct 16 20:56:31 succubus Caller[000000000054fe5c]: bus_for_each_dev+0x44/0x80 Oct 16 20:56:31 succubus Caller[0000000000550218]: bus_add_driver+0x80/0x1c0 Oct 16 20:56:31 succubus Caller[0000000000529b7c]: __pci_register_driver+0x44/0xa0 Oct 16 20:56:31 succubus Caller[000000000047d5cc]: sys_init_module+0x134/0x1400 Oct 16 20:56:31 succubus Caller[0000000000406094]: linux_sparc_syscall32+0x3c/0x40 Oct 16 20:56:31 succubus Caller[00000000000133d8]: 0x133e0 Oct 16 20:56:31 succubus Instruction DUMP: 80a42000 12480005 9610203e <d4060000> 4000cf10 a0102001 833c2000 901460d0 83287002 As for the optimization code, I'd be happy to test it for you (once the bug is fixed), but I can do little more than smoke-test it.
Created attachment 13184 [details] test 1 Please try this, which merely adds printk()s to show where things go wrong. And after that...
Created attachment 13185 [details] test 2 ...please try this here on top of "test 1". Then I hope the problem occurs at a different place. If so, I think know what I'll have to do.
Well, here's what happens with test 1 applied: Oct 16 22:44:06 succubus firewire_ohci: - - - 1 - - - Oct 16 22:44:06 succubus firewire_ohci: - - - 2 - - - Oct 16 22:44:06 succubus firewire_ohci: - - - 3 - - - Oct 16 22:44:06 succubus firewire_ohci: - - - 4 - - - Oct 16 22:44:06 succubus firewire_ohci: - - - 5 - - - Oct 16 22:44:06 succubus OOPS: Bogus kernel PC [0000000000000000] in fault handler Oct 16 22:44:06 succubus OOPS: RPC [0000000010098d0c] Oct 16 22:44:06 succubus RPC: <ar_context_add_page+0x114/0x1c0 [firewire_ohci]> Oct 16 22:44:06 succubus OOPS: Fault was to vaddr[1009a000] [snipped the rest]
And here you have it with test 2 applied on top of test 1: Oct 16 22:58:20 succubus firewire_ohci: - - - 1 - - - Oct 16 22:58:20 succubus firewire_ohci: - - - 2 - - - Oct 16 22:58:20 succubus firewire_ohci: - - - 3 - - - Oct 16 22:58:20 succubus firewire_ohci: - - - 4 - - - Oct 16 22:58:20 succubus firewire_ohci: - - - 5 - - - Oct 16 22:58:20 succubus OOPS: Bogus kernel PC [0000000000000000] in fault handler Oct 16 22:58:20 succubus OOPS: RPC [000000001007ad28] Oct 16 22:58:20 succubus RPC: <ar_context_add_page+0x130/0x1c0 [firewire_ohci]> Oct 16 22:58:20 succubus OOPS: Fault was to vaddr[1004e000] [snipped the rest] Mm, looks pretty much the same to me, apart from that line starting with RPC:
Hmm, I didn't expect both to coke at the same place. (The change from 0x114 to 0x130 are obviously due to the machine code size of the single function call that was added by test 2.) Plan B: Convert the ar buffers to a different kind of DMA mapping. The ar buffers are accessed in wrong ways in other places in fw-ohci too, so better fix it from the ground up. I hope I will have time for this tomorrow. If that goes wrong: Plan C -- ask sparc64 folks what their kernel is trying to tell us... Thanks so far.
You're welcome. I use the driver more out of curiosity than anything, so it's not critical to me, though I would like to see it fixed regardless. I have only one actual firewire device, an external disk, and that one never worked with Linux at all. I mostly just play with IP over firewire, or using Oracle's "endpoint" software to turn my laptop into a makeshift disk. If you have a fix for me to try, just drop me a line. If you prefer, I can give you my Jabber, AIM, Yahoo or MSN messenger contact info. I'll be happy to test your optimization code too.
I don't wanna sound pushy, but is anything still being done with this?
No, or not actively... I would need to specifically ask at sparclinux for help with this (comment 13 plan C). On the other hand, there are currently some people reworking the isochronous reception code of firewire-ohci (which is per se unrelated to asynchronous reception whose setup is making trouble here) and it looks a lot like the IR code will be converted to use coherent DMA allocations. Similarly, I believe it makes sense to convert AR to use coherent DMA allocations as well. Then all of the dma_sync_*() calls will go away. (Comment 13 plan B. Minus this strange "tomorrow" thing.) One other thing we could try before that would be to assemble a patch series which rolls the firewire code from 2.6.23.y back to how it was in 2.6.22. With a bisection in such a series we might be able to learn more about the bug here too. PS, about comment 14: Both IP over FireWire and Endpoint don't work with the new alternative firewire drivers at all yet, and won't do so anytime soon as far as I can tell from our development progress, or lack thereof.
> One other thing we could try before that would be to assemble a patch > series which rolls the firewire code from 2.6.23.y back to how it was > in 2.6.22. I'm gonna attempt to do this tonight. There should only few patches in there: http://me.in-berlin.de/~s5r6/linux1394/merged/in_2.6.23/ minus ???-ieee1394-*.patch and minus [fms]*.patch and minus most or all of ???-fw-sbp2-*.patch.
FWIW, another report of the same bug was made against 2.6.23-rc3: http://www.mail-archive.com/sparclinux@vger.kernel.org/msg01697.html The poster also suspected dma_sync_single_for_device to be at fault. There were no replies.
Well, like I said, no rush. I was just curious as there were no updates for nearly two months. Based on your comments, I reverted to the old FireWire stack for the time being so I can test IP over FireWire between my webserver and my database. Of course I'd still be glad to test any code for you. And now it appears my sungem interface is also not working in 2.6.23... Sigh, it's always something, isn't it?
Created attachment 14086 [details] revert firewire from 2.6.23 to 2.6.22 This patch reverts the whole firewire subsystem in 2.6.23 to how it was in 2.6.22. Please apply to 2.6.23(.y) and test-load firewire-ohci. I only compile-tested the patch but I think it should work (to the extent to that the firewire-subsystem worked in 2.6.22, which was its debut in mainline).
Created attachment 14087 [details] revert a few selected 2.6.23 patches This is a series of 5 patches: patches/revert-560a-firewire-fw-ohci-ignore-failure-of-pci_set_power_state-fix-suspend-regression.patch patches/revert-555-firewire-fix-unloading-of-fw-ohci-while-devices-are-attached.patch patches/revert-530-firewire-fw-ohci-dma_free_coherent-needs-irqs-enabled.patch patches/revert-522-firewire-fw-ohci-fix-scheduling-while-atomic.patch patches/revert-521-firewire-fw-ohci-flush-mmio-write-before-msleep.patch patches/series These are applicable on top of clean 2.6.23(.y) in the order as listed. The series file is only needed if you want to use quilt to apply and unapply these patches one after another. This test would only be interesting if the previous sledge hammer patch actually fixes the oops (which I seriously doubt after having collected this smaller patch series here).
My bet is: - Attachment 14086 [details] (comment 20) will not fix the oops, - the sparc architecture has been broken after 2.6.22 in such a way that dma_sync_*() cannot be used on a device whose probe method hasn't been finished. But let's wait for your test result.
I forgot: Independently of the result of that test, another interesting point is whether the bug still exists in 2.6.24(-rc*).
Well, if the layout of menuconfig is any indication, sparc code has been a mess for some time... I've been tempted to try and clean it up, but I'm afraid I'd only make it worse. ^^ Ah well, applying your patch from #20 now. And I suppose I better get that cross-compiler running again while I'm at it, compiling on the Blade 100 just takes too damn long... I'll keep you updated.
Oh, a little something I spotted while compiling: CC [M] drivers/firewire/fw-cdev.o drivers/firewire/fw-cdev.c: In function 'ioctl_queue_iso': drivers/firewire/fw-cdev.c:725: warning: passing argument 2 of 'access_ok' makes pointer from integer without a cast
> drivers/firewire/fw-cdev.c: In function 'ioctl_queue_iso': > drivers/firewire/fw-cdev.c:725: warning: passing argument 2 of 'access_ok' > makes pointer from integer without a cast I got a slightly different sparse warning but not a compiler warning on i386. It has been fixed in 2.6.24-rc1 though: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=1ccc9147f6a063c42fef67acff34de18435a4a6b
Well, as you already suspected, the patch from #20 did not fix the problem. It still oopses, causing problems with a couple of other modules as well. An interesting point is that network does work with this kernel, while it didn't with non-oopsing 2.6.23 kernels... Mm, wonder if another module is causing the failure... Cross compiler works again, I'll try 2.6.24-rc5 next.
I just tried 2.6.24-rc5. It also oopses just like the rest.
Thanks a lot for taking the time to test thus far. One more question just to be safe. You wrote "Most recent kernel where this bug did not occur: 2.6.22.9" in the description. Are you sure you already had firewire-ohci enabled under 2.6.22.y?
Yes, I had it enabled, but at the time I was not using it for anything at all. I just retested it to make sure. It 'works' in so far that it doesn't oops on loading. However, trying to use it results in hanging my machine. So I guess it was actually already broken in 2.6.22.9, I just didn't notice it.
> It 'works' in so far that it doesn't oops on loading. However, trying > to use it results in hanging my machine. So I guess it was actually > already broken in 2.6.22.9, I just didn't notice it. Well, it was already broken but in different ways. The oops in the pci_probe path is IMO quite specific. I will ask around whether this is something to be fixed in architecture code or driver core code, or whether the dma_sync_* has to be removed from that path.
Current discussion at sparclinux, Cc'd to linux-kernel and linux1394-devel: http://lkml.org/lkml/2007/12/17/413
Could you try this patch by David Miller? posting: http://lkml.org/lkml/2007/12/20/37 raw diff: http://lkml.org/lkml/diff/2007/12/20/37/1
Comment on attachment 14086 [details] revert firewire from 2.6.23 to 2.6.22 Daveid Miller's fix has been confirmed on Sun Blade 2000. See LKML link in the previous comment.
I assume that the problem is now fixed in Linus' tree. Bug fix commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=36bb61346d9e64b55285f27363e93a6e96f2abba Merge commit (a day after 2.6.24-rc6): http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a4c80d2ae2cac531c6655f75658dae02c488abc7 Please reopen if I'm mistaken and the same issue is still present after the fix.