Bug 9160 - kernel oops in firewire_ohci (sparc64 DMA mapping)
Summary: kernel oops in firewire_ohci (sparc64 DMA mapping)
Status: CLOSED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: SPARC64 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Stefan Richter
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-10-13 12:44 UTC by seraph@xs4all.nl
Modified: 2007-12-22 05:06 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.23.1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
firewire: fw-ohci: fix DMA mapping for ar context (1.45 KB, patch)
2007-10-16 10:29 UTC, Stefan Richter
Details | Diff
firewire: fw-ohci: fix DMA mapping for ar context (847 bytes, patch)
2007-10-16 10:41 UTC, Stefan Richter
Details | Diff
firewire: fw-ohci: optimize initialization of ar context (2.03 KB, patch)
2007-10-16 11:04 UTC, Stefan Richter
Details | Diff
test 1 (1.28 KB, patch)
2007-10-16 13:01 UTC, Stefan Richter
Details | Diff
test 2 (354 bytes, patch)
2007-10-16 13:03 UTC, Stefan Richter
Details | Diff
revert firewire from 2.6.23 to 2.6.22 (50.32 KB, patch)
2007-12-17 10:14 UTC, Stefan Richter
Details | Diff
revert a few selected 2.6.23 patches (1.73 KB, application/octet-stream)
2007-12-17 10:19 UTC, Stefan Richter
Details

Description seraph@xs4all.nl 2007-10-13 12:44:11 UTC
Most recent kernel where this bug did not occur: 2.6.22.9
Distribution: Gentoo
Hardware Environment: Sparc64 (Blade 100 UltraSPARC 2e)
Software Environment:
Problem Description:

I found a kernel oops in /var/log/messages after booting both my Blade 100s. The oops appears to have triggered by firewire_ohci, but I'm no kernel hacker and could be mistaken.

Here is the text of one of them. The one from the other machine is identical except for the timestamps and machine name.

Oct 13 20:26:04 succubus OOPS: Bogus kernel PC [0000000000000000] in fault handler
Oct 13 20:26:04 succubus OOPS: RPC [0000000010068cd0]
Oct 13 20:26:04 succubus RPC: <ar_context_add_page+0xd8/0x160 [firewire_ohci]>
Oct 13 20:26:04 succubus OOPS: Fault was to vaddr[1004e000]
Oct 13 20:26:04 succubus Call Trace:
Oct 13 20:26:04 succubus [00000000004076f4] sparc64_realfault_common+0x18/0x20
Oct 13 20:26:04 succubus [0000000010068cd0] ar_context_add_page+0xd8/0x160 [firewire_ohci]
Oct 13 20:26:04 succubus [0000000010068d90] ar_context_init+0x38/0x60 [firewire_ohci]
Oct 13 20:26:04 succubus [000000001006ac50] pci_probe+0xf8/0x340 [firewire_ohci]
Oct 13 20:26:04 succubus [00000000005299bc] pci_device_probe+0x64/0xa0
Oct 13 20:26:04 succubus [0000000000550a28] driver_probe_device+0x90/0x1c0
Oct 13 20:26:04 succubus [0000000000550bc0] __driver_attach+0x68/0x80
Oct 13 20:26:04 succubus [000000000054fe5c] bus_for_each_dev+0x44/0x80
Oct 13 20:26:04 succubus [0000000000550218] bus_add_driver+0x80/0x1c0
Oct 13 20:26:04 succubus [0000000000529b7c] __pci_register_driver+0x44/0xa0
Oct 13 20:26:04 succubus [000000000047d5cc] sys_init_module+0x134/0x1400
Oct 13 20:26:04 succubus [0000000000406094] linux_sparc_syscall32+0x3c/0x40
Oct 13 20:26:04 succubus [00000000000133d8] 0x133e0
Oct 13 20:26:04 succubus Unable to handle kernel NULL pointer dereference
Oct 13 20:26:04 succubus tsk->{mm,active_mm}->context = 00000000000001f9
Oct 13 20:26:04 succubus tsk->{mm,active_mm}->pgd = fffff8004f9f8000
Oct 13 20:26:04 succubus \|/ ____ \|/
Oct 13 20:26:04 succubus "@'/ .. \`@"
Oct 13 20:26:04 succubus /_| \__/ |_\
Oct 13 20:26:04 succubus \__U_/
Oct 13 20:26:04 succubus modprobe(648): Oops [#1]
Oct 13 20:26:04 succubus TSTATE: 0000001911009601 TPC: 0000000000000000 TNPC: 0000000000000004 Y: 00000000    Not tainted
Oct 13 20:26:04 succubus TPC: <0x8>
Oct 13 20:26:04 succubus g0: 0000000000470a40 g1: 00000000005d2b28 g2: 0000000000000000 g3: 0000000000000000
Oct 13 20:26:04 succubus g4: fffff8004f393480 g5: 0000000000000098 g6: fffff8004fa74000 g7: 00c06500180065c0
Oct 13 20:26:04 succubus o0: fffff8004effb870 o1: 00000000c0650000 o2: 0000000000002000 o3: 0000000000000000
Oct 13 20:26:04 succubus o4: 0000000000000004 o5: 000001ff00420000 sp: fffff8004fa76e71 ret_pc: 0000000010068cd0
Oct 13 20:26:04 succubus RPC: <ar_context_add_page+0xd8/0x160 [firewire_ohci]>
Oct 13 20:26:04 succubus l0: fffff8004fba4000 l1: 00000000c0650000 l2: fffff8004effb870 l3: fffff8004f608d90
Oct 13 20:26:04 succubus l4: 000000000063cc00 l5: 0000000000ff0000 l6: fffff8004fa776e0 l7: fffffffffffffff0
Oct 13 20:26:04 succubus i0: 000000000000ff00 i1: 0000000010069cc0 i2: fffff8004f608d90 i3: 0000000000000000
Oct 13 20:26:04 succubus i4: 0000000000000040 i5: 0000000000000000 i6: fffff8004fa76f31 i7: 0000000010068d90
Oct 13 20:26:04 succubus I7: <ar_context_init+0x38/0x60 [firewire_ohci]>
Oct 13 20:26:04 succubus Caller[0000000010068d90]: ar_context_init+0x38/0x60 [firewire_ohci]
Oct 13 20:26:04 succubus Caller[000000001006ac50]: pci_probe+0xf8/0x340 [firewire_ohci]
Oct 13 20:26:04 succubus Caller[00000000005299bc]: pci_device_probe+0x64/0xa0
Oct 13 20:26:04 succubus Caller[0000000000550a28]: driver_probe_device+0x90/0x1c0
Oct 13 20:26:04 succubus Caller[0000000000550bc0]: __driver_attach+0x68/0x80
Oct 13 20:26:04 succubus Caller[000000000054fe5c]: bus_for_each_dev+0x44/0x80
Oct 13 20:26:04 succubus Caller[0000000000550218]: bus_add_driver+0x80/0x1c0
Oct 13 20:26:04 succubus Caller[0000000000529b7c]: __pci_register_driver+0x44/0xa0
Oct 13 20:26:04 succubus Caller[000000000047d5cc]: sys_init_module+0x134/0x1400
Oct 13 20:26:04 succubus Caller[0000000000406094]: linux_sparc_syscall32+0x3c/0x40
Oct 13 20:26:04 succubus Caller[00000000000133d8]: 0x133e0
Oct 13 20:26:04 succubus Instruction DUMP: 01000000  01000000  01000000 <1>Unable to handle kernel NULL pointer dereference
Oct 13 20:26:04 succubus tsk->{mm,active_mm}->context = 00000000000001f9
Oct 13 20:26:04 succubus tsk->{mm,active_mm}->pgd = fffff8004f9f8000
Oct 13 20:26:04 succubus \|/ ____ \|/
Oct 13 20:26:04 succubus "@'/ .. \`@"
Oct 13 20:26:04 succubus /_| \__/ |_\
Oct 13 20:26:04 succubus \__U_/
Oct 13 20:26:04 succubus modprobe(648): Oops [#2]
Oct 13 20:26:04 succubus TSTATE: 0000004480009605 TPC: 0000000000428a5c TNPC: 0000000000428a60 Y: 00000000    Tainted: G      D
Oct 13 20:26:04 succubus TPC: <instruction_dump+0x44/0xa0>
Oct 13 20:26:04 succubus g0: 78745f6164645f70 g1: 000000000063def0 g2: 0000000000000000 g3: 000000000063dc00
Oct 13 20:26:04 succubus g4: fffff8004f393480 g5: 0000000000000098 g6: fffff8004fa74000 g7: 000000000063def0
Oct 13 20:26:04 succubus o0: 00000000006050d0 o1: 000000000000003c o2: 0000000001000000 o3: 000000000000003e
Oct 13 20:26:04 succubus o4: 0000000010068d90 o5: 000000000068d168 sp: fffff8004fa76aa1 ret_pc: 0000000000428a80
Oct 13 20:26:04 succubus RPC: <instruction_dump+0x68/0xa0>
Oct 13 20:26:04 succubus l0: 0000000000000000 l1: 0000000000605000 l2: 0000000000000000 l3: 0000001911009601
Oct 13 20:26:04 succubus l4: 0000000000000000 l5: 000000000063dc00 l6: 000000000063dc00 l7: 0000000000000000
Oct 13 20:26:04 succubus i0: 0000000000000000 i1: 00000000000133e0 i2: fffff8004fa775d0 i3: 0000000000000000
Oct 13 20:26:04 succubus i4: 00000000000000ff i5: 000000000000000b i6: fffff8004fa76b61 i7: 00000000004294b8
Oct 13 20:26:04 succubus I7: <die_if_kernel+0x160/0x2c0>
Oct 13 20:26:04 succubus Caller[00000000004294b8]: die_if_kernel+0x160/0x2c0
Oct 13 20:26:04 succubus Caller[000000000044d130]: unhandled_fault+0x78/0xe0
Oct 13 20:26:04 succubus Caller[00000000004076f4]: sparc64_realfault_common+0x18/0x20
Oct 13 20:26:04 succubus Caller[0000000010068cd0]: ar_context_add_page+0xd8/0x160 [firewire_ohci]
Oct 13 20:26:04 succubus Caller[0000000010068d90]: ar_context_init+0x38/0x60 [firewire_ohci]
Oct 13 20:26:04 succubus Caller[000000001006ac50]: pci_probe+0xf8/0x340 [firewire_ohci]
Oct 13 20:26:04 succubus Caller[00000000005299bc]: pci_device_probe+0x64/0xa0
Oct 13 20:26:04 succubus Caller[0000000000550a28]: driver_probe_device+0x90/0x1c0
Oct 13 20:26:04 succubus Caller[0000000000550bc0]: __driver_attach+0x68/0x80
Oct 13 20:26:04 succubus Caller[000000000054fe5c]: bus_for_each_dev+0x44/0x80
Oct 13 20:26:04 succubus Caller[0000000000550218]: bus_add_driver+0x80/0x1c0
Oct 13 20:26:04 succubus Caller[0000000000529b7c]: __pci_register_driver+0x44/0xa0
Oct 13 20:26:04 succubus Caller[000000000047d5cc]: sys_init_module+0x134/0x1400
Oct 13 20:26:04 succubus Caller[0000000000406094]: linux_sparc_syscall32+0x3c/0x40
Oct 13 20:26:04 succubus Caller[00000000000133d8]: 0x133e0
Oct 13 20:26:04 succubus Instruction DUMP: 80a42000  12480005  9610203e <d4060000> 4000cf10  a0102001  833c2000  901460d0  83287002
Comment 1 Anonymous Emailer 2007-10-13 13:16:34 UTC
Reply-To: stefanr@s5r6.in-berlin.de

For the linux1394-devel and sparclinux mailinglists to see:

Since Linux 2.6.23, the firewire-ohci driver oopses on startup when it
sets up its AR DMA contexts (AR = asynchronous reception).  See quote
from bugzilla below.

The last modification that directly affected firewire-ohci's AR DMA was
already before 2.6.22 (firewire: fix async reception on big endian
machines).
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=history;f=drivers/firewire/fw-ohci.c

Patches to fw-ohci.c in 2.6.23 which are not in 2.6.22.9:

	firewire: missing newline in printk
	firewire: fw-ohci: flush MMIO write before msleep
	firewire: fw-ohci: dma_free_coherent needs IRQs enabled

They are all obviously unrelated to the bug.

bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=9160
> 
>            Summary: kernel oops in firewire_ohci
>            Product: Drivers
>            Version: 2.5
>      KernelVersion: 2.6.23.1
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: IEEE1394
>         AssignedTo: drivers_ieee1394@kernel-bugs.osdl.org
>         ReportedBy: seraph@xs4all.nl
> 
> 
> Most recent kernel where this bug did not occur: 2.6.22.9
> Distribution: Gentoo
> Hardware Environment: Sparc64 (Blade 100 UltraSPARC 2e)
> Software Environment:
> Problem Description:
> 
> I found a kernel oops in /var/log/messages after booting both my Blade 100s.
> The oops appears to have triggered by firewire_ohci, but I'm no kernel hacker
> and could be mistaken.
> 
> Here is the text of one of them. The one from the other machine is identical
> except for the timestamps and machine name.
> 
> Oct 13 20:26:04 succubus OOPS: Bogus kernel PC [0000000000000000] in fault
> handler
> Oct 13 20:26:04 succubus OOPS: RPC [0000000010068cd0]
> Oct 13 20:26:04 succubus RPC: <ar_context_add_page+0xd8/0x160
> [firewire_ohci]>
> Oct 13 20:26:04 succubus OOPS: Fault was to vaddr[1004e000]
> Oct 13 20:26:04 succubus Call Trace:
> Oct 13 20:26:04 succubus [00000000004076f4]
> sparc64_realfault_common+0x18/0x20
> Oct 13 20:26:04 succubus [0000000010068cd0] ar_context_add_page+0xd8/0x160
> [firewire_ohci]
> Oct 13 20:26:04 succubus [0000000010068d90] ar_context_init+0x38/0x60
> [firewire_ohci]
> Oct 13 20:26:04 succubus [000000001006ac50] pci_probe+0xf8/0x340
> [firewire_ohci]
> Oct 13 20:26:04 succubus [00000000005299bc] pci_device_probe+0x64/0xa0
> Oct 13 20:26:04 succubus [0000000000550a28] driver_probe_device+0x90/0x1c0
> Oct 13 20:26:04 succubus [0000000000550bc0] __driver_attach+0x68/0x80
> Oct 13 20:26:04 succubus [000000000054fe5c] bus_for_each_dev+0x44/0x80
> Oct 13 20:26:04 succubus [0000000000550218] bus_add_driver+0x80/0x1c0
> Oct 13 20:26:04 succubus [0000000000529b7c] __pci_register_driver+0x44/0xa0
> Oct 13 20:26:04 succubus [000000000047d5cc] sys_init_module+0x134/0x1400
> Oct 13 20:26:04 succubus [0000000000406094] linux_sparc_syscall32+0x3c/0x40
> Oct 13 20:26:04 succubus [00000000000133d8] 0x133e0
> Oct 13 20:26:04 succubus Unable to handle kernel NULL pointer dereference
> Oct 13 20:26:04 succubus tsk->{mm,active_mm}->context = 00000000000001f9
> Oct 13 20:26:04 succubus tsk->{mm,active_mm}->pgd = fffff8004f9f8000
> Oct 13 20:26:04 succubus \|/ ____ \|/
> Oct 13 20:26:04 succubus "@'/ .. \`@"
> Oct 13 20:26:04 succubus /_| \__/ |_\
> Oct 13 20:26:04 succubus \__U_/
> Oct 13 20:26:04 succubus modprobe(648): Oops [#1]
> Oct 13 20:26:04 succubus TSTATE: 0000001911009601 TPC: 0000000000000000 TNPC:
> 0000000000000004 Y: 00000000    Not tainted
> Oct 13 20:26:04 succubus TPC: <0x8>
> Oct 13 20:26:04 succubus g0: 0000000000470a40 g1: 00000000005d2b28 g2:
> 0000000000000000 g3: 0000000000000000
> Oct 13 20:26:04 succubus g4: fffff8004f393480 g5: 0000000000000098 g6:
> fffff8004fa74000 g7: 00c06500180065c0
> Oct 13 20:26:04 succubus o0: fffff8004effb870 o1: 00000000c0650000 o2:
> 0000000000002000 o3: 0000000000000000
> Oct 13 20:26:04 succubus o4: 0000000000000004 o5: 000001ff00420000 sp:
> fffff8004fa76e71 ret_pc: 0000000010068cd0
> Oct 13 20:26:04 succubus RPC: <ar_context_add_page+0xd8/0x160
> [firewire_ohci]>
> Oct 13 20:26:04 succubus l0: fffff8004fba4000 l1: 00000000c0650000 l2:
> fffff8004effb870 l3: fffff8004f608d90
> Oct 13 20:26:04 succubus l4: 000000000063cc00 l5: 0000000000ff0000 l6:
> fffff8004fa776e0 l7: fffffffffffffff0
> Oct 13 20:26:04 succubus i0: 000000000000ff00 i1: 0000000010069cc0 i2:
> fffff8004f608d90 i3: 0000000000000000
> Oct 13 20:26:04 succubus i4: 0000000000000040 i5: 0000000000000000 i6:
> fffff8004fa76f31 i7: 0000000010068d90
> Oct 13 20:26:04 succubus I7: <ar_context_init+0x38/0x60 [firewire_ohci]>
> Oct 13 20:26:04 succubus Caller[0000000010068d90]: ar_context_init+0x38/0x60
> [firewire_ohci]
> Oct 13 20:26:04 succubus Caller[000000001006ac50]: pci_probe+0xf8/0x340
> [firewire_ohci]
> Oct 13 20:26:04 succubus Caller[00000000005299bc]: pci_device_probe+0x64/0xa0
> Oct 13 20:26:04 succubus Caller[0000000000550a28]:
> driver_probe_device+0x90/0x1c0
> Oct 13 20:26:04 succubus Caller[0000000000550bc0]: __driver_attach+0x68/0x80
> Oct 13 20:26:04 succubus Caller[000000000054fe5c]: bus_for_each_dev+0x44/0x80
> Oct 13 20:26:04 succubus Caller[0000000000550218]: bus_add_driver+0x80/0x1c0
> Oct 13 20:26:04 succubus Caller[0000000000529b7c]:
> __pci_register_driver+0x44/0xa0
> Oct 13 20:26:04 succubus Caller[000000000047d5cc]:
> sys_init_module+0x134/0x1400
> Oct 13 20:26:04 succubus Caller[0000000000406094]:
> linux_sparc_syscall32+0x3c/0x40
> Oct 13 20:26:04 succubus Caller[00000000000133d8]: 0x133e0
> Oct 13 20:26:04 succubus Instruction DUMP: 01000000  01000000  01000000
> <1>Unable to handle kernel NULL pointer dereference
> Oct 13 20:26:04 succubus tsk->{mm,active_mm}->context = 00000000000001f9
> Oct 13 20:26:04 succubus tsk->{mm,active_mm}->pgd = fffff8004f9f8000
> Oct 13 20:26:04 succubus \|/ ____ \|/
> Oct 13 20:26:04 succubus "@'/ .. \`@"
> Oct 13 20:26:04 succubus /_| \__/ |_\
> Oct 13 20:26:04 succubus \__U_/
> Oct 13 20:26:04 succubus modprobe(648): Oops [#2]
> Oct 13 20:26:04 succubus TSTATE: 0000004480009605 TPC: 0000000000428a5c TNPC:
> 0000000000428a60 Y: 00000000    Tainted: G      D
> Oct 13 20:26:04 succubus TPC: <instruction_dump+0x44/0xa0>
> Oct 13 20:26:04 succubus g0: 78745f6164645f70 g1: 000000000063def0 g2:
> 0000000000000000 g3: 000000000063dc00
> Oct 13 20:26:04 succubus g4: fffff8004f393480 g5: 0000000000000098 g6:
> fffff8004fa74000 g7: 000000000063def0
> Oct 13 20:26:04 succubus o0: 00000000006050d0 o1: 000000000000003c o2:
> 0000000001000000 o3: 000000000000003e
> Oct 13 20:26:04 succubus o4: 0000000010068d90 o5: 000000000068d168 sp:
> fffff8004fa76aa1 ret_pc: 0000000000428a80
> Oct 13 20:26:04 succubus RPC: <instruction_dump+0x68/0xa0>
> Oct 13 20:26:04 succubus l0: 0000000000000000 l1: 0000000000605000 l2:
> 0000000000000000 l3: 0000001911009601
> Oct 13 20:26:04 succubus l4: 0000000000000000 l5: 000000000063dc00 l6:
> 000000000063dc00 l7: 0000000000000000
> Oct 13 20:26:04 succubus i0: 0000000000000000 i1: 00000000000133e0 i2:
> fffff8004fa775d0 i3: 0000000000000000
> Oct 13 20:26:04 succubus i4: 00000000000000ff i5: 000000000000000b i6:
> fffff8004fa76b61 i7: 00000000004294b8
> Oct 13 20:26:04 succubus I7: <die_if_kernel+0x160/0x2c0>
> Oct 13 20:26:04 succubus Caller[00000000004294b8]: die_if_kernel+0x160/0x2c0
> Oct 13 20:26:04 succubus Caller[000000000044d130]: unhandled_fault+0x78/0xe0
> Oct 13 20:26:04 succubus Caller[00000000004076f4]:
> sparc64_realfault_common+0x18/0x20
> Oct 13 20:26:04 succubus Caller[0000000010068cd0]:
> ar_context_add_page+0xd8/0x160 [firewire_ohci]
> Oct 13 20:26:04 succubus Caller[0000000010068d90]: ar_context_init+0x38/0x60
> [firewire_ohci]
> Oct 13 20:26:04 succubus Caller[000000001006ac50]: pci_probe+0xf8/0x340
> [firewire_ohci]
> Oct 13 20:26:04 succubus Caller[00000000005299bc]: pci_device_probe+0x64/0xa0
> Oct 13 20:26:04 succubus Caller[0000000000550a28]:
> driver_probe_device+0x90/0x1c0
> Oct 13 20:26:04 succubus Caller[0000000000550bc0]: __driver_attach+0x68/0x80
> Oct 13 20:26:04 succubus Caller[000000000054fe5c]: bus_for_each_dev+0x44/0x80
> Oct 13 20:26:04 succubus Caller[0000000000550218]: bus_add_driver+0x80/0x1c0
> Oct 13 20:26:04 succubus Caller[0000000000529b7c]:
> __pci_register_driver+0x44/0xa0
> Oct 13 20:26:04 succubus Caller[000000000047d5cc]:
> sys_init_module+0x134/0x1400
> Oct 13 20:26:04 succubus Caller[0000000000406094]:
> linux_sparc_syscall32+0x3c/0x40
> Oct 13 20:26:04 succubus Caller[00000000000133d8]: 0x133e0
> Oct 13 20:26:04 succubus Instruction DUMP: 80a42000  12480005  9610203e
> <d4060000> 4000cf10  a0102001  833c2000  901460d0  83287002
> 
> 
Comment 2 Stefan Richter 2007-10-13 13:26:10 UTC
Did you actually enable the new firewire drivers (firewire-ohci etc., as opposed to the old drivers ohci1394 etc.) already when you ran Linux 2.6.22.9?
Comment 3 seraph@xs4all.nl 2007-10-13 13:58:26 UTC
Oh yes, I've been using the new drivers ever since they appeared in the stable tree. Everything was just fine up to and including 2.6.22.9, I checked my old logs.

Could it be Sparc-specific? I also use the new drivers on my laptop (Pentium 4 Mobile) and it seems perfectly fine with the 2.6.23.1 kernel.
Comment 4 Stefan Richter 2007-10-16 10:29:36 UTC
Created attachment 13181 [details]
firewire: fw-ohci: fix DMA mapping for ar context

Could you apply and test this patch?  The diff is against a development version but it should be applicable to 2.6.23(.1) with only harmless line offsets.
Comment 5 Stefan Richter 2007-10-16 10:41:49 UTC
Created attachment 13182 [details]
firewire: fw-ohci: fix DMA mapping for ar context

Oops, the previous version of the patch had a silly error.
Comment 6 Stefan Richter 2007-10-16 11:04:07 UTC
Created attachment 13183 [details]
firewire: fw-ohci: optimize initialization of ar context

Since you have hardware which actually does something when dma_sync_* is called, it would be nice if you would also test this small optimization patch.
Comment 7 Stefan Richter 2007-10-16 11:05:33 UTC
PS:  The fix has to be applied before the optimization can be applied.
Comment 8 seraph@xs4all.nl 2007-10-16 12:07:28 UTC
Well, I'm afraid the patch didn't help. Here is the oops I got after booting with it applied:

Oct 16 20:56:31 succubus OOPS: Bogus kernel PC [0000000000000000] in fault handler
Oct 16 20:56:31 succubus OOPS: RPC [00000000100a2cec]
Oct 16 20:56:31 succubus RPC: <ar_context_add_page+0xf4/0x180 [firewire_ohci]>
Oct 16 20:56:31 succubus OOPS: Fault was to vaddr[10072000]
Oct 16 20:56:31 succubus Call Trace:
Oct 16 20:56:31 succubus [00000000004076f4] sparc64_realfault_common+0x18/0x20
Oct 16 20:56:31 succubus [00000000100a2cec] ar_context_add_page+0xf4/0x180 [firewire_ohci]
Oct 16 20:56:31 succubus [00000000100a2db0] ar_context_init+0x38/0x60 [firewire_ohci]
Oct 16 20:56:31 succubus [00000000100a4c70] pci_probe+0xf8/0x340 [firewire_ohci]
Oct 16 20:56:31 succubus [00000000005299bc] pci_device_probe+0x64/0xa0
Oct 16 20:56:31 succubus [0000000000550a28] driver_probe_device+0x90/0x1c0
Oct 16 20:56:31 succubus [0000000000550bc0] __driver_attach+0x68/0x80
Oct 16 20:56:31 succubus [000000000054fe5c] bus_for_each_dev+0x44/0x80
Oct 16 20:56:31 succubus [0000000000550218] bus_add_driver+0x80/0x1c0
Oct 16 20:56:31 succubus [0000000000529b7c] __pci_register_driver+0x44/0xa0
Oct 16 20:56:31 succubus [000000000047d5cc] sys_init_module+0x134/0x1400
Oct 16 20:56:31 succubus [0000000000406094] linux_sparc_syscall32+0x3c/0x40
Oct 16 20:56:31 succubus [00000000000133d8] 0x133e0
Oct 16 20:56:31 succubus Unable to handle kernel NULL pointer dereference
Oct 16 20:56:31 succubus tsk->{mm,active_mm}->context = 00000000000001f5
Oct 16 20:56:31 succubus tsk->{mm,active_mm}->pgd = fffff8004e92c000
Oct 16 20:56:31 succubus \|/ ____ \|/
Oct 16 20:56:31 succubus "@'/ .. \`@"
Oct 16 20:56:31 succubus /_| \__/ |_\
Oct 16 20:56:31 succubus \__U_/
Oct 16 20:56:31 succubus modprobe(641): Oops [#1]
Oct 16 20:56:31 succubus TSTATE: 0000004411009601 TPC: 0000000000000000 TNPC: 0000000000000004 Y: 00000000    Not tainted
Oct 16 20:56:31 succubus TPC: <0x8>
Oct 16 20:56:31 succubus g0: fffff8004eb775d0 g1: 0000000000000000 g2: 00000000005d2b28 g3: 00c06500180065c0
Oct 16 20:56:31 succubus g4: fffff8004ebdd180 g5: 00000000000000b8 g6: fffff8004eb74000 g7: 0000000000000000
Oct 16 20:56:31 succubus o0: fffff8004effb870 o1: 00000000c0650000 o2: 0000000000002000 o3: 0000000000000000
Oct 16 20:56:31 succubus o4: 0000000000000004 o5: 0000000000006500 sp: fffff8004eb76e71 ret_pc: 00000000100a2cec
Oct 16 20:56:31 succubus RPC: <ar_context_add_page+0xf4/0x180 [firewire_ohci]>
Oct 16 20:56:31 succubus l0: 000000000000ff00 l1: fffff8004f8c2000 l2: fffff8004effb870 l3: 00000000c0650000
Oct 16 20:56:31 succubus l4: fffff8004f664d90 l5: 000000000063cc00 l6: 0000000000ff0000 l7: fffffffffffffff0
Oct 16 20:56:31 succubus i0: 00000000c0650000 i1: 00000000100a3ce0 i2: fffff8004f664d90 i3: 0000000000000000
Oct 16 20:56:31 succubus i4: 0000000000000040 i5: 0000000000000000 i6: fffff8004eb76f31 i7: 00000000100a2db0
Oct 16 20:56:31 succubus I7: <ar_context_init+0x38/0x60 [firewire_ohci]>
Oct 16 20:56:31 succubus Caller[00000000100a2db0]: ar_context_init+0x38/0x60 [firewire_ohci]
Oct 16 20:56:31 succubus Caller[00000000100a4c70]: pci_probe+0xf8/0x340 [firewire_ohci]
Oct 16 20:56:31 succubus Caller[00000000005299bc]: pci_device_probe+0x64/0xa0
Oct 16 20:56:31 succubus Caller[0000000000550a28]: driver_probe_device+0x90/0x1c0
Oct 16 20:56:31 succubus Caller[0000000000550bc0]: __driver_attach+0x68/0x80
Oct 16 20:56:31 succubus Caller[000000000054fe5c]: bus_for_each_dev+0x44/0x80
Oct 16 20:56:31 succubus Caller[0000000000550218]: bus_add_driver+0x80/0x1c0
Oct 16 20:56:31 succubus Caller[0000000000529b7c]: __pci_register_driver+0x44/0xa0
Oct 16 20:56:31 succubus Caller[000000000047d5cc]: sys_init_module+0x134/0x1400
Oct 16 20:56:31 succubus Caller[0000000000406094]: linux_sparc_syscall32+0x3c/0x40
Oct 16 20:56:31 succubus Caller[00000000000133d8]: 0x133e0
Oct 16 20:56:31 succubus Instruction DUMP: 01000000  01000000  01000000 <1>Unable to handle kernel NULL pointer dereference
Oct 16 20:56:31 succubus tsk->{mm,active_mm}->context = 00000000000001f5
Oct 16 20:56:31 succubus tsk->{mm,active_mm}->pgd = fffff8004e92c000
Oct 16 20:56:31 succubus \|/ ____ \|/
Oct 16 20:56:31 succubus "@'/ .. \`@"
Oct 16 20:56:31 succubus /_| \__/ |_\
Oct 16 20:56:31 succubus \__U_/
Oct 16 20:56:31 succubus modprobe(641): Oops [#2]
Oct 16 20:56:31 succubus TSTATE: 0000004480009605 TPC: 0000000000428a5c TNPC: 0000000000428a60 Y: 00000000    Tainted: G      D
Oct 16 20:56:31 succubus TPC: <instruction_dump+0x44/0xa0>
Oct 16 20:56:31 succubus g0: 78745f6164645f70 g1: 000000000063def0 g2: 0000000000000000 g3: 000000000063dc00
Oct 16 20:56:31 succubus g4: fffff8004ebdd180 g5: 00000000000000b8 g6: fffff8004eb74000 g7: 000000000063def0
Oct 16 20:56:31 succubus o0: 00000000006050d0 o1: 000000000000003c o2: 0000000001000000 o3: 000000000000003e
Oct 16 20:56:31 succubus o4: 00000000100a2db0 o5: 000000000068d168 sp: fffff8004eb76aa1 ret_pc: 0000000000428a80
Oct 16 20:56:31 succubus RPC: <instruction_dump+0x68/0xa0>
Oct 16 20:56:31 succubus l0: 0000000000000000 l1: 0000000000605000 l2: 0000000000000000 l3: 0000004411009601
Oct 16 20:56:31 succubus l4: 0000000000000000 l5: 000000000063dc00 l6: 000000000063dc00 l7: 0000000000000000
Oct 16 20:56:31 succubus i0: 0000000000000000 i1: 00000000000133e0 i2: fffff8004eb775d0 i3: 0000000000000000
Oct 16 20:56:31 succubus i4: 00000000000000ff i5: 000000000000000b i6: fffff8004eb76b61 i7: 00000000004294b8
Oct 16 20:56:31 succubus I7: <die_if_kernel+0x160/0x2c0>
Oct 16 20:56:31 succubus Caller[00000000004294b8]: die_if_kernel+0x160/0x2c0
Oct 16 20:56:31 succubus Caller[000000000044d130]: unhandled_fault+0x78/0xe0
Oct 16 20:56:31 succubus Caller[00000000004076f4]: sparc64_realfault_common+0x18/0x20
Oct 16 20:56:31 succubus Caller[00000000100a2cec]: ar_context_add_page+0xf4/0x180 [firewire_ohci]
Oct 16 20:56:31 succubus Caller[00000000100a2db0]: ar_context_init+0x38/0x60 [firewire_ohci]
Oct 16 20:56:31 succubus Caller[00000000100a4c70]: pci_probe+0xf8/0x340 [firewire_ohci]
Oct 16 20:56:31 succubus Caller[00000000005299bc]: pci_device_probe+0x64/0xa0
Oct 16 20:56:31 succubus Caller[0000000000550a28]: driver_probe_device+0x90/0x1c0
Oct 16 20:56:31 succubus Caller[0000000000550bc0]: __driver_attach+0x68/0x80
Oct 16 20:56:31 succubus Caller[000000000054fe5c]: bus_for_each_dev+0x44/0x80
Oct 16 20:56:31 succubus Caller[0000000000550218]: bus_add_driver+0x80/0x1c0
Oct 16 20:56:31 succubus Caller[0000000000529b7c]: __pci_register_driver+0x44/0xa0
Oct 16 20:56:31 succubus Caller[000000000047d5cc]: sys_init_module+0x134/0x1400
Oct 16 20:56:31 succubus Caller[0000000000406094]: linux_sparc_syscall32+0x3c/0x40
Oct 16 20:56:31 succubus Caller[00000000000133d8]: 0x133e0
Oct 16 20:56:31 succubus Instruction DUMP: 80a42000  12480005  9610203e <d4060000> 4000cf10  a0102001  833c2000  901460d0  83287002 

As for the optimization code, I'd be happy to test it for you (once the bug is fixed), but I can do little more than smoke-test it.
Comment 9 Stefan Richter 2007-10-16 13:01:01 UTC
Created attachment 13184 [details]
test 1

Please try this, which merely adds printk()s to show where things go wrong.  And after that...
Comment 10 Stefan Richter 2007-10-16 13:03:16 UTC
Created attachment 13185 [details]
test 2

...please try this here on top of "test 1".  Then I hope the problem occurs at a different place.  If so, I think know what I'll have to do.
Comment 11 seraph@xs4all.nl 2007-10-16 13:50:25 UTC
Well, here's what happens with test 1 applied:

Oct 16 22:44:06 succubus firewire_ohci: - - - 1 - - -
Oct 16 22:44:06 succubus firewire_ohci: - - - 2 - - -
Oct 16 22:44:06 succubus firewire_ohci: - - - 3 - - -
Oct 16 22:44:06 succubus firewire_ohci: - - - 4 - - -
Oct 16 22:44:06 succubus firewire_ohci: - - - 5 - - -
Oct 16 22:44:06 succubus OOPS: Bogus kernel PC [0000000000000000] in fault handler
Oct 16 22:44:06 succubus OOPS: RPC [0000000010098d0c]
Oct 16 22:44:06 succubus RPC: <ar_context_add_page+0x114/0x1c0 [firewire_ohci]>
Oct 16 22:44:06 succubus OOPS: Fault was to vaddr[1009a000]
[snipped the rest]
Comment 12 seraph@xs4all.nl 2007-10-16 14:05:20 UTC
And here you have it with test 2 applied on top of test 1:

Oct 16 22:58:20 succubus firewire_ohci: - - - 1 - - -
Oct 16 22:58:20 succubus firewire_ohci: - - - 2 - - -
Oct 16 22:58:20 succubus firewire_ohci: - - - 3 - - -
Oct 16 22:58:20 succubus firewire_ohci: - - - 4 - - -
Oct 16 22:58:20 succubus firewire_ohci: - - - 5 - - -
Oct 16 22:58:20 succubus OOPS: Bogus kernel PC [0000000000000000] in fault handler
Oct 16 22:58:20 succubus OOPS: RPC [000000001007ad28]
Oct 16 22:58:20 succubus RPC: <ar_context_add_page+0x130/0x1c0 [firewire_ohci]>
Oct 16 22:58:20 succubus OOPS: Fault was to vaddr[1004e000]
[snipped the rest]

Mm, looks pretty much the same to me, apart from that line starting with RPC:
Comment 13 Stefan Richter 2007-10-16 14:38:14 UTC
Hmm, I didn't expect both to coke at the same place.

(The change from 0x114 to 0x130 are obviously due to the machine code size of the single function call that was added by test 2.)

Plan B:  Convert the ar buffers to a different kind of DMA mapping.  The ar buffers are accessed in wrong ways in other places in fw-ohci too, so better fix it from the ground up.  I hope I will have time for this tomorrow.

If that goes wrong:  Plan C -- ask sparc64 folks what their kernel is trying to tell us...

Thanks so far.
Comment 14 seraph@xs4all.nl 2007-10-17 13:25:55 UTC
You're welcome. I use the driver more out of curiosity than anything, so it's not critical to me, though I would like to see it fixed regardless.

I have only one actual firewire device, an external disk, and that one never worked with Linux at all. I mostly just play with IP over firewire, or using Oracle's "endpoint" software to turn my laptop into a makeshift disk.

If you have a fix for me to try, just drop me a line. If you prefer, I can give you my Jabber, AIM, Yahoo or MSN messenger contact info. I'll be happy to test your optimization code too.
Comment 15 seraph@xs4all.nl 2007-12-17 08:18:40 UTC
I don't wanna sound pushy, but is anything still being done with this?
Comment 16 Stefan Richter 2007-12-17 08:37:58 UTC
No, or not actively...

I would need to specifically ask at sparclinux for help with this (comment 13 plan C).

On the other hand, there are currently some people reworking the isochronous reception code of firewire-ohci (which is per se unrelated to asynchronous reception whose setup is making trouble here) and it looks a lot like the IR code will be converted to use coherent DMA allocations.  Similarly, I believe it makes sense to convert AR to use coherent DMA allocations as well.  Then all of the dma_sync_*() calls will go away.  (Comment 13 plan B.  Minus this strange "tomorrow" thing.)

One other thing we could try before that would be to assemble a patch series which rolls the firewire code from 2.6.23.y back to how it was in 2.6.22.  With a bisection in such a series we might be able to learn more about the bug here too.

PS, about comment 14:  Both IP over FireWire and Endpoint don't work with the new alternative firewire drivers at all yet, and won't do so anytime soon as far as I can tell from our development progress, or lack thereof.
Comment 17 Stefan Richter 2007-12-17 08:43:25 UTC
> One other thing we could try before that would be to assemble a patch
> series which rolls the firewire code from 2.6.23.y back to how it was
> in 2.6.22.

I'm gonna attempt to do this tonight.  There should only few patches in there:
http://me.in-berlin.de/~s5r6/linux1394/merged/in_2.6.23/
minus ???-ieee1394-*.patch and minus [fms]*.patch and minus most or all of ???-fw-sbp2-*.patch.
Comment 18 Stefan Richter 2007-12-17 09:03:40 UTC
FWIW, another report of the same bug was made against 2.6.23-rc3:
http://www.mail-archive.com/sparclinux@vger.kernel.org/msg01697.html
The poster also suspected dma_sync_single_for_device to be at fault.
There were no replies.
Comment 19 seraph@xs4all.nl 2007-12-17 09:18:26 UTC
Well, like I said, no rush. I was just curious as there were no updates for nearly two months.

Based on your comments, I reverted to the old FireWire stack for the time being so I can test IP over FireWire between my webserver and my database.

Of course I'd still be glad to test any code for you.


And now it appears my sungem interface is also not working in 2.6.23... Sigh, it's always something, isn't it?
Comment 20 Stefan Richter 2007-12-17 10:14:05 UTC
Created attachment 14086 [details]
revert firewire from 2.6.23 to 2.6.22

This patch reverts the whole firewire subsystem in 2.6.23 to how it was in 2.6.22. Please apply to 2.6.23(.y) and test-load firewire-ohci.

I only compile-tested the patch but I think it should work (to the extent to that the firewire-subsystem worked in 2.6.22, which was its debut in mainline).
Comment 21 Stefan Richter 2007-12-17 10:19:08 UTC
Created attachment 14087 [details]
revert a few selected 2.6.23 patches

This is a series of 5 patches:

patches/revert-560a-firewire-fw-ohci-ignore-failure-of-pci_set_power_state-fix-suspend-regression.patch
patches/revert-555-firewire-fix-unloading-of-fw-ohci-while-devices-are-attached.patch
patches/revert-530-firewire-fw-ohci-dma_free_coherent-needs-irqs-enabled.patch
patches/revert-522-firewire-fw-ohci-fix-scheduling-while-atomic.patch
patches/revert-521-firewire-fw-ohci-flush-mmio-write-before-msleep.patch
patches/series

These are applicable on top of clean 2.6.23(.y) in the order as listed.  The series file is only needed if you want to use quilt to apply and unapply these patches one after another.

This test would only be interesting if the previous sledge hammer patch actually fixes the oops (which I seriously doubt after having collected this smaller patch series here).
Comment 22 Stefan Richter 2007-12-17 10:29:43 UTC
My bet is:
  - Attachment 14086 [details] (comment 20) will not fix the oops,
  - the sparc architecture has been broken after 2.6.22 in such a way that dma_sync_*() cannot be used on a device whose probe method hasn't been finished.

But let's wait for your test result.
Comment 23 Stefan Richter 2007-12-17 10:45:47 UTC
I forgot:
Independently of the result of that test, another interesting point is whether the bug still exists in 2.6.24(-rc*).
Comment 24 seraph@xs4all.nl 2007-12-17 10:46:34 UTC
Well, if the layout of menuconfig is any indication, sparc code has been a mess for some time... I've been tempted to try and clean it up, but I'm afraid I'd only make it worse. ^^

Ah well, applying your patch from #20 now. And I suppose I better get that cross-compiler running again while I'm at it, compiling on the Blade 100 just takes too damn long...

I'll keep you updated.
Comment 25 seraph@xs4all.nl 2007-12-17 11:05:42 UTC
Oh, a little something I spotted while compiling:

  CC [M]  drivers/firewire/fw-cdev.o
drivers/firewire/fw-cdev.c: In function 'ioctl_queue_iso':
drivers/firewire/fw-cdev.c:725: warning: passing argument 2 of 'access_ok' makes pointer from integer without a cast
Comment 26 Stefan Richter 2007-12-17 11:25:39 UTC
> drivers/firewire/fw-cdev.c: In function 'ioctl_queue_iso':
> drivers/firewire/fw-cdev.c:725: warning: passing argument 2 of 'access_ok'
> makes pointer from integer without a cast

I got a slightly different sparse warning but not a compiler warning on i386.
It has been fixed in 2.6.24-rc1 though:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=1ccc9147f6a063c42fef67acff34de18435a4a6b
Comment 27 seraph@xs4all.nl 2007-12-17 11:35:34 UTC
Well, as you already suspected, the patch from #20 did not fix the problem. It still oopses, causing problems with a couple of other modules as well.

An interesting point is that network does work with this kernel, while it didn't with non-oopsing 2.6.23 kernels... Mm, wonder if another module is causing the failure...

Cross compiler works again, I'll try 2.6.24-rc5 next.
Comment 28 seraph@xs4all.nl 2007-12-17 13:05:19 UTC
I just tried 2.6.24-rc5. It also oopses just like the rest.
Comment 29 Stefan Richter 2007-12-17 13:21:16 UTC
Thanks a lot for taking the time to test thus far.  One more question just to be safe.  You wrote "Most recent kernel where this bug did not occur: 2.6.22.9" in the description.  Are you sure you already had firewire-ohci enabled under 2.6.22.y?
Comment 30 seraph@xs4all.nl 2007-12-17 14:32:56 UTC
Yes, I had it enabled, but at the time I was not using it for anything at all.

I just retested it to make sure. It 'works' in so far that it doesn't oops on loading. However, trying to use it results in hanging my machine. So I guess it was actually already broken in 2.6.22.9, I just didn't notice it.
Comment 31 Stefan Richter 2007-12-17 15:21:41 UTC
> It 'works' in so far that it doesn't oops on loading. However, trying
> to use it results in hanging my machine. So I guess it was actually
> already broken in 2.6.22.9, I just didn't notice it.

Well, it was already broken but in different ways.  The oops in the pci_probe path is IMO quite specific.  I will ask around whether this is something to be fixed in architecture code or driver core code, or whether the dma_sync_* has to be removed from that path.
Comment 32 Stefan Richter 2007-12-19 08:03:44 UTC
Current discussion at sparclinux, Cc'd to linux-kernel and linux1394-devel: http://lkml.org/lkml/2007/12/17/413
Comment 33 Stefan Richter 2007-12-20 07:11:18 UTC
Could you try this patch by David Miller?
posting:   http://lkml.org/lkml/2007/12/20/37
raw diff:  http://lkml.org/lkml/diff/2007/12/20/37/1
Comment 34 Stefan Richter 2007-12-22 04:50:43 UTC
Comment on attachment 14086 [details]
revert firewire from 2.6.23 to 2.6.22

Daveid Miller's fix has been confirmed on Sun Blade 2000.  See LKML link in the previous comment.
Comment 35 Stefan Richter 2007-12-22 05:06:07 UTC
I assume that the problem is now fixed in Linus' tree.
Bug fix commit:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=36bb61346d9e64b55285f27363e93a6e96f2abba
Merge commit (a day after 2.6.24-rc6):
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a4c80d2ae2cac531c6655f75658dae02c488abc7

Please reopen if I'm mistaken and the same issue is still present after the fix.

Note You need to log in before you can comment on or make changes to this bug.