Bug 8646

Summary: fw-ohci and ohci1394: panic in softirq, below smp_apic_timer_interrupt
Product: Drivers Reporter: Stefan Richter (stefanr)
Component: IEEE1394Assignee: drivers_ieee1394
Status: CLOSED CODE_FIX    
Severity: high    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: all Subsystem:
Regression: --- Bisected commit-id:
Attachments: screenshot
screenshot
kill tasklets in fw-ohci::pci_remove
screenshot
firewire: fix unloading of fw-ohci while devices are attached

Description Stefan Richter 2007-06-17 14:04:35 UTC
Most recent kernel where this bug did not occur: unknown
Hardware Environment: SMP i586 and SMP x86-64
Software Environment: Linux 2.6.x

If an SBP-2 device is attached and sbp2 or firewire-sbp2 is loaded, any of the following commands

# modprobe -r sbp2 && modprobe -r ohci1394
# modprobe -r firewire-ohci
# modprobe -r firewire-sbp2 firewire-ohci
# modprobe -r firewire-sbp2 && sleep 0 && modprobe -r firewire-ohci

will lead to a panic with a trace similar to this:
general protection fault [...]
Pid: 0, comm: swapper [...]
run_timer_softirq
__do_softirq
call_softirq
do_softirq
irq_exit
smp_apic_timer_interrupt
mwait_idle
apic_timer_interrupt
[...]

This happened on two different i945GM based boards with Core 2 Duo, 32-bit and 64-bit kernels.  The last time I tried this with ohci1394/sbp2 is a while ago.  I just saw it now happening in the new drivers too.

The same trace also happened on an older kernel repeatedly in a totally different context, without FireWire drivers loaded:  It could be triggered by "make -j" in the kernel source tree, i.e. by a spawning something with many subthreads.  I don't have a spare machine with enough RAM available to test this again with a recent kernel, for now.  I will try to make the machine where I saw it available again.

I.e. the bug may be entirely outside the old and new FireWire drivers.

*Not* affected are:
  - The same machines with "modprobe -r firewire-sbp2 && sleep 0.1 && modprobe -r firewire-ohci",
  - A VIA KM-266/ AMD-Athlon single-CPU PC, even if running an SMP kernel.

I will attach a sample screenshot in a follow-up entry.
Comment 1 Stefan Richter 2007-06-17 14:07:30 UTC
Created attachment 11773 [details]
screenshot
Comment 2 Stefan Richter 2007-07-28 04:51:59 UTC
An old Pentium MMX notebook also crashes this way but does not print out a panic message.  Tested with UP PREEMPT kernel.
Comment 3 Stefan Richter 2007-08-19 09:34:21 UTC
Created attachment 12445 [details]
screenshot

panic on 2.6.23-rc3 x86-64
triggered by modprobe -r firewire-ohci shortly after modprobe firewire-ohci,
firewire-sbp2 was not loaded
Comment 4 Stefan Richter 2007-08-20 14:22:16 UTC
Created attachment 12461 [details]
kill tasklets in fw-ohci::pci_remove

It seems this patch doesn't help.  See next screenshot.
Comment 5 Stefan Richter 2007-08-20 14:32:27 UTC
Created attachment 12462 [details]
screenshot

panic, with patch id=12461 applied, triggered by modprobe -r firewire-ohci with an SBP-2 disk attached and firewire-sbp2 loaded

The only and probably insignificant difference is that __update_rq_clock (a new function in 2.6.23-rc3) appears in the trace between __do_softirq and run_timer_softirq.
Comment 6 Stefan Richter 2007-08-20 16:25:54 UTC
Created attachment 12463 [details]
firewire: fix unloading of fw-ohci while devices are attached

Fixes modprobe -r firewire-ohci in presence of an SBP-2 device.
Comment 7 Stefan Richter 2007-08-20 16:35:36 UTC
Still to do:

  - Check if patch attachment 12469 [details] also fixed the bug per comment #3.
    We can't tell for sure as long as bug 8906 is unfixed.

  - Fix the bug for the old ieee1394 stack.
Comment 8 Stefan Richter 2007-08-20 16:36:41 UTC
(I meant attachment 12463 [details] of course.)
Comment 9 Stefan Richter 2007-09-18 14:59:11 UTC
attachment 12463 [details] has been merged in linux 2.6.23-rc4
Comment 10 Stefan Richter 2008-02-24 10:52:04 UTC
Re comment #7:
Fixes for bug 8906 have been posted, bug per comment #3 does not happen anymore.  I.e. only the old ieee1394 stack _may_ still be affected.

I am closing this bug now; if anybody still encounters this bug with the ieee1394 stack, please reopen and rename this bug.