Bug 6706

Summary: modprobe -r ohci1394 hangs or panics
Product: Drivers Reporter: Stefan Richter (stefanr)
Component: IEEE1394Assignee: Stefan Richter (stefanr)
Status: CLOSED CODE_FIX    
Severity: normal    
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.15, 2.6.16 Subsystem:
Regression: --- Bisected commit-id:
Attachments: effectively reverts 2.6.16 patch "Hold the device's parent's lock during probe and remove"
an unsane fix
another ugly workaround
previous patch revised

Description Stefan Richter 2006-06-18 03:40:46 UTC
Most recent kernel where this bug did not occur: unknown
Distribution: Mandrake 10.1
Hardware Environment: x86 K7 uniprocessor
two FireWire host adapters (0: TI based, 1394b, 1: VIA based, 1394a)
Software Environment: gcc 3.4.1, kernel.org's kernel
Problem Description:

When "modprobe ohci1394" is followed shortly by "modprobe -r ohci1394" (e.g. 1
second after the previous modprobe finished), one of the following may happen:

 - kernel panic due to exception in interrupt
   (happened on 2.6.15.x preempt uniprocessor)

   or

 - modprobe -r hangs in D state, as does knodemgrd_0
   (happened on 2.6.16.x preempt SMP on a uniprocessor machine,
   with original drivers as well as with 1394 drivers equivalent
   to 2.6.17-rc6-mm2 --- i.e. 2.6.17 will also be affected).

   The knodemgrd_1 slept interruptibly (S state) while the other
   slept uninterruptibly (D state) right after modprobe -r was
   issued. This happens with or without other nodes attached to
   the FireWire ports, with or without eth1394 loaded.

There is no problem if a longer pause is put between modprobe and modprobe -r,
for example 4 seconds.
Comment 1 Stefan Richter 2006-10-22 04:21:13 UTC
Didn't happen anymore when I tested again with Linux 2.6.19-rc2. I am too lazy
to find out which patch fixed that. On a quick glance, this fix in 2.6.18 looks
like it:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=445151932e869fd76b23bccff75ae2a600ccf3c9
Comment 2 Stefan Richter 2006-11-18 15:37:58 UTC
Alas it's not fixed. I never get a panic but the deadlock between the modprobe
process and the knodemgrd kthread does still happen if the timing is right.
Tested with 2.6.19-rc.

# modprobe ohci1394 && modprobe -r ohci1394
works.

# modprobe ohci1394 && sleep 1 && modprobe -r ohci1394
gets stuck in uninterruptible sleep on kthread_stop(). This is trying to stop
the knodemgrd which uninterruptibly sleeps on bus_rescan_devices_helper() meanwhile.


Call trace of the modprobe -r context:
    kthread_stop		in kernel/kthread.c
    nodemgr_remove_host		in drivers/ieee1394/nodemgr.c
    __unregister_host		in drivers/ieee1394/highlevel.c
    highlevel_remove_host	in drivers/ieee1394/highlevel.c
    hpsb_remove_host		in drivers/ieee1394/hosts.c
    ohci1394_pci_remove		in drivers/ieee1394/ohci1394.c
    pci_device_remove		in pci/pci-driver.c
    __device_release_driver	in drivers/base/dd.c
    driver_detach		in drivers/base/dd.c

Call trace of the knodemgrd context:
    bus_rescan_devices_helper	in drivers/base/bus.c
    bus_rescan_devices		in drivers/base/bus.c
    nodemgr_node_probe		in drivers/ieee1394/nodemgr.c
    nodemgr_host_thread		in drivers/ieee1394/nodemgr.c


It seems the following is the culprit:

Since Linux 2.6.16, bus_rescan_devices_helper takes down(&dev->parent->sem) if a
parent device exists. This is true for all devices that are managed by nodemgr.
(FireWire ud's have ud's or ne's as parent, and FireWire ne's have hosts as
parent.) And yes, the call in driver_detach to __device_release_driver is
enclosed in down(&dev->sem).
Comment 3 Stefan Richter 2006-11-19 03:56:10 UTC
Created attachment 9564 [details]
effectively reverts 2.6.16 patch "Hold the device's parent's lock during probe and remove"

As expected, reverting the dev->parent->sem changes made in Linux 2.6.16 avoids
the deadlock. Of course we cannot simply revert it without wreaking havoc in
the USB subsystem.
Comment 4 Stefan Richter 2006-11-19 05:47:36 UTC
Created attachment 9566 [details]
an unsane fix

This prevents the deadlock. The lines in // are not required.

Note, although the deadlock is now gone for good according to repeated tests,
there is now a new oops in eth1394, logged as bug 7550. This could have been
the lock-up which I observed with kernels before 2.6.16.
Comment 5 Stefan Richter 2006-11-19 14:40:25 UTC
Discussion on the mailinglists: http://lkml.org/lkml/2006/11/18/140
Comment 6 Stefan Richter 2006-11-20 16:24:52 UTC
Created attachment 9577 [details]
another ugly workaround

This patch creates a dummy driver and binds it to all fw-host devices. That
way, bus_rescan_devices_helper will skip fw-host devices and won't block on
their parent's device semaphore.
Comment 7 Stefan Richter 2006-11-21 13:51:42 UTC
Created attachment 9585 [details]
previous patch revised

binds a dummy driver named "ieee1394" to fw-host devices, thus prevents
fw-hosts from being scanned by the driver core, thus prevents the deadlock
Comment 8 Stefan Richter 2006-12-10 08:32:13 UTC
fix was merged, will appear in Linux 2.6.20-rc1