Most recent kernel where this bug did not occur: unknown
Distribution: Mandrake 10.1
Hardware Environment: x86 K7 uniprocessor
two FireWire host adapters (0: TI based, 1394b, 1: VIA based, 1394a)
Software Environment: gcc 3.4.1, kernel.org's kernel
When "modprobe ohci1394" is followed shortly by "modprobe -r ohci1394" (e.g. 1
second after the previous modprobe finished), one of the following may happen:
- kernel panic due to exception in interrupt
(happened on 2.6.15.x preempt uniprocessor)
- modprobe -r hangs in D state, as does knodemgrd_0
(happened on 2.6.16.x preempt SMP on a uniprocessor machine,
with original drivers as well as with 1394 drivers equivalent
to 2.6.17-rc6-mm2 --- i.e. 2.6.17 will also be affected).
The knodemgrd_1 slept interruptibly (S state) while the other
slept uninterruptibly (D state) right after modprobe -r was
issued. This happens with or without other nodes attached to
the FireWire ports, with or without eth1394 loaded.
There is no problem if a longer pause is put between modprobe and modprobe -r,
for example 4 seconds.
Didn't happen anymore when I tested again with Linux 2.6.19-rc2. I am too lazy
to find out which patch fixed that. On a quick glance, this fix in 2.6.18 looks
Alas it's not fixed. I never get a panic but the deadlock between the modprobe
process and the knodemgrd kthread does still happen if the timing is right.
Tested with 2.6.19-rc.
# modprobe ohci1394 && modprobe -r ohci1394
# modprobe ohci1394 && sleep 1 && modprobe -r ohci1394
gets stuck in uninterruptible sleep on kthread_stop(). This is trying to stop
the knodemgrd which uninterruptibly sleeps on bus_rescan_devices_helper() meanwhile.
Call trace of the modprobe -r context:
kthread_stop in kernel/kthread.c
nodemgr_remove_host in drivers/ieee1394/nodemgr.c
__unregister_host in drivers/ieee1394/highlevel.c
highlevel_remove_host in drivers/ieee1394/highlevel.c
hpsb_remove_host in drivers/ieee1394/hosts.c
ohci1394_pci_remove in drivers/ieee1394/ohci1394.c
pci_device_remove in pci/pci-driver.c
__device_release_driver in drivers/base/dd.c
driver_detach in drivers/base/dd.c
Call trace of the knodemgrd context:
bus_rescan_devices_helper in drivers/base/bus.c
bus_rescan_devices in drivers/base/bus.c
nodemgr_node_probe in drivers/ieee1394/nodemgr.c
nodemgr_host_thread in drivers/ieee1394/nodemgr.c
It seems the following is the culprit:
Since Linux 2.6.16, bus_rescan_devices_helper takes down(&dev->parent->sem) if a
parent device exists. This is true for all devices that are managed by nodemgr.
(FireWire ud's have ud's or ne's as parent, and FireWire ne's have hosts as
parent.) And yes, the call in driver_detach to __device_release_driver is
enclosed in down(&dev->sem).
Created attachment 9564 [details]
effectively reverts 2.6.16 patch "Hold the device's parent's lock during probe and remove"
As expected, reverting the dev->parent->sem changes made in Linux 2.6.16 avoids
the deadlock. Of course we cannot simply revert it without wreaking havoc in
the USB subsystem.
Created attachment 9566 [details]
an unsane fix
This prevents the deadlock. The lines in // are not required.
Note, although the deadlock is now gone for good according to repeated tests,
there is now a new oops in eth1394, logged as bug 7550. This could have been
the lock-up which I observed with kernels before 2.6.16.
Discussion on the mailinglists: http://lkml.org/lkml/2006/11/18/140
Created attachment 9577 [details]
another ugly workaround
This patch creates a dummy driver and binds it to all fw-host devices. That
way, bus_rescan_devices_helper will skip fw-host devices and won't block on
their parent's device semaphore.
Created attachment 9585 [details]
previous patch revised
binds a dummy driver named "ieee1394" to fw-host devices, thus prevents
fw-hosts from being scanned by the driver core, thus prevents the deadlock
fix was merged, will appear in Linux 2.6.20-rc1