Most recent kernel where this bug did not occur: unknown Distribution: Mandrake 10.1 Hardware Environment: x86 K7 uniprocessor two FireWire host adapters (0: TI based, 1394b, 1: VIA based, 1394a) Software Environment: gcc 3.4.1, kernel.org's kernel Problem Description: When "modprobe ohci1394" is followed shortly by "modprobe -r ohci1394" (e.g. 1 second after the previous modprobe finished), one of the following may happen: - kernel panic due to exception in interrupt (happened on 2.6.15.x preempt uniprocessor) or - modprobe -r hangs in D state, as does knodemgrd_0 (happened on 2.6.16.x preempt SMP on a uniprocessor machine, with original drivers as well as with 1394 drivers equivalent to 2.6.17-rc6-mm2 --- i.e. 2.6.17 will also be affected). The knodemgrd_1 slept interruptibly (S state) while the other slept uninterruptibly (D state) right after modprobe -r was issued. This happens with or without other nodes attached to the FireWire ports, with or without eth1394 loaded. There is no problem if a longer pause is put between modprobe and modprobe -r, for example 4 seconds.
Didn't happen anymore when I tested again with Linux 2.6.19-rc2. I am too lazy to find out which patch fixed that. On a quick glance, this fix in 2.6.18 looks like it: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=445151932e869fd76b23bccff75ae2a600ccf3c9
Alas it's not fixed. I never get a panic but the deadlock between the modprobe process and the knodemgrd kthread does still happen if the timing is right. Tested with 2.6.19-rc. # modprobe ohci1394 && modprobe -r ohci1394 works. # modprobe ohci1394 && sleep 1 && modprobe -r ohci1394 gets stuck in uninterruptible sleep on kthread_stop(). This is trying to stop the knodemgrd which uninterruptibly sleeps on bus_rescan_devices_helper() meanwhile. Call trace of the modprobe -r context: kthread_stop in kernel/kthread.c nodemgr_remove_host in drivers/ieee1394/nodemgr.c __unregister_host in drivers/ieee1394/highlevel.c highlevel_remove_host in drivers/ieee1394/highlevel.c hpsb_remove_host in drivers/ieee1394/hosts.c ohci1394_pci_remove in drivers/ieee1394/ohci1394.c pci_device_remove in pci/pci-driver.c __device_release_driver in drivers/base/dd.c driver_detach in drivers/base/dd.c Call trace of the knodemgrd context: bus_rescan_devices_helper in drivers/base/bus.c bus_rescan_devices in drivers/base/bus.c nodemgr_node_probe in drivers/ieee1394/nodemgr.c nodemgr_host_thread in drivers/ieee1394/nodemgr.c It seems the following is the culprit: Since Linux 2.6.16, bus_rescan_devices_helper takes down(&dev->parent->sem) if a parent device exists. This is true for all devices that are managed by nodemgr. (FireWire ud's have ud's or ne's as parent, and FireWire ne's have hosts as parent.) And yes, the call in driver_detach to __device_release_driver is enclosed in down(&dev->sem).
Created attachment 9564 [details] effectively reverts 2.6.16 patch "Hold the device's parent's lock during probe and remove" As expected, reverting the dev->parent->sem changes made in Linux 2.6.16 avoids the deadlock. Of course we cannot simply revert it without wreaking havoc in the USB subsystem.
Created attachment 9566 [details] an unsane fix This prevents the deadlock. The lines in // are not required. Note, although the deadlock is now gone for good according to repeated tests, there is now a new oops in eth1394, logged as bug 7550. This could have been the lock-up which I observed with kernels before 2.6.16.
Discussion on the mailinglists: http://lkml.org/lkml/2006/11/18/140
Created attachment 9577 [details] another ugly workaround This patch creates a dummy driver and binds it to all fw-host devices. That way, bus_rescan_devices_helper will skip fw-host devices and won't block on their parent's device semaphore.
Created attachment 9585 [details] previous patch revised binds a dummy driver named "ieee1394" to fw-host devices, thus prevents fw-hosts from being scanned by the driver core, thus prevents the deadlock
fix was merged, will appear in Linux 2.6.20-rc1