Created attachment 107070 [details] panic dmesg There is a potential deadlock situation when we manipulate the pci-sysfs user interfaces from different bus hierarchy simultaneously, described as following: path1: sysfs remove device: | path2: sysfs rescan device: sysfs_schedule_callback_work() | sysfs_write_file() remove_callback() | flush_write_buffer() *1* mutex_lock(&pci_remove_rescan_mutex)|*2* sysfs_get_active(attr_sd) ... | dev_attr_store() device_remove_file() | dev_rescan_store() ... |*4* mutex_lock(&pci_remove_rescan_mutex) *3* sysfs_deactivate(sd) | ... wait_for_completion() |*5* sysfs_put_active(attr_sd) *6* mutex_unlock(&pci_remove_rescan_mutex) If path1 first holds the pci_remove_rescan_mutex at *1*, then another path called path2 actived and runs to *2* before path1 runs to *3*, we now runs to a deadlock situation: Path1 holds the mutex waiting path2 to decrease sysfs_dirent's s_active counter at *5*, but path2 is blocked at *4* when trying to get the pci_remove_rescan_mutex. The mutex won't be put by path1 until it reach *6*, but it's now blocked at *3*. Running the following two scripts simultaneously can produce this issue: 1. for i in `seq 20`; do echo $i; echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove; echo -n 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan ; done 2. for i in `seq 20000`; do echo $i; sleep 0.1; echo -n 1 > /sys/bus/pci/devices/0000\:1c\:00.1/remove ; echo -n 1 > /sys/bus/pci/devices/0000\:1c\:00.0/remove ;echo -n 1 > /sys/bus/pci/devices/0000\:1a\:01.0/rescan; done panic dmesg info, pci tree infos are sent as attachments.
Created attachment 107071 [details] pci tree infos
Gu, has this been fixed? I suspect that Rafael's work, e.g., f41b32613138 ACPI / hotplug / PCI: Move PCI rescan-remove locking to hotplug_event() d42f5da23400 ACPI / hotplug / PCI: Scan root bus under the PCI rescan-remove lock might have fixed this.
(In reply to Bjorn Helgaas from comment #2) Hi Bjorn, I'll confirm this in the coming week, the box is on other duty now. Thanks, Gu > Gu, has this been fixed? I suspect that Rafael's work, e.g., > > f41b32613138 ACPI / hotplug / PCI: Move PCI rescan-remove locking to > hotplug_event() > d42f5da23400 ACPI / hotplug / PCI: Scan root bus under the PCI > rescan-remove lock Yeah, the code seems can fix this issue. > > might have fixed this.