Bug 200283

Summary: PCI: Data corruption happening due to a race condition
Product: Drivers Reporter: Hari Kishore Vyas (hari.vyas)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: hari.vyas
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.18 Tree: Mainline
Regression: No
Attachments: Race condition while calling pci_set_master() and pci_bus_add_device()
Proposed fix to avoid race condition while updating is_busmaster and is_added bits of pci_dev structure

Description Hari Kishore Vyas 2018-06-26 09:58:41 UTC
Created attachment 276877 [details]
Race condition while calling pci_set_master() and pci_bus_add_device()

After removing/re-scanning PCIe nvme device multiple times, encountered "proc_dir entry already registered" message which keep on appearing on console with subsequent re-scan.


#echo 1 > /sys/bus/pci/rescan
[ 2260.042971] pci 0000:01:00.0: BAR 0: assigned [mem 0x400000000-0x400003fff 64bit]
[ 2260.050915] proc_dir_entry '0000:01/00.0' already registered
[ 2260.056765] -----------[ cut here ]-----------
[ 2260.061538] WARNING: CPU: 7 PID: 1966 at /tmp/yocto_builds/11411866_nxs_1.0_ea3_stingray-bcm958742t/poky/build/tmp/work-shared/bcm958742t/kernel-source/fs/proc/generic.c:346 proc_register+0x11c/0x158
[ 2260.079830] Modules linked in: bnxt_re(O) rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core bnxt_en(O) nvmem_bcm_ocotp nvmem_core ipv6
[ 2260.092043] CPU: 7 PID: 1966 Comm: sh Tainted: G W O 4.12.0-rc7+g2d2a228 #1
[ 2260.100209] Hardware name: Stingray SST100 (BCM958742T) (DT)
[ 2260.106047] task: ffff80a1626f8c00 task.stack: ffff80a160320000
[ 2260.112154] PC is at proc_register+0x11c/0x158
[ 2260.116737] LR is at proc_register+0x11c/0x158
[ 2260.121320] pc : [<ffff00000823138c>] lr : [<ffff00000823138c>] pstate: 00000145
[ 2260.128948] sp : ffff80a160323bb0
[ 2260.132365] x29: ffff80a160323bb0 x28: ffff80a1626f8c00
[ 2260.137845] x27: ffff0000087a2000 x26: 0000000000000040
[ 2260.143324] x25: ffff80a164173838 x24: ffff000008c18000
[ 2260.148804] x23: ffff80a15c524485 x22: ffff80a164173800
[ 2260.154283] x21: 0000000000000000 x20: ffff80a15c524400
[ 2260.159762] x19: ffff80a1640f4240 x18: 0000000000000010
[ 2260.165241] x17: 0000000000000000 x16: 0000000000000000
[ 2260.170721] x15: ffffffffffffffff x14: ffff000088be2d6f
[ 2260.176200] x13: ffff000008b28b38 x12: ffff00000846e1b8
[ 2260.181679] x11: 0000000005f5e0ff x10: 0000000000000016
[ 2260.187158] x9 : 00000000ffffffd0 x8 : 6465726574736967
[ 2260.192638] x7 : 6572207964616572 x6 : ffff80a17ffa0d38
[ 2260.198117] x5 : ffff80a17ffa0d38 x4 : 0000000000000000
[ 2260.203596] x3 : ffff80a17ffa5958 x2 : ffff80a17ffa0d38
[ 2260.209074] x1 : 000080a177490000 x0 : 0000000000000030
[ 2260.214554] --[ end trace f7b5e622386491a4 ]--
[ 2260.219315] Call trace:
[ 2260.221837] Exception stack(0xffff80a1603239e0 to 0xffff80a160323b10)
[ 2260.228480] 39e0: ffff80a1640f4240 0001000000000000 ffff80a160323bb0 ffff00000823138c
[ 2260.236557] 3a00: ffff0000087a2000 ffff80a1626f8c00 ffff80a160323a20 0000000000000000
[ 2260.244634] 3a20: ffff80a160323bb0 ffff80a160323bb0 ffff80a160323b70 00000000ffffffc8
[ 2260.252710] 3a40: ffff80a160323a70 ffff0000080e5810 ffff80a160323bb0 ffff80a160323bb0
[ 2260.260787] 3a60: ffff80a160323b70 00000000ffffffc8 ffff80a160323b20 ffff000008159920
[ 2260.268865] 3a80: 0000000000000030 000080a177490000 ffff80a17ffa0d38 ffff80a17ffa5958
[ 2260.276941] 3aa0: 0000000000000000 ffff80a17ffa0d38 ffff80a17ffa0d38 6572207964616572
[ 2260.285018] 3ac0: 6465726574736967 00000000ffffffd0 0000000000000016 0000000005f5e0ff
[ 2260.293094] 3ae0: ffff00000846e1b8 ffff000008b28b38 ffff000088be2d6f ffffffffffffffff
[ 2260.301171] 3b00: 0000000000000000 0000000000000000
[ 2260.306202] [<ffff00000823138c>] proc_register+0x11c/0x158
[ 2260.311860] [<ffff00000823166c>] proc_create_data+0x7c/0xe0
[ 2260.317610] [<ffff0000084272a8>] pci_proc_attach_device+0x78/0xf0
[ 2260.323896] [<ffff00000841565c>] pci_bus_add_device+0x34/0xa8
[ 2260.329823] [<ffff000008415704>] pci_bus_add_devices+0x34/0x80
[ 2260.335840] [<ffff00000841574c>] pci_bus_add_devices+0x7c/0x80
[ 2260.341855] [<ffff000008418528>] pci_rescan_bus+0x28/0x38
[ 2260.347423] [<ffff000008420f64>] bus_rescan_store+0x54/0x78
[ 2260.353171] [<ffff00000848f650>] bus_attr_store+0x20/0x30
[ 2260.358739] [<ffff00000823e064>] sysfs_kf_write+0x3c/0x50
[ 2260.364307] [<ffff00000823d324>] kernfs_fop_write+0x104/0x1f0
[ 2260.370235] [<ffff0000081c3ad8>] __vfs_write+0x18/0x100
[ 2260.375624] [<ffff0000081c4e7c>] vfs_write+0x9c/0x1a8
[ 2260.380833] [<ffff0000081c6490>] SyS_write+0x48/0xb0
[ 2260.385955] [<ffff000008082f8c>] __sys_trace_return+0x0/0x4
[ 2260.391807] nvme nvme0: pci function 0000:01:00.0

However, despite the errors, the device still appears to be reinitialized correctly.

Once we observe this error, it will occur for ever for all next re-scan of that device.

Steps to Reproduce:
1. Insert a Samsung 960 drive into the first PCIe slot.

2. Check it's detected

lspci -s 0000:01:00.0
0000:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
3. Remove the drive via the PCIe interface

echo 1 > /sys/bus/pci/devices/0000:01:00.0/remove

4. Check it's removed

lspci -s 0000:01:00.0

5. Rescan the PCIe bus

echo 1 > /sys/bus/pci/rescan
[ 2677.645883] pci 0000:01:00.0: BAR 0: assigned [mem 0x400000000-0x400003fff 64bit] 
[ 2677.656855] nvme nvme0: pci function 0000:01:00.0

6. Repeat steps 2 through 5 until the error occurs.

Further debugging revealed that there is a race condition while setting is_added and is_busmaster bits of pci_dev structure.

This issue is happening  with multiple times device removal and rescan from sysfs. Card is not removed physically. 

Is_added bit is set after device attach which probe nvme driver. NVMe driver starts one workqueue and that one is calling pci_set_master()  to set is_busmaster bit.
With multiple times device removal and rescan from sysfs,  race condition happens and is_added bit is over-written to 0 from workqueue started by NVMe driver.
Due to this when user tries to remove sysfs entry, it fails in pci_stop_dev() as it checks for is_added bit which was supposed to be 1 but observed as zero and proc,sysfs entries remains.

This causes failure while creating new proc,sysfs entries as part of device addition as proc directory already existing.
 
Sequence 1:
pci_bus_add_device()
{
    device_attach();
   ...
    dev->is_added = 1;
}

Sequence 2:
nvme_probe()
{
...
INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work); ...
}
nvme_reset_work()--->nvme_pci_enable()-->pci_set_master()-->__pci_set_master(true)-->dev->is_busmaster = enable
Comment 1 Hari Kishore Vyas 2018-06-26 10:08:22 UTC
Created attachment 276879 [details]
Proposed fix to avoid race condition while updating is_busmaster and is_added bits  of pci_dev structure

Proposing to use a spin lock while updating is_busmater and is_added bits of pci_dev structure.

As all bit fields sharing same memory location so other solutions could be to use separate independent fields i.s.o. bitfields in include/linux/pci.h. This will need only modification in mentioned pci.h file but at the cost of additional memory but with advantage of no locking mechanism.