Bug 200283 - PCI: Data corruption happening due to a race condition
Summary: PCI: Data corruption happening due to a race condition
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-06-26 09:58 UTC by Hari Kishore Vyas
Modified: 2018-06-26 10:08 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.18
Tree: Mainline
Regression: No


Attachments
Race condition while calling pci_set_master() and pci_bus_add_device() (1.94 KB, text/plain)
2018-06-26 09:58 UTC, Hari Kishore Vyas
Details
Proposed fix to avoid race condition while updating is_busmaster and is_added bits of pci_dev structure (4.66 KB, application/mbox)
2018-06-26 10:08 UTC, Hari Kishore Vyas
Details

Description Hari Kishore Vyas 2018-06-26 09:58:41 UTC
Created attachment 276877 [details]
Race condition while calling pci_set_master() and pci_bus_add_device()

After removing/re-scanning PCIe nvme device multiple times, encountered "proc_dir entry already registered" message which keep on appearing on console with subsequent re-scan.


#echo 1 > /sys/bus/pci/rescan
[ 2260.042971] pci 0000:01:00.0: BAR 0: assigned [mem 0x400000000-0x400003fff 64bit]
[ 2260.050915] proc_dir_entry '0000:01/00.0' already registered
[ 2260.056765] -----------[ cut here ]-----------
[ 2260.061538] WARNING: CPU: 7 PID: 1966 at /tmp/yocto_builds/11411866_nxs_1.0_ea3_stingray-bcm958742t/poky/build/tmp/work-shared/bcm958742t/kernel-source/fs/proc/generic.c:346 proc_register+0x11c/0x158
[ 2260.079830] Modules linked in: bnxt_re(O) rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core bnxt_en(O) nvmem_bcm_ocotp nvmem_core ipv6
[ 2260.092043] CPU: 7 PID: 1966 Comm: sh Tainted: G W O 4.12.0-rc7+g2d2a228 #1
[ 2260.100209] Hardware name: Stingray SST100 (BCM958742T) (DT)
[ 2260.106047] task: ffff80a1626f8c00 task.stack: ffff80a160320000
[ 2260.112154] PC is at proc_register+0x11c/0x158
[ 2260.116737] LR is at proc_register+0x11c/0x158
[ 2260.121320] pc : [<ffff00000823138c>] lr : [<ffff00000823138c>] pstate: 00000145
[ 2260.128948] sp : ffff80a160323bb0
[ 2260.132365] x29: ffff80a160323bb0 x28: ffff80a1626f8c00
[ 2260.137845] x27: ffff0000087a2000 x26: 0000000000000040
[ 2260.143324] x25: ffff80a164173838 x24: ffff000008c18000
[ 2260.148804] x23: ffff80a15c524485 x22: ffff80a164173800
[ 2260.154283] x21: 0000000000000000 x20: ffff80a15c524400
[ 2260.159762] x19: ffff80a1640f4240 x18: 0000000000000010
[ 2260.165241] x17: 0000000000000000 x16: 0000000000000000
[ 2260.170721] x15: ffffffffffffffff x14: ffff000088be2d6f
[ 2260.176200] x13: ffff000008b28b38 x12: ffff00000846e1b8
[ 2260.181679] x11: 0000000005f5e0ff x10: 0000000000000016
[ 2260.187158] x9 : 00000000ffffffd0 x8 : 6465726574736967
[ 2260.192638] x7 : 6572207964616572 x6 : ffff80a17ffa0d38
[ 2260.198117] x5 : ffff80a17ffa0d38 x4 : 0000000000000000
[ 2260.203596] x3 : ffff80a17ffa5958 x2 : ffff80a17ffa0d38
[ 2260.209074] x1 : 000080a177490000 x0 : 0000000000000030
[ 2260.214554] --[ end trace f7b5e622386491a4 ]--
[ 2260.219315] Call trace:
[ 2260.221837] Exception stack(0xffff80a1603239e0 to 0xffff80a160323b10)
[ 2260.228480] 39e0: ffff80a1640f4240 0001000000000000 ffff80a160323bb0 ffff00000823138c
[ 2260.236557] 3a00: ffff0000087a2000 ffff80a1626f8c00 ffff80a160323a20 0000000000000000
[ 2260.244634] 3a20: ffff80a160323bb0 ffff80a160323bb0 ffff80a160323b70 00000000ffffffc8
[ 2260.252710] 3a40: ffff80a160323a70 ffff0000080e5810 ffff80a160323bb0 ffff80a160323bb0
[ 2260.260787] 3a60: ffff80a160323b70 00000000ffffffc8 ffff80a160323b20 ffff000008159920
[ 2260.268865] 3a80: 0000000000000030 000080a177490000 ffff80a17ffa0d38 ffff80a17ffa5958
[ 2260.276941] 3aa0: 0000000000000000 ffff80a17ffa0d38 ffff80a17ffa0d38 6572207964616572
[ 2260.285018] 3ac0: 6465726574736967 00000000ffffffd0 0000000000000016 0000000005f5e0ff
[ 2260.293094] 3ae0: ffff00000846e1b8 ffff000008b28b38 ffff000088be2d6f ffffffffffffffff
[ 2260.301171] 3b00: 0000000000000000 0000000000000000
[ 2260.306202] [<ffff00000823138c>] proc_register+0x11c/0x158
[ 2260.311860] [<ffff00000823166c>] proc_create_data+0x7c/0xe0
[ 2260.317610] [<ffff0000084272a8>] pci_proc_attach_device+0x78/0xf0
[ 2260.323896] [<ffff00000841565c>] pci_bus_add_device+0x34/0xa8
[ 2260.329823] [<ffff000008415704>] pci_bus_add_devices+0x34/0x80
[ 2260.335840] [<ffff00000841574c>] pci_bus_add_devices+0x7c/0x80
[ 2260.341855] [<ffff000008418528>] pci_rescan_bus+0x28/0x38
[ 2260.347423] [<ffff000008420f64>] bus_rescan_store+0x54/0x78
[ 2260.353171] [<ffff00000848f650>] bus_attr_store+0x20/0x30
[ 2260.358739] [<ffff00000823e064>] sysfs_kf_write+0x3c/0x50
[ 2260.364307] [<ffff00000823d324>] kernfs_fop_write+0x104/0x1f0
[ 2260.370235] [<ffff0000081c3ad8>] __vfs_write+0x18/0x100
[ 2260.375624] [<ffff0000081c4e7c>] vfs_write+0x9c/0x1a8
[ 2260.380833] [<ffff0000081c6490>] SyS_write+0x48/0xb0
[ 2260.385955] [<ffff000008082f8c>] __sys_trace_return+0x0/0x4
[ 2260.391807] nvme nvme0: pci function 0000:01:00.0

However, despite the errors, the device still appears to be reinitialized correctly.

Once we observe this error, it will occur for ever for all next re-scan of that device.

Steps to Reproduce:
1. Insert a Samsung 960 drive into the first PCIe slot.

2. Check it's detected

lspci -s 0000:01:00.0
0000:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
3. Remove the drive via the PCIe interface

echo 1 > /sys/bus/pci/devices/0000:01:00.0/remove

4. Check it's removed

lspci -s 0000:01:00.0

5. Rescan the PCIe bus

echo 1 > /sys/bus/pci/rescan
[ 2677.645883] pci 0000:01:00.0: BAR 0: assigned [mem 0x400000000-0x400003fff 64bit] 
[ 2677.656855] nvme nvme0: pci function 0000:01:00.0

6. Repeat steps 2 through 5 until the error occurs.

Further debugging revealed that there is a race condition while setting is_added and is_busmaster bits of pci_dev structure.

This issue is happening  with multiple times device removal and rescan from sysfs. Card is not removed physically. 

Is_added bit is set after device attach which probe nvme driver. NVMe driver starts one workqueue and that one is calling pci_set_master()  to set is_busmaster bit.
With multiple times device removal and rescan from sysfs,  race condition happens and is_added bit is over-written to 0 from workqueue started by NVMe driver.
Due to this when user tries to remove sysfs entry, it fails in pci_stop_dev() as it checks for is_added bit which was supposed to be 1 but observed as zero and proc,sysfs entries remains.

This causes failure while creating new proc,sysfs entries as part of device addition as proc directory already existing.
 
Sequence 1:
pci_bus_add_device()
{
    device_attach();
   ...
    dev->is_added = 1;
}

Sequence 2:
nvme_probe()
{
...
INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work); ...
}
nvme_reset_work()--->nvme_pci_enable()-->pci_set_master()-->__pci_set_master(true)-->dev->is_busmaster = enable
Comment 1 Hari Kishore Vyas 2018-06-26 10:08:22 UTC
Created attachment 276879 [details]
Proposed fix to avoid race condition while updating is_busmaster and is_added bits  of pci_dev structure

Proposing to use a spin lock while updating is_busmater and is_added bits of pci_dev structure.

As all bit fields sharing same memory location so other solutions could be to use separate independent fields i.s.o. bitfields in include/linux/pci.h. This will need only modification in mentioned pci.h file but at the cost of additional memory but with advantage of no locking mechanism.

Note You need to log in before you can comment on or make changes to this bug.