Bug 95101 - scsi/mpt2sas: setpci reset results in kernel oops
Summary: scsi/mpt2sas: setpci reset results in kernel oops
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-19 17:21 UTC by Nagarajkumar N
Modified: 2015-04-02 05:58 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.19.1
Subsystem:
Regression: No
Bisected commit-id:


Attachments
var/log/messages log on kernel oops followed by setpci reset (19.08 KB, application/octet-stream)
2015-03-19 17:21 UTC, Nagarajkumar N
Details
mpt2sas pci resource removal synchronization patch (11.22 KB, application/octet-stream)
2015-04-02 05:58 UTC, Nagarajkumar N
Details

Description Nagarajkumar N 2015-03-19 17:21:10 UTC
Created attachment 171201 [details]
var/log/messages log on kernel oops followed by setpci reset

Hardware: LSI/SEAGATE Nytro Warpdrive Flash card (Device ID: 007E) cards
Platform: x86_64
driver: mpt2sas

issuing setpci reset on Nytro warpdrive results in kernel oops [000] while 
(i) running cli commands through ioctl path (ddcli tool) 
(ii) accessing BRM status in sysfs path 
(iii) run I/O with nytro warpdrive using dd command (/dev/sdx)
in parallel

This happens as there setpci reset results in freeing resources while cli and BRM status access tries to use those resources, synchronization missing.

Steps to reproduce:
1. Connect the LSI card in the x86 64 bit server
2. Power on the server
3. Check lspci -vt to get PCI address of Nytro warpdrive
4. check for mpt2sas driver related information
5. run ddoemcli tools list command in a loop (ioctl path) in background
   script:
   for i in `seq 1 10000`; do
	ddoemcli -c 1 -list
   done
6. run BRM_status query in loop over sysfs path in backround
   script:
   #!/bin/bash
   while [ : ]
   do
   find /sys -name BRM_status | xargs cat
   done
7. for i in `seq 1 40`; do
	dd if=/dev/sdb of=/dev/null bs=10M&
   done
8. Issue setpci reset
   setpci -s 0000:80:03.0 0xa0.l=0x308200d0   (where 0000:80:03.0 is the PCI    address obtained from lspci -vt

On executing setpci reset we will be able to see oops [000] message in /var/log/messages

logs for kernel oops attached
Comment 1 Joe Lawrence 2015-03-19 21:46:04 UTC
From the /var/log/messages attachment:

Mar 16 01:13:10 RHEL63 kernel: mpt2sas1: _base_fault_reset_work: Running mpt2sas_dead_ioc thread success !!!!
...
Mar 16 01:13:20 RHEL63 kernel: mpt2sas1: _scsih_ir_shutdown: timeout
Mar 16 01:13:20 RHEL63 kernel: mpt2sas1: removing handle(0x0024), wwid(0x0c4e8a1c03a9b742)

indicates that _scsih_remove was called when the driver's watchdog detected that the device was misbehaving.  Driver device removal invokes:

mpt2sas_base_detach
  mpt2sas_base_free_resources
    iounmap(ioc->chip)

setting the stage for the crash:

Mar 16 01:13:37 RHEL63 kernel: BUG: unable to handle kernel paging request at ffffc900171e0000                             
Mar 16 01:13:37 RHEL63 kernel: IP: [<ffffffffa00502e0>] mpt2sas_base_get_iocstate+0x10/0x30 [mpt2sas]
...
Mar 16 01:13:37 RHEL63 kernel: RAX: ffffc900171e0000 RBX: ffff88105a0aa788 RCX: 0000000000004fdc

where mpt2sas_base_get_iocstate was probably calling readl(&ioc->chip->Doorbell)

So it would seem that the mpt2sas ioctl code (step 5, I think) isn't synchronized against device removal.
Comment 2 Nagarajkumar N 2015-04-02 05:58:32 UTC
Created attachment 172971 [details]
mpt2sas pci resource removal synchronization patch

The attached patch need to applied on latest linux main git branch on scsi/mpt2sas
The patch provides syncrhonization between cli, brm status show and pci resource removal path through mutex lock and spinlock protection on linked list of controllers (multiple warpdrive cards are used) on controller resource removal and addition

Note You need to log in before you can comment on or make changes to this bug.