Bug 95101
Summary: | scsi/mpt2sas: setpci reset results in kernel oops | ||
---|---|---|---|
Product: | Drivers | Reporter: | Nagarajkumar N (nagarajkumar.narayanan) |
Component: | Other | Assignee: | drivers_other |
Status: | NEW --- | ||
Severity: | high | CC: | joe.lawrence, linux-scsi, nagarajkumar.narayanan |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 3.19.1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
var/log/messages log on kernel oops followed by setpci reset
mpt2sas pci resource removal synchronization patch |
From the /var/log/messages attachment: Mar 16 01:13:10 RHEL63 kernel: mpt2sas1: _base_fault_reset_work: Running mpt2sas_dead_ioc thread success !!!! ... Mar 16 01:13:20 RHEL63 kernel: mpt2sas1: _scsih_ir_shutdown: timeout Mar 16 01:13:20 RHEL63 kernel: mpt2sas1: removing handle(0x0024), wwid(0x0c4e8a1c03a9b742) indicates that _scsih_remove was called when the driver's watchdog detected that the device was misbehaving. Driver device removal invokes: mpt2sas_base_detach mpt2sas_base_free_resources iounmap(ioc->chip) setting the stage for the crash: Mar 16 01:13:37 RHEL63 kernel: BUG: unable to handle kernel paging request at ffffc900171e0000 Mar 16 01:13:37 RHEL63 kernel: IP: [<ffffffffa00502e0>] mpt2sas_base_get_iocstate+0x10/0x30 [mpt2sas] ... Mar 16 01:13:37 RHEL63 kernel: RAX: ffffc900171e0000 RBX: ffff88105a0aa788 RCX: 0000000000004fdc where mpt2sas_base_get_iocstate was probably calling readl(&ioc->chip->Doorbell) So it would seem that the mpt2sas ioctl code (step 5, I think) isn't synchronized against device removal. Created attachment 172971 [details]
mpt2sas pci resource removal synchronization patch
The attached patch need to applied on latest linux main git branch on scsi/mpt2sas
The patch provides syncrhonization between cli, brm status show and pci resource removal path through mutex lock and spinlock protection on linked list of controllers (multiple warpdrive cards are used) on controller resource removal and addition
|
Created attachment 171201 [details] var/log/messages log on kernel oops followed by setpci reset Hardware: LSI/SEAGATE Nytro Warpdrive Flash card (Device ID: 007E) cards Platform: x86_64 driver: mpt2sas issuing setpci reset on Nytro warpdrive results in kernel oops [000] while (i) running cli commands through ioctl path (ddcli tool) (ii) accessing BRM status in sysfs path (iii) run I/O with nytro warpdrive using dd command (/dev/sdx) in parallel This happens as there setpci reset results in freeing resources while cli and BRM status access tries to use those resources, synchronization missing. Steps to reproduce: 1. Connect the LSI card in the x86 64 bit server 2. Power on the server 3. Check lspci -vt to get PCI address of Nytro warpdrive 4. check for mpt2sas driver related information 5. run ddoemcli tools list command in a loop (ioctl path) in background script: for i in `seq 1 10000`; do ddoemcli -c 1 -list done 6. run BRM_status query in loop over sysfs path in backround script: #!/bin/bash while [ : ] do find /sys -name BRM_status | xargs cat done 7. for i in `seq 1 40`; do dd if=/dev/sdb of=/dev/null bs=10M& done 8. Issue setpci reset setpci -s 0000:80:03.0 0xa0.l=0x308200d0 (where 0000:80:03.0 is the PCI address obtained from lspci -vt On executing setpci reset we will be able to see oops [000] message in /var/log/messages logs for kernel oops attached