Bug 112241

Summary: Under heavy load FC TARGET going to Oops
Product: SCSI Drivers Reporter: Anthony (anthony.bloodoff)
Component: QLOGIC QLA2XXXAssignee: scsi_drivers-qla2xxx
Status: NEW ---    
Severity: high CC: szg00000
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.5.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: Kernel stacktrace
Screenshot with call trace for kernel 4.5.0

Description Anthony 2016-02-10 07:32:13 UTC
Created attachment 203261 [details]
Kernel stacktrace

Storage on Linux Fedora with QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA exporting Adaptec RAID6 with bcache on Intel SSD to VMWARE 5
On heavy load (for example VM migrating from storage) system going to Oops.


Stacktrace attached
Comment 1 Anthony 2016-02-29 02:24:47 UTC
Created attachment 206351 [details]
Screenshot with call trace for kernel 4.5.0
Comment 2 Anthony 2016-02-29 02:26:23 UTC
With kernel 4.5.0 on target, system hang after clients connects to target.
Comment 3 nab 2016-03-01 05:16:10 UTC
Hi Anthony,

On Mon, 2016-02-29 at 02:26 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=112241
> 
> Anthony <anthony.bloodoff@gmail.com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>      Kernel Version|4.3.3                       |4.5.0
> 
> --- Comment #2 from Anthony <anthony.bloodoff@gmail.com> ---
> With kernel 4.5.0 on target, system hang after clients connects to target.
> 

So there are two things going on here.

First, the BUG_ON your ESX <-> LIO FC setup triggered has been addressed
recently in v4.5-rc4 and later kernels with the following series:

http://www.spinics.net/lists/target-devel/msg11822.html

Note these patches will be making it back to earlier stable kernels over
the next weeks.

However, this specific bug is a final consequence of larger ESX v5.5u2+
host side issue of AtomicTestandSet (ATS) heartbeat being enabled (by
default) for all VMFS5 mounts:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956

Other folks have been hitting this recently, here's some extra
background:

http://permalink.gmane.org/gmane.linux.scsi.target.devel/11574
http://www.spinics.net/lists/target-devel/msg12124.html

Note this effects all targets w/ VAAI ATS (including EMC, IBM, 3PAR,
SolidFire, etc) and the current solution for ESX v5.5u2+ is to either:

  - Explicitly disable ATS heartbeat usage on all VMFS5 mounts as 
    described in the VMWare -kb article, or:
  - Explicitly disable all ATS logic completely from LIO using 
    emulate_caw=0 on all backends connected to ESX v5.5u2+ hosts
    with VMFS5.

You can google for 'esx ats heartbeat bug' to see the gory details.

Thanks for reporting!

--nab