Bug 112241 - Under heavy load FC TARGET going to Oops
Summary: Under heavy load FC TARGET going to Oops
Status: NEW
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: QLOGIC QLA2XXX (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: scsi_drivers-qla2xxx
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-02-10 07:32 UTC by Anthony
Modified: 2016-03-01 05:16 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.5.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Kernel stacktrace (3.60 KB, text/plain)
2016-02-10 07:32 UTC, Anthony
Details
Screenshot with call trace for kernel 4.5.0 (39.67 KB, image/png)
2016-02-29 02:24 UTC, Anthony
Details

Description Anthony 2016-02-10 07:32:13 UTC
Created attachment 203261 [details]
Kernel stacktrace

Storage on Linux Fedora with QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA exporting Adaptec RAID6 with bcache on Intel SSD to VMWARE 5
On heavy load (for example VM migrating from storage) system going to Oops.


Stacktrace attached
Comment 1 Anthony 2016-02-29 02:24:47 UTC
Created attachment 206351 [details]
Screenshot with call trace for kernel 4.5.0
Comment 2 Anthony 2016-02-29 02:26:23 UTC
With kernel 4.5.0 on target, system hang after clients connects to target.
Comment 3 nab 2016-03-01 05:16:10 UTC
Hi Anthony,

On Mon, 2016-02-29 at 02:26 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=112241
> 
> Anthony <anthony.bloodoff@gmail.com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>      Kernel Version|4.3.3                       |4.5.0
> 
> --- Comment #2 from Anthony <anthony.bloodoff@gmail.com> ---
> With kernel 4.5.0 on target, system hang after clients connects to target.
> 

So there are two things going on here.

First, the BUG_ON your ESX <-> LIO FC setup triggered has been addressed
recently in v4.5-rc4 and later kernels with the following series:

http://www.spinics.net/lists/target-devel/msg11822.html

Note these patches will be making it back to earlier stable kernels over
the next weeks.

However, this specific bug is a final consequence of larger ESX v5.5u2+
host side issue of AtomicTestandSet (ATS) heartbeat being enabled (by
default) for all VMFS5 mounts:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956

Other folks have been hitting this recently, here's some extra
background:

http://permalink.gmane.org/gmane.linux.scsi.target.devel/11574
http://www.spinics.net/lists/target-devel/msg12124.html

Note this effects all targets w/ VAAI ATS (including EMC, IBM, 3PAR,
SolidFire, etc) and the current solution for ESX v5.5u2+ is to either:

  - Explicitly disable ATS heartbeat usage on all VMFS5 mounts as 
    described in the VMWare -kb article, or:
  - Explicitly disable all ATS logic completely from LIO using 
    emulate_caw=0 on all backends connected to ESX v5.5u2+ hosts
    with VMFS5.

You can google for 'esx ats heartbeat bug' to see the gory details.

Thanks for reporting!

--nab

Note You need to log in before you can comment on or make changes to this bug.