Bug 196315

Summary: kernel BUG at nvme/host/pci.c
Product: Drivers Reporter: Andreas Pflug (pgadmin)
Component: OtherAssignee: drivers_other
Status: RESOLVED CODE_FIX    
Severity: high CC: ben, dwmw2
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.9.30 Subsystem:
Regression: No Bisected commit-id:
Attachments: netconsole kernel log

Description Andreas Pflug 2017-07-10 15:28:22 UTC
Created attachment 257445 [details]
netconsole kernel log

I'm running a patched (see below) debian 4.9.30 kernel with xen4.8.1 on Debian9. Starting a specific virtual machine, very soon the kernel will emit

  kernel BUG at /usr/src/kernel/linux-4.9.30/drivers/nvme/host/pci.c:495!

via netconsole to my logging host, and become unstable until hard reset.

Hardware is dual E5-2620v4 on Supermicro 10DRI-T with two SAMSUNG
MZQLW960HMJP-00003 NVME disks (mdadm RAID-1) backing the vhds (os on separate SSD).

The bug was reported to debian as https://bugs.debian.org/866511 . According to Ben Hutchings' advice, I patched the standard kernel with 0001-swiotlb-ensure-that-page-sized-mappings-are-page-ali.patch since its description sounded promising, but the bug remains.

Log is attached, cut after 460 lines: the last trace on CPU15 is
repeated all over again, eventually leading to "Fixing recursive fault
but reboot is needed!"

Regards,
Andreas
Comment 1 Ben Hutchings 2017-07-10 17:38:38 UTC
This report should be reassigned or closed (as I don't think nvme bugs are tracked on Bugzilla).
Comment 2 David Woodhouse 2017-07-11 14:33:35 UTC
No idea why you filed this bug against raw NOR/NAND flash devices, but I'll take it anyway...
Comment 3 Andreas Pflug 2017-07-11 14:51:05 UTC
Because this appeared the least non-appropriate category to me...

After Ben's hint, I posted to linux-nvme@lists.infradead.org, and checked with a 4.12.0 kernel with same result.
Comment 4 David Woodhouse 2017-08-15 12:28:56 UTC
http://xenbits.xen.org/xsa/advisory-229.html