Bug 191201 - Randomly freezes due to VMXNET3
Summary: Randomly freezes due to VMXNET3
Status: RESOLVED INVALID
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: x86-64 Linux
: P1 blocking
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-12-26 14:42 UTC by peter.hun
Modified: 2018-06-05 17:45 UTC (History)
15 users (show)

See Also:
Kernel Version: 4.9.0
Tree: Mainline
Regression: No


Attachments
Kernel Panic log (77.46 KB, image/png)
2016-12-26 14:42 UTC, peter.hun
Details
vmss file from VM at panic state (1.17 MB, application/octet-stream)
2017-03-23 11:41 UTC, Kalchenko Alexandr
Details
CPU and RAM usage after bug occurred (26.33 KB, image/png)
2017-03-24 21:26 UTC, Remigiusz Szczepanik
Details
vmss from panic state VM (493.90 KB, application/x-zip-compressed)
2017-03-25 15:16 UTC, Remigiusz Szczepanik
Details
vmss from panic state VM - just after reboot (493.34 KB, application/x-zip-compressed)
2017-03-25 15:23 UTC, Remigiusz Szczepanik
Details
crash dump from panic state (109.67 KB, text/plain)
2017-03-26 13:57 UTC, Remigiusz Szczepanik
Details
config-4.10.3-1.el7.elrepo.x86_64 (184.75 KB, text/plain)
2017-03-27 17:01 UTC, Remigiusz Szczepanik
Details
debug-patch (2.01 KB, patch)
2017-03-28 01:49 UTC, Shrikrishna Khare
Details | Diff

Description peter.hun 2016-12-26 14:42:55 UTC
Created attachment 248601 [details]
Kernel Panic log

This issue affects all virtual machines running on ESXi 6.5 host (with virtual hardware version 13), the guest will freeze randomly (sometimes several minutes after power on, and sometimes freezes several hours from boot).

I got this kernel panic log several times, possibly this issue was caused by VMXNET3.

All kernel newer than 4.8.x are affected with this issue, if I downgrade the kernel version back to 4.4.x, the VMs will work like a charm.

(Guest OS is CentOS 7.3 with kernel-ml)

And this issue doesn't happen while virtual hardware version 11 with ESXi 6.5, only happen on virtual hardware version 13 + ESXi 6.5.
Comment 1 Mikołaj Kowalski 2017-01-05 16:21:17 UTC
I've probably also hit that bug on Ubuntu 16.10, linux-image-4.8.0-32-generic 4.8.0-32.34

[ 8442.722056] ------------[ cut here ]------------
[ 8442.722119] kernel BUG at /build/linux-_qw1uB/linux-4.8.0/drivers/net/vmxnet3/vmxnet3_drv.c:1413!
[ 8442.722198] invalid opcode: 0000 [#1] SMP
[ 8442.722235] Modules linked in: xt_multiport vmw_vsock_vmci_transport vsock ppdev vmw_balloon coretemp input_leds joydev serio_raw i2c_piix4 shpchp nfit vmw_vmci parport_pc parport mac_hid ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables ib_iser nf_conntrack_netbios_ns rdma_cm nf_conntrack_broadcast iw_cm ib_cm nf_nat_ftp ib_core nf_nat nf_conntrack_ftp nf_conntrack iptable_filter configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic
[ 8442.723051]  usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 vmwgfx lrw glue_helper ablk_helper cryptd ttm drm_kms_helper psmouse syscopyarea sysfillrect vmxnet3 sysimgblt ahci fb_sys_fops mptspi libahci mptscsih drm mptbase scsi_transport_spi pata_acpi fjes
[ 8442.723351] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.8.0-32-generic #34-Ubuntu
[ 8442.723416] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[ 8442.723508] task: ffff9333762a2700 task.stack: ffff9333762ac000
[ 8442.723560] RIP: 0010:[<ffffffffc030e701>]  [<ffffffffc030e701>] vmxnet3_rq_rx_complete+0x8d1/0xeb0 [vmxnet3]
[ 8442.723656] RSP: 0018:ffff93337fc83dc8  EFLAGS: 00010297
[ 8442.723704] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff933372735800
[ 8442.723765] RDX: 0000000000000040 RSI: 0000000000000001 RDI: 0000000000000040
[ 8442.723826] RBP: ffff93337fc83e40 R08: 0000000000000002 R09: 0000000000000030
[ 8442.723888] R10: 0000000000000000 R11: ffff93336d15c880 R12: ffff933372bae850
[ 8442.723949] R13: ffff93336d15d400 R14: ffff933372ab2010 R15: ffff933372b28018
[ 8442.724012] FS:  0000000000000000(0000) GS:ffff93337fc80000(0000) knlGS:0000000000000000
[ 8442.724081] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8442.724131] CR2: 00007f4ca84d62a0 CR3: 00000002327d5000 CR4: 00000000000006e0
[ 8442.724299] Stack:
[ 8442.724324]  ffff93336d15c880 ffff93336d15c880 0000000000000000 ffff93336d15c880
[ 8442.724401]  ffff93337fc8002d ffff93336d15d420 0000000000000002 0000000100000040
[ 8442.724475]  ffff93336d15d4e8 0000000000000000 ffff93336d15c880 ffff93336d15d420
[ 8442.724549] Call Trace:
[ 8442.726430]  <IRQ>
[ 8442.726458]  [<ffffffffc030ee3a>] vmxnet3_poll_rx_only+0x3a/0xb0 [vmxnet3]
[ 8442.730045]  [<ffffffff87167dfd>] ? add_interrupt_randomness+0x19d/0x210
[ 8442.730812]  [<ffffffff8737fa28>] net_rx_action+0x238/0x380
[ 8442.731566]  [<ffffffff8749dddd>] __do_softirq+0x10d/0x298
[ 8442.732306]  [<ffffffff86c88d93>] irq_exit+0xa3/0xb0
[ 8442.733035]  [<ffffffff8749db24>] do_IRQ+0x54/0xd0
[ 8442.733744]  [<ffffffff8749bc02>] common_interrupt+0x82/0x82
[ 8442.734437]  <EOI>
[ 8442.734449]  [<ffffffff86c64236>] ? native_safe_halt+0x6/0x10
[ 8442.735776]  [<ffffffff86c37e60>] default_idle+0x20/0xd0
[ 8442.736434]  [<ffffffff86c385cf>] arch_cpu_idle+0xf/0x20
[ 8442.737067]  [<ffffffff86cc77fa>] default_idle_call+0x2a/0x40
[ 8442.737682]  [<ffffffff86cc7afc>] cpu_startup_entry+0x2ec/0x350
[ 8442.738283]  [<ffffffff86c518a1>] start_secondary+0x151/0x190
[ 8442.738870] Code: 90 88 45 98 4c 89 55 a0 e8 4d f0 06 c7 0f b6 45 98 4c 8b 5d 90 4c 8b 55 a0 49 c7 85 60 01 00 00 00 00 00 00 89 c6 e9 91 f8 ff ff <0f> 0b 0f 0b 49 83 85 b8 01 00 00 01 49 c7 85 60 01 00 00 00 00
[ 8442.740689] RIP  [<ffffffffc030e701>] vmxnet3_rq_rx_complete+0x8d1/0xeb0 [vmxnet3]
[ 8442.741273]  RSP <ffff93337fc83dc8>
Comment 2 Kalchenko Alexandr 2017-02-21 15:38:22 UTC
Catched same thing on Gentoo with latest "stable" kernel 4.9.6-r1, VMware 6.5 fully patched.
Virtual machine with HW version 13 and VMXNET3 network card crashed with kernel panic immediately after i tried to ssh into them.
Comment 3 Shrikrishna Khare 2017-03-23 00:15:42 UTC
Thank you for reporting it.

I have been running kernel 4.9 VM, hwversion 13 on ESXi6.5 host for several hours, and exchanging traffic and yet have not managed to reproduce this. Any pointers on how I might be able to reproduce this? A repro will allow me to collect vmss file which will help in debugging this.

If you hit the issue again, could you please generate vmss file when the VM is in the panic state? It can be obtained by suspending the VM in the panic state and then locating the vmss file in the director of the virtual machine on the ESX host. More details here:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2005831
Comment 4 Kalchenko Alexandr 2017-03-23 11:41:43 UTC
Created attachment 255445 [details]
vmss file from VM at panic state

VMSS file in panic state from gentoo VM with 4.9.6 kernel.
panic appeared immediately after ssh login attempt
Comment 5 Kalchenko Alexandr 2017-03-23 12:31:48 UTC
So it was repeatable on VM with setting Cores per socket = 4. VCPU = 4.
After i changed Cores per socket to 1 or 2, kernel panic disappeared.
Strange thing, that if i set Cores per socket to 4 again - kernel panic don't appeared...
So i can't reproduce it with 100% for now.
Comment 6 Kalchenko Alexandr 2017-03-23 12:36:08 UTC
Sorry, a little mistake. Was reproducable with Cores per socket = 1, VCPU = 4.
Not reproducable with Cores per socket = 2 or 4. (VCPU = 4)
Comment 7 Shrikrishna Khare 2017-03-24 05:53:11 UTC
Thank you.

Any pointer on how I could get vmlinux file (debug symbols) for this gentoo VM? Also, the vmss size is only 1.17Mb, without the vmliux, I could not access it, but it seems too small.

Btw, I tried reproducing this again on Ubuntu VM and on a Photon VM running kernel 4.9 with 4 VCPUS (cores per socket = 1) by passing traffic as well as leaving ssh session running for several hours. 

Never hit it yet. Please let me know if there is anything else that I may try that may increase the chances of me hitting it locally.
Comment 8 Remigiusz Szczepanik 2017-03-24 21:26:45 UTC
Created attachment 255521 [details]
CPU and RAM usage after bug occurred

Hello everyone

I'm suffering from the same issue. It happens "randomly" (after some time after SSH usage) on in my case two different VMs but they are quite similar.

I will tell you everything I did so maybe it will give you a bit of an idea how you can reproduce this issue.

ESXi version: You are running HPE Customized Image ESXi 6.5.0 version 650.9.6.0.28 released on November 2016 and based on ESXi 6.5.0 Vmkernel Release Build 4564106.
ESXi hardware: HP ProLiant MicroServer Gen8, 2 CPUs x Intel(R) Celeron(R) CPU G1610T @ 2.30GHz, 11.84 GB of RAM
Guest OS: CentOS 4/5 or later (64-bit) [actually it is CentOS 7)
Compatibility: ESXi 6.5 and later (VM version 13) [made from within ESXi 6.5 web client]
VMware Tools: Yes [open-vm-tools]
CPUs: 2 [2 cores per socket, 1 socket]
Memory: 1 GB
Network Adapter: VMXNET 3

CentOS 7 is installed with "minimal" software.
I have changed kernel by utilizing this Ansible Playbook (I'm learning Ansible, still quite green) - https://github.com/komarEX/ansible/blob/master/kernel.yml

Freeze occurs on kernel I have installed right now: Linux 4.10.3-1.el7.elrepo.x86_64 #1 SMP Wed Mar 15 14:45:27 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

When freeze occurs I can see high usage of one CPU on ESXi monitor (attachment) and I cannot do anything from ESXi console side (frozen screen).
On attachment you can see actually 2 cases of bug - occurred, restart, occurred again, second restart.

I can try to do whatever is needed to better pinpoint source of issue but I may need a little of guidance in this matter.
Comment 9 Shrikrishna Khare 2017-03-24 21:35:20 UTC
(In reply to Remigiusz Szczepanik from comment #8)
> Created attachment 255521 [details]
> 
> I can try to do whatever is needed to better pinpoint source of issue but I
> may need a little of guidance in this matter.

Thank you! I will try to repro using the steps you mentioned. But what will really help us is below:

Could you please generate vmss file when the VM is in the panic state? It can be obtained by suspending the VM in the panic state and then locating the vmss file in the director of the virtual machine on the ESX host. More details here:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2005831

if we have the VMSS file when the VM is in the panic state and if we have the debug symbols for the kernel you are using, we can analyze the crash dump. That will significantly improve our odds of root causing this!

Thanks again,
Shri
Comment 10 peter.hun 2017-03-24 21:49:04 UTC
Thank you for responding this issue, since we waited so long after this issue published to here, we had switched to SR-IOV and E1000E solutions and everything seemed work like a charms, so the issues should be only related to VMXNET3.

I just deployed a experimental VM with following configuration, and if it freezes again, I will try to capture vmss file of the affected VM.

8 vCores
16 GB RAM
H/W v13
VMXNET3
CentOS with kernel-ml 4.10
Workload : cPanel v63 edge
Comment 11 Shrikrishna Khare 2017-03-25 01:01:53 UTC
(In reply to peter.hun from comment #10)
> Thank you for responding this issue, since we waited so long after this
> issue published to here, we had switched to SR-IOV and E1000E solutions and
> everything seemed work like a charms, so the issues should be only related
> to VMXNET3.
> 
> I just deployed a experimental VM with following configuration, and if it
> freezes again, I will try to capture vmss file of the affected VM.
> 
> 8 vCores
> 16 GB RAM
> H/W v13
> VMXNET3
> CentOS with kernel-ml 4.10
> Workload : cPanel v63 edge

Thanks a lot Peter. That will really help in root causing this.

Btw, not sure whether the you originally hit the issue with 8 vcpus? Earlier updates in this thread suggest the issue was hit with 4 vcpus.
Comment 12 peter.hun 2017-03-25 01:08:32 UTC
Actually all of my three major production servers affected by this issue, one has dual Xeon E5-2620 v3 (that's 12C24T), one has dual Xeon E5-2683 v3 (that's 28C56T), and another has a Xeon E3-1231 v3 (that's 4C8T).

Tested with VM with 8/12/24/28/56 vCPUs last year, but no luck.
Comment 13 Remigiusz Szczepanik 2017-03-25 15:16:10 UTC
Created attachment 255533 [details]
vmss from panic state VM

I was working on these VMs for half a day and finally panic state happened.

I have downloaded .vmss, .nvram and .vmem files from suspended machine. I will give .vmem file only if it's really needed.

Additionally I can tell that state happened exactly at full hour so I believe that come kind of cron maybe helped here.
Comment 14 Remigiusz Szczepanik 2017-03-25 15:23:32 UTC
Created attachment 255535 [details]
vmss from panic state VM - just after reboot

After reboot it happened instantly again so I attached second .vmss (also have .nvram and .vmem files).
Comment 15 Shrikrishna Khare 2017-03-25 18:09:22 UTC
(In reply to Remigiusz Szczepanik from comment #14)
> Created attachment 255535 [details]
> vmss from panic state VM - just after reboot
> 
> After reboot it happened instantly again so I attached second .vmss (also
> have .nvram and .vmem files).

Thank you Remigiusz.

Could you please also share vmlinux file (debug symbols) for the kernel (is it CentOS 4.10.3)you are running? 

Using vmss and vmlinux, I should able to run gdb.
Comment 16 Remigiusz Szczepanik 2017-03-25 18:19:19 UTC
(In reply to Shrikrishna Khare from comment #15)
> Could you please also share vmlinux file (debug symbols) for the kernel (is
> it CentOS 4.10.3)you are running? 


I don't have debug symbols installed (only vmlinuz files I guess?).

But I believe you can get these from elrepo. I'm using "kernel-ml.x86_64 4.10.3-1.el7.elrepo".
Comment 17 Shrikrishna Khare 2017-03-25 21:10:03 UTC
(In reply to Remigiusz Szczepanik from comment #16)
> (In reply to Shrikrishna Khare from comment #15)
> > Could you please also share vmlinux file (debug symbols) for the kernel (is
> > it CentOS 4.10.3)you are running? 
> 
> 
> I don't have debug symbols installed (only vmlinuz files I guess?).
> 
> But I believe you can get these from elrepo. I'm using "kernel-ml.x86_64
> 4.10.3-1.el7.elrepo".

Think I will need vmem file as crash complained without it. Please see the steps I tried below. Btw, I could not find 4.10.3 in the elrepo, only 4.10.4.

# search vmlinux, vmlinuz
rpm2cpio kernel-ml-4.10.4-1.el7.elrepo.x86_64.rpm  | cpio -idt | grep -e "vmlinux" -e "vmlinuz"
./boot/vmlinuz-4.10.4-1.el7.elrepo.x86_64     # i.e. no vmlinux file, but only vmlinuz

# extract vmlinux from vmlinuz
wget -O extract-vmlinux https://raw.githubusercontent.com/torvalds/linux/master/scripts/extract-vmlinux
./extract-vmlinux boot/vmlinuz-4.10.4-1.el7.elrepo.x86_64 > vmlinux

crash vmlinux Hana-backup-92165ff7.vmss

crash 7.1.5
Copyright (C) 2002-2016  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

vmw: Memory dump is not part of this vmss file.
vmw: Try to locate the companion vmem file ...
crash: vmw: Hana-backup-92165ff7.vmem: No such file or directory

crash: Hana-backup-92165ff7.vmss: initialization failed
Comment 18 Remigiusz Szczepanik 2017-03-25 23:59:27 UTC
I have extracted vmlinux from vmlinuz and I have used .vmem file.

# crash vmlinux Hana-backup-92165ff7.vmss

crash 7.1.4
Copyright (C) 2002-2015  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

vmw: Memory dump is not part of this vmss file.
vmw: Try to locate the companion vmem file ...
vmw: vmem file: Hana-backup-92165ff7.vmem

crash: vmlinux: no .gnu_debuglink section

crash: vmlinux: no debugging data available
Comment 19 Remigiusz Szczepanik 2017-03-26 00:14:56 UTC
If I understand correctly elrepo does not give source or debuginfo packages - https://elrepo.org/bugs/view.php?id=684

I do have config file for that kernel. Can I use it to recompile kernel against mainline source and somehow compile debuginfo? I'm not very well versed in advance kernel compilation.
Comment 20 Remigiusz Szczepanik 2017-03-26 13:57:27 UTC
Created attachment 255557 [details]
crash dump from panic state

So it took me a while to make crash work like I would like it to.
I had to recompile kernel 4.10.3 using elrepo config file and I had to recompile crash to newest version as only 4.1.8 crash has fixes for bugs introduced in kernels 4.10+.

I have used crash on second .vmss (panic after reboot).

As I said before I would like to not disclose .vmem file (obvious reasons).
If you need anything more please tell me what I need to type info crash.

In attachment you have:

# crash System.map-4.10.3-1.el7.elrepo.x86_64 ./kernel_rebuild/linux-4.10.3/vmlinux Hana-backup-92165ff7.vmss

crash 7.1.8
Copyright (C) 2002-2016  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

vmw: Memory dump is not part of this vmss file.
vmw: Try to locate the companion vmem file ...
vmw: vmem file: Hana-backup-92165ff7.vmem

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernels compiled by different gcc versions:
  ./kernel_rebuild/linux-4.10.3/vmlinux: 5.4.0
  Hana-backup-92165ff7.vmss kernel: 4.8.5

  SYSTEM MAP: System.map-4.10.3-1.el7.elrepo.x86_64
DEBUG KERNEL: ./kernel_rebuild/linux-4.10.3/vmlinux (4.10.3)
    DUMPFILE: Hana-backup-92165ff7.vmss
        CPUS: 2 [OFFLINE: 1]
        DATE: Sat Mar 25 16:18:48 2017
      UPTIME: 00:00:14
LOAD AVERAGE: 0.29, 0.06, 0.02
       TASKS: 167
    NODENAME: hana
     RELEASE: 4.10.3-1.el7.elrepo.x86_64
     VERSION: #1 SMP Wed Mar 15 14:45:27 EDT 2017
     MACHINE: x86_64  (2294 Mhz)
      MEMORY: 1 GB
       PANIC: "kernel BUG at drivers/net/vmxnet3/vmxnet3_drv.c:1413!"
         PID: 0
     COMMAND: "swapper/0"
        TASK: ffffffff81c10500  (1 of 2)  [THREAD_INFO: ffffffff81c10500]
         CPU: 0
       STATE: TASK_RUNNING
     WARNING: panic task not found

crash> bt
(...)
crash> log
(...)
crash> ps
Comment 21 Remigiusz Szczepanik 2017-03-26 14:07:16 UTC
@Shrikrishna Khare

Can you confirm that it is indeed the same bug (just on older kernel)?
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1654319
Comment 22 Shrikrishna Khare 2017-03-26 15:52:09 UTC
(In reply to Remigiusz Szczepanik from comment #21)
> @Shrikrishna Khare
> 
> Can you confirm that it is indeed the same bug (just on older kernel)?
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1654319

Yes, it is the same issue, let me check that vmss.
Comment 23 Shrikrishna Khare 2017-03-26 16:07:02 UTC
(In reply to Remigiusz Szczepanik from comment #20)
> Created attachment 255557 [details]
> crash dump from panic state
> 
> So it took me a while to make crash work like I would like it to.
> I had to recompile kernel 4.10.3 using elrepo config file and I had to
> recompile crash to newest version as only 4.1.8 crash has fixes for bugs
> introduced in kernels 4.10+.
> 
> I have used crash on second .vmss (panic after reboot).

Thanks a lot! Really appreciate all your efforts and help with this!

> 
> As I said before I would like to not disclose .vmem file (obvious reasons).
> If you need anything more please tell me what I need to type info crash.

I understand, but that makes our debugging tricky, but let us give it a try.

On the crash prompt:

1. bt # now with the right debug symbols, I take it that it shows the panic stack with vmxnet3_rq_rx_complete?
2. set print pretty on
3. info locals # should list local variables in vmxnet3_rq_rx_complete, does it list rcd, we would like to get values of all local variables as well as arguments to this function? In particular try below:
4. p *rq
5. p *rcd
6. p *rbi
7. p idx
8. p ring_idx
9. p *ring
10. p *rxd

Basically let us get the state of vmxnet3 device when this hit the BUG_ON in the receive path, then I can check if anything is amiss.

The relevant code:
http://lxr.free-electrons.com/source/drivers/net/vmxnet3/vmxnet3_drv.c#L1257

Thanks!
Shri





> 
> In attachment you have:
> 
> # crash System.map-4.10.3-1.el7.elrepo.x86_64
> ./kernel_rebuild/linux-4.10.3/vmlinux Hana-backup-92165ff7.vmss
> 
> crash 7.1.8
> Copyright (C) 2002-2016  Red Hat, Inc.
> Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
> Copyright (C) 1999-2006  Hewlett-Packard Co
> Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
> Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
> Copyright (C) 2005, 2011  NEC Corporation
> Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
> Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
> This program is free software, covered by the GNU General Public License,
> and you are welcome to change it and/or distribute copies of it under
> certain conditions.  Enter "help copying" to see the conditions.
> This program has absolutely no warranty.  Enter "help warranty" for details.
> 
> vmw: Memory dump is not part of this vmss file.
> vmw: Try to locate the companion vmem file ...
> vmw: vmem file: Hana-backup-92165ff7.vmem
> 
> GNU gdb (GDB) 7.6
> Copyright (C) 2013 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-unknown-linux-gnu"...
> 
> WARNING: kernels compiled by different gcc versions:
>   ./kernel_rebuild/linux-4.10.3/vmlinux: 5.4.0
>   Hana-backup-92165ff7.vmss kernel: 4.8.5
> 
>   SYSTEM MAP: System.map-4.10.3-1.el7.elrepo.x86_64
> DEBUG KERNEL: ./kernel_rebuild/linux-4.10.3/vmlinux (4.10.3)
>     DUMPFILE: Hana-backup-92165ff7.vmss
>         CPUS: 2 [OFFLINE: 1]
>         DATE: Sat Mar 25 16:18:48 2017
>       UPTIME: 00:00:14
> LOAD AVERAGE: 0.29, 0.06, 0.02
>        TASKS: 167
>     NODENAME: hana
>      RELEASE: 4.10.3-1.el7.elrepo.x86_64
>      VERSION: #1 SMP Wed Mar 15 14:45:27 EDT 2017
>      MACHINE: x86_64  (2294 Mhz)
>       MEMORY: 1 GB
>        PANIC: "kernel BUG at drivers/net/vmxnet3/vmxnet3_drv.c:1413!"
>          PID: 0
>      COMMAND: "swapper/0"
>         TASK: ffffffff81c10500  (1 of 2)  [THREAD_INFO: ffffffff81c10500]
>          CPU: 0
>        STATE: TASK_RUNNING
>      WARNING: panic task not found
> 
> crash> bt
> (...)
> crash> log
> (...)
> crash> ps
Comment 24 Remigiusz Szczepanik 2017-03-26 16:59:36 UTC
> 1. bt # now with the right debug symbols, I take it that it shows the panic
> stack with vmxnet3_rq_rx_complete?

Apparently it doesn't. Moreover crash tells me:

       PANIC: "kernel BUG at drivers/net/vmxnet3/vmxnet3_drv.c:1413!"
         PID: 0
     COMMAND: "swapper/0"
        TASK: ffffffff81c10500  (1 of 2)  [THREAD_INFO: ffffffff81c10500]
         CPU: 0
       STATE: TASK_RUNNING (ACTIVE)
     WARNING: panic task not found

Last line. "panic task not found"

So if I run:

crash> bt
PID: 0      TASK: ffffffff81c10500  CPU: 0   COMMAND: "swapper/0"
 #0 [ffffffff81c03d80] __schedule at ffffffff8177ed1c
 #1 [ffffffff81c03db8] native_safe_halt at ffffffff81784006
 #2 [ffffffff81c03de8] default_idle at ffffffff81783d4e
 #3 [ffffffff81c03e08] arch_cpu_idle at ffffffff81037baf
 #4 [ffffffff81c03e18] default_idle_call at ffffffff8178416c
 #5 [ffffffff81c03e28] do_idle at ffffffff810cbef8
 #6 [ffffffff81c03e60] cpu_startup_entry at ffffffff810cc231
 #7 [ffffffff81c03e88] rest_init at ffffffff81776617
 #8 [ffffffff81c03e98] start_kernel at ffffffff81da71f7
 #9 [ffffffff81c03ee8] x86_64_start_reservations at ffffffff81da65d6
#10 [ffffffff81c03ef8] x86_64_start_kernel at ffffffff81da6724
crash> set -p
set: no panic task found!
crash> set -c 0
    PID: 0
COMMAND: "swapper/0"
   TASK: ffffffff81c10500  (1 of 2)  [THREAD_INFO: ffffffff81c10500]
    CPU: 0
  STATE: TASK_RUNNING (ACTIVE)
crash> set -c 1
    PID: 0
COMMAND: "swapper/1"
   TASK: ffff88003fb94080  (1 of 2)  [THREAD_INFO: ffff88003fb94080]
    CPU: 1
  STATE: TASK_RUNNING (ACTIVE)
crash> bt
PID: 0      TASK: ffff88003fb94080  CPU: 1   COMMAND: "swapper/1"
 #0 [ffffc9000020fe20] __schedule at ffffffff8177ed1c
 #1 [ffffc9000020fe58] native_safe_halt at ffffffff81784006
 #2 [ffffc9000020fe88] default_idle at ffffffff81783d4e
 #3 [ffffc9000020fea8] arch_cpu_idle at ffffffff81037baf
 #4 [ffffc9000020feb8] default_idle_call at ffffffff8178416c
 #5 [ffffc9000020fec8] do_idle at ffffffff810cbef8
 #6 [ffffc9000020ff00] cpu_startup_entry at ffffffff810cc231
 #7 [ffffc9000020ff28] start_secondary at ffffffff81052054


I have a lot of tasks but not the panic one:

crash> last
[66871679980576] [IN]  PID: 572    TASK: ffff8800399cab00  CPU: 0   COMMAND: "vmtoolsd"
[66871612637175] [IN]  PID: 6183   TASK: ffff88003b345600  CPU: 0   COMMAND: "smtp"
[66871581528376] [IN]  PID: 6170   TASK: ffff88003b139580  CPU: 0   COMMAND: "kworker/0:2"
[66871224877535] [IN]  PID: 938    TASK: ffff8800399bab00  CPU: 1   COMMAND: "tuned"
[66871213468448] [IN]  PID: 6097   TASK: ffff88003a98d600  CPU: 1   COMMAND: "kworker/1:1"
[66870695509563] [IN]  PID: 9      TASK: ffff88003fb8c080  CPU: 1   COMMAND: "rcuos/0"
[66870695472934] [IN]  PID: 7      TASK: ffff88003fb89580  CPU: 1   COMMAND: "rcu_sched"
[66870388582379] [IN]  PID: 21     TASK: ffff88003fbf1580  CPU: 1   COMMAND: "rcuos/1"
[66870377129420] [IN]  PID: 18     TASK: ffff88003fbbc080  CPU: 1   COMMAND: "ksoftirqd/1"
[66870371282980] [IN]  PID: 983    TASK: ffff88003b342b00  CPU: 1   COMMAND: "master"
[66870370621290] [IN]  PID: 985    TASK: ffff88003a2dc080  CPU: 0   COMMAND: "qmgr"
[66870368604083] [IN]  PID: 992    TASK: ffff88003b13d600  CPU: 1   COMMAND: "tlsmgr"
[66870348187337] [IN]  PID: 882    TASK: ffff88003c7bab00  CPU: 0   COMMAND: "rs:main Q:Reg"
[66870348137285] [IN]  PID: 881    TASK: ffff88003c7bd600  CPU: 0   COMMAND: "in:imjournal"
[66870348125355] [IN]  PID: 6182   TASK: ffff88003b13ab00  CPU: 1   COMMAND: "local"
[66870346763493] [IN]  PID: 6181   TASK: ffff8800361b4080  CPU: 0   COMMAND: "trivial-rewrite"
[66870346737411] [IN]  PID: 6178   TASK: ffff8800398f5600  CPU: 0   COMMAND: "cleanup"
[66870346653737] [RU]  PID: 415    TASK: ffff8800363ad600  CPU: 1   COMMAND: "systemd-journal"
[66870345260017] [IN]  PID: 6111   TASK: ffff8800363a9580  CPU: 1   COMMAND: "kworker/1:3"
[66870326578232] [IN]  PID: 4198   TASK: ffff88003c7b9580  CPU: 1   COMMAND: "kworker/1:0"
[66870323456819] [IN]  PID: 602    TASK: ffff8800399bd600  CPU: 1   COMMAND: "NetworkManager"
[66870323286289] [IN]  PID: 612    TASK: ffff88003b138000  CPU: 1   COMMAND: "gdbus"
[66870323128721] [IN]  PID: 563    TASK: ffff88003cc54080  CPU: 1   COMMAND: "dbus-daemon"
[66870323011581] [IN]  PID: 586    TASK: ffff88003be19580  CPU: 1   COMMAND: "gdbus"
[66870322884504] [IN]  PID: 566    TASK: ffff88003cc51580  CPU: 1   COMMAND: "polkitd"
[66870322580600] [IN]  PID: 567    TASK: ffff88003a90d600  CPU: 1   COMMAND: "systemd-logind"
[66870318690684] [IN]  PID: 5713   TASK: ffff88003fb45600  CPU: 1   COMMAND: "pickup"
[66870314012253] [IN]  PID: 1      TASK: ffff88003fb40000  CPU: 0   COMMAND: "systemd"
[66870312550634] [IN]  PID: 6099   TASK: ffff880036930000  CPU: 0   COMMAND: "kworker/0:0"
[66870300261359] [IN]  PID: 5664   TASK: ffff8800399c8000  CPU: 0   COMMAND: "kworker/u4:4"
[66870298754740] [IN]  PID: 598    TASK: ffff8800399b9580  CPU: 0   COMMAND: "firewalld"
[66870298066092] [IN]  PID: 579    TASK: ffff88003c7b8000  CPU: 0   COMMAND: "crond"
[66870297743047] [IN]  PID: 544    TASK: ffff8800399bc080  CPU: 0   COMMAND: "auditd"
[66870297622980] [IN]  PID: 40     TASK: ffff88003df7ab00  CPU: 0   COMMAND: "kauditd"
[66869466618701] [IN]  PID: 677    TASK: ffff88003b340000  CPU: 0   COMMAND: "gmain"
[66868653505756] [IN]  PID: 29     TASK: ffff88003dcd2b00  CPU: 1   COMMAND: "khugepaged"
[66868519433080] [IN]  PID: 13     TASK: ffff88003fb92b00  CPU: 0   COMMAND: "watchdog/0"
[66868518464398] [IN]  PID: 16     TASK: ffff88003fbb9580  CPU: 1   COMMAND: "watchdog/1"
[66868365687554] [IN]  PID: 565    TASK: ffff88003cc50000  CPU: 1   COMMAND: "irqbalance"
[66850221360652] [IN]  PID: 6      TASK: ffff88003fb88000  CPU: 0   COMMAND: "ksoftirqd/0"
[66849197364092] [IN]  PID: 399    TASK: ffff88003a3c1580  CPU: 1   COMMAND: "kworker/1:1H"
[66848173339718] [IN]  PID: 400    TASK: ffff88003a3c4080  CPU: 0   COMMAND: "kworker/0:1H"
[66844088824438] [IN]  PID: 342    TASK: ffff8800398f2b00  CPU: 1   COMMAND: "btrfs-cleaner"
[66844088496858] [IN]  PID: 343    TASK: ffff8800398f4080  CPU: 1   COMMAND: "btrfs-transacti"
[66843101155645] [IN]  PID: 503    TASK: ffff880039c04080  CPU: 0   COMMAND: "btrfs-cleaner"
[66843101066941] [IN]  PID: 504    TASK: ffff880039c05600  CPU: 1   COMMAND: "btrfs-transacti"
[66827185902700] [IN]  PID: 5711   TASK: ffff8800399b8000  CPU: 0   COMMAND: "kworker/u4:5"
[66827181568996] [IN]  PID: 6076   TASK: ffff880023660000  CPU: 1   COMMAND: "kworker/u4:1"
[66823251253622] [IN]  PID: 4638   TASK: ffff88003a8e2b00  CPU: 0   COMMAND: "kworker/0:1"
[66823251246612] [IN]  PID: 5547   TASK: ffff8800361b0000  CPU: 0   COMMAND: "kworker/0:4"
[66823251209329] [IN]  PID: 2      TASK: ffff88003fb41580  CPU: 0   COMMAND: "kthreadd"
[66823236413494] [IN]  PID: 879    TASK: ffff88003a8e5600  CPU: 1   COMMAND: "sshd"
[66818135390983] [IN]  PID: 6075   TASK: ffff880023665600  CPU: 0   COMMAND: "kworker/u4:0"
[66817576151940] [IN]  PID: 6167   TASK: ffff88003a8e1580  CPU: 0   COMMAND: "kworker/u4:3"
[66817573221273] [IN]  PID: 6166   TASK: ffff88003fbbd600  CPU: 0   COMMAND: "kworker/u4:2"
[66799754004545] [IN]  PID: 2339   TASK: ffff88003b13c080  CPU: 0   COMMAND: "sshd"
[66799753307264] [IN]  PID: 2408   TASK: ffff8800399c9580  CPU: 1   COMMAND: "bash"
[66785052428665] [IN]  PID: 11     TASK: ffff88003fb90000  CPU: 0   COMMAND: "migration/0"
[66784187730559] [IN]  PID: 3255   TASK: ffff88003b344080  CPU: 1   COMMAND: "kworker/u5:1"
[66784184445159] [IN]  PID: 553    TASK: ffff88003cc52b00  CPU: 1   COMMAND: "auditd"
[66781148806540] [IN]  PID: 17     TASK: ffff88003fbbab00  CPU: 1   COMMAND: "migration/1"
[66780003032440] [IN]  PID: 5601   TASK: ffff880023664080  CPU: 1   COMMAND: "kworker/1:2"
[66246988457847] [IN]  PID: 576    TASK: ffff88003a908000  CPU: 1   COMMAND: "chronyd"
[64814608465097] [IN]  PID: 877    TASK: ffff88003dcc4080  CPU: 1   COMMAND: "tuned"
[64058239054346] [IN]  PID: 41     TASK: ffff88003df7c080  CPU: 0   COMMAND: "kswapd0"
[63973989336167] [IN]  PID: 2439   TASK: ffff8800363a8000  CPU: 1   COMMAND: "sshd"
[63973988922623] [IN]  PID: 2463   TASK: ffff88003a3c2b00  CPU: 1   COMMAND: "bash"
[48761462305916] [IN]  PID: 323    TASK: ffff88003a2d9580  CPU: 1   COMMAND: "kworker/u5:0"
[47144992352413] [IN]  PID: 2462   TASK: ffff88003a3c5600  CPU: 0   COMMAND: "su"
[47144973301719] [IN]  PID: 2459   TASK: ffff88003c7bc080  CPU: 0   COMMAND: "sudo"
[47143593082031] [IN]  PID: 2440   TASK: ffff88003fb42b00  CPU: 0   COMMAND: "bash"
[47141229692450] [IN]  PID: 2436   TASK: ffff8800363aab00  CPU: 1   COMMAND: "sshd"
[46801273145488] [IN]  PID: 2407   TASK: ffff8800399cd600  CPU: 1   COMMAND: "su"
[46801249654976] [IN]  PID: 2404   TASK: ffff88003a3c0000  CPU: 1   COMMAND: "sudo"
[46799180716297] [IN]  PID: 2340   TASK: ffff88003dcdab00  CPU: 0   COMMAND: "bash"
[45891490302616] [IN]  PID: 2336   TASK: ffff88003dcc2b00  CPU: 1   COMMAND: "sshd"
[   13202418030] [IN]  PID: 939    TASK: ffff88003dcc5600  CPU: 1   COMMAND: "tuned"
[   13182382728] [IN]  PID: 937    TASK: ffff88003df79580  CPU: 1   COMMAND: "tuned"
[   13166754457] [IN]  PID: 936    TASK: ffff88003a8e4080  CPU: 0   COMMAND: "gmain"
[   12728443201] [IN]  PID: 874    TASK: ffff88003a90ab00  CPU: 0   COMMAND: "rsyslogd"
[    9875184187] [IN]  PID: 437    TASK: ffff8800363ac080  CPU: 1   COMMAND: "systemd-udevd"
[    9543608280] [IN]  PID: 593    TASK: ffff88003be1c080  CPU: 0   COMMAND: "agetty"
[    5877264907] [IN]  PID: 610    TASK: ffff88003a909580  CPU: 1   COMMAND: "gmain"
[    4544673780] [IN]  PID: 23     TASK: ffff88003fbf4080  CPU: 1   COMMAND: "kdevtmpfs"
[    4535151979] [IN]  PID: 596    TASK: ffff88003a989580  CPU: 1   COMMAND: "runaway-killer-"
[    4534482665] [IN]  PID: 594    TASK: ffff88003be1ab00  CPU: 0   COMMAND: "JS Sour~ Thread"
[    4519813916] [IN]  PID: 592    TASK: ffff88003be1d600  CPU: 1   COMMAND: "JS GC Helper"
[    4491563091] [IN]  PID: 584    TASK: ffff88003be18000  CPU: 0   COMMAND: "gmain"
[    4329930894] [IN]  PID: 564    TASK: ffff88003cc55600  CPU: 1   COMMAND: "dbus-daemon"
[    4111166111] [IN]  PID: 14     TASK: ffff88003fb95600  CPU: 0   COMMAND: "cpuhp/0"
[    4102748854] [IN]  PID: 15     TASK: ffff88003fbb8000  CPU: 1   COMMAND: "cpuhp/1"
[    4023245737] [IN]  PID: 502    TASK: ffff880039c02b00  CPU: 0   COMMAND: "btrfs-extent-re"
[    4023205983] [IN]  PID: 501    TASK: ffff880039c01580  CPU: 0   COMMAND: "btrfs-qgroup-re"
[    4023166619] [IN]  PID: 500    TASK: ffff880039c00000  CPU: 0   COMMAND: "btrfs-readahead"
[    4022763948] [IN]  PID: 499    TASK: ffff880039e7d600  CPU: 0   COMMAND: "btrfs-delayed-m"
[    4022725694] [IN]  PID: 498    TASK: ffff880039e7c080  CPU: 0   COMMAND: "btrfs-freespace"
[    4022687960] [IN]  PID: 497    TASK: ffff880039e7ab00  CPU: 0   COMMAND: "btrfs-endio-wri"
[    4022649715] [IN]  PID: 496    TASK: ffff880039e79580  CPU: 0   COMMAND: "btrfs-rmw"
[    4022596237] [IN]  PID: 495    TASK: ffff880039e78000  CPU: 0   COMMAND: "btrfs-endio-rep"
[    4022553019] [IN]  PID: 494    TASK: ffff880039d45600  CPU: 0   COMMAND: "btrfs-endio-rai"
[    4022270333] [IN]  PID: 493    TASK: ffff880039d44080  CPU: 0   COMMAND: "btrfs-endio-met"
[    4022187789] [IN]  PID: 492    TASK: ffff880039d42b00  CPU: 0   COMMAND: "btrfs-endio-met"
[    4022152262] [IN]  PID: 490    TASK: ffff880039d41580  CPU: 0   COMMAND: "btrfs-endio"
[    4015271795] [IN]  PID: 489    TASK: ffff880039d40000  CPU: 1   COMMAND: "btrfs-fixup"
[    4009744449] [IN]  PID: 487    TASK: ffff88003a98c080  CPU: 0   COMMAND: "btrfs-submit"
[    4007793306] [IN]  PID: 486    TASK: ffff88003a988000  CPU: 1   COMMAND: "btrfs-cache"
[    4004354466] [IN]  PID: 484    TASK: ffff88003bb8ab00  CPU: 0   COMMAND: "btrfs-flush_del"
[    4000721584] [IN]  PID: 483    TASK: ffff88003bb8d600  CPU: 1   COMMAND: "btrfs-delalloc"
[    3998239904] [IN]  PID: 482    TASK: ffff88003bb8c080  CPU: 0   COMMAND: "btrfs-worker-hi"
[    3994395002] [IN]  PID: 481    TASK: ffff88003bb89580  CPU: 0   COMMAND: "btrfs-worker"
[    3876841558] [IN]  PID: 463    TASK: ffff88003bb88000  CPU: 0   COMMAND: "nfit"
[    3056246826] [IN]  PID: 4      TASK: ffff88003fb44080  CPU: 0   COMMAND: "kworker/0:0H"
[    3052889052] [IN]  PID: 20     TASK: ffff88003fbf0000  CPU: 1   COMMAND: "kworker/1:0H"
[    2612557318] [IN]  PID: 341    TASK: ffff8800398f1580  CPU: 1   COMMAND: "btrfs-extent-re"
[    2612263455] [IN]  PID: 340    TASK: ffff8800398f0000  CPU: 1   COMMAND: "btrfs-qgroup-re"
[    2611026509] [IN]  PID: 339    TASK: ffff880039ecd600  CPU: 1   COMMAND: "btrfs-readahead"
[    2610979062] [IN]  PID: 338    TASK: ffff880039ecc080  CPU: 1   COMMAND: "btrfs-delayed-m"
[    2610943773] [IN]  PID: 337    TASK: ffff880039ecab00  CPU: 1   COMMAND: "btrfs-freespace"
[    2610796167] [IN]  PID: 336    TASK: ffff880039ec9580  CPU: 1   COMMAND: "btrfs-endio-wri"
[    2610736280] [IN]  PID: 335    TASK: ffff880039ec8000  CPU: 1   COMMAND: "btrfs-rmw"
[    2610672683] [IN]  PID: 334    TASK: ffff880039ee5600  CPU: 1   COMMAND: "btrfs-endio-rep"
[    2610615219] [IN]  PID: 333    TASK: ffff880039ee4080  CPU: 1   COMMAND: "btrfs-endio-rai"
[    2610555397] [IN]  PID: 332    TASK: ffff880039ee2b00  CPU: 1   COMMAND: "btrfs-endio-met"
[    2610494628] [IN]  PID: 331    TASK: ffff880039ee1580  CPU: 1   COMMAND: "btrfs-endio-met"
[    2610434824] [IN]  PID: 330    TASK: ffff880039ee0000  CPU: 1   COMMAND: "btrfs-endio"
[    2610369422] [IN]  PID: 329    TASK: ffff88003a911580  CPU: 1   COMMAND: "btrfs-fixup"
[    2610309520] [IN]  PID: 328    TASK: ffff88003a915600  CPU: 1   COMMAND: "btrfs-submit"
[    2610244336] [IN]  PID: 327    TASK: ffff8800361b5600  CPU: 1   COMMAND: "btrfs-cache"
[    2610100243] [IN]  PID: 326    TASK: ffff88003a8fab00  CPU: 1   COMMAND: "btrfs-flush_del"
[    2610027400] [IN]  PID: 325    TASK: ffff88003a98ab00  CPU: 1   COMMAND: "btrfs-delalloc"
[    2609608030] [IN]  PID: 324    TASK: ffff88003a2dd600  CPU: 1   COMMAND: "btrfs-worker-hi"
[    2609443664] [IN]  PID: 322    TASK: ffff88003a2d8000  CPU: 1   COMMAND: "btrfs-worker"
[    2555519433] [IN]  PID: 314    TASK: ffff88003a2dab00  CPU: 0   COMMAND: "bioset"
[    2269801721] [IN]  PID: 268    TASK: ffff88003a8f8000  CPU: 0   COMMAND: "scsi_eh_0"
[    2257640722] [IN]  PID: 271    TASK: ffff880036b5ab00  CPU: 0   COMMAND: "scsi_eh_2"
[    2114105126] [IN]  PID: 280    TASK: ffff880036931580  CPU: 0   COMMAND: "ttm_swap"
[    2105308766] [IN]  PID: 277    TASK: ffff8800361b2b00  CPU: 0   COMMAND: "bioset"
[    2101715141] [IN]  PID: 276    TASK: ffff8800361b1580  CPU: 1   COMMAND: "bioset"
[    2095462148] [IN]  PID: 274    TASK: ffff880036934080  CPU: 1   COMMAND: "vmw_pvscsi_wq_1"
[    2090856254] [IN]  PID: 273    TASK: ffff880036b58000  CPU: 0   COMMAND: "scsi_tmf_2"
[    2090315361] [IN]  PID: 272    TASK: ffff88003a8fc080  CPU: 1   COMMAND: "scsi_tmf_1"
[    2086686421] [IN]  PID: 270    TASK: ffff880036b5d600  CPU: 1   COMMAND: "scsi_eh_1"
[    2085309360] [IN]  PID: 269    TASK: ffff880036b5c080  CPU: 0   COMMAND: "scsi_tmf_0"
[    2079346645] [IN]  PID: 267    TASK: ffff88003a8fd600  CPU: 1   COMMAND: "ata_sff"
[    1548966015] [IN]  PID: 91     TASK: ffff88003a912b00  CPU: 1   COMMAND: "charger_manager"
[    1509549094] [IN]  PID: 78     TASK: ffff88003a910000  CPU: 1   COMMAND: "ipv6_addrconf"
[    1473328109] [IN]  PID: 76     TASK: ffff88003a9b0000  CPU: 0   COMMAND: "kaluad_sync"
[    1473216616] [IN]  PID: 75     TASK: ffff88003a9b1580  CPU: 0   COMMAND: "kaluad"
[    1473136969] [IN]  PID: 74     TASK: ffff88003a9b5600  CPU: 0   COMMAND: "kmpath_rdacd"
[    1465862404] [IN]  PID: 73     TASK: ffff88003a9b4080  CPU: 0   COMMAND: "acpi_thermal_pm"
[    1424377931] [IN]  PID: 72     TASK: ffff88003a9b2b00  CPU: 1   COMMAND: "kthrotld"
[    1038363757] [IN]  PID: 43     TASK: ffff88003a8e0000  CPU: 1   COMMAND: "bioset"
[    1038299274] [IN]  PID: 42     TASK: ffff88003df7d600  CPU: 1   COMMAND: "vmstat"
[     518060696] [IN]  PID: 38     TASK: ffff88003df78000  CPU: 1   COMMAND: "watchdogd"
[     472319573] [IN]  PID: 37     TASK: ffff88003dcdd600  CPU: 1   COMMAND: "devfreq_wq"
[     472212721] [IN]  PID: 36     TASK: ffff88003dcdc080  CPU: 1   COMMAND: "md"
[     154943148] [IN]  PID: 33     TASK: ffff88003dcd9580  CPU: 1   COMMAND: "kblockd"
[     154781957] [IN]  PID: 32     TASK: ffff88003dcd8000  CPU: 1   COMMAND: "bioset"
[     154590679] [IN]  PID: 31     TASK: ffff88003dcd5600  CPU: 1   COMMAND: "kintegrityd"
[     154521031] [IN]  PID: 30     TASK: ffff88003dcd4080  CPU: 1   COMMAND: "crypto"
[     153833820] [IN]  PID: 28     TASK: ffff88003dcc1580  CPU: 0   COMMAND: "ksmd"
[     153653459] [IN]  PID: 27     TASK: ffff88003dcd1580  CPU: 1   COMMAND: "kcompactd0"
[     153524673] [IN]  PID: 26     TASK: ffff88003dcd0000  CPU: 1   COMMAND: "writeback"
[     152959346] [IN]  PID: 25     TASK: ffff88003dcc0000  CPU: 0   COMMAND: "oom_reaper"
[     144428302] [IN]  PID: 24     TASK: ffff88003fbf5600  CPU: 1   COMMAND: "netns"
[     140877813] [IN]  PID: 22     TASK: ffff88003fbf2b00  CPU: 1   COMMAND: "rcuob/1"
[     135819063] [IN]  PID: 10     TASK: ffff88003fb8d600  CPU: 0   COMMAND: "rcuob/0"
[     134951627] [IN]  PID: 12     TASK: ffff88003fb91580  CPU: 0   COMMAND: "lru-add-drain"
[     134141337] [IN]  PID: 8      TASK: ffff88003fb8ab00  CPU: 0   COMMAND: "rcu_bh"
[             0] [RU]  PID: 0      TASK: ffffffff81c10500  CPU: 0   COMMAND: "swapper/0"
[             0] [RU]  PID: 0      TASK: ffff88003fb94080  CPU: 1   COMMAND: "swapper/1"


According to log the panic state happened in interrupt:

[66871.638531] ------------[ cut here ]------------
[66871.651583] kernel BUG at drivers/net/vmxnet3/vmxnet3_drv.c:1413!
[66871.651706] invalid opcode: 0000 [#1] SMP
[66871.651819] Modules linked in: ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vmw_vsock_vmci_transport vsock ppdev coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cryptd vmw_balloon intel_rapl_perf pcspkr input_leds sg i2c_piix4 nfit vmw_vmci shpchp parport_pc parport acpi_cpufreq ip_tables btrfs xor raid6_pq ata_generic pata_acpi sd_mod crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea vmxnet3 sysfillrect
[66871.652887]  sysimgblt fb_sys_fops ttm drm vmw_pvscsi ata_piix libata fjes
[66871.653513] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.10.3-1.el7.elrepo.x86_64 #1
[66871.653663] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[66871.653916] task: ffff88003fb94080 task.stack: ffffc9000020c000
[66871.654642] RIP: 0010:vmxnet3_rq_rx_complete+0xd35/0xdf0 [vmxnet3]
[66871.654791] RSP: 0018:ffff88003fd03e10 EFLAGS: 00010297
[66871.654894] RAX: 0000000000000040 RBX: ffff88003629d1e8 RCX: ffff88003d658700
[66871.655019] RDX: 0000000000000004 RSI: 0000000000000001 RDI: 0000000000000040
[66871.655126] RBP: ffff88003fd03e88 R08: 0000000000000030 R09: 0000000000000000
[66871.655246] R10: ffff88003b3cc3c0 R11: ffff88003629c900 R12: ffff88003c09f280
[66871.655358] R13: ffff88003b3560f0 R14: ffff88003629d100 R15: 0000000000000028
[66871.655482] FS:  0000000000000000(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
[66871.655603] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[66871.655717] CR2: 000055bd3ee9db28 CR3: 000000003c8ed000 CR4: 00000000001406e0
[66871.656197] Call Trace:
[66871.657426]  <IRQ>
[66871.660567]  ? ktime_get+0x3c/0xb0
[66871.660709]  vmxnet3_poll_rx_only+0x36/0xa0 [vmxnet3]
[66871.662274]  net_rx_action+0x260/0x3c0
[66871.664457]  __do_softirq+0xc9/0x28c
[66871.674621]  irq_exit+0xd9/0xf0
[66871.675640]  do_IRQ+0x51/0xd0
[66871.676849]  common_interrupt+0x93/0x93
[66871.677796] RIP: 0010:native_safe_halt+0x6/0x10
[66871.678663] RSP: 0018:ffffc9000020fe80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff1b
[66871.681640] RAX: 0000000000000000 RBX: ffff88003fb94080 RCX: 0000000000000000
[66871.682518] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[66871.683502] RBP: ffffc9000020fe80 R08: 00003cd1cbb71906 R09: 0000000000000000
[66871.684346] R10: 0000000000000204 R11: 0000000000000000 R12: 0000000000000001
[66871.685258] R13: ffff88003fb94080 R14: 0000000000000000 R15: 0000000000000000
[66871.686044]  </IRQ>
[66871.686797]  default_idle+0x1e/0xd0
[66871.688424]  arch_cpu_idle+0xf/0x20
[66871.689185]  default_idle_call+0x2c/0x40
[66871.690262]  do_idle+0x158/0x200
[66871.690948]  cpu_startup_entry+0x71/0x80
[66871.692150]  start_secondary+0x154/0x190
[66871.693378]  start_cpu+0x14/0x14
[66871.693999] Code: 78 2c 12 a0 48 c7 c7 90 43 12 a0 31 c0 44 89 4d b0 4c 89 5d b8 e8 ec d4 28 e1 4c 8b 5d b8 44 8b 4d b0 e9 8e f5 ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 41 3b 96 78 01 00 00 0f 84 0a f4 ff ff 0f 0b 41 f6
[66871.696089] RIP: vmxnet3_rq_rx_complete+0xd35/0xdf0 [vmxnet3] RSP: ffff88003fd03e10
[66871.702943] ---[ end trace 07ef0fdac6ebe666 ]---
[66871.703653] Kernel panic - not syncing: Fatal exception in interrupt
[66871.708828] Kernel Offset: disabled
[66871.712014] ---[ end Kernel panic - not syncing: Fatal exception in interrupt


Should I do "bt" on every task and look for vmxnet3_rq_rx_complete panic stack? I don't know if I find one when crash cannot find it itself.
Comment 25 Remigiusz Szczepanik 2017-03-26 17:17:32 UTC
So according to log panic state happened in task "ffff88003fb94080" which is PID 0 on CPU 1 meaning "swapper/1".
I have checked "irq" and nothing I find interesting is there. But I can give you "task" of that task where panic occured:


PID: 0      TASK: ffff88003fb94080  CPU: 1   COMMAND: "swapper/1"
struct task_struct {
  thread_info = {
    flags = 8
  },
  state = 0,
  stack = 0xffffc9000020c000,
  usage = {
    counter = 2
  },
  flags = 2097218,
  ptrace = 0,
  wake_entry = {
    next = 0x0
  },
  on_cpu = 1,
  cpu = 1,
  wakee_flips = 13,
  wakee_flip_decay_ts = 4361538059,
  last_wakee = 0xffff8800363ad600,
  wake_cpu = 1,
  on_rq = 1,
  prio = 120,
  static_prio = 120,
  normal_prio = 120,
  rt_priority = 0,
  sched_class = 0xffffffff8181cf80 <idle_sched_class>,
  se = {
    load = {
      weight = 1048576,
      inv_weight = 4194304
    },
    run_node = {
      __rb_parent_color = 1,
      rb_right = 0x0,
      rb_left = 0x0
    },
    group_node = {
      next = 0xffff88003fb94128,
      prev = 0xffff88003fb94128
    },
    on_rq = 0,
    exec_start = 135180636,
    sum_exec_runtime = 0,
    vruntime = 0,
    prev_sum_exec_runtime = 0,
    nr_migrations = 0,
    statistics = {
      wait_start = 0,
      wait_max = 0,
      wait_count = 0,
      wait_sum = 0,
      iowait_count = 0,
      iowait_sum = 0,
      sleep_start = 0,
      sleep_max = 0,
      sum_sleep_runtime = 0,
      block_start = 0,
      block_max = 0,
      exec_max = 0,
      slice_max = 0,
      nr_migrations_cold = 0,
      nr_failed_migrations_affine = 0,
      nr_failed_migrations_running = 0,
      nr_failed_migrations_hot = 0,
      nr_forced_migrations = 0,
      nr_wakeups = 0,
      nr_wakeups_sync = 0,
      nr_wakeups_migrate = 0,
      nr_wakeups_local = 0,
      nr_wakeups_remote = 0,
      nr_wakeups_affine = 0,
      nr_wakeups_affine_attempts = 0,
      nr_wakeups_passive = 0,
      nr_wakeups_idle = 0
    },
    depth = 0,
    parent = 0x0,
    cfs_rq = 0xffff88003fd19530,
    my_q = 0x0,
    avg = {
      last_update_time = 0,
      load_sum = 48887808,
      util_sum = 0,
      period_contrib = 1023,
      load_avg = 1024,
      util_avg = 0
    }
  },
  rt = {
    run_list = {
      next = 0xffff88003fb942c0,
      prev = 0xffff88003fb942c0
    },
    timeout = 0,
    watchdog_stamp = 0,
    time_slice = 100,
    on_rq = 0,
    on_list = 0,
    back = 0x0,
    parent = 0x0,
    rt_rq = 0xffff88003fd19670,
    my_q = 0x0
  },
  sched_task_group = 0xffffffff82015b80 <root_task_group>,
  dl = {
    rb_node = {
      __rb_parent_color = 18446612133383324432,
      rb_right = 0x0,
      rb_left = 0x0
    },
    dl_runtime = 0,
    dl_deadline = 0,
    dl_period = 0,
    dl_bw = 0,
    runtime = 0,
    deadline = 0,
    flags = 0,
    dl_throttled = 0,
    dl_boosted = 0,
    dl_yielded = 0,
    dl_timer = {
      node = {
        node = {
          __rb_parent_color = 18446612133383324520,
          rb_right = 0x0,
          rb_left = 0x0
        },
        expires = 0
      },
      _softexpires = 0,
      function = 0xffffffff810ca3d0 <dl_task_timer>,
      base = 0xffff88003fc12600,
      state = 0 '\000',
      is_rel = 0 '\000',
      start_pid = -1,
      start_site = 0x0,
      start_comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
    }
  },
  preempt_notifiers = {
    first = 0x0
  },
  btrace_seq = 0,
  policy = 0,
  nr_cpus_allowed = 1,
  cpus_allowed = {
    bits = {}
  },
  sched_info = {
    pcount = 0,
    run_delay = 0,
    last_arrival = 0,
    last_queued = 0
  },
  tasks = {
    next = 0xffff88003fb41cf8,
    prev = 0xffffffff81c10c78 <init_task+1912>
  },
  pushable_tasks = {
    prio = 140,
    prio_list = {
      next = 0xffff88003fb94810,
      prev = 0xffff88003fb94810
    },
    node_list = {
      next = 0xffff88003fb94820,
      prev = 0xffff88003fb94820
    }
  },
  pushable_dl_tasks = {
    __rb_parent_color = 18446612133383325744,
    rb_right = 0x0,
    rb_left = 0x0
  },
  mm = 0x0,
  active_mm = 0xffff880036057440,
  vmacache_seqnum = 0,
  vmacache = {0x0, 0x0, 0x0, 0x0},
  rss_stat = {
    events = 0,
    count = {0, 0, 0, 0}
  },
  exit_state = 0,
  exit_code = 0,
  exit_signal = 0,
  pdeath_signal = 0,
  jobctl = 0,
  personality = 0,
  sched_reset_on_fork = 0,
  sched_contributes_to_load = 1,
  sched_migrated = 0,
  sched_remote_wakeup = 0,
  in_execve = 0,
  in_iowait = 0,
  restore_sigmask = 0,
  memcg_may_oom = 0,
  memcg_kmem_skip_account = 0,
  atomic_flags = 0,
  restart_block = {
    fn = 0xffffffff810942a0 <do_no_restart_syscall>,
    {
      futex = {
        uaddr = 0x0,
        val = 0,
        flags = 0,
        bitset = 0,
        time = 0,
        uaddr2 = 0x0
      },
      nanosleep = {
        clockid = 0,
        rmtp = 0x0,
        compat_rmtp = 0x0,
        expires = 0
      },
      poll = {
        ufds = 0x0,
        nfds = 0,
        has_timeout = 0,
        tv_sec = 0,
        tv_nsec = 0
      }
    }
  },
  pid = 0,
  tgid = 0,
  stack_canary = 5645735404757978967,
  real_parent = 0xffff88003fb40000,
  parent = 0xffffffff81c10500 <init_task>,
  children = {
    next = 0xffff88003fb94918,
    prev = 0xffff88003fb94918
  },
  sibling = {
    next = 0xffff88003fb94928,
    prev = 0xffff88003fb94928
  },
  group_leader = 0xffff88003fb94080,
  ptraced = {
    next = 0xffff88003fb408c0,
    prev = 0xffff88003fb408c0
  },
  ptrace_entry = {
    next = 0xffff88003fb408d0,
    prev = 0xffff88003fb408d0
  },
  pids = {{
      node = {
        next = 0x0,
        pprev = 0x0
      },
      pid = 0xffffffff81c56c00 <init_struct_pid>
    }, {
      node = {
        next = 0x0,
        pprev = 0x0
      },
      pid = 0xffffffff81c56c00 <init_struct_pid>
    }, {
      node = {
        next = 0x0,
        pprev = 0x0
      },
      pid = 0xffffffff81c56c00 <init_struct_pid>
    }},
  thread_group = {
    next = 0xffff88003fb949a8,
    prev = 0xffff88003fb949a8
  },
  thread_node = {
    next = 0xffff88003fb53750,
    prev = 0xffff88003fb53750
  },
  vfork_done = 0x0,
  set_child_tid = 0x0,
  clear_child_tid = 0x0,
  utime = 0,
  stime = 2434000000,
  gtime = 0,
  prev_cputime = {
    utime = 0,
    stime = 0,
    lock = {
      raw_lock = {
        val = {
          counter = 0
        }
      }
    }
  },
  vtime_seqcount = {
    sequence = 6
  },
  vtime_snap = 4294667340,
  vtime_snap_whence = VTIME_SYS,
  tick_dep_mask = {
    counter = 0
  },
  nvcsw = 0,
  nivcsw = 1140396,
  start_time = 44000000,
  real_start_time = 44000000,
  min_flt = 0,
  maj_flt = 0,
  cputime_expires = {
    utime = 0,
    stime = 0,
    sum_exec_runtime = 0
  },
  cpu_timers = {{
      next = 0xffff88003fb94a70,
      prev = 0xffff88003fb94a70
    }, {
      next = 0xffff88003fb94a80,
      prev = 0xffff88003fb94a80
    }, {
      next = 0xffff88003fb94a90,
      prev = 0xffff88003fb94a90
    }},
  ptracer_cred = 0x0,
  real_cred = 0xffff88003fa9e9c0,
  cred = 0xffff88003fa9e9c0,
  comm = "swapper/1\000\000\000\000\000\000",
  nameidata = 0x0,
  sysvsem = {
    undo_list = 0x0
  },
  sysvshm = {
    shm_clist = {
      next = 0xffff88003fb94ad8,
      prev = 0xffff88003fb94ad8
    }
  },
  fs = 0xffff88003fa973c0,
  files = 0xffff88003facc2c0,
  nsproxy = 0xffffffff81c56ea0 <init_nsproxy>,
  signal = 0xffff88003fb53740,
  sighand = 0xffff88003fb4eb40,
  blocked = {
    sig = {0}
  },
  real_blocked = {
    sig = {0}
  },
  saved_sigmask = {
    sig = {0}
  },
  pending = {
    list = {
      next = 0xffff88003fb94b28,
      prev = 0xffff88003fb94b28
    },
    signal = {
      sig = {0}
    }
  },
  sas_ss_sp = 0,
  sas_ss_size = 0,
  sas_ss_flags = 2,
  task_works = 0x0,
  audit_context = 0x0,
  loginuid = {
    val = 4294967295
  },
  sessionid = 4294967295,
  seccomp = {
    mode = 0,
    filter = 0x0
  },
  parent_exec_id = 0,
  self_exec_id = 0,
  alloc_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  pi_lock = {
    raw_lock = {
      val = {
        counter = 0
      }
    }
  },
  wake_q = {
    next = 0x0
  },
  pi_waiters = {
    rb_node = 0x0
  },
  pi_waiters_leftmost = 0x0,
  pi_blocked_on = 0x0,
  journal_info = 0x0,
  bio_list = 0x0,
  plug = 0x0,
  reclaim_state = 0x0,
  backing_dev_info = 0x0,
  io_context = 0x0,
  ptrace_message = 0,
  last_siginfo = 0x0,
  ioac = {
    rchar = 0,
    wchar = 0,
    syscr = 0,
    syscw = 0,
    read_bytes = 0,
    write_bytes = 0,
    cancelled_write_bytes = 0
  },
  acct_rss_mem1 = 0,
  acct_vm_mem1 = 0,
  acct_timexpd = 0,
  mems_allowed = {
    bits = {1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
  },
  mems_allowed_seq = {
    sequence = 0
  },
  cpuset_mem_spread_rotor = -1,
  cpuset_slab_spread_rotor = -1,
  cgroups = 0xffffffff81c8cca0 <init_css_set>,
  cg_list = {
    next = 0xffff88003fb94cd8,
    prev = 0xffff88003fb94cd8
  },
  robust_list = 0x0,
  compat_robust_list = 0x0,
  pi_state_list = {
    next = 0xffff88003fb94cf8,
    prev = 0xffff88003fb94cf8
  },
  pi_state_cache = 0x0,
  perf_event_ctxp = {0x0, 0x0},
  perf_event_mutex = {
    owner = {
      counter = 0
    },
    wait_lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    osq = {
      tail = {
        counter = 0
      }
    },
    wait_list = {
      next = 0xffff88003fb94d30,
      prev = 0xffff88003fb94d30
    }
  },
  perf_event_list = {
    next = 0xffff88003fb94d40,
    prev = 0xffff88003fb94d40
  },
  mempolicy = 0xffff88003f88d000,
  il_next = 0,
  pref_node_fork = 0,
  numa_scan_seq = 0,
  numa_scan_period = 1000,
  numa_scan_period_max = 0,
  numa_preferred_nid = -1,
  numa_migrate_retry = 0,
  node_stamp = 0,
  last_task_numa_placement = 0,
  last_sum_exec_runtime = 0,
  numa_work = {
    next = 0xffff88003fb94d90,
    func = 0x0
  },
  numa_entry = {
    next = 0x0,
    prev = 0x0
  },
  numa_group = 0x0,
  numa_faults = 0x0,
  total_numa_faults = 0,
  numa_faults_locality = {0, 0, 0},
  numa_pages_migrated = 0,
  tlb_ubc = {
    cpumask = {
      bits = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
    },
    flush_required = false,
    writable = false
  },
  rcu = {
    next = 0x0,
    func = 0x0
  },
  splice_pipe = 0x0,
  task_frag = {
    page = 0x0,
    offset = 0,
    size = 0
  },
  delays = 0xffff88003fa97380,
  nr_dirtied = 0,
  nr_dirtied_pause = 32,
  dirty_paused_when = 0,
  timer_slack_ns = 50000,
  default_timer_slack_ns = 50000,
  curr_ret_stack = -1,
  ret_stack = 0x0,
  ftrace_timestamp = 0,
  trace_overrun = {
    counter = 0
  },
  tracing_graph_pause = {
    counter = 0
  },
  trace = 0,
  trace_recursion = 0,
  memcg_in_oom = 0x0,
  memcg_oom_gfp_mask = 0,
  memcg_oom_order = 0,
  memcg_nr_pages_over_high = 0,
  utask = 0x0,
  sequential_io = 0,
  sequential_io_avg = 0,
  pagefault_disabled = 0,
  oom_reaper_list = 0x0,
  stack_vm_area = 0xffff88003fb99500,
  stack_refcount = {
    counter = 1
  },
  thread = {
    tls_array = {{
        {
          {
            a = 0,
            b = 0
          },
          {
            limit0 = 0,
            base0 = 0,
            base1 = 0,
            type = 0,
            s = 0,
            dpl = 0,
            p = 0,
            limit = 0,
            avl = 0,
            l = 0,
            d = 0,
            g = 0,
            base2 = 0
          }
        }
      }, {
        {
          {
            a = 0,
            b = 0
          },
          {
            limit0 = 0,
            base0 = 0,
            base1 = 0,
            type = 0,
            s = 0,
            dpl = 0,
            p = 0,
            limit = 0,
            avl = 0,
            l = 0,
            d = 0,
            g = 0,
            base2 = 0
          }
        }
      }, {
        {
          {
            a = 0,
            b = 0
          },
          {
            limit0 = 0,
            base0 = 0,
            base1 = 0,
            type = 0,
            s = 0,
            dpl = 0,
            p = 0,
            limit = 0,
            avl = 0,
            l = 0,
            d = 0,
            g = 0,
            base2 = 0
          }
        }
      }},
    sp0 = 18446683600572186624,
    sp = 18446683600572186144,
    es = 0,
    ds = 0,
    fsindex = 0,
    gsindex = 0,
    status = 0,
    fsbase = 0,
    gsbase = 0,
    ptrace_bps = {0x0, 0x0, 0x0, 0x0},
    debugreg6 = 0,
    ptrace_dr7 = 0,
    cr2 = 0,
    trap_nr = 6,
    error_code = 0,
    io_bitmap_ptr = 0x0,
    iopl = 0,
    io_bitmap_max = 0,
    addr_limit = {
      seg = 18446744073709551615
    },
    sig_on_uaccess_err = 0,
    uaccess_err = 0,
    fpu = {
      last_cpu = 4294967295,
      fpstate_active = 0 '\000',
      fpregs_active = 0 '\000',
      state = {
        fsave = {
          cwd = 0,
          swd = 0,
          twd = 0,
          fip = 0,
          fcs = 0,
          foo = 0,
          fos = 0,
          st_space = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
          status = 0
        },
        fxsave = {
          cwd = 0,
          swd = 0,
          twd = 0,
          fop = 0,
          {
            {
              rip = 0,
              rdp = 0
            },
            {
              fip = 0,
              fcs = 0,
              foo = 0,
              fos = 0
            }
          },
          mxcsr = 0,
          mxcsr_mask = 0,
          st_space = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
          xmm_space = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
          padding = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
          {
            padding1 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
            sw_reserved = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
          }
        },
        soft = {
          cwd = 0,
          swd = 0,
          twd = 0,
          fip = 0,
          fcs = 0,
          foo = 0,
          fos = 0,
          st_space = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
          ftop = 0 '\000',
          changed = 0 '\000',
          lookahead = 0 '\000',
          no_update = 0 '\000',
          rm = 0 '\000',
          alimit = 0 '\000',
          info = 0x0,
          entry_eip = 0
        },
        xsave = {
          i387 = {
            cwd = 0,
            swd = 0,
            twd = 0,
            fop = 0,
            {
              {
                rip = 0,
                rdp = 0
              },
              {
                fip = 0,
                fcs = 0,
                foo = 0,
                fos = 0
              }
            },
            mxcsr = 0,
            mxcsr_mask = 0,
            st_space = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
            xmm_space = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
            padding = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
            {
              padding1 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
              sw_reserved = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
            }
          },
          header = {
            xfeatures = 0,
            xcomp_bv = 0,
            reserved = {0, 0, 0, 0, 0, 0}
          },
          extended_state_area = 0xffff88003fb95600 ""
        },
        __padding
      }
    }
  }
}

struct thread_info {
  flags = 8
}
Comment 26 Shrikrishna Khare 2017-03-26 17:22:44 UTC
(In reply to Remigiusz Szczepanik from comment #24)

> Should I do "bt" on every task and look for vmxnet3_rq_rx_complete panic
> stack? I don't know if I find one when crash cannot find it itself.

try

set logging on
thread apply all bt

i.e. redirect backtraces all threads to a file, and then grep for vmxnet3_rq_rx_complete.

Once the thread is found, go back to the crash prompt:

thread <thread number>
bt # should show stack with vmxnet3_rq_rx_complete

now run the commands to get rcd, rbi values etc.
Comment 27 Remigiusz Szczepanik 2017-03-26 18:24:48 UTC
Commands you gave didn't work but I did same think in other way.

Anyway there's no "vmxnet3" keyword anywhere in any task "bt".

I think that's because it didn't exactly crash kernel but it introduced some kind of loop/freeze.
Comment 28 Shrikrishna Khare 2017-03-26 19:35:08 UTC
(In reply to Remigiusz Szczepanik from comment #27)
> Commands you gave didn't work but I did same think in other way.
> 
> Anyway there's no "vmxnet3" keyword anywhere in any task "bt".
> 
> I think that's because it didn't exactly crash kernel but it introduced some
> kind of loop/freeze.

I thought the vmss was created after hitting BUG_ON. Let us collect vmss when the VM is in BUG_ON state?
Comment 29 Remigiusz Szczepanik 2017-03-26 19:48:48 UTC
(In reply to Shrikrishna Khare from comment #28)
> (In reply to Remigiusz Szczepanik from comment #27)
> > Commands you gave didn't work but I did same think in other way.
> > 
> > Anyway there's no "vmxnet3" keyword anywhere in any task "bt".
> > 
> > I think that's because it didn't exactly crash kernel but it introduced
> some
> > kind of loop/freeze.
> 
> I thought the vmss was created after hitting BUG_ON. Let us collect vmss
> when the VM is in BUG_ON state?

Well according to logs it did hit BUG_ON but for some reason kernel was working after that that for some time (which I believe is why I don't have a trace for this crash).
I may be mistaken though.

If you tell me what I must do to get proper vmss then I will but right now I'm clueless.
Comment 30 Shrikrishna Khare 2017-03-27 00:39:56 UTC
(In reply to Remigiusz Szczepanik from comment #29)
> Well according to logs it did hit BUG_ON but for some reason kernel was
> working after that that for some time (which I believe is why I don't have a
> trace for this crash).
> I may be mistaken though.
> 
> If you tell me what I must do to get proper vmss then I will but right now
> I'm clueless.

That is strange. Let me think about this.

In the meanwhile, may be we can try some other tricks to get to the bottom of this:

1. Lets narrow it down the patch that may have introduced this behavior.

Given that you hit this issue relatively frequently (I haven't hit this issue even once :-(), may be you can try disabling one of the recently added features to vmxnet3 and let the VM run and see if the issue happens?

ethtool -G eth0 rx-mini 0  # disable rx data ring
ethtool -g eth0            # check if rx-mini is indeed 0

replace eth0 with your interface name. Whether or not the bug happens again gives us a clue.

2. Will it be possible for you to apply a vmxnet3 patch and run the patched driver? In that case I can provide with a debug patch, just let me know which kernel version I should generate the patch against and where I can find sources for that kernel, and I will share a patch.

That debug patch will print out several of the state info (rcd, rxd, rbi etc.) before hitting the BUG_ON. Not as good as crash dump, but should help us make progress.

Again, thank you for patiently helping with this!

Thanks,
Shri
Comment 31 Remigiusz Szczepanik 2017-03-27 17:01:06 UTC
Created attachment 255579 [details]
config-4.10.3-1.el7.elrepo.x86_64

The first option if "what if" and the second one is actually a lot better.

I have attached config for kernel 4.10.3.
You can use it on: https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.10.3.tar.xz

After you recompile it please provide me with RPM packages for CentOS 7 and I will change my current running kernel with yours and try to "crash" it again.

I don't know if it's helpful information but as this is network problem I guess I have to tell you a bit about my network.
I believe that it essentially crashing on some kind of unexpected data in packets or frames.

On the same ESXi I'm running PfSense (2.3.3-RELEASE-p1 (amd64)) and I have my 2 VPSes on LAN side of PfSense. It may or may not be a case when PfSense forwards somehow deformed packet to CentOSes? I don't know it's just a random idea.

However if I somehow find a way to "crash" VM on my whim then I will try to collect all packets running from PfSense to that VM so we can see if any of them is anyhow deformed or unexpected.
Comment 32 Shrikrishna Khare 2017-03-28 01:49:08 UTC
Created attachment 255585 [details]
debug-patch
Comment 33 Shrikrishna Khare 2017-03-28 01:51:50 UTC
(In reply to Remigiusz Szczepanik from comment #31)
> Created attachment 255579 [details]
> config-4.10.3-1.el7.elrepo.x86_64
> 
> The first option if "what if" and the second one is actually a lot better.
> 
> I have attached config for kernel 4.10.3.
> You can use it on:
> https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.10.3.tar.xz
> 
> After you recompile it please provide me with RPM packages for CentOS 7 

Please find debug-patch in the attachments. It applies cleanly on 4.10.3 and will print several vmxnet3 data structures just before hitting the BUG_ON case. Would you be able to patch your vmxnet3 driver and try it out?

Or else, I can try to create RPM package for CentOS. Did not get around to trying that today.


and
> I will change my current running kernel with yours and try to "crash" it
> again.
> 
> I don't know if it's helpful information but as this is network problem I
> guess I have to tell you a bit about my network.
> I believe that it essentially crashing on some kind of unexpected data in
> packets or frames.
> 
> On the same ESXi I'm running PfSense (2.3.3-RELEASE-p1 (amd64)) and I have
> my 2 VPSes on LAN side of PfSense. It may or may not be a case when PfSense
> forwards somehow deformed packet to CentOSes? I don't know it's just a
> random idea.
> 
> However if I somehow find a way to "crash" VM on my whim then I will try to
> collect all packets running from PfSense to that VM so we can see if any of
> them is anyhow deformed or unexpected.

Yes, that will surely help. Thanks again!
Comment 34 Remigiusz Szczepanik 2017-03-28 17:57:13 UTC
(In reply to Shrikrishna Khare from comment #33)
> Or else, I can try to create RPM package for CentOS. Did not get around to
> trying that today.


Apparently I have very little time this week so please .RPM would be very very nice.
Comment 35 Shrikrishna Khare 2017-03-29 01:51:43 UTC
(In reply to Remigiusz Szczepanik from comment #34)
> (In reply to Shrikrishna Khare from comment #33)
> > Or else, I can try to create RPM package for CentOS. Did not get around to
> > trying that today.
> 
> 
> Apparently I have very little time this week so please .RPM would be very
> very nice.

Ok. Will do.
Comment 36 Remigiusz Szczepanik 2017-03-29 18:51:17 UTC
Just a tip.

Freeze happens frequently (but not every time) when I mistype "yum" and run any "yum list whateverpacket" from under non-root user. When yum is on "Determining fastest mirrors" server freeze.
Happened to me 2 times in row today but when I was prepared for the 3rd time (tcpdumps on) it didn't work.
Comment 37 Shrikrishna Khare 2017-03-30 01:47:37 UTC
I built CentOS rpm with debug patch and just as I was about to share it, my CentOS VM got (un)lucky and I hit this issue, so could root cause it.

The bug is in the vmxnet3 device emulation (ESX 6.5) and will be fixed in the next update.

In the meantime, suggested workaround:
 - disable rx data ring: ethtool -G eth? rx-mini 0

The issue should not be hit if using HW version 12 or older (with any kernel) or with kernel older than 4.8 (any HW version).

Remigiusz, thank you for all your help in debugging and root causing this and being very patient as we went through multiple iterations.
Comment 38 Remigiusz Szczepanik 2017-03-30 15:57:30 UTC
Sure, no problem. Just don't forget about me when writing changelog :)

Would it be fixed in next version of ESXi or you upstream patch to kernel driver?
Any actual time of fix release?
Comment 39 Shrikrishna Khare 2017-03-31 00:04:07 UTC
(In reply to Remigiusz Szczepanik from comment #38)
> Sure, no problem. Just don't forget about me when writing changelog :)

Certainly :-). The bug (and thus the fix) is in the device emulation though which part of the ESX sources and not the Linux kernel.


> 
> Would it be fixed in next version of ESXi or you upstream patch to kernel
> driver?
> Any actual time of fix release?

As I said above, it is not a driver bug, but an ESX bug. I don't know the timeline, I will try to find out and update.
Comment 40 Shrikrishna Khare 2017-04-11 00:54:01 UTC
(In reply to Shrikrishna Khare from comment #37)
> I built CentOS rpm with debug patch and just as I was about to share it, my
> CentOS VM got (un)lucky and I hit this issue, so could root cause it.
> 
> The bug is in the vmxnet3 device emulation (ESX 6.5) and will be fixed in
> the next update.
> 
> In the meantime, suggested workaround:
>  - disable rx data ring: ethtool -G eth? rx-mini 0


To be absolutely safe from hitting this bug, the above setting must be done before the interface had any opportunity to receive traffic. 

This may be not always be feasible. A safer  alternative is to shutdown the VM and add below the vmx file and reboot:

vmxnet3.rev.30 = FALSE


> 
> The issue should not be hit if using HW version 12 or older (with any
> kernel) or with kernel older than 4.8 (any HW version).
> 
> Remigiusz, thank you for all your help in debugging and root causing this
> and being very patient as we went through multiple iterations.
Comment 41 peter.hun 2017-04-20 21:39:25 UTC
is this issue fixed in ESXi 6.5d / ESXi-6.5.0-20170404001-standard (Build 5310538) ?
Comment 42 Shrikrishna Khare 2017-04-20 21:48:19 UTC
(In reply to peter.hun from comment #41)
> is this issue fixed in ESXi 6.5d / ESXi-6.5.0-20170404001-standard (Build
> 5310538) ?

No. That was cut before the fix. The next 6.5 update should have the fix.
Comment 43 Dan Wall 2017-06-13 14:02:53 UTC
I just ran into this with Red Hat 6.9. A case was opened with Red Hat, and this is how I found this thread. 

I have a case open with VMware, and just awaiting a response.

Has a fix been created yet?
Comment 44 Shrikrishna Khare 2017-06-14 21:06:42 UTC
(In reply to Dan Wall from comment #43)
> I just ran into this with Red Hat 6.9. A case was opened with Red Hat, and
> this is how I found this thread. 
> 
> I have a case open with VMware, and just awaiting a response.
> 
> Has a fix been created yet?

What is the vmxnet3 driver version in RHEL 6.9? 

As mentioned above, the fix is part of ESXi, and will be part of next 6.5 update.
Comment 45 miniflowtrader 2017-07-04 09:53:01 UTC
I can consistently produce this bug on Debian stretch. The bug seems to occur ONLY when lro is enabled.
Comment 46 miniflowtrader 2017-07-04 10:21:10 UTC
I just read through all the comments and applied both vmxnet3.rev.30 = FALSE and ethtool -G eth? rx-mini 0.

Both corrected the issue as well.  Waiting patiently for the next ESXi update before going to a 4.9 kernel.
Comment 47 Wang Jingkai 2017-07-10 06:24:16 UTC
(In reply to Shrikrishna Khare from comment #44)
> (In reply to Dan Wall from comment #43)
> > I just ran into this with Red Hat 6.9. A case was opened with Red Hat, and
> > this is how I found this thread. 
> > 
> > I have a case open with VMware, and just awaiting a response.
> > 
> > Has a fix been created yet?
> 
> What is the vmxnet3 driver version in RHEL 6.9? 
> 
> As mentioned above, the fix is part of ESXi, and will be part of next 6.5
> update.

Hi, I got the same issuse.
I use ESXi6.5 ((Updated) Dell-ESXi-6.5.0-4564106-A00 (Dell)) on Dell PowerEdge R630
Create a machine running CentOS6.6 and then use 'yum update' upgrade to 6.9 (use 'uname -r' then print '2.6.32-696.3.2.el6.x86_64')

Sometimes it freeze several minutes after power on, and sometimes freezes several hours from boot

I check kdump and found
 'PANIC: kernel BUG at drivers/net/vmxnet3/vmxnet3_drv.c:1412!'
Comment 48 Shrikrishna Khare 2017-07-10 18:18:45 UTC
(In reply to Wang Jingkai from comment #47)
> (In reply to Shrikrishna Khare from comment #44)
> > (In reply to Dan Wall from comment #43)
> > > I just ran into this with Red Hat 6.9. A case was opened with Red Hat,
> and
> > > this is how I found this thread. 
> > > 
> > > I have a case open with VMware, and just awaiting a response.
> > > 
> > > Has a fix been created yet?
> > 
> > What is the vmxnet3 driver version in RHEL 6.9? 
> > 
> > As mentioned above, the fix is part of ESXi, and will be part of next 6.5
> > update.
> 
> Hi, I got the same issuse.
> I use ESXi6.5 ((Updated) Dell-ESXi-6.5.0-4564106-A00 (Dell)) on Dell
> PowerEdge R630
> Create a machine running CentOS6.6 and then use 'yum update' upgrade to 6.9
> (use 'uname -r' then print '2.6.32-696.3.2.el6.x86_64')
> 
> Sometimes it freeze several minutes after power on, and sometimes freezes
> several hours from boot
> 
> I check kdump and found
>  'PANIC: kernel BUG at drivers/net/vmxnet3/vmxnet3_drv.c:1412!'

Build 4564106 does not have the fix. The fix will be available in 6.5 update release (not released yet). In the meantime, please use workaround mentioned above.
Comment 49 miniflowtrader 2017-07-30 19:04:34 UTC
Confirmed update release has fixed the issue.  Nice job!
Comment 50 Shrikrishna Khare 2017-07-31 18:49:09 UTC
(In reply to miniflowtrader from comment #49)
> Confirmed update release has fixed the issue.  Nice job!

Thank you for verifying this! I think it is OK to close this bug.

Thanks,
Shri
Comment 51 Ryan Breaker 2017-08-03 03:58:34 UTC
I'm running build 4887370 and am still experiencing this bug; is there a newer build I'm missing?

Thanks so much for the work on this in the meantime guys.
Comment 52 miniflowtrader 2017-08-03 04:02:26 UTC
(In reply to Ryan Breaker from comment #51)
> I'm running build 4887370 and am still experiencing this bug; is there a
> newer build I'm missing?
> 
> Thanks so much for the work on this in the meantime guys.

You want to be on 5969303. Released 2017-07-27.
Comment 53 Ryan Breaker 2017-08-03 04:18:02 UTC
(In reply to miniflowtrader from comment #52)
> (In reply to Ryan Breaker from comment #51)
> > I'm running build 4887370 and am still experiencing this bug; is there a
> > newer build I'm missing?
> > 
> > Thanks so much for the work on this in the meantime guys.
> 
> You want to be on 5969303. Released 2017-07-27.

Oh got it, I'm still new to VMWare and am looking into updating now, thank you!
Comment 54 Andrew Moore 2017-08-07 18:03:39 UTC
I'm running the HPE custom version of 5969303, HPE-ESXi-6.5.0-Update1-iso-650.U1.10.1.0.14 and I am still experiencing this issue with Debian Stretch.

Is it possible that HPE remove this fix?
Comment 55 nathan 2018-01-09 23:54:22 UTC
On the latest VMware ESXi 6.5.0 build-7526125 this kernel message is still printed as of kernel 4.9.75. Haven't seen an actual crash yet.
Comment 56 John Savanyo 2018-06-04 23:52:18 UTC
I believe this issues is resolved in vSphere/ESXi 6.5 U1, see:
https://kb.vmware.com/s/article/2151480
Comment 57 Fabio Pedretti 2018-06-05 12:12:44 UTC
This bug (the freeze issue) should be closed since it is not a kernel bug.
About the warning, you may want to refer to https://bugzilla.kernel.org/show_bug.cgi?id=194569
Comment 58 Stephen Hemminger 2018-06-05 16:32:43 UTC
Since this is not an upstream kernel bug. Closing

Note You need to log in before you can comment on or make changes to this bug.