Bug 197325 - NETDEV WATCHDOG: enp2s0f3 (i40e): transmit queue 4 timed out
Summary: NETDEV WATCHDOG: enp2s0f3 (i40e): transmit queue 4 timed out
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-19 17:47 UTC by Pawel Staszewski
Modified: 2020-11-27 15:00 UTC (History)
5 users (show)

See Also:
Kernel Version: 4.14.0-rc4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
4.13.9 panic (158.69 KB, image/png)
2017-10-22 18:47 UTC, Pawel Staszewski
Details
Patch to add dump Tx Buffer Info and Descriptor for hung ring (3.10 KB, patch)
2017-11-30 20:32 UTC, Alexander Duyck
Details | Diff

Description Pawel Staszewski 2017-10-19 17:47:55 UTC
[66663.913440] NETDEV WATCHDOG: enp2s0f3 (i40e): transmit queue 4 timed out
[66663.913455] ------------[ cut here ]------------
[66663.913461] WARNING: CPU: 5 PID: 0 at net/sched/sch_generic.c:320 dev_watchdog+0xc5/0x122
[66663.913462] Modules linked in: team_mode_roundrobin team x86_pkg_temp_thermal ipmi_si
[66663.913467] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.14.0-rc4 #3
[66663.913468] task: ffff88085ab34100 task.stack: ffffc900031f4000
[66663.913470] RIP: 0010:dev_watchdog+0xc5/0x122
[66663.913471] RSP: 0000:ffff88085e343e88 EFLAGS: 00010292
[66663.913472] RAX: 000000000000003c RBX: ffff880858151000 RCX: 0000000000000000
[66663.913472] RDX: ffff88085e353501 RSI: ffff88085e34cab8 RDI: ffff88085e34cab8
[66663.913473] RBP: ffff88085e343e98 R08: 00222038e6dd795d R09: ffff88087f013d6c
[66663.913474] R10: 0000000000000001 R11: 000000000000005c R12: 0000000000000004
[66663.913475] R13: ffffffff815ee486 R14: ffff880858151438 R15: ffff880858151000
[66663.913476] FS:  0000000000000000(0000) GS:ffff88085e340000(0000) knlGS:0000000000000000
[66663.913477] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[66663.913477] CR2: 00007f3181cdaff8 CR3: 0000000859d3b004 CR4: 00000000001606e0
[66663.913478] Call Trace:
[66663.913479]  <IRQ>
[66663.913482]  call_timer_fn+0x56/0x119
[66663.913484]  run_timer_softirq+0x136/0x15b
[66663.913487]  __do_softirq+0xe6/0x23a
[66663.913491]  irq_exit+0x4d/0x5b
[66663.913492]  smp_apic_timer_interrupt+0xb6/0xf0
[66663.913494]  apic_timer_interrupt+0x90/0xa0
[66663.913494]  </IRQ>
[66663.913497] RIP: 0010:default_idle+0x17/0x28
[66663.913498] RSP: 0000:ffffc900031f7eb8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[66663.913499] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88085e350b40
[66663.913500] RDX: 00000000131f6bfa RSI: 0000000000000005 RDI: 0000000000000001
[66663.913501] RBP: ffffc900031f7eb8 R08: 0000000000000000 R09: 0000000000000001
[66663.913501] R10: 0000000000000000 R11: 0000000000000033 R12: ffff88085ab34100
[66663.913502] R13: 0000000000000000 R14: ffff88085ab34100 R15: 00000000fffffff0
[66663.913504]  ? default_idle+0x15/0x28
[66663.913507]  arch_cpu_idle+0xa/0xc
[66663.913509]  default_idle_call+0x15/0x17
[66663.913511]  do_idle+0xb8/0x16f
[66663.913512]  cpu_startup_entry+0x1a/0x1c
[66663.913514]  start_secondary+0xe9/0xec
[66663.913516]  secondary_startup_64+0xa5/0xa5
[66663.913517] Code: 4e ea 6e 00 00 75 38 48 89 df c6 05 42 ea 6e 00 01 e8 ea 42 fe ff 44 89 e1 48 89 de 48 c7 c7 36 75 ac 81 48 89 c2 e8 7f 27 a8 ff <0f> ff eb 10 41 ff c4 48 05 40 01 00 00 41 39 f4 75 9a eb 0d 48
[66663.913539] ---[ end trace e6e15f4c0c2e9ed4 ]---
[66663.913544] i40e 0000:02:00.3 enp2s0f3: tx_timeout: VSI_seid: 397, Q 4, NTC: 0x64, HWB: 0x64, NTU: 0x4d, TAIL: 0x64, INT: 0x1
[66663.913546] i40e 0000:02:00.3 enp2s0f3: tx_timeout recovery level 1, hung_queue 4
[66664.298996] i40e 0000:02:00.3: PF reset failed, -15
[67633.048135] i40e 0000:02:00.3: update vlan stripping failed, err I40E_ERR_QUEUE_EMPTY aq_err OK
[67633.215896] i40e 0000:02:00.3: Failed to clear LAN Tx queue context on Tx ring 0 (pf_q 0), error: -19
[67633.482725] i40e 0000:02:00.3: PF reset failed, -15
[67644.164409] i40e 0000:02:00.3: update vlan stripping failed, err I40E_ERR_QUEUE_EMPTY aq_err OK
[67644.331271] i40e 0000:02:00.3: Failed to clear LAN Tx queue context on Tx ring 0 (pf_q 0), error: -19
[67644.597718] i40e 0000:02:00.3: PF reset failed, -15
[68843.749625] i40e 0000:02:00.0 enp2s0f0: tx_timeout: VSI_seid: 399, Q 3, NTC: 0x1c2, HWB: 0x1c2, NTU: 0x1ac, TAIL: 0x1c2, INT: 0x1
[68843.749628] i40e 0000:02:00.0 enp2s0f0: tx_timeout recovery level 1, hung_queue 3
[68844.055419] i40e 0000:02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on

After this information in dmesg i can't reset link on the card.
ip link set up dev enp2s0f3
RTNETLINK answers: Cannot allocate memory

[69476.064969] i40e 0000:02:00.3: update vlan stripping failed, err I40E_ERR_QUEUE_EMPTY aq_err OK
[69476.234327] i40e 0000:02:00.3: Failed to clear LAN Tx queue context on Tx ring 0 (pf_q 0), error: -19
[69476.501691] i40e 0000:02:00.3: PF reset failed, -15





CONFIGURATION:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3'
for i in $ifc
        do
        ip link set up dev $i
        ethtool -A $i autoneg off rx off tx off
        ethtool -G $i rx 2048 tx 2048
        ip link set $i txqueuelen 1000
        ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 17 tx-usecs 125
        ethtool -L $i combined 6
        ethtool -K $i ntuple on
        ethtool -K $i gro on
        ethtool -K $i tso on
        done
Comment 1 Pawel Staszewski 2017-10-19 17:48:35 UTC
lspci | grep Ether
02:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
02:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
02:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
02:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
03:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
03:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
03:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
03:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
Comment 2 Pawel Staszewski 2017-10-20 09:48:14 UTC
Today no changes in configuration and another card goes down with info:
[125896.938279] i40e 0000:03:00.3 enp3s0f3: tx_timeout: VSI_seid: 397, Q 1, NTC: 0x1be, HWB: 0x1be, NTU: 0x1a6, TAIL: 0x1be, INT: 0x1
[125896.938281] i40e 0000:03:00.3 enp3s0f3: tx_timeout recovery level 1, hung_queue 1
[125897.298600] i40e 0000:03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on

Cant restore traffic on this nic.
Comment 3 Pawel Staszewski 2017-10-20 10:08:12 UTC
And just seconds ago another card also without any changes
[127914.734443] i40e 0000:03:00.2 enp3s0f2: tx_timeout: VSI_seid: 396, Q 0, NTC: 0x18d, HWB: 0x18d, NTU: 0x166, TAIL: 0x18d, INT: 0x1
[127914.734445] i40e 0000:03:00.2 enp3s0f2: tx_timeout recovery level 1, hung_queue 0
[127915.365285] i40e 0000:03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on
Comment 4 Pawel Staszewski 2017-10-20 10:18:00 UTC
This problem here is same as mine:
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20161121/007324.html

But this is from 2016 - so is this still not resolved ?
hmm...
Comment 5 Pawel Staszewski 2017-10-20 16:22:44 UTC
Upgraded kernel today to:
4.14.0-rc5-next-20171018

Cleaned some settings - let them default 
Current are:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3'
for i in $ifc
        do
        ip link set up dev $i
        ethtool -A $i autoneg off rx off tx off
        ethtool -G $i rx 2048 tx 2048
        ip link set $i txqueuelen 1000
        ethtool -L $i combined 6
        done

And waiting - if there will be PF reset - first was after about 24 hours - so this can take time to tomorrow.

"besides that there is memleak" :) - but lets left memleak for other case :)
Comment 6 Pawel Staszewski 2017-10-20 16:35:59 UTC
Aditional info:
ethtool -i enp2s0f0
driver: i40e
version: 2.1.14-k
firmware-version: 6.01 0x80003484 1.1747.0
expansion-rom-version:
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

All cards have same FW
Comment 7 Pawel Staszewski 2017-10-22 13:18:06 UTC
Same problem after some time first link goes down
[175817.966335] NETDEV WATCHDOG: enp3s0f3 (i40e): transmit queue 3 timed out
[175817.966349] ------------[ cut here ]------------
[175817.966355] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:320 dev_watchdog+0xbc/0x117
[175817.966356] Modules linked in: team_mode_roundrobin team x86_pkg_temp_thermal ipmi_si
[175817.966361] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc5-next-20171018 #1
[175817.966362] task: ffffffff81c0e4c0 task.stack: ffffffff81c00000
[175817.966364] RIP: 0010:dev_watchdog+0xbc/0x117
[175817.966365] RSP: 0018:ffff88085e203eb0 EFLAGS: 00010282
[175817.966366] RAX: 000000000000003c RBX: ffff880857e87800 RCX: 0000000000000000
[175817.966367] RDX: ffff88085e213601 RSI: ffff88085e20cad8 RDI: ffff88085e20cad8
[175817.966368] RBP: 0000000000000003 R08: 00251efa74c4c7c7 R09: ffff88087f0139e4
[175817.966368] R10: 0000000000000001 R11: 000000000000005c R12: ffffffff815c122c
[175817.966369] R13: ffff880857e87c38 R14: ffff880857e87800 R15: 0000000000000001
[175817.966370] FS:  0000000000000000(0000) GS:ffff88085e200000(0000) knlGS:0000000000000000
[175817.966372] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[175817.966372] CR2: 000000000725a750 CR3: 0000000001c09005 CR4: 00000000001606f0
[175817.966373] Call Trace:
[175817.966375]  <IRQ>
[175817.966379]  call_timer_fn+0x51/0x10e
[175817.966382]  run_timer_softirq+0x12b/0x14e
[175817.966385]  ? timerqueue_add+0x56/0x74
[175817.966388]  __do_softirq+0xeb/0x247
[175817.966390]  irq_exit+0x49/0x55
[175817.966392]  smp_apic_timer_interrupt+0xb1/0xe9
[175817.966394]  apic_timer_interrupt+0x89/0x90
[175817.966395]  </IRQ>
[175817.966396] RIP: 0010:default_idle+0x13/0x22
[175817.966397] RSP: 0018:ffffffff81c03ed8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
[175817.966398] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff81c42040
[175817.966399] RDX: 00000000e6f00346 RSI: 0000000000000000 RDI: 0000000000000001
[175817.966400] RBP: ffffffff81c0e4c0 R08: 00251efa7319a991 R09: 0000000000000208
[175817.966401] R10: ffffc900031b7e20 R11: 0000000000000000 R12: 0000000000000000
[175817.966401] R13: ffffffff81c0e4c0 R14: ffffffff81c0e4c0 R15: 00000000fffffff0
[175817.966404]  ? default_idle+0x11/0x22
[175817.966406]  do_idle+0xaf/0x164
[175817.966408]  cpu_startup_entry+0x18/0x1a
[175817.966411]  start_kernel+0x3bd/0x3c5
[175817.966416]  secondary_startup_64+0xa5/0xb0
[175817.966417] Code: 3d 35 dd 71 00 00 75 35 48 89 df c6 05 29 dd 71 00 01 e8 58 53 fe ff 89 e9 48 89 de 48 c7 c7 f5 b4 ac 81 48 89 c2 e8 56 c6 aa ff <0f> ff eb 0e ff c5 48 05 40 01 00 00 39 f5 75 9d eb 0d 48 8b 83
[175817.966440] ---[ end trace 1ab6cb8e3150990a ]---
[175817.966445] i40e 0000:03:00.3 enp3s0f3: tx_timeout: VSI_seid: 397, Q 3, NTC: 0x193, HWB: 0x193, NTU: 0x179, TAIL: 0x193, INT: 0x1
[175817.966446] i40e 0000:03:00.3 enp3s0f3: tx_timeout recovery level 1, hung_queue 3
[175818.348483] i40e 0000:03:00.3: PF reset failed, -15
Comment 8 Pawel Staszewski 2017-10-22 14:24:07 UTC
another link goes down after some time
[184084.975025] i40e 0000:03:00.0 enp3s0f0: tx_timeout: VSI_seid: 399, Q 3, NTC: 0x2e, HWB: 0x2e, NTU: 0x18, TAIL: 0x2e, INT: 0x0
[184084.975028] i40e 0000:03:00.0 enp3s0f0: tx_timeout recovery level 1, hung_queue 3
[184086.180816] i40e 0000:03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on
[184417.905765] i40e 0000:03:00.0 enp3s0f0: NIC Link is Down
Comment 9 Pawel Staszewski 2017-10-22 14:31:01 UTC
And about first link that goes down:
ip link set down enp3s0f3
ip link set up enp3s0f3
RTNETLINK answers: Cannot allocate memory

dmesg:
[185127.163571] i40e 0000:03:00.3: update vlan stripping failed, err I40E_ERR_QUEUE_EMPTY aq_err OK
[185127.369464] i40e 0000:03:00.3: Failed to clear LAN Tx queue context on Tx ring 0 (pf_q 0), error: -19
[185127.384530] i40e 0000:03:00.3 enp3s0f3: NIC Link is Down
[185127.855799] i40e 0000:03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on
Comment 10 Pawel Staszewski 2017-10-22 14:54:46 UTC
Also i can up device: enp3s0f0 that have PF reset too

but after i link down / up this devide - then remove and add again to the team
teamdctl team0 port remove enp3s0f0
teamdctl team0 port add enp3s0f0

There is no rx traffic to this device - and all frames are dropped on RX
NIC statistics:
     rx_packets: 103923
     tx_packets: 346595112
     rx_bytes: 7603608
     tx_bytes: 317591433056
     rx_errors: 0
     tx_errors: 0
     rx_dropped: 0
     tx_dropped: 0
     collisions: 0
     rx_length_errors: 0
     rx_crc_errors: 0
     rx_unicast: 0
     tx_unicast: 346590454
     rx_multicast: 61132
     tx_multicast: 22
     rx_broadcast: 42693
     tx_broadcast: 4575
     rx_unknown_protocol: 0
     tx_linearize: 0
     tx_force_wb: 0
     rx_alloc_fail: 0
     rx_pg_alloc_fail: 0
     tx-0.tx_packets: 58656175
     tx-0.tx_bytes: 54128063478
     rx-0.rx_packets: 8333
     rx-0.rx_bytes: 683328
     tx-1.tx_packets: 58351825
     tx-1.tx_bytes: 53885190020
     rx-1.rx_packets: 9413
     rx-1.rx_bytes: 768526
     tx-2.tx_packets: 56666428
     tx-2.tx_bytes: 51457891575
     rx-2.rx_packets: 13257
     rx-2.rx_bytes: 1087617
     tx-3.tx_packets: 58333047
     tx-3.tx_bytes: 53274070997
     rx-3.rx_packets: 62749
     rx-3.rx_bytes: 4229310
     tx-4.tx_packets: 57349402
     tx-4.tx_bytes: 53449582836
     rx-4.rx_packets: 8626
     rx-4.rx_bytes: 708605
     tx-5.tx_packets: 57238235
     tx-5.tx_bytes: 51396634150
     rx-5.rx_packets: 1545
     rx-5.rx_bytes: 126222
     port.rx_bytes: 5996611563
     port.tx_bytes: 320428847308
     port.rx_unicast: 23409778
     port.tx_unicast: 346572372
     port.rx_multicast: 96797
     port.tx_multicast: 50
     port.rx_broadcast: 44688
     port.tx_broadcast: 4575
     port.tx_errors: 0
     port.rx_dropped: 0
     port.tx_dropped_link_down: 18125
     port.rx_crc_errors: 0
     port.illegal_bytes: 0
     port.mac_local_faults: 6
     port.mac_remote_faults: 7
     port.tx_timeout: 1
     port.rx_csum_bad: 0
     port.rx_length_errors: 0
     port.link_xon_rx: 0
     port.link_xoff_rx: 0
     port.link_xon_tx: 0
     port.link_xoff_tx: 0
     port.rx_size_64: 1821065
     port.rx_size_127: 14633656
     port.rx_size_255: 2802123
     port.rx_size_511: 1352542
     port.rx_size_1023: 756897
     port.rx_size_1522: 2184968
     port.rx_size_big: 3
     port.tx_size_64: 35835246
     port.tx_size_127: 76595624
     port.tx_size_255: 15956690
     port.tx_size_511: 7729005
     port.tx_size_1023: 8031246
     port.tx_size_1522: 202429160
     port.tx_size_big: 0
     port.rx_undersize: 0
     port.rx_fragments: 0
     port.rx_oversize: 0
     port.rx_jabber: 0
     port.VF_admin_queue_requests: 0
     port.arq_overflows: 0
     port.rx_hwtstamp_cleared: 0
     port.tx_hwtstamp_skipped: 0
     port.fdir_flush_cnt: 3774
     port.fdir_atr_match: 0
     port.fdir_atr_tunnel_match: 0
     port.fdir_atr_status: 0
     port.fdir_sb_match: 0
     port.fdir_sb_status: 1
     port.tx_lpi_status: 0
     port.rx_lpi_status: 0
     port.tx_lpi_count: 0
     port.rx_lpi_count: 0
     port.tx_priority_0_xon: 0
     port.tx_priority_0_xoff: 0
     port.tx_priority_1_xon: 0
     port.tx_priority_1_xoff: 0
     port.tx_priority_2_xon: 0
     port.tx_priority_2_xoff: 0
     port.tx_priority_3_xon: 0
     port.tx_priority_3_xoff: 0
     port.tx_priority_4_xon: 0
     port.tx_priority_4_xoff: 0
     port.tx_priority_5_xon: 0
     port.tx_priority_5_xoff: 0
     port.tx_priority_6_xon: 0
     port.tx_priority_6_xoff: 0
     port.tx_priority_7_xon: 0
     port.tx_priority_7_xoff: 0
     port.rx_priority_0_xon: 0
     port.rx_priority_0_xoff: 0
     port.rx_priority_1_xon: 0
     port.rx_priority_1_xoff: 0
     port.rx_priority_2_xon: 0
     port.rx_priority_2_xoff: 0
     port.rx_priority_3_xon: 0
     port.rx_priority_3_xoff: 0
     port.rx_priority_4_xon: 0
     port.rx_priority_4_xoff: 0
     port.rx_priority_5_xon: 0
     port.rx_priority_5_xoff: 0
     port.rx_priority_6_xon: 0
     port.rx_priority_6_xoff: 0
     port.rx_priority_7_xon: 0
     port.rx_priority_7_xoff: 0
     port.rx_priority_0_xon_2_xoff: 0
     port.rx_priority_1_xon_2_xoff: 0
     port.rx_priority_2_xon_2_xoff: 0
     port.rx_priority_3_xon_2_xoff: 0
     port.rx_priority_4_xon_2_xoff: 0
     port.rx_priority_5_xon_2_xoff: 0
     port.rx_priority_6_xon_2_xoff: 0
     port.rx_priority_7_xon_2_xoff: 0

so what is the configuration exactly:
modprobe team
teamd -d -f /etc/teamd.conf

cat /etc/teamd.conf
{
        "device": "team0",
        "hwaddr": "0c:c4:7a:bc:40:f7",
        "runner": {
                "name": "roundrobin",
                "active": true,
                "fast_rate": true,
                "tx_hash": ["eth", "ipv4", "ipv6"]
        },
        "link_watch": {"name": "ethtool"},
        "ports": {"enp2s0f0": {}, "enp2s0f1": {}, "enp2s0f2": {}, "enp2s0f3": {}, "enp3s0f0": {}, "enp3s0f1": {}, "enp3s0f2": {}, "enp3s0f3": {}}
}



ip link set up dev team0
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3'
for i in $ifc
        do
        ip link set up dev $i
        ethtool -A $i autoneg off rx off tx off
        ethtool -G $i rx 2048 tx 2048
        ip link set $i txqueuelen 1000
        ethtool -L $i combined 6
        done


vlans="7 65 227 333 334 335 490 589 336 337 339 409 572 667 575 717 154 342 343 346 347 349 455 474 456 459 460 461 462 463 464 465 466 467 468 470 476 385 477 479 481 482 598 849 873 895 1327 1780 1781 1782 1783 1785 1786 1784 1787 968 630 1788 1790 1791 2002 1792 1793 1794 1795 1796 590 1797 1798 1800 1740 1741 1742 1743 1744 1745 4043 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1767 1768 1769 1770 1772 1773 1774 1775 1776 1777 1778 1779 1821 1822 1823 1824 1825 1826 1827 1828 1829 1841 1842 1843 1844 1845 1846 1852 1847 1848 1849 2568 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142"
for i in $vlans
        do
        ip link add link team0 name vlan$i type vlan id $i
        done



And all traffic is on vlans attached to team0

but cant recover traffic to any card after PF reset - there is TX - but switch cant find Agg member for sharing at this ports that was reset - (and one port always first port that hit this bug cant be turned on - cause of "cannot allocate memory"

This problem is not existing on kernel 4.11.12 (with kernel 4.11.12 all was working about 1 month without problems)
Comment 11 Pawel Staszewski 2017-10-22 15:34:34 UTC
Tonight will go back a little and check 4.13.9 + some patches for memleak problems.

We will see how far this bug with PF reset is placed.
Comment 12 Pawel Staszewski 2017-10-22 18:18:24 UTC
For this buggy i40e driver should be someting like other path for git bisect :)
Then You can take only bisects from the driver not whole kernel

For example i was planning to bisect this but kernel changes from 4.11.12 - was toooo many :) - i know about many other bughs that will affect me for bisects - for example i was bisecting three bugs for path 4.12 -> 4.14 - but was about apic/fib/mellanox - some of them can make my hosts to panic after about 15 do 30 secs - sooo this is not goot to go bisecting all kernel - should be other way to bisect only driver.

For example im not using now any intel nics that are >10G cause i40e have too many bugs - if there will be a way to bisect only driver then i can push with testing and report bugs - make bisects etc - but really :) dont want to bisect kernel then i dont know what im bisecting - need to prepare then custom git - my repo with my patches that cant be bisected - cause was repaired - and was a bug.
Comment 13 Pawel Staszewski 2017-10-22 18:23:38 UTC
Also why :)
i40e is less optimized than ixgbe ? :)
for example 
i40e 
   PerfTop:   48502 irqs/sec  kernel:90.3%  exact:  0.0% [4000Hz cycles],  (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------------------

     5.69%  [kernel]                [k] i40e_napi_poll
     4.73%  [kernel]                [k] fib_table_lookup
     3.79%  [kernel]                [k] do_raw_spin_lock
     3.43%  [kernel]                [k] __dev_queue_xmit
     2.97%  [kernel]                [k] ipt_do_table
     2.66%  [kernel]                [k] i40e_lan_xmit_frame
     1.93%  [kernel]                [k] ip_finish_output2
     1.91%  [team_mode_roundrobin]  [k] rr_transmit
     1.80%  [kernel]                [k] vlan_do_receive
     1.80%  [kernel]                [k] __netif_receive_skb_core
     1.57%  [kernel]                [k] irq_entries_start
     1.46%  [kernel]                [k] dev_gro_receive
     1.45%  [kernel]                [k] netif_skb_features
     1.28%  [kernel]                [k] __inet_lookup_established
     1.27%  [kernel]                [k] skb_release_data
     1.15%  [kernel]                [k] __build_skb
     1.14%  [kernel]                [k] dev_hard_start_xmit
     1.10%  [kernel]                [k] apic_timer_interrupt
     1.09%  [kernel]                [k] ip_rcv
     1.00%  [kernel]                [k] vlan_dev_hard_start_xmit
     0.99%  [kernel]                [k] __slab_free
     0.98%  [kernel]                [k] ip_forward
     0.93%  [kernel]                [k] inet_gro_receive
     0.77%  [kernel]                [k] __netdev_pick_tx
     0.74%  [kernel]                [k] validate_xmit_skb
     0.74%  [kernel]                [k] __qdisc_run
     0.69%  [kernel]                [k] netdev_pick_tx
     0.69%  [kernel]                [k] rt_cache_valid
     0.68%  [kernel]                [k] pfifo_fast_dequeue
     0.66%  libzebra.so.0.0.0       [.] kmap_item_lookup
     0.65%  [kernel]                [k] read_tsc
     0.62%  [kernel]                [k] atomic_add_unless.constprop.130
     0.60%  [kernel]                [k] tcp_gro_receive


And ixgbe:
     7.08%  [kernel]                [k] do_raw_spin_lock
     5.42%  [kernel]                [k] ixgbe_poll
     4.69%  [kernel]                [k] __dev_queue_xmit
     4.62%  [kernel]                [k] fib_table_lookup
     4.38%  [kernel]                [k] queued_spin_lock_slowpath
     3.82%  [team_mode_roundrobin]  [k] rr_transmit
     3.50%  [kernel]                [k] ixgbe_xmit_frame_ring
     2.69%  [kernel]                [k] netdev_pick_tx
     2.29%  [kernel]                [k] skb_release_data
     1.89%  [kernel]                [k] ip_finish_output2
     1.84%  [kernel]                [k] __netif_receive_skb_core
     1.67%  [kernel]                [k] page_frag_free
     1.67%  [kernel]                [k] dev_gro_receive
     1.58%  [kernel]                [k] vlan_do_receive
     1.53%  [kernel]                [k] skb_unref
     1.32%  [kernel]                [k] virt_to_head_page
     1.27%  [kernel]                [k] __slab_free
     1.22%  [kernel]                [k] netif_skb_features
     1.19%  [kernel]                [k] ixgbe_maybe_stop_tx
     1.19%  [kernel]                [k] fq_codel_dequeue
     1.10%  [kernel]                [k] __build_skb
     1.05%  [kernel]                [k] dma_mapping_error
     1.03%  [kernel]                [k] ip_rcv
     1.02%  [kernel]                [k] compound_head
     1.01%  [kernel]                [k] inet_gro_receive
     0.99%  [kernel]                [k] ipt_do_table
Comment 14 Pawel Staszewski 2017-10-22 18:25:36 UTC
forgot to tell 8xixgbe = 60Gbit of trtaffic about 6Mpps
i40e was 20Gbit about 2.5Mpps (so i40e more than 2x less traffix but same or more load on code)
Comment 15 Pawel Staszewski 2017-10-22 18:26:19 UTC
just will add for i40e header from perf top with irqs
   PerfTop:  147621 irqs/sec  kernel:99.4%  exact:  0.0% [4000Hz cycles],  (all, 40 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

     7.14%  [kernel]                [k] do_raw_spin_lock
     5.43%  [kernel]                [k] ixgbe_poll
     4.67%  [kernel]                [k] __dev_queue_xmit
     4.62%  [kernel]                [k] fib_table_lookup
     3.84%  [kernel]                [k] queued_spin_lock_slowpath
     3.75%  [team_mode_roundrobin]  [k] rr_transmit
     3.53%  [kernel]                [k] ixgbe_xmit_frame_ring
     2.58%  [kernel]                [k] netdev_pick_tx
     2.30%  [kernel]                [k] skb_release_data
     1.91%  [kernel]                [k] ip_finish_output2
     1.81%  [kernel]                [k] __netif_receive_skb_core
     1.69%  [kernel]                [k] page_frag_free
     1.67%  [kernel]                [k] dev_gro_receive
     1.60%  [kernel]                [k] vlan_do_receive
     1.51%  [kernel]                [k] skb_unref
     1.31%  [kernel]                [k] virt_to_head_page
     1.27%  [kernel]                [k] __slab_free
     1.21%  [kernel]                [k] netif_skb_features
     1.20%  [kernel]                [k] ixgbe_maybe_stop_tx
Comment 16 Pawel Staszewski 2017-10-22 18:47:23 UTC
Just tried 4.13.9 :)
but im hitting other bug that is solved in 4.14-rc3 and not pushed to the 4.13.9
So... cant check 4.13.9 - attached image with panic
Comment 17 Pawel Staszewski 2017-10-22 18:47:53 UTC
Created attachment 260319 [details]
4.13.9 panic
Comment 18 Pawel Staszewski 2017-10-22 18:55:08 UTC
So now we have problem :)

We have kernel 4.11.12 without problems but too low performance for ip processing comparing to 4.14 :)

And 4.13.9 with kernel panic
also all kernels >4.11.12  -> memleak

So now need to thing - add patches for memleak from Alexander Duyck to 4.14-rc5 and check if there will be PF reset issue or back to 4.11.12 :) that will not handle aditional 1Mpps that i need now :)
ehh :)

So will try to add patches:
[jkirsher/next-queue PATCH] i40e/i40evf: Revert "i40e/i40evf: bump tail only in multiples of 8"
[jkirsher/net-queue PATCH] i40e: Add programming descriptors to cleaned_count

To 4.14-rc5-next :)

I saw there was some patches more also for i40e that are currently not in linux-next tree

More patches is in the davem net tree - so meybee adding this patches to the net tree will be better....
Comment 19 Pawel Staszewski 2017-10-22 22:09:00 UTC
ok just upgraded to 
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
with some pattches to memleak like:
Re: [jkirsher/net-queue PATCH] i40e: Add programming descriptors to cleaned_count

other was added.

And now checking if PF reset will be still there
Comment 20 Pawel Staszewski 2017-10-23 16:44:25 UTC
With net.git tree same problem
[67277.797498] NETDEV WATCHDOG: enp2s0f1 (i40e): transmit queue 3 timed out
[67277.797514] ------------[ cut here ]------------
[67277.797519] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:320 dev_watchdog+0xc5/0x122
[67277.797520] Modules linked in: team_mode_roundrobin team ipmi_si x86_pkg_temp_thermal
[67277.797526] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.0-rc4 #4
[67277.797527] task: ffff88085ab32700 task.stack: ffffc900031e4000
[67277.797529] RIP: 0010:dev_watchdog+0xc5/0x122
[67277.797530] RSP: 0018:ffff88085e2c3e88 EFLAGS: 00010292
[67277.797531] RAX: 000000000000003c RBX: ffff8808598b5000 RCX: 0000000000000000
[67277.797532] RDX: ffff88085e2d3501 RSI: ffff88085e2ccab8 RDI: ffff88085e2ccab8
[67277.797532] RBP: ffff88085e2c3e98 R08: 0026672c83440046 R09: ffff88087f0138ac
[67277.797533] R10: 0000000000000001 R11: 000000000000005c R12: 0000000000000003
[67277.797533] R13: ffffffff815ee173 R14: ffff8808598b5438 R15: ffff8808598b5000
[67277.797534] FS:  0000000000000000(0000) GS:ffff88085e2c0000(0000) knlGS:0000000000000000
[67277.797535] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[67277.797535] CR2: 00007fbe415bdff0 CR3: 0000000856a3e005 CR4: 00000000001606e0
[67277.797536] Call Trace:
[67277.797537]  <IRQ>
[67277.797540]  call_timer_fn+0x56/0x119
[67277.797541]  run_timer_softirq+0x136/0x15b
[67277.797542]  ? tk_clock_read+0xc/0xe
[67277.797544]  ? timekeeping_get_ns+0x1d/0x31
[67277.797546]  __do_softirq+0xe6/0x23a
[67277.797549]  irq_exit+0x4d/0x5b
[67277.797550]  smp_apic_timer_interrupt+0xb6/0xf0
[67277.797551]  apic_timer_interrupt+0x90/0xa0
[67277.797552]  </IRQ>
[67277.797555] RIP: 0010:default_idle+0x17/0x28
[67277.797556] RSP: 0018:ffffc900031e7eb8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[67277.797557] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88085e2d0b40
[67277.797558] RDX: 00000000cb9b0cbe RSI: 0000000000000003 RDI: 0000000000000001
[67277.797559] RBP: ffffc900031e7eb8 R08: 0000000000000001 R09: 0000000000000208
[67277.797559] R10: ffffc90003297e38 R11: ffff88085e2d8f90 R12: ffff88085ab32700
[67277.797560] R13: 0000000000000000 R14: ffff88085ab32700 R15: 00000000fffffff0
[67277.797563]  ? default_idle+0x15/0x28
[67277.797565]  arch_cpu_idle+0xa/0xc
[67277.797567]  default_idle_call+0x15/0x17
[67277.797569]  do_idle+0xb8/0x16f
[67277.797570]  cpu_startup_entry+0x1a/0x1c
[67277.797573]  start_secondary+0xe9/0xec
[67277.797575]  secondary_startup_64+0xa5/0xa5
[67277.797576] Code: 64 ed 6e 00 00 75 38 48 89 df c6 05 58 ed 6e 00 01 e8 ea 42 fe ff 44 89 e1 48 89 de 48 c7 c7 b5 75 ac 81 48 89 c2 e8 92 2a a8 ff <0f> ff eb 10 41 ff c4 48 05 40 01 00 00 41 39 f4 75 9a eb 0d 48
[67277.797597] ---[ end trace 254b5ec935206247 ]---
[67277.797601] i40e 0000:02:00.1 enp2s0f1: tx_timeout: VSI_seid: 398, Q 3, NTC: 0x11, HWB: 0x11, NTU: 0x1f6, TAIL: 0x11, INT: 0x0
[67277.797602] i40e 0000:02:00.1 enp2s0f1: tx_timeout recovery level 1, hung_queue 3
[67278.517801] i40e 0000:02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on
[67916.300725] i40e 0000:02:00.1 enp2s0f1: NIC Link is Down
Comment 21 Pawel Staszewski 2017-10-23 17:03:49 UTC
And another port just after first one
[68866.021624] i40e 0000:03:00.2 enp3s0f2: tx_timeout: VSI_seid: 396, Q 4, NTC: 0x1c8, HWB: 0x1c8, NTU: 0x1a6, TAIL: 0x1c8, INT: 0x1
[68866.021627] i40e 0000:03:00.2 enp3s0f2: tx_timeout recovery level 1, hung_queue 4
[68866.398730] i40e 0000:03:00.2: PF reset failed, -15



Same schema as on net-next/linux-next git threes
Comment 22 Pawel Staszewski 2017-10-23 17:20:32 UTC
And just another port goes down
[69650.917687] i40e 0000:02:00.3 enp2s0f3: tx_timeout: VSI_seid: 397, Q 5, NTC: 0x117, HWB: 0x117, NTU: 0x100, TAIL: 0x117, INT: 0x1
[69650.917689] i40e 0000:02:00.3 enp2s0f3: tx_timeout recovery level 1, hung_queue 5
[69651.292249] i40e 0000:02:00.3: PF reset failed, -15




Whole dmesg for this 

[67277.797498] NETDEV WATCHDOG: enp2s0f1 (i40e): transmit queue 3 timed out
[67277.797514] ------------[ cut here ]------------
[67277.797519] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:320 dev_watchdog+0xc5/0x122
[67277.797520] Modules linked in: team_mode_roundrobin team ipmi_si x86_pkg_temp_thermal
[67277.797526] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.0-rc4 #4
[67277.797527] task: ffff88085ab32700 task.stack: ffffc900031e4000
[67277.797529] RIP: 0010:dev_watchdog+0xc5/0x122
[67277.797530] RSP: 0018:ffff88085e2c3e88 EFLAGS: 00010292
[67277.797531] RAX: 000000000000003c RBX: ffff8808598b5000 RCX: 0000000000000000
[67277.797532] RDX: ffff88085e2d3501 RSI: ffff88085e2ccab8 RDI: ffff88085e2ccab8
[67277.797532] RBP: ffff88085e2c3e98 R08: 0026672c83440046 R09: ffff88087f0138ac
[67277.797533] R10: 0000000000000001 R11: 000000000000005c R12: 0000000000000003
[67277.797533] R13: ffffffff815ee173 R14: ffff8808598b5438 R15: ffff8808598b5000
[67277.797534] FS:  0000000000000000(0000) GS:ffff88085e2c0000(0000) knlGS:0000000000000000
[67277.797535] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[67277.797535] CR2: 00007fbe415bdff0 CR3: 0000000856a3e005 CR4: 00000000001606e0
[67277.797536] Call Trace:
[67277.797537]  <IRQ>
[67277.797540]  call_timer_fn+0x56/0x119
[67277.797541]  run_timer_softirq+0x136/0x15b
[67277.797542]  ? tk_clock_read+0xc/0xe
[67277.797544]  ? timekeeping_get_ns+0x1d/0x31
[67277.797546]  __do_softirq+0xe6/0x23a
[67277.797549]  irq_exit+0x4d/0x5b
[67277.797550]  smp_apic_timer_interrupt+0xb6/0xf0
[67277.797551]  apic_timer_interrupt+0x90/0xa0
[67277.797552]  </IRQ>
[67277.797555] RIP: 0010:default_idle+0x17/0x28
[67277.797556] RSP: 0018:ffffc900031e7eb8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[67277.797557] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88085e2d0b40
[67277.797558] RDX: 00000000cb9b0cbe RSI: 0000000000000003 RDI: 0000000000000001
[67277.797559] RBP: ffffc900031e7eb8 R08: 0000000000000001 R09: 0000000000000208
[67277.797559] R10: ffffc90003297e38 R11: ffff88085e2d8f90 R12: ffff88085ab32700
[67277.797560] R13: 0000000000000000 R14: ffff88085ab32700 R15: 00000000fffffff0
[67277.797563]  ? default_idle+0x15/0x28
[67277.797565]  arch_cpu_idle+0xa/0xc
[67277.797567]  default_idle_call+0x15/0x17
[67277.797569]  do_idle+0xb8/0x16f
[67277.797570]  cpu_startup_entry+0x1a/0x1c
[67277.797573]  start_secondary+0xe9/0xec
[67277.797575]  secondary_startup_64+0xa5/0xa5
[67277.797576] Code: 64 ed 6e 00 00 75 38 48 89 df c6 05 58 ed 6e 00 01 e8 ea 42 fe ff 44 89 e1 48 89 de 48 c7 c7 b5 75 ac 81 48 89 c2 e8 92 2a a8 ff <0f> ff eb 10 41 ff c4 48 05 40 01 00 00 41 39 f4 75 9a eb 0d 48
[67277.797597] ---[ end trace 254b5ec935206247 ]---
[67277.797601] i40e 0000:02:00.1 enp2s0f1: tx_timeout: VSI_seid: 398, Q 3, NTC: 0x11, HWB: 0x11, NTU: 0x1f6, TAIL: 0x11, INT: 0x0
[67277.797602] i40e 0000:02:00.1 enp2s0f1: tx_timeout recovery level 1, hung_queue 3
[67278.517801] i40e 0000:02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on
[67916.300725] i40e 0000:02:00.1 enp2s0f1: NIC Link is Down
[68866.021624] i40e 0000:03:00.2 enp3s0f2: tx_timeout: VSI_seid: 396, Q 4, NTC: 0x1c8, HWB: 0x1c8, NTU: 0x1a6, TAIL: 0x1c8, INT: 0x1
[68866.021627] i40e 0000:03:00.2 enp3s0f2: tx_timeout recovery level 1, hung_queue 4
[68866.398730] i40e 0000:03:00.2: PF reset failed, -15
[69650.917687] i40e 0000:02:00.3 enp2s0f3: tx_timeout: VSI_seid: 397, Q 5, NTC: 0x117, HWB: 0x117, NTU: 0x100, TAIL: 0x117, INT: 0x1
[69650.917689] i40e 0000:02:00.3 enp2s0f3: tx_timeout recovery level 1, hung_queue 5
[69651.292249] i40e 0000:02:00.3: PF reset failed, -15
Comment 23 Pawel Staszewski 2017-10-23 18:12:48 UTC
Ok all interfaces goes down - cant up them - can login now only thru ipmi and make reboot - no more info in dmesg - all interfaces are down .... with PF reset failed -15
Comment 24 Pawel Staszewski 2017-10-23 20:06:14 UTC
Ok reverting back to 4.11.12
Comment 25 Pawel Staszewski 2017-10-24 00:48:31 UTC
Ok last chance - where im not hitting DST_NOCACHE bug is kernel 4.12.14
Just applied memleak patches for i40e and we will see - now waiting for about 16-18 hours - cause after this time all nics was shutting down by PF reset.
Comment 26 Pawel Staszewski 2017-10-24 20:00:59 UTC
So far all is working with 4.12.14

So this is the latest working kernel for me.
Comment 27 Pawel Staszewski 2017-10-24 23:27:53 UTC
So what now ? :)
We are waiting one year for somebody that will hit same problem ? :)

for me
4.12.14 - working kernel
4.13.X - NOCACHE FIB PANIC (cant applu patches cause lack of many things from 4.14)
4.14.X - any version from 4.14-rc1 to net-next / net / linux-next 
PF reset bug
Comment 28 Alexander Duyck 2017-10-25 15:36:30 UTC
So I think there was some mention of bonding somewhere wasn't there? If you are using a bond could you provide the bonding config as well?

You are running the latest firmware from what I can tell which is important as there were known bonding issues with some of the older versions:
firmware-version: 6.01 0x80003484 1.1747.0

Just to paraphrase:
4.12.14 - working kernel (ignoring memoryleak we recently fixed)
4.13.X - Don't know due to unrelated panic that blocks testing
4.14.X - PF reset bug issues

That narrows things down a bit. We will focus on trying to address the PF reset bug that is being reported. We just have to get root cause and then work on trying to fix it.
Comment 29 Pawel Staszewski 2017-10-25 23:24:43 UTC
Yes there is teamd
modprobe team
teamd -d -f /etc/teamd.conf

cat /etc/teamd.conf
{
        "device": "team0",
        "hwaddr": "0c:c4:7a:bc:40:f7",
        "runner": {
                "name": "roundrobin",
                "active": true,
                "fast_rate": true,
                "tx_hash": ["eth", "ipv4", "ipv6"]
        },
        "link_watch": {"name": "ethtool"},
        "ports": {"enp2s0f0": {}, "enp2s0f1": {}, "enp2s0f2": {}, "enp2s0f3": {}, "enp3s0f0": {}, "enp3s0f1": {}, "enp3s0f2": {}, "enp3s0f3": {}}
}
Comment 30 Pawel Staszewski 2017-10-26 14:42:26 UTC
Ok - so 4.12.14 - also PF reset
This just takes longer to show this bug
[222383.298448] NETDEV WATCHDOG: enp2s0f0 (i40e): transmit queue 2 timed out
[222383.298464] ------------[ cut here ]------------
[222383.298469] WARNING: CPU: 3 PID: 24 at net/sched/sch_generic.c:316 dev_watchdog+0xc5/0x122
[222383.298470] Modules linked in: team_mode_roundrobin team x86_pkg_temp_thermal ipmi_si
[222383.298475] CPU: 3 PID: 24 Comm: ksoftirqd/3 Not tainted 4.12.14 #2
[222383.298476] task: ffff88085abb3100 task.stack: ffffc90003294000
[222383.298478] RIP: 0010:dev_watchdog+0xc5/0x122
[222383.298479] RSP: 0018:ffffc90003297d90 EFLAGS: 00010286
[222383.298480] RAX: 000000000000003c RBX: ffff880859144000 RCX: 0000000000000000
[222383.298481] RDX: ffff88085e2d3301 RSI: ffff88085e2ccaa8 RDI: ffff88085e2ccaa8
[222383.298482] RBP: ffffc90003297da0 R08: 0000000000000000 R09: ffff88087f0137cc
[222383.298482] R10: ffffc90003297cf8 R11: 000000000000005c R12: 0000000000000002
[222383.298483] R13: ffffffff815dd720 R14: ffff880859144438 R15: ffff880859144000
[222383.298484] FS:  0000000000000000(0000) GS:ffff88085e2c0000(0000) knlGS:0000000000000000
[222383.298485] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[222383.298486] CR2: 00007ffe7ef59b68 CR3: 000000084838c000 CR4: 00000000001406e0
[222383.298487] Call Trace:
[222383.298491]  call_timer_fn+0x56/0x119
[222383.298492]  run_timer_softirq+0x136/0x15b
[222383.298495]  ? update_next_balance+0x1a/0x2d
[222383.298499]  ? _raw_spin_lock+0x9/0xb
[222383.298501]  ? pick_next_task_fair+0x26d/0x2e5
[222383.298504]  __do_softirq+0xe6/0x23a
[222383.298508]  ? sort_range+0x1d/0x1d
[222383.298509]  run_ksoftirqd+0x15/0x2a
[222383.298510]  smpboot_thread_fn+0x126/0x13d
[222383.298512]  kthread+0xf6/0xfb
[222383.298513]  ? init_completion+0x24/0x24
[222383.298514]  ret_from_fork+0x22/0x30
[222383.298515] Code: b9 3a 70 00 00 75 38 48 89 df c6 05 ad 3a 70 00 01 e8 6a 59 fe ff 44 89 e1 48 89 de 48 c7 c7 4c a6 aa 81 48 89 c2 e8 30 5f b0 ff <0f> ff eb 10 41 ff c4 48 05 40 01 00 00 41 39 f4 75 9a eb 0d 48
[222383.298530] ---[ end trace a9810da52af61a52 ]---
[222383.298536] i40e 0000:02:00.0 enp2s0f0: tx_timeout: VSI_seid: 399, Q 2, NTC: 0x1e4, HWB: 0x1e4, NTU: 0x1cc, TAIL: 0x1e4, INT: 0x1
[222383.298537] i40e 0000:02:00.0 enp2s0f0: tx_timeout recovery level 1, hung_queue 2
[222383.666444] i40e 0000:02:00.0: PF reset failed, -15


First link goes down now.
So probabbly all others now will go down too
Comment 31 Pawel Staszewski 2017-10-26 14:53:34 UTC
Also can up this link again
ip link set up enp2s0f0
RTNETLINK answers: Cannot allocate memory


[223584.066499] i40e 0000:02:00.0 enp2s0f0: NIC Link is Down
[223584.682314] i40e 0000:02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on
Comment 32 Pawel Staszewski 2017-10-26 14:56:44 UTC
next try to up port
ip link set up dev enp2s0f0
RTNETLINK answers: Device or resource busy

[224051.287277] WARNING: CPU: 3 PID: 25031 at drivers/net/ethernet/intel/i40e/i40e_txrx.c:1248 i40e_setup_rx_descriptors+0x15/0xa9
[224051.287278] Modules linked in: team_mode_roundrobin team x86_pkg_temp_thermal ipmi_si
[224051.287327] CPU: 3 PID: 25031 Comm: ip Tainted: G        W       4.12.14 #2
[224051.287330] task: ffff880859e09880 task.stack: ffffc900036ec000
[224051.287332] RIP: 0010:i40e_setup_rx_descriptors+0x15/0xa9
[224051.287332] RSP: 0018:ffffc900036ef6e8 EFLAGS: 00010286
[224051.287333] RAX: ffff8808595eda00 RBX: ffff880856d36d00 RCX: 014000c000000001
[224051.287334] RDX: 0000000000000001 RSI: ffff880844418000 RDI: ffff880856d36d00
[224051.287334] RBP: ffffc900036ef6f8 R08: 000000000001ccc3 R09: ffffea0021110620
[224051.287335] R10: 0000000000000000 R11: ffff88087effae90 R12: ffff8808590300a0
[224051.287335] R13: 0000000000000002 R14: 00000000fffffff0 R15: 0000000000000001
[224051.287336] FS:  00007f1e4658b740(0000) GS:ffff88085e2c0000(0000) knlGS:0000000000000000
[224051.287337] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[224051.287337] CR2: 00007ffd74790000 CR3: 000000059e2f4000 CR4: 00000000001406e0
[224051.287338] Call Trace:
[224051.287339]  i40e_vsi_open+0x7d/0x1e7
[224051.287341]  i40e_open+0x4d/0xc3
[224051.287342]  __dev_open+0x8b/0xcd
[224051.287344]  __dev_change_flags+0xa2/0x13d
[224051.287346]  dev_change_flags+0x20/0x53
[224051.287347]  do_setlink+0x2d0/0xad6
[224051.287349]  ? zone_statistics+0x5a/0x61
[224051.287350]  ? get_page_from_freelist+0x4c8/0x627
[224051.287352]  rtnl_newlink+0x391/0x6d6
[224051.287353]  ? netdev_master_upper_dev_get+0xd/0x57
[224051.287354]  ? rtnl_newlink+0x106/0x6d6
[224051.287356]  ? alloc_pages_vma+0x8c/0x17a
[224051.287357]  ? pagevec_lru_move_fn+0x20/0xc1
[224051.287359]  ? lru_cache_add_active_or_unevictable+0x27/0x7a
[224051.287360]  ? __handle_mm_fault+0x4c1/0x8ae
[224051.287362]  rtnetlink_rcv_msg+0x166/0x173
[224051.287363]  ? __kmalloc_node_track_caller+0x11f/0x12f
[224051.287365]  ? __alloc_skb+0x89/0x175
[224051.287366]  ? rtnl_newlink+0x6d6/0x6d6
[224051.287367]  netlink_rcv_skb+0x57/0xa0
[224051.287369]  rtnetlink_rcv+0x1e/0x25
[224051.287371]  netlink_unicast+0x103/0x187
[224051.287372]  netlink_sendmsg+0x28d/0x2ad
[224051.287374]  sock_sendmsg_nosec+0x12/0x1d
[224051.287375]  ___sys_sendmsg+0x19d/0x217
[224051.287377]  ? kmem_cache_free+0x4b/0xf3
[224051.287492]  ? alloc_pages_vma+0x147/0x17a
[224051.287494]  ? __page_set_anon_rmap+0x24/0x65
[224051.287495]  ? get_page+0x9/0xf
[224051.287496]  ? __lru_cache_add+0x18/0x47
[224051.287498]  ? __handle_mm_fault+0x4c1/0x8ae
[224051.287499]  __sys_sendmsg+0x40/0x5e
[224051.287564]  ? __sys_sendmsg+0x40/0x5e
[224051.287566]  SyS_sendmsg+0xd/0x17
[224051.287567]  entry_SYSCALL_64_fastpath+0x13/0x94
[224051.287568] RIP: 0033:0x7f1e45cac620
[224051.287569] RSP: 002b:00007ffd7478b4d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[224051.287570] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1e45cac620
[224051.287571] RDX: 0000000000000000 RSI: 00007ffd7478b520 RDI: 0000000000000003
[224051.287572] RBP: 00007ffd7478b520 R08: 0000000000000001 R09: fefefeff77686d74
[224051.287572] R10: 00000000000005e6 R11: 0000000000000246 R12: 00007ffd7478b560
[224051.287573] R13: 00000000006724c0 R14: 00007ffd747935e0 R15: 0000000000000000
[224051.287574] Code: 00 00 48 8b 7b 10 e8 41 f2 ff ff 48 c7 43 08 00 00 00 00 5b 5d c3 55 48 89 e5 41 54 53 48 83 7f 20 00 48 89 fb 4c 8b 67 10 74 02 <0f> ff 0f b7 7b 44 48 6b ff 18 e8 65 f5 ff ff 48 85 c0 48 89 43
[224051.287597] ---[ end trace a9810da52af61a5a ]---
[224051.287607] ------------[ cut here ]------------
[224051.287609] WARNING: CPU: 3 PID: 25031 at drivers/net/ethernet/intel/i40e/i40e_txrx.c:1248 i40e_setup_rx_descriptors+0x15/0xa9
[224051.287609] Modules linked in: team_mode_roundrobin team x86_pkg_temp_thermal ipmi_si
[224051.287612] CPU: 3 PID: 25031 Comm: ip Tainted: G        W       4.12.14 #2
[224051.287613] task: ffff880859e09880 task.stack: ffffc900036ec000
[224051.287614] RIP: 0010:i40e_setup_rx_descriptors+0x15/0xa9
[224051.287615] RSP: 0018:ffffc900036ef6e8 EFLAGS: 00010286
[224051.287616] RAX: ffff8808595eda00 RBX: ffff880856d36f00 RCX: 014000c000000002
[224051.287617] RDX: 0000000000000002 RSI: ffff880590cb4000 RDI: ffff880856d36f00
[224051.287618] RBP: ffffc900036ef6f8 R08: 000000000001ccc3 R09: ffffea0016432d20
[224051.287618] R10: 0000000000000000 R11: ffff88087effae90 R12: ffff8808590300a0
[224051.287619] R13: 0000000000000003 R14: 00000000fffffff0 R15: 0000000000000001
[224051.287620] FS:  00007f1e4658b740(0000) GS:ffff88085e2c0000(0000) knlGS:0000000000000000
[224051.287621] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[224051.287622] CR2: 00007ffd74790000 CR3: 000000059e2f4000 CR4: 00000000001406e0
[224051.287622] Call Trace:
[224051.287624]  i40e_vsi_open+0x7d/0x1e7
[224051.288201]  i40e_open+0x4d/0xc3
[224051.288203]  __dev_open+0x8b/0xcd
[224051.288205]  __dev_change_flags+0xa2/0x13d
[224051.288207]  dev_change_flags+0x20/0x53
[224051.288208]  do_setlink+0x2d0/0xad6
[224051.288210]  ? zone_statistics+0x5a/0x61
[224051.288212]  ? get_page_from_freelist+0x4c8/0x627
[224051.288213]  rtnl_newlink+0x391/0x6d6
[224051.288215]  ? netdev_master_upper_dev_get+0xd/0x57
[224051.288216]  ? rtnl_newlink+0x106/0x6d6
[224051.288217]  ? alloc_pages_vma+0x8c/0x17a
[224051.288219]  ? pagevec_lru_move_fn+0x20/0xc1
[224051.288220]  ? lru_cache_add_active_or_unevictable+0x27/0x7a
[224051.288221]  ? __handle_mm_fault+0x4c1/0x8ae
[224051.288224]  rtnetlink_rcv_msg+0x166/0x173
[224051.288225]  ? __kmalloc_node_track_caller+0x11f/0x12f
[224051.288227]  ? __alloc_skb+0x89/0x175
[224051.288228]  ? rtnl_newlink+0x6d6/0x6d6
[224051.288230]  netlink_rcv_skb+0x57/0xa0
[224051.288232]  rtnetlink_rcv+0x1e/0x25
[224051.288233]  netlink_unicast+0x103/0x187
[224051.288235]  netlink_sendmsg+0x28d/0x2ad
[224051.288236]  sock_sendmsg_nosec+0x12/0x1d
[224051.288238]  ___sys_sendmsg+0x19d/0x217
[224051.288239]  ? kmem_cache_free+0x4b/0xf3
[224051.288241]  ? alloc_pages_vma+0x147/0x17a
[224051.288242]  ? __page_set_anon_rmap+0x24/0x65
[224051.288244]  ? get_page+0x9/0xf
[224051.288245]  ? __lru_cache_add+0x18/0x47
[224051.288246]  ? __handle_mm_fault+0x4c1/0x8ae
[224051.288249]  __sys_sendmsg+0x40/0x5e
[224051.288250]  ? __sys_sendmsg+0x40/0x5e
[224051.288252]  SyS_sendmsg+0xd/0x17
[224051.288253]  entry_SYSCALL_64_fastpath+0x13/0x94
[224051.288254] RIP: 0033:0x7f1e45cac620
[224051.288255] RSP: 002b:00007ffd7478b4d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[224051.288256] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1e45cac620
[224051.288257] RDX: 0000000000000000 RSI: 00007ffd7478b520 RDI: 0000000000000003
[224051.288258] RBP: 00007ffd7478b520 R08: 0000000000000001 R09: fefefeff77686d74
[224051.288259] R10: 00000000000005e6 R11: 0000000000000246 R12: 00007ffd7478b560
[224051.288259] R13: 00000000006724c0 R14: 00007ffd747935e0 R15: 0000000000000000
[224051.288260] Code: 00 00 48 8b 7b 10 e8 41 f2 ff ff 48 c7 43 08 00 00 00 00 5b 5d c3 55 48 89 e5 41 54 53 48 83 7f 20 00 48 89 fb 4c 8b 67 10 74 02 <0f> ff 0f b7 7b 44 48 6b ff 18 e8 65 f5 ff ff 48 85 c0 48 89 43
[224051.288284] ---[ end trace a9810da52af61a5b ]---
[224051.288291] ------------[ cut here ]------------
[224051.288293] WARNING: CPU: 3 PID: 25031 at drivers/net/ethernet/intel/i40e/i40e_txrx.c:1248 i40e_setup_rx_descriptors+0x15/0xa9
[224051.288293] Modules linked in: team_mode_roundrobin team x86_pkg_temp_thermal ipmi_si
[224051.288295] CPU: 3 PID: 25031 Comm: ip Tainted: G        W       4.12.14 #2
[224051.288296] task: ffff880859e09880 task.stack: ffffc900036ec000
[224051.288297] RIP: 0010:i40e_setup_rx_descriptors+0x15/0xa9
[224051.288297] RSP: 0018:ffffc900036ef6e8 EFLAGS: 00010286
[224051.288298] RAX: ffff8808595eda00 RBX: ffff880856d37100 RCX: 014000c000000003
[224051.288299] RDX: 0000000000000003 RSI: ffff880845df0000 RDI: ffff880856d37100
[224051.288300] RBP: ffffc900036ef6f8 R08: 000000000001ccc3 R09: ffffea0021177c20
[224051.288300] R10: 0000000000000000 R11: ffff88087effae90 R12: ffff8808590300a0
[224051.288301] R13: 0000000000000004 R14: 00000000fffffff0 R15: 0000000000000001
[224051.288302] FS:  00007f1e4658b740(0000) GS:ffff88085e2c0000(0000) knlGS:0000000000000000
[224051.288303] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[224051.288303] CR2: 00007ffd74790000 CR3: 000000059e2f4000 CR4: 00000000001406e0
[224051.288304] Call Trace:
[224051.288306]  i40e_vsi_open+0x7d/0x1e7
[224051.288307]  i40e_open+0x4d/0xc3
[224051.288309]  __dev_open+0x8b/0xcd
[224051.288311]  __dev_change_flags+0xa2/0x13d
[224051.288313]  dev_change_flags+0x20/0x53
[224051.288314]  do_setlink+0x2d0/0xad6
[224051.288315]  ? zone_statistics+0x5a/0x61
[224051.288317]  ? get_page_from_freelist+0x4c8/0x627
[224051.288319]  rtnl_newlink+0x391/0x6d6
[224051.288320]  ? netdev_master_upper_dev_get+0xd/0x57
[224051.288321]  ? rtnl_newlink+0x106/0x6d6
[224051.288322]  ? alloc_pages_vma+0x8c/0x17a
[224051.288323]  ? pagevec_lru_move_fn+0x20/0xc1
[224051.288324]  ? lru_cache_add_active_or_unevictable+0x27/0x7a
[224051.288324]  ? __handle_mm_fault+0x4c1/0x8ae
[224051.288326]  rtnetlink_rcv_msg+0x166/0x173
[224051.288327]  ? __kmalloc_node_track_caller+0x11f/0x12f
[224051.288328]  ? __alloc_skb+0x89/0x175
[224051.288328]  ? rtnl_newlink+0x6d6/0x6d6
[224051.288330]  netlink_rcv_skb+0x57/0xa0
[224051.288331]  rtnetlink_rcv+0x1e/0x25
[224051.288332]  netlink_unicast+0x103/0x187
[224051.288333]  netlink_sendmsg+0x28d/0x2ad
[224051.288334]  sock_sendmsg_nosec+0x12/0x1d
[224051.288335]  ___sys_sendmsg+0x19d/0x217
[224051.288336]  ? kmem_cache_free+0x4b/0xf3
[224051.288337]  ? alloc_pages_vma+0x147/0x17a
[224051.288339]  ? __page_set_anon_rmap+0x24/0x65
[224051.288340]  ? get_page+0x9/0xf
[224051.288341]  ? __lru_cache_add+0x18/0x47
[224051.288342]  ? __handle_mm_fault+0x4c1/0x8ae
[224051.288344]  __sys_sendmsg+0x40/0x5e
[224051.288345]  ? __sys_sendmsg+0x40/0x5e
[224051.288347]  SyS_sendmsg+0xd/0x17
[224051.288349]  entry_SYSCALL_64_fastpath+0x13/0x94
[224051.288349] RIP: 0033:0x7f1e45cac620
[224051.288350] RSP: 002b:00007ffd7478b4d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[224051.288351] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1e45cac620
[224051.288352] RDX: 0000000000000000 RSI: 00007ffd7478b520 RDI: 0000000000000003
[224051.288353] RBP: 00007ffd7478b520 R08: 0000000000000001 R09: fefefeff77686d74
[224051.288353] R10: 00000000000005e6 R11: 0000000000000246 R12: 00007ffd7478b560
[224051.288354] R13: 00000000006724c0 R14: 00007ffd747935e0 R15: 0000000000000000
[224051.288355] Code: 00 00 48 8b 7b 10 e8 41 f2 ff ff 48 c7 43 08 00 00 00 00 5b 5d c3 55 48 89 e5 41 54 53 48 83 7f 20 00 48 89 fb 4c 8b 67 10 74 02 <0f> ff 0f b7 7b 44 48 6b ff 18 e8 65 f5 ff ff 48 85 c0 48 89 43
[224051.288376] ---[ end trace a9810da52af61a5c ]---
[224051.288382] ------------[ cut here ]------------
[224051.288384] WARNING: CPU: 3 PID: 25031 at drivers/net/ethernet/intel/i40e/i40e_txrx.c:1248 i40e_setup_rx_descriptors+0x15/0xa9
[224051.288384] Modules linked in: team_mode_roundrobin team x86_pkg_temp_thermal ipmi_si
[224051.288387] CPU: 3 PID: 25031 Comm: ip Tainted: G        W       4.12.14 #2
[224051.288387] task: ffff880859e09880 task.stack: ffffc900036ec000
[224051.288389] RIP: 0010:i40e_setup_rx_descriptors+0x15/0xa9
[224051.288389] RSP: 0018:ffffc900036ef6e8 EFLAGS: 00010286
[224051.288391] RAX: ffff8808595eda00 RBX: ffff880856d37300 RCX: 014000c000000004
[224051.288391] RDX: 0000000000000004 RSI: ffff88084bf2c000 RDI: ffff880856d37300
[224051.288392] RBP: ffffc900036ef6f8 R08: 000000000001ccc3 R09: ffffea00212fcb20
[224051.288393] R10: 0000000000000000 R11: ffff88087effae90 R12: ffff8808590300a0
[224051.288393] R13: 0000000000000005 R14: 00000000fffffff0 R15: 0000000000000001
[224051.288394] FS:  00007f1e4658b740(0000) GS:ffff88085e2c0000(0000) knlGS:0000000000000000
[224051.288395] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[224051.288396] CR2: 00007ffd74790000 CR3: 000000059e2f4000 CR4: 00000000001406e0
[224051.288396] Call Trace:
[224051.288398]  i40e_vsi_open+0x7d/0x1e7
[224051.288399]  i40e_open+0x4d/0xc3
[224051.288401]  __dev_open+0x8b/0xcd
[224051.288403]  __dev_change_flags+0xa2/0x13d
[224051.288404]  dev_change_flags+0x20/0x53
[224051.288405]  do_setlink+0x2d0/0xad6
[224051.288406]  ? zone_statistics+0x5a/0x61
[224051.288408]  ? get_page_from_freelist+0x4c8/0x627
[224051.288409]  rtnl_newlink+0x391/0x6d6
[224051.288409]  ? netdev_master_upper_dev_get+0xd/0x57
[224051.288410]  ? rtnl_newlink+0x106/0x6d6
[224051.288411]  ? alloc_pages_vma+0x8c/0x17a
[224051.288412]  ? pagevec_lru_move_fn+0x20/0xc1
[224051.288413]  ? lru_cache_add_active_or_unevictable+0x27/0x7a
[224051.288414]  ? __handle_mm_fault+0x4c1/0x8ae
[224051.288415]  rtnetlink_rcv_msg+0x166/0x173
[224051.288416]  ? __kmalloc_node_track_caller+0x11f/0x12f
[224051.288417]  ? __alloc_skb+0x89/0x175
[224051.288418]  ? rtnl_newlink+0x6d6/0x6d6
[224051.288419]  netlink_rcv_skb+0x57/0xa0
[224051.288421]  rtnetlink_rcv+0x1e/0x25
[224051.288422]  netlink_unicast+0x103/0x187
[224051.288424]  netlink_sendmsg+0x28d/0x2ad
[224051.288425]  sock_sendmsg_nosec+0x12/0x1d
[224051.288426]  ___sys_sendmsg+0x19d/0x217
[224051.288427]  ? kmem_cache_free+0x4b/0xf3
[224051.288429]  ? alloc_pages_vma+0x147/0x17a
[224051.288431]  ? __page_set_anon_rmap+0x24/0x65
[224051.288432]  ? get_page+0x9/0xf
[224051.288433]  ? __lru_cache_add+0x18/0x47
[224051.288434]  ? __handle_mm_fault+0x4c1/0x8ae
[224051.288436]  __sys_sendmsg+0x40/0x5e
[224051.288437]  ? __sys_sendmsg+0x40/0x5e
[224051.288438]  SyS_sendmsg+0xd/0x17
[224051.288439]  entry_SYSCALL_64_fastpath+0x13/0x94
[224051.288440] RIP: 0033:0x7f1e45cac620
[224051.288440] RSP: 002b:00007ffd7478b4d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[224051.288441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1e45cac620
[224051.288441] RDX: 0000000000000000 RSI: 00007ffd7478b520 RDI: 0000000000000003
[224051.288442] RBP: 00007ffd7478b520 R08: 0000000000000001 R09: fefefeff77686d74
[224051.288442] R10: 00000000000005e6 R11: 0000000000000246 R12: 00007ffd7478b560
[224051.288443] R13: 00000000006724c0 R14: 00007ffd747935e0 R15: 0000000000000000
[224051.288444] Code: 00 00 48 8b 7b 10 e8 41 f2 ff ff 48 c7 43 08 00 00 00 00 5b 5d c3 55 48 89 e5 41 54 53 48 83 7f 20 00 48 89 fb 4c 8b 67 10 74 02 <0f> ff 0f b7 7b 44 48 6b ff 18 e8 65 f5 ff ff 48 85 c0 48 89 43
[224051.288459] ---[ end trace a9810da52af61a5d ]---
[224051.288467] ------------[ cut here ]------------
[224051.288469] WARNING: CPU: 3 PID: 25031 at drivers/net/ethernet/intel/i40e/i40e_txrx.c:1248 i40e_setup_rx_descriptors+0x15/0xa9
[224051.288470] Modules linked in: team_mode_roundrobin team x86_pkg_temp_thermal ipmi_si
[224051.288473] CPU: 3 PID: 25031 Comm: ip Tainted: G        W       4.12.14 #2
[224051.288473] task: ffff880859e09880 task.stack: ffffc900036ec000
[224051.288475] RIP: 0010:i40e_setup_rx_descriptors+0x15/0xa9
[224051.288475] RSP: 0018:ffffc900036ef6e8 EFLAGS: 00010286
[224051.288476] RAX: ffff8808595eda00 RBX: ffff880856d37500 RCX: 014000c000000005
[224051.288477] RDX: 0000000000000005 RSI: ffff880847778000 RDI: ffff880856d37500
[224051.288478] RBP: ffffc900036ef6f8 R08: 000000000001ccc3 R09: ffffea00211dde20
[224051.288479] R10: 0000000000000000 R11: ffff88087effae90 R12: ffff8808590300a0
[224051.288479] R13: 0000000000000006 R14: 00000000fffffff0 R15: 0000000000000001
[224051.288480] FS:  00007f1e4658b740(0000) GS:ffff88085e2c0000(0000) knlGS:0000000000000000
[224051.288481] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[224051.288482] CR2: 00007ffd74790000 CR3: 000000059e2f4000 CR4: 00000000001406e0
[224051.288482] Call Trace:
[224051.288484]  i40e_vsi_open+0x7d/0x1e7
[224051.288486]  i40e_open+0x4d/0xc3
[224051.288487]  __dev_open+0x8b/0xcd
[224051.288489]  __dev_change_flags+0xa2/0x13d
[224051.288491]  dev_change_flags+0x20/0x53
[224051.288492]  do_setlink+0x2d0/0xad6
[224051.288494]  ? zone_statistics+0x5a/0x61
[224051.288496]  ? get_page_from_freelist+0x4c8/0x627
[224051.288497]  rtnl_newlink+0x391/0x6d6
[224051.288498]  ? netdev_master_upper_dev_get+0xd/0x57
[224051.288499]  ? rtnl_newlink+0x106/0x6d6
[224051.288501]  ? alloc_pages_vma+0x8c/0x17a
[224051.288503]  ? pagevec_lru_move_fn+0x20/0xc1
[224051.288504]  ? lru_cache_add_active_or_unevictable+0x27/0x7a
[224051.288505]  ? __handle_mm_fault+0x4c1/0x8ae
[224051.288507]  rtnetlink_rcv_msg+0x166/0x173
[224051.288509]  ? __kmalloc_node_track_caller+0x11f/0x12f
[224051.288510]  ? __alloc_skb+0x89/0x175
[224051.288511]  ? rtnl_newlink+0x6d6/0x6d6
[224051.288513]  netlink_rcv_skb+0x57/0xa0
[224051.288515]  rtnetlink_rcv+0x1e/0x25
[224051.288516]  netlink_unicast+0x103/0x187
[224051.288518]  netlink_sendmsg+0x28d/0x2ad
[224051.288520]  sock_sendmsg_nosec+0x12/0x1d
[224051.288521]  ___sys_sendmsg+0x19d/0x217
[224051.288523]  ? kmem_cache_free+0x4b/0xf3
[224051.288524]  ? alloc_pages_vma+0x147/0x17a
[224051.288526]  ? __page_set_anon_rmap+0x24/0x65
[224051.288527]  ? get_page+0x9/0xf
[224051.288528]  ? __lru_cache_add+0x18/0x47
[224051.288530]  ? __handle_mm_fault+0x4c1/0x8ae
[224051.288531]  __sys_sendmsg+0x40/0x5e
[224051.288533]  ? __sys_sendmsg+0x40/0x5e
[224051.288534]  SyS_sendmsg+0xd/0x17
[224051.288536]  entry_SYSCALL_64_fastpath+0x13/0x94
[224051.288537] RIP: 0033:0x7f1e45cac620
[224051.288537] RSP: 002b:00007ffd7478b4d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[224051.288538] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1e45cac620
[224051.288539] RDX: 0000000000000000 RSI: 00007ffd7478b520 RDI: 0000000000000003
[224051.288540] RBP: 00007ffd7478b520 R08: 0000000000000001 R09: fefefeff77686d74
[224051.288541] R10: 00000000000005e6 R11: 0000000000000246 R12: 00007ffd7478b560
[224051.288541] R13: 00000000006724c0 R14: 00007ffd747935e0 R15: 0000000000000000
[224051.288542] Code: 00 00 48 8b 7b 10 e8 41 f2 ff ff 48 c7 43 08 00 00 00 00 5b 5d c3 55 48 89 e5 41 54 53 48 83 7f 20 00 48 89 fb 4c 8b 67 10 74 02 <0f> ff 0f b7 7b 44 48 6b ff 18 e8 65 f5 ff ff 48 85 c0 48 89 43
[224051.288564] ---[ end trace a9810da52af61a5e ]---
[224051.487730] genirq: Flags mismatch irq 64. 00000000 (i40e-enp2s0f0-TxRx-0) vs. 00000000 (i40e-enp2s0f0-TxRx-0)
[224051.487734] i40e 0000:02:00.0: MSIX request_irq failed, error: -16
[224051.487735] i40e 0000:02:00.0: request_irq failed, Error -16
[224051.854175] i40e 0000:02:00.0: PF reset failed, -15
Comment 33 Pawel Staszewski 2017-10-26 15:15:36 UTC
Another try to get down/up interface
[225184.634024] i40e 0000:02:00.0: update vlan stripping failed, err I40E_ERR_QUEUE_EMPTY aq_err OK
[225184.834987] i40e 0000:02:00.0: Failed to clear LAN Tx queue context on Tx ring 0 (pf_q 0), error: -19
[225185.716274] i40e 0000:02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on
Comment 34 Pawel Staszewski 2017-10-26 15:17:53 UTC
removed from teamd and try tu up
[225308.277194] WARNING: CPU: 1 PID: 3142 at drivers/net/ethernet/intel/i40e/i40e_txrx.c:1248 i40e_setup_rx_descriptors+0x15/0xa9
[225308.277195] Modules linked in: team_mode_roundrobin team x86_pkg_temp_thermal ipmi_si
[225308.277198] CPU: 1 PID: 3142 Comm: ip Tainted: G        W       4.12.14 #2
[225308.277198] task: ffff8808435955c0 task.stack: ffffc90003dc8000
[225308.277200] RIP: 0010:i40e_setup_rx_descriptors+0x15/0xa9
[225308.277200] RSP: 0018:ffffc90003dcb6e8 EFLAGS: 00010286
[225308.277201] RAX: ffff8808595eda00 RBX: ffff880856d37500 RCX: 014000c000000005
[225308.277202] RDX: 0000000000000005 RSI: ffff880844430000 RDI: ffff880856d37500
[225308.277203] RBP: ffffc90003dcb6f8 R08: 000000000001ccc3 R09: ffffea0021110c20
[225308.277204] R10: 0000000000000000 R11: ffff88087effaf60 R12: ffff8808590300a0
[225308.277204] R13: 0000000000000006 R14: 00000000fffffff0 R15: 0000000000000001
[225308.277205] FS:  00007f1f1a058740(0000) GS:ffff88085e240000(0000) knlGS:0000000000000000
[225308.277205] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[225308.277206] CR2: 00007ffc4845f000 CR3: 0000000846df6000 CR4: 00000000001406e0
[225308.277206] Call Trace:
[225308.277207]  i40e_vsi_open+0x7d/0x1e7
[225308.277208]  i40e_open+0x4d/0xc3
[225308.277210]  __dev_open+0x8b/0xcd
[225308.277211]  __dev_change_flags+0xa2/0x13d
[225308.277212]  dev_change_flags+0x20/0x53
[225308.277213]  do_setlink+0x2d0/0xad6
[225308.277214]  ? zone_statistics+0x5a/0x61
[225308.277215]  ? get_page_from_freelist+0x4c8/0x627
[225308.277217]  rtnl_newlink+0x391/0x6d6
[225308.277218]  ? netdev_master_upper_dev_get+0xd/0x57
[225308.277219]  ? rtnl_newlink+0x106/0x6d6
[225308.277220]  ? ___slab_alloc+0xd2/0x425
[225308.277221]  ? ___slab_alloc+0xd2/0x425
[225308.277222]  ? __handle_mm_fault+0x4c1/0x8ae
[225308.277224]  rtnetlink_rcv_msg+0x166/0x173
[225308.277226]  ? __kmalloc_node_track_caller+0x11f/0x12f
[225308.277227]  ? __alloc_skb+0x89/0x175
[225308.277228]  ? rtnl_newlink+0x6d6/0x6d6
[225308.277230]  netlink_rcv_skb+0x57/0xa0
[225308.277231]  rtnetlink_rcv+0x1e/0x25
[225308.277233]  netlink_unicast+0x103/0x187
[225308.277235]  netlink_sendmsg+0x28d/0x2ad
[225308.277236]  sock_sendmsg_nosec+0x12/0x1d
[225308.277237]  ___sys_sendmsg+0x19d/0x217
[225308.277238]  ? kmem_cache_free+0x4b/0xf3
[225308.277240]  ? alloc_pages_vma+0x147/0x17a
[225308.277241]  ? __page_set_anon_rmap+0x24/0x65
[225308.277242]  ? get_page+0x9/0xf
[225308.277243]  ? __lru_cache_add+0x18/0x47
[225308.277243]  ? __handle_mm_fault+0x4c1/0x8ae
[225308.277244]  __sys_sendmsg+0x40/0x5e
[225308.277245]  ? __sys_sendmsg+0x40/0x5e
[225308.277246]  SyS_sendmsg+0xd/0x17
[225308.277248]  entry_SYSCALL_64_fastpath+0x13/0x94
[225308.277249] RIP: 0033:0x7f1f19779620
[225308.277250] RSP: 002b:00007ffc4845a328 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[225308.277251] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1f19779620
[225308.277251] RDX: 0000000000000000 RSI: 00007ffc4845a370 RDI: 0000000000000003
[225308.277252] RBP: 00007ffc4845a370 R08: 0000000000000001 R09: fefefeff77686d74
[225308.277252] R10: 00000000000005e6 R11: 0000000000000246 R12: 00007ffc4845a3b0
[225308.277252] R13: 00000000006724c0 R14: 00007ffc48462430 R15: 0000000000000000
[225308.277253] Code: 00 00 48 8b 7b 10 e8 41 f2 ff ff 48 c7 43 08 00 00 00 00 5b 5d c3 55 48 89 e5 41 54 53 48 83 7f 20 00 48 89 fb 4c 8b 67 10 74 02 <0f> ff 0f b7 7b 44 48 6b ff 18 e8 65 f5 ff ff 48 85 c0 48 89 43
[225308.277271] ---[ end trace a9810da52af61a6a ]---
[225308.279042] genirq: Flags mismatch irq 64. 00000000 (i40e-enp2s0f0-TxRx-0) vs. 00000000 (i40e-enp2s0f0-TxRx-0)
[225308.279044] i40e 0000:02:00.0: MSIX request_irq failed, error: -16
[225308.279045] i40e 0000:02:00.0: request_irq failed, Error -16
Comment 35 Pawel Staszewski 2017-10-26 15:21:22 UTC
So for the update.
Latest working kernel 4.11.12
Kernel 4.12.14 - PF reset bug
Comment 36 Pawel Staszewski 2017-10-26 20:51:38 UTC
update here.

Ok now with 6.01 firmware and kernel 4.11.12 i have same problem as on all kernels
The only change is new firmware.


[ 4533.442365] ------------[ cut here ]------------
[ 4533.442372] WARNING: CPU: 9 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0xd4/0x12f
[ 4533.442374] NETDEV WATCHDOG: enp2s0f1 (i40e): transmit queue 2 timed out
[ 4533.442374] Modules linked in: team_mode_roundrobin team ipmi_si x86_pkg_temp_thermal
[ 4533.442379] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G        W       4.11.12 #1
[ 4533.442381] Call Trace:
[ 4533.442382]  <IRQ>
[ 4533.442386]  dump_stack+0x4d/0x63
[ 4533.442389]  __warn+0xbd/0xd8
[ 4533.442391]  ? netif_tx_lock+0x79/0x79
[ 4533.442393]  warn_slowpath_fmt+0x46/0x4e
[ 4533.442397]  ? _raw_spin_lock+0x9/0xb
[ 4533.442398]  ? netif_tx_lock+0x4b/0x79
[ 4533.442399]  dev_watchdog+0xd4/0x12f
[ 4533.442401]  call_timer_fn+0x5b/0x139
[ 4533.442402]  run_timer_softirq+0x137/0x15a
[ 4533.442403]  ? tk_clock_read+0xc/0xe
[ 4533.442404]  ? timekeeping_get_ns+0x1d/0x31
[ 4533.442405]  ? ktime_get+0x3c/0x4d
[ 4533.442408]  __do_softirq+0xe4/0x259
[ 4533.442411]  ? tick_program_event+0x5d/0x64
[ 4533.442413]  irq_exit+0x4d/0x5b
[ 4533.442415]  smp_apic_timer_interrupt+0x29/0x34
[ 4533.442417]  apic_timer_interrupt+0x86/0x90
[ 4533.442419] RIP: 0010:default_idle+0x17/0x28
[ 4533.442420] RSP: 0018:ffffc90003217ec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[ 4533.442421] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88085e458f00
[ 4533.442422] RDX: 0000000015dac3d5 RSI: 0000000000000009 RDI: 0000000000000001
[ 4533.442423] RBP: ffffc90003217ec0 R08: 0000000000000000 R09: 0000000000000000
[ 4533.442424] R10: ffffc90003387e00 R11: ffff88085e458450 R12: ffff88085ab36200
[ 4533.442425] R13: 0000000000000000 R14: ffff88085ab36200 R15: ffff88085ab36200
[ 4533.442425]  </IRQ>
[ 4533.442429]  arch_cpu_idle+0xa/0xc
[ 4533.442431]  default_idle_call+0x15/0x17
[ 4533.442433]  do_idle+0xb0/0x185
[ 4533.442434]  cpu_startup_entry+0x1a/0x1c
[ 4533.442436]  start_secondary+0xd0/0xd3
[ 4533.442439]  start_cpu+0x14/0x14
[ 4533.442441] ---[ end trace 9411a47607befc55 ]---
[ 4533.442447] i40e 0000:02:00.1 enp2s0f1: tx_timeout: VSI_seid: 398, Q 2, NTC: 0x9b, HWB: 0x9b, NTU: 0x78, TAIL: 0x9b, INT: 0x1
[ 4533.442448] i40e 0000:02:00.1 enp2s0f1: tx_timeout recovery level 1, hung_queue 2
[ 4533.846398] i40e 0000:02:00.1: PF reset failed, -15
Comment 37 Pawel Staszewski 2017-10-27 17:58:15 UTC
Also
with 6.01
[    3.263620] i40e 0000:02:00.1: MAC address: 0c:c4:7a:bc:40:f5
[    3.288589] i40e 0000:02:00.1: PCI-Express: Speed 8.0GT/s Width x8
[    3.299881] i40e 0000:02:00.1: Features: PF-id[1] VSIs: 34 QP: 12 RSS FD_ATR FD_SB NTUPLE VxLAN Geneve PTP VEPA
[    3.321099] i40e 0000:02:00.2: fw 6.0.48442 api 1.7 nvm 6.01 0x80003484 1.1747.0
[    3.329262] i40e 0000:02:00.2: The driver for the device detected a newer version of the NVM image than expected. Please install the most recent version of the network driver.
[    3.569521] i40e 0000:02:00.2: MAC address: 0c:c4:7a:bc:40:f6
[    3.594613] i40e 0000:02:00.2: PCI-Express: Speed 8.0GT/s Width x8
[    3.605942] i40e 0000:02:00.2: Features: PF-id[2] VSIs: 34 QP: 12 RSS FD_ATR FD_SB NTUPLE VxLAN Geneve PTP VEPA
[    3.627105] i40e 0000:02:00.3: fw 6.0.48442 api 1.7 nvm 6.01 0x80003484 1.1747.0
[    3.635221] i40e 0000:02:00.3: The driver for the device detected a newer version of the NVM image than expected. Please install the most recent version of the network driver.
[    3.875453] i40e 0000:02:00.3: MAC address: 0c:c4:7a:bc:40:f7
[    3.901005] i40e 0000:02:00.3: PCI-Express: Speed 8.0GT/s Width x8
[    3.912273] i40e 0000:02:00.3: Features: PF-id[3] VSIs: 34 QP: 12 RSS FD_ATR FD_SB NTUPLE VxLAN Geneve PTP VEPA
[    3.932347] i40e 0000:03:00.0: fw 6.0.48442 api 1.7 nvm 6.01 0x80003484 1.1747.0
[    3.940604] i40e 0000:03:00.0: The driver for the device detected a newer version of the NVM image than expected. Please install the most recent version of the network driver.
[    4.180449] i40e 0000:03:00.0: MAC address: 0c:c4:7a:8e:b5:34
Comment 38 Pawel Staszewski 2017-10-27 18:37:52 UTC
with kernel from git net davem tree:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/



[ 2184.928191] NETDEV WATCHDOG: enp2s0f1 (i40e): transmit queue 5 timed out
[ 2184.928207] ------------[ cut here ]------------
[ 2184.928212] WARNING: CPU: 2 PID: 19 at net/sched/sch_generic.c:320 dev_watchdog+0xc5/0x122
[ 2184.928212] Modules linked in: bonding x86_pkg_temp_thermal ipmi_si
[ 2184.928217] CPU: 2 PID: 19 Comm: ksoftirqd/2 Not tainted 4.14.0-rc5 #5
[ 2184.928219] task: ffff88085abb1a00 task.stack: ffffc9000326c000
[ 2184.928220] RIP: 0010:dev_watchdog+0xc5/0x122
[ 2184.928221] RSP: 0018:ffffc9000326fd90 EFLAGS: 00010286
[ 2184.928222] RAX: 000000000000003c RBX: ffff8808598f5000 RCX: 0000000000000000
[ 2184.928223] RDX: ffff88085e293501 RSI: ffff88085e28cab8 RDI: ffff88085e28cab8
[ 2184.928224] RBP: ffffc9000326fda0 R08: 002aca35a1b646e0 R09: ffff88087f013c8c
[ 2184.928224] R10: ffffc9000326fe38 R11: 000000000000005c R12: 0000000000000005
[ 2184.928225] R13: ffffffff815f3840 R14: ffff8808598f5438 R15: ffff8808598f5000
[ 2184.928226] FS:  0000000000000000(0000) GS:ffff88085e280000(0000) knlGS:0000000000000000
[ 2184.928227] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2184.928228] CR2: 000000000223b000 CR3: 0000000001c09002 CR4: 00000000001606e0
[ 2184.928229] Call Trace:
[ 2184.928233]  call_timer_fn+0x56/0x119
[ 2184.928235]  run_timer_softirq+0x136/0x15b
[ 2184.928238]  ? update_next_balance+0x1a/0x2d
[ 2184.928239]  ? pick_next_task_fair+0x1c0/0x30e
[ 2184.928243]  __do_softirq+0xe6/0x23a
[ 2184.928245]  ? sort_range+0x1d/0x1d
[ 2184.928247]  run_ksoftirqd+0x15/0x2a
[ 2184.928248]  smpboot_thread_fn+0x126/0x13d
[ 2184.928250]  kthread+0xf6/0xfb
[ 2184.928252]  ? __init_completion+0x24/0x24
[ 2184.928254]  ret_from_fork+0x22/0x30
[ 2184.928255] Code: 56 50 6f 00 00 75 38 48 89 df c6 05 4a 50 6f 00 01 e8 fe 42 fe ff 44 89 e1 48 89 de 48 c7 c7 e0 ce ac 81 48 89 c2 e8 26 d4 a7 ff <0f> ff eb 10 41 ff c4 48 05 40 01 00 00 41 39 f4 75 9a eb 0d 48
[ 2184.928279] ---[ end trace 19de655d7b9e0810 ]---
[ 2184.928284] i40e 0000:02:00.1 enp2s0f1: tx_timeout: VSI_seid: 398, Q 5, NTC: 0x10d, HWB: 0x10d, NTU: 0xf3, TAIL: 0x10d, INT: 0x1
[ 2184.928285] i40e 0000:02:00.1 enp2s0f1: tx_timeout recovery level 1, hung_queue 5
Comment 39 Pawel Staszewski 2017-10-30 08:12:34 UTC
Problem solved.
You can close this bug.

Somebody after some time will maybee reopen it or it will help somebody - to stick somewhere with his problem.
Comment 40 Alexander Duyck 2017-10-30 15:01:53 UTC
(In reply to Pawel Staszewski from comment #33)
> Another try to get down/up interface
> [225184.634024] i40e 0000:02:00.0: update vlan stripping failed, err
> I40E_ERR_QUEUE_EMPTY aq_err OK
> [225184.834987] i40e 0000:02:00.0: Failed to clear LAN Tx queue context on
> Tx ring 0 (pf_q 0), error: -19
> [225185.716274] i40e 0000:02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters
> on PF, promiscuous mode forced on

My understanding is we have a reproduction for this issue and plan to work on getting to root cause and hopefully a resolution for the next release.
Comment 41 Pawel Staszewski 2017-11-02 15:17:14 UTC
What can I say - just turn off TSO and all will work on i40e driver / XL710 cards
Almost 30% more cpu load - but no problems with card restarts and tx hangs
Comment 42 Todd Fujinaka 2017-11-02 15:18:30 UTC
Created attachment 260465 [details]
attachment-20971-0.html

I am on vacation 10/30 - 11/15.

For urgent technical assistance, please contact Aaron Rowden (aaron.f.rowden@intel.com) or
Andy Harazin (andrew.t.harazin@intel.com).)
Comment 43 Pawel Staszewski 2017-11-27 17:56:23 UTC
wow - so many helpful information here.... 
just vacation info ... but ok this is normal ...


If You want to know more - all this bugs/hungs/tx restarts kernel panics are related to GRO that is turned on on i40e driver

Current settings:
ethtool -k enp3s0f2
Features for enp3s0f2:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: on
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: on
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: on
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp-mangleid-segmentation: on
        tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: on
receive-hashing: on
highdma: on
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on



After turning on GRO - day after TX transmit queue timeout and link down on nic.
Comment 44 Alexander Duyck 2017-11-27 18:22:51 UTC
Try disabling TSO on the interfaces and leaving GRO on. My concern is there may be something specific to the TCP segmentation offload on the i40e interfaces, and we can verify if that is the case by disabling TSO.
Comment 45 Alexander Duyck 2017-11-27 18:27:27 UTC
If you are willing to test a debug patch we could probably isolate things further. What we would need to determine is what is the format of the packets on the ring that are causing the Tx hang.

Last comment we had from you was that you weren't interested in doing that though so we are just doing our best with the limited information we have in the meantime as we have yet to get a full reproduction with the Tx hang you are seeing.
Comment 46 Pawel Staszewski 2017-11-29 17:09:09 UTC
ok so turned off tso and enabled gro on one port from 4 port X710 card.
And now wait for 24 hours or more for hang or not.

About commet - yes - but Your comment #40 tells about You have reproduction and working on solution.... sooo
Comment 47 Alexander Duyck 2017-11-29 18:29:39 UTC
So specifically we have a reproduction/resolution for the I40E_AQ_RC_ENOSPC issue and it is in testing. So we should be able to pass traffic without any issues when that issue is seen as we have a correction for some of the filtering behavior.

As I said this is one of the issues when you have 2 or 3 different issues all packed into one bugzilla. It becomes unclear what is being worked on.

If we verify this is only seen with TSO enabled we would want to identify the type of packet that is causing the issue. Would you be open to testing a special "debug" patch that would dump the descriptor ring and packet for a ring that is hung? If so I would just need to know what kernel you were planning to test on and I could provide you with a patch for it.
Comment 48 Pawel Staszewski 2017-11-29 19:08:32 UTC
yes if You have debug patch i can apply 
Current kernel is: 4.14.0-rc5

I can go to stable 4.14 if You want or net next davem tree
Comment 49 Alexander Duyck 2017-11-30 20:32:50 UTC
Created attachment 260965 [details]
Patch to add dump Tx Buffer Info and Descriptor for hung ring

The attached patch should add some functionality that will allow for dumping any rings that still have packets outstanding at the time a reset is requested. I tried to make it so that we only dump the data that hasn't been completed as the rest will likely just be noise.

You should be able to apply the patch, enable TSO, and when you see the NETDEV WATCHDOG error displayed it should include some additional data in the dmesg that will dump the layout of any work on the Tx rings that hasn't been completed.
Comment 50 Pawel Staszewski 2017-12-02 15:12:24 UTC
About thing that i done at comment #46
I turned off tso and enabled gro - and wait - so today is the result.

Features for enp2s0f0:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: on
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: on
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: on
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off
        tx-tcp-ecn-segmentation: off
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: off
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: on
receive-hashing: on
highdma: on
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on


As You can see gro turned on on port:  enp2s0f0

But today other port without turned on gro goes down:
[3081717.211287] NETDEV WATCHDOG: enp3s0f0 (i40e): transmit queue 0 timed out
[3081717.211302] ------------[ cut here ]------------
[3081717.211307] WARNING: CPU: 0 PID: 7 at net/sched/sch_generic.c:320 dev_watchdog+0xc5/0x122
[3081717.211307] Modules linked in: bonding x86_pkg_temp_thermal ipmi_si
[3081717.211312] CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 4.14.0-rc5 #5
[3081717.211314] task: ffff88085aafce00 task.stack: ffffc900031b4000
[3081717.211315] RIP: 0010:dev_watchdog+0xc5/0x122
[3081717.211316] RSP: 0018:ffffc900031b7d90 EFLAGS: 00010286
[3081717.211317] RAX: 000000000000003c RBX: ffff88085858b000 RCX: 0000000000000000
[3081717.211318] RDX: ffff88085e213501 RSI: ffff88085e20cab8 RDI: ffff88085e20cab8
[3081717.211319] RBP: ffffc900031b7da0 R08: 00514e85479c2bea R09: ffff88087f013c8c
[3081717.211320] R10: ffff8805587c4d00 R11: 000000000000005c R12: 0000000000000000
[3081717.211321] R13: ffffffff815f3840 R14: ffff88085858b438 R15: ffff88085858b000
[3081717.211322] FS:  0000000000000000(0000) GS:ffff88085e200000(0000) knlGS:0000000000000000
[3081717.211323] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3081717.211323] CR2: 00007f40f14bc030 CR3: 0000000001c09001 CR4: 00000000001606f0
[3081717.211324] Call Trace:
[3081717.211328]  call_timer_fn+0x56/0x119
[3081717.211330]  run_timer_softirq+0x136/0x15b
[3081717.211332]  ? update_next_balance+0x1a/0x2d
[3081717.211334]  ? pick_next_task_fair+0x1c0/0x30e
[3081717.211338]  __do_softirq+0xe6/0x23a
[3081717.211340]  ? sort_range+0x1d/0x1d
[3081717.211343]  run_ksoftirqd+0x15/0x2a
[3081717.211344]  smpboot_thread_fn+0x126/0x13d
[3081717.211346]  kthread+0xf6/0xfb
[3081717.211348]  ? __init_completion+0x24/0x24
[3081717.211350]  ret_from_fork+0x22/0x30
[3081717.211351] Code: 56 50 6f 00 00 75 38 48 89 df c6 05 4a 50 6f 00 01 e8 fe 42 fe ff 44 89 e1 48 89 de 48 c7 c7 e0 ce ac 81 48 89 c2 e8 26 d4 a7 ff <0f> ff eb 10 41 ff c4 48 05 40 01 00 00 41 39 f4 75 9a eb 0d 48
[3081717.211374] ---[ end trace 43105fa53be459a5 ]---
[3081717.211380] i40e 0000:03:00.0 enp3s0f0: tx_timeout: VSI_seid: 399, Q 0, NTC: 0x13f, HWB: 0x13f, NTU: 0x128, TAIL: 0x13f, INT: 0x1
[3081717.211381] i40e 0000:03:00.0 enp3s0f0: tx_timeout recovery level 1, hung_queue 0
[3081717.335148] bond0: link status definitely down for interface enp3s0f0, disabling it
[3081717.439141] bond0: link status up again after 0 ms for interface enp2s0f0
[3081717.439143] bond0: link status up again after 0 ms for interface enp2s0f1
[3081717.439143] bond0: link status up again after 0 ms for interface enp2s0f2
[3081717.439144] bond0: link status up again after 0 ms for interface enp2s0f3
[3081717.439145] bond0: link status up again after 0 ms for interface enp3s0f1
[3081717.439146] bond0: link status up again after 0 ms for interface enp3s0f2
[3081717.439146] bond0: link status up again after 0 ms for interface enp3s0f3
[3081717.635189] i40e 0000:03:00.0: PF reset failed, -15


So it looks like no matter if there will be one or all card with turned on gro - this will happen.

So in summary there was now 8 cards
One (enp2s0f0) with turned on gro and turned off tso 

And the rest with turned only tso and gro turned off.

Previously all was working for almost month without any problem 
- the problem starts with enabled gro - after about 48 hours.
Comment 51 Alexander Duyck 2017-12-03 20:08:46 UTC
Okay, so if you can apply the debug patch and run it with GRO and TSO enabled we can try capturing a descriptor ring dump for the hung Tx rings and see if there is a specific pattern we are seeing to the hang after capturing a couple of dumps.
Comment 52 Oleg Tarassov 2018-07-02 16:17:48 UTC
Hello -

We have been getting hit with this issue since the moment we provisioned the servers with LACP (802.4ad) Bonds. 
A bit of backgroundl 2 X servers act as Hadoop Master Nodes have this issue at least once a week. We are using flannel to relay IP from Container to Containers on separate Hosts.

I am wondering if there is a solution for this apart from disabling GRO with enabled TSO?

Logs:
```
Jul 2 08:15:35 master-node-02 kernel: i40e 0000:08:00.1 ens1f1: tx_timeout: VSI_seid: 386, Q 40, NTC: 0x30, HWB: 0x30, NTU: 0x1a, TAIL: 0x30, INT: 0x1
Jul 2 08:15:35 master-node-02 kernel: i40e 0000:08:00.1 ens1f1: tx_timeout recovery level 1, hung_queue 40
Jul 2 08:15:36 master-node-02 kernel: i40e 0000:08:00.1 ens1f1: vxlan port 8472 already offloaded
Jul 2 08:15:36 master-node-02 kernel: i40e 0000:08:00.1 ens1f1: speed changed to 0 for port ens1f1

[Jul 1 23:10] -----------[ cut here ]-----------
[ +0.000024] WARNING: at net/sched/sch_generic.c:297 dev_watchdog+0x276/0x280()
[ +0.000005] NETDEV WATCHDOG: ens1f0 (i40e): transmit queue 7 timed out
[ +0.000004] Modules linked in: nf_conntrack_netlink nfnetlink ipt_REJECT nf_reject_ipv4 xt_REDIRECT nf_nat_redirect vxlan ip6_udp_tunnel udp_tunnel xt_statistic xt_recent veth xt_comment xt_mark xt_nat ipt_MASQUERADE nf_nat_masquerade_i
[ +0.000074] wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgbl
[ +0.000022] CPU: 40 PID: 0 Comm: swapper/40 Tainted: G B ------------ 3.10.0-514.21.1.el7.x86_64 #1
[ +0.000003] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 01/22/2018
[ +0.000001] ffff882fbfc83d88 e16efe80e7f6bbb1 ffff882fbfc83d40 ffffffff81686f13
[ +0.000003] ffff882fbfc83d78 ffffffff81085cb0 0000000000000007 ffff882fb8b53000
[ +0.000019] ffff882fb875cf40 0000000000000040 0000000000000028 ffff882fbfc83de0
[ +0.000004] Call Trace:
[ +0.000002] <IRQ> [<ffffffff81686f13>] dump_stack+0x19/0x1b
[ +0.000028] [<ffffffff81085cb0>] warn_slowpath_common+0x70/0xb0
[ +0.000003] [<ffffffff81085d4c>] warn_slowpath_fmt+0x5c/0x80
[ +0.000004] [<ffffffff81597076>] dev_watchdog+0x276/0x280
[ +0.000003] [<ffffffff81596e00>] ? dev_graft_qdisc+0x80/0x80
[ +0.000005] [<ffffffff81095eb6>] call_timer_fn+0x36/0x110
[ +0.000002] [<ffffffff81596e00>] ? dev_graft_qdisc+0x80/0x80
[ +0.000004] [<ffffffff81098ba7>] run_timer_softirq+0x237/0x340
[ +0.000006] [<ffffffff8108f63f>] __do_softirq+0xef/0x280
[ +0.000006] [<ffffffff8169905c>] call_softirq+0x1c/0x30
[ +0.000004] [<ffffffff8102d365>] do_softirq+0x65/0xa0
[ +0.000003] [<ffffffff8108f9d5>] irq_exit+0x115/0x120
[ +0.000003] [<ffffffff81699cd5>] smp_apic_timer_interrupt+0x45/0x60
[ +0.000002] [<ffffffff8169821d>] apic_timer_interrupt+0x6d/0x80
[ +0.000001] <EOI> [<ffffffff81514c32>] ? cpuidle_enter_state+0x52/0xc0
[ +0.000007] [<ffffffff81514d79>] cpuidle_idle_call+0xd9/0x210
[ +0.000005] [<ffffffff810350ee>] arch_cpu_idle+0xe/0x30
[ +0.000007] [<ffffffff810e82f5>] cpu_startup_entry+0x245/0x290
[ +0.000006] [<ffffffff8104f09a>] start_secondary+0x1ba/0x230
[ +0.000002] --[ end trace de976ac7c5d6d0fd ]--
[ +0.000020] i40e 0000:08:00.0 ens1f0: tx_timeout: VSI_seid: 387, Q 7, NTC: 0x4d, HWB: 0x4d, NTU: 0x37, TAIL: 0x4d, INT: 0x1
[ +0.000003] i40e 0000:08:00.0 ens1f0: tx_timeout recovery level 1, hung_queue 7
[ +0.179134] i40e 0000:08:00.0 ens1f0: vxlan port 8472 already offloaded
[ +0.001595] i40e 0000:08:00.0 ens1f0: speed changed to 0 for port ens1f0
[ +0.001027] bond0: link status up again after 0 ms for interface ens1f0
[Jul 2 08:15] i40e 0000:08:00.1 ens1f1: tx_timeout: VSI_seid: 386, Q 40, NTC: 0x30, HWB: 0x30, NTU: 0x1a, TAIL: 0x30, INT: 0x1
[ +0.000013] i40e 0000:08:00.1 ens1f1: tx_timeout recovery level 1, hung_queue 40
[ +0.202427] i40e 0000:08:00.1 ens1f1: vxlan port 8472 already offloaded
[ +0.001631] i40e 0000:08:00.1 ens1f1: speed changed to 0 for port ens1f1
[ +0.001167] bond0: link status up again after 0 ms for interface ens1f1

[root@master-node-02 ~]# ethtool -i ens1f0
driver: i40e
version: 1.5.10-k
firmware-version: 6.00 0x8000366c 1.1825.0
expansion-rom-version:
bus-info: 0000:08:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

[root@master-node-02 ~]# ethtool -i ens1f1
driver: i40e
version: 1.5.10-k
firmware-version: 6.00 0x8000366c 1.1825.0
expansion-rom-version:
bus-info: 0000:08:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
```


Thank you


Oleg
Comment 53 Melroy 2020-11-27 01:32:44 UTC
I have the following network controller:  Marvell Technology Group Ltd. 88E8056 PCI-E Gigabit Ethernet Controller (rev 12)

Kernel version: 5.9.1

Network driver: sky2


Causing a lot of network issues lately when ever there is some network usage. dmesg log:


[   15.839448] sky2 0000:04:00.0 enp4s0: enabling interface
[   15.846933] sky2 0000:06:00.0 enp6s0: enabling interface
[   18.539088] sky2 0000:06:00.0 enp6s0: Link is up at 1000 Mbps, full duplex, flow control both
[   18.539106] IPv6: ADDRCONF(NETDEV_CHANGE): enp6s0: link becomes ready
[   18.545715] sky2 0000:06:00.0 enp6s0: checksum offload not possible with jumbo frames
[   23.013302] kauditd_printk_skb: 33 callbacks suppressed
[   23.013306] audit: type=1400 audit(1606408640.744:45): apparmor="STATUS" operation="profile_load" profile="unconfined" name="docker-default" pid=2215 comm="apparmor_parser"
[   23.990101] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   23.993529] Bridge firewalling registered
[   24.002265] bpfilter: Loaded bpfilter_umh pid 2238
[   24.002586] Started bpfilter
[   24.074691] sky2 0000:06:00.0 enp6s0: checksum offload not possible with jumbo frames
[   24.084637] Initializing XFRM netlink socket
[   24.561733] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
[   24.734202] eth0: renamed from vetha5832a9
[  792.503168] perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 1095.385458] perf: interrupt took too long (3132 > 3127), lowering kernel.perf_event_max_sample_rate to 63000
[ 1598.989500] perf: interrupt took too long (4031 > 3915), lowering kernel.perf_event_max_sample_rate to 49000
[ 2448.406508] audit: type=1400 audit(1606411065.196:46): apparmor="DENIED" operation="open" profile="/usr/bin/evince-thumbnailer" name="/tmp/tumbler-XTVUMU0.png" pid=19857 comm="evince-thumbnai" requested_mask="wc" denied_mask="wc" fsuid=1000 ouid=1000
[ 2621.705271] perf: interrupt took too long (5382 > 5038), lowering kernel.perf_event_max_sample_rate to 37000
[ 6853.677486] perf: interrupt took too long (6893 > 6727), lowering kernel.perf_event_max_sample_rate to 29000
[12023.336263] audit: type=1400 audit(1606420640.398:47): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/snap/snapd/9721/usr/lib/snapd/snap-confine" pid=34337 comm="apparmor_parser"
[12023.336275] audit: type=1400 audit(1606420640.398:48): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/snap/snapd/9721/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=34337 comm="apparmor_parser"
[12023.342319] audit: type=1400 audit(1606420640.404:49): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.snap-store.snap-store" pid=34341 comm="apparmor_parser"
[12023.342844] audit: type=1400 audit(1606420640.404:50): apparmor="STATUS" operation="profile_load" profile="unconfined" name="snap.snap-store.hook.configure" pid=34340 comm="apparmor_parser"
[12023.343153] audit: type=1400 audit(1606420640.404:51): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.snap-store.ubuntu-software" pid=34342 comm="apparmor_parser"
[12023.343184] audit: type=1400 audit(1606420640.404:52): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.snap-store.ubuntu-software-local-file" pid=34343 comm="apparmor_parser"
[12024.086223] audit: type=1400 audit(1606420641.148:53): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap-update-ns.snap-store" pid=34339 comm="apparmor_parser"
[22988.439429] audit: type=1400 audit(1606431605.811:54): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=46950 comm="cups-browsed" capability=23  capname="sys_nice"
[28304.283736] sky2 0000:06:00.0: error interrupt status=0x8
[28316.481864] ------------[ cut here ]------------
[28316.481868] NETDEV WATCHDOG: enp6s0 (sky2): transmit queue 0 timed out
[28316.481886] WARNING: CPU: 7 PID: 63 at net/sched/sch_generic.c:442 dev_watchdog+0x2ab/0x2c0
[28316.481894] Modules linked in: xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge stp llc overlay joydev input_leds intel_powerclamp coretemp kvm_intel kvm intel_cstate serio_raw snd_hda_codec_analog snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi asus_atk0110 snd_seq_midi snd_hda_intel snd_seq_midi_event snd_intel_dspcfg mac_hid snd_hda_codec snd_rawmidi snd_hda_core snd_hwdep snd_seq snd_pcm snd_seq_device snd_timer snd soundcore i5500_temp i7core_edac sch_fq_codel binfmt_misc parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c dm_mirror dm_region_hash dm_log amdgpu hid_generic iommu_v2 gpu_sched i2c_algo_bit ttm usbhid psmouse hid drm_kms_helper syscopyarea sysfillrect i2c_i801 sysimgblt firewire_ohci i2c_smbus fb_sys_fops ahci firewire_core cec crc_itu_t
[28316.481951]  mxm_wmi pata_jmicron libahci pata_acpi lpc_ich sky2 drm wmi
[28316.481957] CPU: 7 PID: 63 Comm: ksoftirqd/7 Tainted: G          I       5.9.1-rt20melvalds-1.3 #1
[28316.481959] Hardware name: System manufacturer System Product Name/Rampage II Extreme, BIOS 2101    09/19/2011
[28316.481960] RIP: 0010:dev_watchdog+0x2ab/0x2c0
[28316.481963] Code: ff e9 4a ff ff ff 4c 89 f7 c6 05 3d a7 fd 00 01 e8 3a b3 fa ff 44 89 e9 4c 89 f6 48 c7 c7 e0 e3 83 91 48 89 c2 e8 2a 9a 6f ff <0f> 0b e9 2c ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 66
[28316.481965] RSP: 0000:ffff9e16802a7d60 EFLAGS: 00210286
[28316.481967] RAX: 0000000000000000 RBX: ffff8f0f2f3db000 RCX: 0000000000000027
[28316.481968] RDX: 0000000000000001 RSI: 0000000000200203 RDI: ffff8f0f33bd8998
[28316.481969] RBP: ffff8f0f2f0ca420 R08: ffff8f0f33bd8990 R09: 0000000000000002
[28316.481970] R10: ffff9e16802a7c98 R11: ffffffff91a7ede8 R12: ffff8f0f2f0ca4f0
[28316.481971] R13: 0000000000000000 R14: ffff8f0f2f0ca000 R15: ffff8f0f2f3db080
[28316.481972] FS:  0000000000000000(0000) GS:ffff8f0f33bc0000(0000) knlGS:0000000000000000
[28316.481973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[28316.481974] CR2: 00007fb393fd2000 CR3: 00000001d9c2c000 CR4: 00000000000006e0
[28316.481976] Call Trace:
[28316.481980]  ? dev_reset_queue.constprop.0+0xb0/0xb0
[28316.481983]  call_timer_fn+0x2d/0x1c0
[28316.481987]  run_timer_softirq+0x4cf/0x670
[28316.481990]  ? dev_reset_queue.constprop.0+0xb0/0xb0
[28316.481992]  ? psi_group_change+0x41/0x220
[28316.481996]  __do_softirq+0xe0/0x336
[28316.482000]  ? smpboot_register_percpu_thread+0xe0/0xe0
[28316.482003]  run_ksoftirqd+0x2c/0x70
[28316.482006]  smpboot_thread_fn+0x131/0x290
[28316.482008]  kthread+0x12a/0x170
[28316.482012]  ? kthread_park+0x90/0x90
[28316.482014]  ret_from_fork+0x22/0x30
[28316.482019] ---[ end trace 0000000000000002 ]---
[28316.482022] sky2 0000:06:00.0 enp6s0: tx timeout
[28316.482027] sky2 0000:06:00.0 enp6s0: transmit ring 65 .. 90 report=65 done=65
[28316.486225] sky2 0000:06:00.0 enp6s0: checksum offload not possible with jumbo frames
[28319.594754] sky2 0000:06:00.0 enp6s0: Link is up at 1000 Mbps, full duplex, flow control both
[28320.144290] sky2 0000:06:00.0: error interrupt status=0x8
[28325.186632] sky2 0000:06:00.0 enp6s0: tx timeout
[28325.186643] sky2 0000:06:00.0 enp6s0: transmit ring 0 .. 3 report=0 done=0
[28325.190806] sky2 0000:06:00.0 enp6s0: checksum offload not possible with jumbo frames
[28328.465992] sky2 0000:06:00.0 enp6s0: Link is up at 1000 Mbps, full duplex, flow control both
[28334.401941] sky2 0000:06:00.0 enp6s0: tx timeout
[28334.401948] sky2 0000:06:00.0 enp6s0: transmit ring 0 .. 26 report=0 done=0
[28334.406113] sky2 0000:06:00.0 enp6s0: checksum offload not possible with jumbo frames
[28337.501158] sky2 0000:06:00.0 enp6s0: Link is up at 1000 Mbps, full duplex, flow control both
[28343.105134] sky2 0000:06:00.0 enp6s0: tx timeout
[28343.105142] sky2 0000:06:00.0 enp6s0: transmit ring 0 .. 25 report=0 done=0
[28343.109333] sky2 0000:06:00.0 enp6s0: checksum offload not possible with jumbo frames
[28346.204571] sky2 0000:06:00.0 enp6s0: Link is up at 1000 Mbps, full duplex, flow control both
[28351.297871] sky2 0000:06:00.0 enp6s0: tx timeout
[28351.297879] sky2 0000:06:00.0 enp6s0: transmit ring 0 .. 25 report=0 done=0
[28351.302045] sky2 0000:06:00.0 enp6s0: checksum offload not possible with jumbo frames
[28354.463863] sky2 0000:06:00.0 enp6s0: Link is up at 1000 Mbps, full duplex, flow control both
[28360.512709] sky2 0000:06:00.0 enp6s0: tx timeout
[28360.512717] sky2 0000:06:00.0 enp6s0: transmit ring 0 .. 26 report=0 done=0
[28360.516910] sky2 0000:06:00.0 enp6s0: checksum offload not possible with jumbo frames
[28363.650952] sky2 0000:06:00.0 enp6s0: Link is up at 1000 Mbps, full duplex, flow control both
[28374.337322] sky2 0000:06:00.0 enp6s0: tx timeout
[28374.337329] sky2 0000:06:00.0 enp6s0: transmit ring 0 .. 25 report=0 done=0
[28374.341541] sky2 0000:06:00.0 enp6s0: checksum offload not possible with jumbo frames
[28377.468310] sky2 0000:06:00.0 enp6s0: Link is up at 1000 Mbps, full duplex, flow control both
[28383.040062] sky2 0000:06:00.0 enp6s0: tx timeout
[28383.040073] sky2 0000:06:00.0 enp6s0: transmit ring 0 .. 25 report=0 done=0
[28383.044258] sky2 0000:06:00.0 enp6s0: checksum offload not possible with jumbo frames
[28386.128565] sky2 0000:06:00.0 enp6s0: Link is up at 1000 Mbps, full duplex, flow control both
[28391.231853] sky2 0000:06:00.0 enp6s0: tx timeout
Comment 54 Todd Fujinaka 2020-11-27 01:33:13 UTC
Created attachment 293837 [details]
attachment-17382-0.html

I am out until 12/14

For urgent technical assistance, please contact Ngai-Mint Kwan <ngai-mint.kwan@intel.com>
Comment 55 Alexander Duyck 2020-11-27 01:41:55 UTC
(In reply to Melroy from comment #53)
> I have the following network controller:  Marvell Technology Group Ltd.
> 88E8056 PCI-E Gigabit Ethernet Controller (rev 12)
> 
> Kernel version: 5.9.1
> 
> Network driver: sky2
> 
> 
> Causing a lot of network issues lately when ever there is some network
> usage. dmesg log:

This is completely unrelated to this issue. This issue was an i40e adapter that was generating Tx timeouts.

Please start a new bugzilla for this issue so it can be directed to the maintainers for the sky2 network adapter.

Note You need to log in before you can comment on or make changes to this bug.