Bug 197005

Summary: Kernel panic with the latest net-next
Product: Networking Reporter: Pawel Staszewski (pstaszewski)
Component: OtherAssignee: Stephen Hemminger (stephen)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal    
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.14-rc1+ Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel panic screenshoot
Kernel panic Kernel 4.14.0-rc1-next-20170919
Kernel 4.14-rc1+ panic
Kernel 4.14-rc1+
Kernel 4.13.2
Kernel 4.13.2 - panic with two more lines
kernel panic screenshoot pinctrl-v4.13-1
kernel panic / bisection kernel 4.12-rc7+
kernel panic net-next
Kernel panic Kernel 4.14.0-rc1-next-20170919

Description Pawel Staszewski 2017-09-19 22:36:32 UTC
Created attachment 258499 [details]
kernel panic screenshoot

Just after few seconds when host is starting to forward traffic i have same kernel panic.
Type of traffic TCP/UDP normal Internet traffic from users.
Comment 1 Pawel Staszewski 2017-09-19 23:20:18 UTC
Created attachment 258501 [details]
Kernel panic Kernel 4.14.0-rc1-next-20170919

trying linux-next tree - also panic
Comment 2 Pawel Staszewski 2017-09-19 23:35:13 UTC
Created attachment 258503 [details]
Kernel 4.14-rc1+ panic

Added new screenshoot with panic for Kernel 4.14-rc1+ (net-next from 20-09-2017)
Difference to the first one is that now iommu is enabled - previous have iommu=off
Comment 3 Pawel Staszewski 2017-09-19 23:45:19 UTC
Created attachment 258507 [details]
Kernel 4.14-rc1+

Trying other boot options idle=poll iommu=off
Kernel panic attached
dst_destroy+0x4c/0x83
Comment 4 Pawel Staszewski 2017-09-20 00:03:14 UTC
Some information about enviroment:
Server is acting as a ip router with bgp
There are 6x bgp sessions - each with full bgp table ~600k prefixes

And it looks like panic is appearing after bgp sessions are connected - not by traffic - cause at time when panic occured there is almost no traffic.

Also when I run tris server without turning on BGP and push thru this server traffic by pktgen there is no panic.

just after it learn routes it panick
Comment 5 Pawel Staszewski 2017-09-20 00:06:47 UTC
Created attachment 258509 [details]
Kernel 4.13.2

Trying another kernels - and soo far same situation with 4.13.2 - panic just after bgp sessions are connected and kernel learning routes.
Comment 6 Pawel Staszewski 2017-09-20 00:30:28 UTC
Latest working kernel without this problem 4.12.13
Comment 7 Pawel Staszewski 2017-09-20 08:45:55 UTC
Created attachment 258521 [details]
Kernel 4.13.2 - panic with two more lines
Comment 8 Pawel Staszewski 2017-09-20 09:27:18 UTC
Created attachment 258525 [details]
kernel panic screenshoot pinctrl-v4.13-1

first bisection panic
git bisect good v4.12
Bisecting: 7105 revisions left to test after this (roughly 13 steps)
[ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 'pinctrl-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Comment 9 Pawel Staszewski 2017-09-20 10:52:29 UTC
Created attachment 258527 [details]
kernel panic / bisection kernel 4.12-rc7+

bisect log

git bisect start
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 'pinctrl-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid using stack larger than 1024.
git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f


Going to next bisects
Comment 10 Pawel Staszewski 2017-09-20 11:14:08 UTC
Some warning at bisection - not connected to main problem but somebody want to know  maybee -  (no panic)
good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 'udp-reduce-cache-pressure' 


[  639.883924] INFO: rcu_sched self-detected stall on CPU
[  639.883930]  0-...: (5999 ticks this GP) idle=ae2/140000000000001/0 softirq=43421/43421 fqs=2992
[  639.883931]   (t=6000 jiffies g=8207 c=8206 q=1035171)
[  639.883937] NMI backtrace for cpu 0
[  639.883939] CPU: 0 PID: 2886 Comm: md2_raid1 Not tainted 4.12.0-rc4+ #2
[  639.883940] Call Trace:
[  639.883942]  <IRQ>
[  639.883947]  dump_stack+0x63/0x7f
[  639.883949]  nmi_cpu_backtrace+0x7a/0x89
[  639.883953]  ? irq_force_complete_move+0xe9/0xe9
[  639.883955]  nmi_trigger_cpumask_backtrace+0x4e/0xc6
[  639.883956]  arch_trigger_cpumask_backtrace+0x14/0x17
[  639.883958]  rcu_dump_cpu_stacks+0x6d/0xb9
[  639.883961]  rcu_check_callbacks+0x220/0x5fb
[  639.883964]  ? raise_softirq+0x29/0x30
[  639.883966]  update_process_times+0x2a/0x4f
[  639.883968]  tick_sched_timer+0xfa/0x12f
[  639.883969]  __hrtimer_run_queues+0x6e/0xcf
[  639.883970]  hrtimer_interrupt+0x69/0x10d
[  639.883972]  local_apic_timer_interrupt+0x49/0x4d
[  639.883973]  smp_apic_timer_interrupt+0x25/0x36
[  639.883975]  apic_timer_interrupt+0x89/0x90
[  639.883977] RIP: 0010:atomic64_cmpxchg+0xc/0xe
[  639.883978] RSP: 0018:ffffc90008607d20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  639.883979] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  639.883979] RDX: ffff88086cb4f000 RSI: 0000000000000000 RDI: ffff88046b1a29e8
[  639.883980] RBP: ffffc90008607d20 R08: 0000000000000000 R09: ffff88086be4c0d8
[  639.883980] R10: 000000000000001e R11: 000000000000001e R12: ffff88086cb4f000
[  639.883981] R13: ffff88046b1a29e8 R14: ffff88086be4c0d0 R15: 0000000000000000
[  639.883982]  </IRQ>
[  639.883983]  __mutex_trylock_or_owner+0x4b/0x63
[  639.883984]  __mutex_trylock+0x9/0x11
[  639.883986]  mutex_trylock+0x9/0xe
[  639.883989]  md_check_recovery+0xde/0x3d5
[  639.883991]  raid1d+0x35/0xc2c
[  639.883993]  md_thread+0x11d/0x13b
[  639.883995]  ? add_wait_queue_exclusive+0x43/0x43
[  639.883996]  ? md_do_sync+0xc61/0xc61
[  639.883998]  kthread+0xf4/0xfc
[  639.883999]  ? __kthread_init_worker+0x2a/0x2a
[  639.884001]  ret_from_fork+0x25/0x30
[  745.520198] INFO: task kworker/1:1:598 blocked for more than 120 seconds.
[  745.520201]       Not tainted 4.12.0-rc4+ #2
[  745.520201] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  745.520202] kworker/1:1     D    0   598      2 0x00000000
[  745.520209] Workqueue: events_long flush_old_commits
[  745.520210] Call Trace:
[  745.520225]  __schedule+0x40e/0x4ea
[  745.520226]  schedule+0x80/0x8b
[  745.520228]  queue_log_writer+0x81/0xb5
[  745.520231]  ? try_to_wake_up+0x1f9/0x1f9
[  745.520232]  do_journal_end+0x1c3/0xb74
[  745.520233]  ? do_journal_begin_r+0x219/0x24d
[  745.520235]  journal_end_sync+0x60/0x65
[  745.520236]  reiserfs_sync_fs+0x3a/0x57
[  745.520237]  flush_old_commits+0x7d/0x8e
[  745.520240]  process_one_work+0x134/0x1df
[  745.520241]  ? process_scheduled_works+0x2e/0x2e
[  745.520242]  worker_thread+0x198/0x25a
[  745.520244]  ? process_scheduled_works+0x2e/0x2e
[  745.520245]  kthread+0xf4/0xfc
[  745.520246]  ? __kthread_init_worker+0x2a/0x2a
[  745.520247]  ? __request_module+0x1d5/0x1d5
[  745.520249]  ? call_usermodehelper_exec_async+0x11e/0x129
[  745.520251]  ret_from_fork+0x25/0x30
[  745.520265] INFO: task snmpd:4410 blocked for more than 120 seconds.
[  745.520266]       Not tainted 4.12.0-rc4+ #2
[  745.520266] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  745.520267] snmpd           D    0  4410      1 0x00000000
[  745.520268] Call Trace:
[  745.520269]  __schedule+0x40e/0x4ea
[  745.520271]  schedule+0x80/0x8b
[  745.520272]  queue_log_writer+0x81/0xb5
[  745.520273]  ? try_to_wake_up+0x1f9/0x1f9
[  745.520274]  do_journal_begin_r+0x18d/0x24d
[  745.520275]  journal_begin+0xbb/0xf3
[  745.520277]  reiserfs_dirty_inode+0x4f/0x7c
[  745.520287]  ? current_kernel_time64+0xd/0x31
[  745.520288]  ? kfree+0x24/0xb1
[  745.520291]  __mark_inode_dirty+0x2c/0x1cf
[  745.520292]  generic_update_time+0x9f/0xaa
[  745.520294]  file_update_time+0xad/0xce
[  745.520296]  ? virt_to_head_page+0x3b/0x3d
[  745.520299]  __generic_file_write_iter+0x5d/0x152
[  745.520300]  ? destroy_inode+0x39/0x52
[  745.520301]  generic_file_write_iter+0x4b/0x95
[  745.520303]  __vfs_write+0x79/0xa2
[  745.520304]  vfs_write+0x96/0xfc
[  745.520306]  SyS_write+0x4d/0x7f
[  745.520307]  entry_SYSCALL_64_fastpath+0x1a/0xa5
[  745.520308] RIP: 0033:0x7f2da122e640
[  745.520309] RSP: 002b:00007ffe3443e4e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  745.520310] RAX: ffffffffffffffda RBX: 00007ffe3443b478 RCX: 00007f2da122e640
[  745.520311] RDX: 000000000000002c RSI: 00007f2da20e4000 RDI: 0000000000000003
[  745.520311] RBP: 0000000000000000 R08: 746e6f6362757320 R09: 69272072656e6961
[  745.520312] R10: 2027726464615f61 R11: 0000000000000246 R12: 0000000000000048
[  745.520313] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffe3443b6b8
[  745.520324] INFO: task objcopy:24869 blocked for more than 120 seconds.
[  745.520325]       Not tainted 4.12.0-rc4+ #2
[  745.520325] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  745.520326] objcopy         D    0 24869  24868 0x00000004
[  745.520327] Call Trace:
[  745.520329]  __schedule+0x40e/0x4ea
[  745.520330]  ? bit_wait_timeout+0x44/0x44
[  745.520331]  schedule+0x80/0x8b
[  745.520333]  io_schedule+0x12/0x35
[  745.520334]  bit_wait_io+0xd/0x2c
[  745.520335]  __wait_on_bit+0x4c/0x86
[  745.520337]  ? submit_bio+0xf6/0xff
[  745.520338]  out_of_line_wait_on_bit+0x6d/0x78
[  745.520340]  ? bit_wait_timeout+0x44/0x44
[  745.520342]  ? wake_atomic_t_function+0x28/0x28
[  745.520344]  __wait_on_buffer+0x21/0x23
[  745.520346]  search_by_key+0x19f/0x77c
[  745.520348]  search_for_position_by_key+0x34/0x62
[  745.520349]  ? journal_transaction_should_end+0x24/0x95
[  745.520350]  reiserfs_do_truncate+0x330/0x3ad
[  745.520352]  ? reiserfs_write_lock_nested+0x1f/0x35
[  745.520354]  reiserfs_delete_object+0x2a/0x68
[  745.520355]  reiserfs_evict_inode+0xc2/0x146
[  745.520356]  evict+0xb5/0x168
[  745.520357]  iput+0x134/0x139
[  745.520359]  do_unlinkat+0x10a/0x1a0
[  745.520360]  SyS_unlink+0x11/0x13
[  745.520361]  entry_SYSCALL_64_fastpath+0x1a/0xa5
[  745.520362] RIP: 0033:0x2b4f96f6e787
[  745.520363] RSP: 002b:00007fff6c35fcd8 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
[  745.520364] RAX: ffffffffffffffda RBX: 00002b4f97232640 RCX: 00002b4f96f6e787
[  745.520364] RDX: 0000000000008000 RSI: 00007fff6c35fce0 RDI: 00007fff6c3607c9
[  745.520365] RBP: 0000000000000ff1 R08: 0000000000000058 R09: 0000000000140000
[  745.520366] R10: 00007fff6c35fa90 R11: 0000000000000206 R12: 0000000001133930
[  745.520366] R13: 000000000000003f R14: 00002b4f97232698 R15: 000000000000e6e1
Comment 11 Pawel Staszewski 2017-09-20 11:40:20 UTC
Another bisect (some warnings):
with: 
# good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 'bpf-ctx-narrow'
git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70


[  171.393948] NETDEV WATCHDOG: enp4s0f0 (ixgbe): transmit queue 0 timed out
[  171.393957] ------------[ cut here ]------------
[  171.393962] WARNING: CPU: 24 PID: 0 at dev_watchdog+0xcf/0x139
[  171.393962] Modules linked in: bonding ipmi_si x86_pkg_temp_thermal
[  171.393967] CPU: 24 PID: 0 Comm: swapper/24 Not tainted 4.12.0-rc4+ #6
[  171.393968] task: ffff88046d98c000 task.stack: ffffc90003320000
[  171.393970] RIP: 0010:dev_watchdog+0xcf/0x139
[  171.393970] RSP: 0018:ffff88046fb83ea8 EFLAGS: 00010246
[  171.393971] RAX: 000000000000003d RBX: ffff88046bf10000 RCX: 0000000000000000
[  171.393972] RDX: ffff88046fb92f01 RSI: ffffffff81e07102 RDI: ffff88046fb8da08
[  171.393972] RBP: ffff88046fb83eb8 R08: ffffffff81e07102 R09: 000000000000003d
[  171.393973] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[  171.393973] R13: ffff88046fb83ef8 R14: ffff88046d98c000 R15: ffffffff81c06008
[  171.393974] FS:  0000000000000000(0000) GS:ffff88046fb80000(0000) knlGS:0000000000000000
[  171.393975] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  171.393976] CR2: 0000000001c060c8 CR3: 000000086b910000 CR4: 00000000001406e0
[  171.393976] Call Trace:
[  171.393978]  <IRQ>
[  171.393979]  ? netif_tx_lock+0x79/0x79
[  171.393983]  call_timer_fn.isra.24+0x17/0x77
[  171.393984]  run_timer_softirq+0x118/0x161
[  171.393986]  ? netif_tx_lock+0x79/0x79
[  171.393988]  ? arch_local_irq_enable+0xb/0xd
[  171.393990]  __do_softirq+0xa8/0x19d
[  171.393991]  irq_exit+0x5d/0x6b
[  171.393994]  smp_apic_timer_interrupt+0x2a/0x36
[  171.393995]  apic_timer_interrupt+0x89/0x90
[  171.393998] RIP: 0010:mwait_idle+0x4e/0x6a
[  171.393999] RSP: 0018:ffffc90003323e98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  171.394000] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  171.394000] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88046d98c000
[  171.394001] RBP: ffffc90003323e98 R08: 00000000000000af R09: 0000000000000001
[  171.394001] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[  171.394002] R13: ffff88046d98c000 R14: ffff88046d98c000 R15: ffff88046d98c000
[  171.394002]  </IRQ>
[  171.394006]  arch_cpu_idle+0xa/0xc
[  171.394007]  default_idle_call+0x19/0x1b
[  171.394009]  do_idle+0xb7/0x191
[  171.394010]  cpu_startup_entry+0x1d/0x21
[  171.394012]  start_secondary+0xd8/0xdc
[  171.394014]  secondary_startup_64+0x9f/0x9f
[  171.394015] Code: cc 75 bd eb 35 48 89 df c6 05 b2 ba 75 00 01 e8 f8 60 fe ff 44 89 e1 48 89 de 48 89 c2 48 c7 c7 03 32 a4 81 31 c0 e8 8f 1a b6 ff <0f> ff 48 8b 83 e0 01 00 00 48 89 df ff 50 78 48 8b 05 3f 9c 6b
[  171.394033] ---[ end trace b78b2168985afd72 ]---
Comment 12 Pawel Staszewski 2017-09-20 14:02:55 UTC
Created attachment 258529 [details]
kernel panic net-next

After aditional patch for debug:
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f535779d9dc1dfe36934c2abba4e43d053ac5d6f..1eaa3553a724dc8c048f67b556337072d5addc82 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3331,7 +3331,14 @@ void netdev_run_todo(void);
   */
  static inline void dev_put(struct net_device *dev)
  {
-	this_cpu_dec(*dev->pcpu_refcnt);
+	int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
+
+	if (!pref) {
+		pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle %d\n",
+		       dev, dev->name, dev->reg_state, dev->dismantle);
+		BUG();
+	}
+	this_cpu_dec(*pref);
  }
  
  /**
Comment 13 Pawel Staszewski 2017-09-20 15:05:45 UTC
Created attachment 258531 [details]
Kernel panic Kernel 4.14.0-rc1-next-20170919

A little more in new attachement from ipmi recording
Comment 14 Pawel Staszewski 2017-09-21 09:07:26 UTC
After adding patch :
diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..82aff41c6f63 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry
*dst, unsigned long time)
 static inline struct dst_entry *dst_clone(struct dst_entry *dst)
 {
        if (dst)
-               atomic_inc(&dst->__refcnt);
+               dst_hold(dst);
        return dst;
 }



dmesg output

[  222.276139] WARNING: CPU: 25 PID: 0 at dst_hold+0x25/0x29
[  222.276140] Modules linked in: bonding ipmi_si x86_pkg_temp_thermal
[  222.276159] CPU: 25 PID: 0 Comm: swapper/25 Tainted: G        W       4.14.0-rc1+ #1
[  222.276160] task: ffff88046da5d000 task.stack: ffffc90003328000
[  222.276161] RIP: 0010:dst_hold+0x25/0x29
[  222.276162] RSP: 0018:ffff88046fbc39d8 EFLAGS: 00010246
[  222.276163] RAX: 0000000000000000 RBX: ffff88045b7dc400 RCX: 0000000000000b84
[  222.276164] RDX: ffff88084be83280 RSI: 0000000000000036 RDI: ffff88084be83200
[  222.276165] RBP: ffff88046fbc39d8 R08: 0000000000000000 R09: ffff8804573880e2
[  222.276165] R10: 00000000000000c0 R11: ffffffff815921b2 R12: ffff88046d7f8000
[  222.276166] R13: ffff88046d7f8000 R14: 0000000000000000 R15: 0000000000000002
[  222.276167] FS:  0000000000000000(0000) GS:ffff88046fbc0000(0000) knlGS:0000000000000000
[  222.276167] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  222.276168] CR2: 00007fdd4c81d580 CR3: 000000086b737003 CR4: 00000000001606e0
[  222.276168] Call Trace:
[  222.276169]  <IRQ>
[  222.276170]  skb_dst_force+0x32/0x34
[  222.276172]  __dev_queue_xmit+0x1ad/0x482
[  222.276173]  ? eth_header+0x28/0xc6
[  222.276174]  dev_queue_xmit+0xb/0xd
[  222.276175]  neigh_connected_output+0x9b/0xb4
[  222.276177]  ip_finish_output2+0x234/0x294
[  222.276178]  ? ipt_do_table+0x369/0x388
[  222.276179]  ip_finish_output+0x12c/0x13f
[  222.276181]  ip_output+0x53/0x87
[  222.276182]  ip_forward_finish+0x53/0x5a
[  222.276183]  ip_forward+0x2cb/0x3e6
[  222.276185]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
[  222.276186]  ip_rcv_finish+0x2e2/0x321
[  222.276187]  ip_rcv+0x26f/0x2eb
[  222.276189]  ? vlan_do_receive+0x4f/0x289
[  222.276190]  ? virt_to_head_page+0x3b/0x3d
[  222.276192]  ? skb_gro_receive+0x13c/0x29e
[  222.276193]  __netif_receive_skb_core+0x467/0x50b
[  222.276194]  __netif_receive_skb+0x4d/0x5f
[  222.276196]  netif_receive_skb_internal+0x5c/0xaf
[  222.276197]  napi_gro_complete+0x7e/0x88
[  222.276198]  napi_gro_flush+0x4f/0x6b
[  222.276199]  napi_complete_done+0x4b/0xbc
[  222.276201]  ixgbe_poll+0xd2e/0xf09
[  222.276202]  ? kmem_cache_free_bulk+0x1b6/0x1f7
[  222.276204]  net_rx_action+0xf4/0x266
[  222.276205]  __do_softirq+0xa8/0x19d
[  222.276207]  irq_exit+0x5d/0x6b
[  222.276208]  do_IRQ+0x9c/0xb5
[  222.276209]  common_interrupt+0x93/0x93
[  222.276210]  </IRQ>
[  222.276211] RIP: 0010:mwait_idle+0x4e/0x6a
[  222.276212] RSP: 0018:ffffc9000332be98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff56
[  222.276213] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  222.276213] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88046da5d000
[  222.276214] RBP: ffffc9000332be98 R08: fffffffffffffffc R09: 0000000000000000
[  222.276214] R10: ffff88046da5d000 R11: ffff88046da5d000 R12: 0000000000000000
[  222.276215] R13: ffff88046da5d000 R14: ffff88046da5d000 R15: ffff88046da5d000
[  222.276217]  arch_cpu_idle+0xa/0xc
[  222.276218]  default_idle_call+0x19/0x1b
[  222.276220]  do_idle+0xbc/0x196
[  222.276221]  ? do_idle+0x194/0x196
[  222.276222]  cpu_startup_entry+0x1d/0x20
[  222.276224]  start_secondary+0xf6/0xfb
[  222.276225]  secondary_startup_64+0xa5/0xa5
[  222.276226] Code: 08 0f 4f c2 5d c3 8b 87 80 00 00 00 55 48 8d 97 80 00 00 00 48 89 e5 85 c0 74 10 8d 48 01 f0 0f b1 0a 0f 94 c1 84 c9 75 04 eb ec <0f> ff 5d c3 48 8b 47 58 31 d2 a8 01 74 0b 31 d2 48 a9 fe ff ff
[  222.276246] ---[ end trace c197aa7ded9b7547 ]---
[  222.299282] ------------[ cut here ]------------
[  222.299288] WARNING: CPU: 19 PID: 0 at dst_hold+0x25/0x29
[  222.299288] Modules linked in: bonding ipmi_si x86_pkg_temp_thermal
[  222.299292] CPU: 19 PID: 0 Comm: swapper/19 Tainted: G        W       4.14.0-rc1+ #1
[  222.299293] task: ffff88086d919000 task.stack: ffffc900032f8000
[  222.299294] RIP: 0010:dst_hold+0x25/0x29
[  222.299295] RSP: 0018:ffff88087fa43a28 EFLAGS: 00010246
[  222.299296] RAX: 0000000000000000 RBX: ffff88045aceb000 RCX: 0000000000000000
[  222.299297] RDX: ffff880459d40180 RSI: 0000000000000000 RDI: ffff880459d40100
[  222.299297] RBP: ffff88087fa43a28 R08: ffff88084dd13800 R09: 00000000000005c0
[  222.299298] R10: ffff88084dd138c0 R11: ffffffff815921b2 R12: ffff88086cc4d000
[  222.299299] R13: ffff88086cc4d000 R14: 0000000000000000 R15: 0000000000000002
[  222.299300] FS:  0000000000000000(0000) GS:ffff88087fa40000(0000) knlGS:0000000000000000
[  222.299301] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  222.299301] CR2: 00007f83f3ce1580 CR3: 0000000866dca002 CR4: 00000000001606e0
[  222.299302] Call Trace:
[  222.299303]  <IRQ>
[  222.299305]  skb_dst_force+0x32/0x34
[  222.299306]  __dev_queue_xmit+0x1ad/0x482
[  222.299308]  ? eth_header+0x28/0xc6
[  222.299310]  dev_queue_xmit+0xb/0xd
[  222.299312]  neigh_connected_output+0x9b/0xb4
[  222.299314]  ip_finish_output2+0x234/0x294
[  222.299316]  ? ipt_do_table+0x369/0x388
[  222.299317]  ip_finish_output+0x12c/0x13f
[  222.299319]  ip_output+0x53/0x87
[  222.299320]  ip_forward_finish+0x53/0x5a
[  222.299321]  ip_forward+0x2cb/0x3e6
[  222.299323]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
[  222.299324]  ip_rcv_finish+0x2e2/0x321
[  222.299325]  ip_rcv+0x26f/0x2eb
[  222.299337]  ? vlan_do_receive+0x4f/0x289
[  222.299338]  __netif_receive_skb_core+0x467/0x50b
[  222.299341]  ? tcp_gro_receive+0x239/0x239
[  222.299344]  ? inet_gro_receive+0x36/0x238
[  222.299345]  __netif_receive_skb+0x4d/0x5f
[  222.299346]  netif_receive_skb_internal+0x5c/0xaf
[  222.299348]  napi_gro_receive+0x45/0x81
[  222.299351]  ixgbe_poll+0xc8a/0xf09
[  222.299354]  ? kmem_cache_free_bulk+0x1b6/0x1f7
[  222.299356]  net_rx_action+0xf4/0x266
[  222.299358]  __do_softirq+0xa8/0x19d
[  222.299361]  irq_exit+0x5d/0x6b
[  222.299362]  do_IRQ+0x9c/0xb5
[  222.299363]  common_interrupt+0x93/0x93
[  222.299364]  </IRQ>
[  222.299366] RIP: 0010:mwait_idle+0x4e/0x6a
[  222.299367] RSP: 0018:ffffc900032fbe98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffc4
[  222.299368] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  222.299368] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88086d919000
[  222.299369] RBP: ffffc900032fbe98 R08: 0000000000000000 R09: 0000000000000000
[  222.299370] R10: ffff88086d919000 R11: ffff88086d919000 R12: 0000000000000000
[  222.299370] R13: ffff88086d919000 R14: ffff88086d919000 R15: ffff88086d919000
[  222.299374]  arch_cpu_idle+0xa/0xc
[  222.299375]  default_idle_call+0x19/0x1b
[  222.299377]  do_idle+0xbc/0x196
[  222.299379]  cpu_startup_entry+0x1d/0x20
[  222.299380]  start_secondary+0xf6/0xfb
[  222.299383]  secondary_startup_64+0xa5/0xa5
[  222.299384] Code: 08 0f 4f c2 5d c3 8b 87 80 00 00 00 55 48 8d 97 80 00 00 00 48 89 e5 85 c0 74 10 8d 48 01 f0 0f b1 0a 0f 94 c1 84 c9 75 04 eb ec <0f> ff 5d c3 48 8b 47 58 31 d2 a8 01 74 0b 31 d2 48 a9 fe ff ff
[  222.299402] ---[ end trace c197aa7ded9b7548 ]---
[  222.324407] ------------[ cut here ]------------
[  222.324410] WARNING: CPU: 32 PID: 0 at dst_hold+0x25/0x29
[  222.324411] Modules linked in: bonding ipmi_si x86_pkg_temp_thermal
[  222.324414] CPU: 32 PID: 0 Comm: swapper/32 Tainted: G        W       4.14.0-rc1+ #1
[  222.324415] task: ffff88086d91d000 task.stack: ffffc90003360000
[  222.324416] RIP: 0010:dst_hold+0x25/0x29
[  222.324417] RSP: 0018:ffff88087fb039d8 EFLAGS: 00010246
[  222.324418] RAX: 0000000000000000 RBX: ffff880459bd4900 RCX: 00000000000011be
[  222.324418] RDX: ffff880459bf0280 RSI: 0000000000000036 RDI: ffff880459bf0200
[  222.324419] RBP: ffff88087fb039d8 R08: 0000000000000000 R09: ffff880848a920e2
[  222.324420] R10: 00000000000000c0 R11: ffffffff815921b2 R12: ffff88046d7f8000
[  222.324420] R13: ffff88046d7f8000 R14: 0000000000000000 R15: 0000000000000002
[  222.324421] FS:  0000000000000000(0000) GS:ffff88087fb00000(0000) knlGS:0000000000000000
[  222.324422] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  222.324422] CR2: 0000000011f99008 CR3: 000000086b737000 CR4: 00000000001606e0
[  222.324423] Call Trace:
[  222.324433]  <IRQ>
[  222.324435]  skb_dst_force+0x32/0x34
[  222.324436]  __dev_queue_xmit+0x1ad/0x482
[  222.324438]  ? eth_header+0x28/0xc6
[  222.324439]  dev_queue_xmit+0xb/0xd
[  222.324441]  neigh_connected_output+0x9b/0xb4
[  222.324443]  ip_finish_output2+0x234/0x294
[  222.324444]  ? ipt_do_table+0x369/0x388
[  222.324445]  ip_finish_output+0x12c/0x13f
[  222.324447]  ip_output+0x53/0x87
[  222.324448]  ip_forward_finish+0x53/0x5a
[  222.324449]  ip_forward+0x2cb/0x3e6
[  222.324451]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
[  222.324452]  ip_rcv_finish+0x2e2/0x321
[  222.324453]  ip_rcv+0x26f/0x2eb
[  222.324455]  ? vlan_do_receive+0x4f/0x289
[  222.324457]  ? virt_to_head_page+0x3b/0x3d
[  222.324458]  ? skb_gro_receive+0x13c/0x29e
[  222.324460]  __netif_receive_skb_core+0x467/0x50b
[  222.324461]  __netif_receive_skb+0x4d/0x5f
[  222.324462]  netif_receive_skb_internal+0x5c/0xaf
[  222.324464]  napi_gro_complete+0x7e/0x88
[  222.324465]  napi_gro_flush+0x4f/0x6b
[  222.324466]  napi_complete_done+0x4b/0xbc
[  222.324468]  ixgbe_poll+0xd2e/0xf09
[  222.324470]  ? kmem_cache_free_bulk+0x1b6/0x1f7
[  222.324471]  net_rx_action+0xf4/0x266
[  222.324473]  __do_softirq+0xa8/0x19d
[  222.324475]  irq_exit+0x5d/0x6b
[  222.324476]  do_IRQ+0x9c/0xb5
[  222.324477]  common_interrupt+0x93/0x93
[  222.324478]  </IRQ>
[  222.324480] RIP: 0010:mwait_idle+0x4e/0x6a
[  222.324480] RSP: 0018:ffffc90003363e98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffa6
[  222.324482] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  222.324482] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88086d91d000
[  222.324483] RBP: ffffc90003363e98 R08: fffffffffffffffc R09: 0000000000000000
[  222.324483] R10: ffff88086d91d000 R11: ffff88086d91d000 R12: 0000000000000000
[  222.324484] R13: ffff88086d91d000 R14: ffff88086d91d000 R15: ffff88086d91d000
[  222.324486]  arch_cpu_idle+0xa/0xc
[  222.324488]  default_idle_call+0x19/0x1b
[  222.324490]  do_idle+0xbc/0x196
[  222.324491]  ? complete+0x48/0x50
[  222.324493]  cpu_startup_entry+0x1d/0x20
[  222.324494]  start_secondary+0xf6/0xfb
[  222.324497]  secondary_startup_64+0xa5/0xa5
[  222.324497] Code: 08 0f 4f c2 5d c3 8b 87 80 00 00 00 55 48 8d 97 80 00 00 00 48 89 e5 85 c0 74 10 8d 48 01 f0 0f b1 0a 0f 94 c1 84 c9 75 04 eb ec <0f> ff 5d c3 48 8b 47 58 31 d2 a8 01 74 0b 31 d2 48 a9 fe ff ff
[  222.324516] ---[ end trace c197aa7ded9b7549 ]---
Comment 15 Pawel Staszewski 2017-09-21 09:29:17 UTC
update 
[  961.873764] dst_release: dst:ffff88086bfcfe00 refcnt:-1
[  989.919469] ------------[ cut here ]------------
[  989.919476] WARNING: CPU: 4 PID: 0 at dst_hold+0x25/0x29
[  989.919476] Modules linked in: bonding ipmi_si x86_pkg_temp_thermal
[  989.919481] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G        W       4.14.0-rc1+ #1
[  989.919482] task: ffff88046da32000 task.stack: ffffc90003280000
[  989.919484] RIP: 0010:dst_hold+0x25/0x29
[  989.919484] RSP: 0018:ffff88046f903a28 EFLAGS: 00010246
[  989.919485] RAX: 0000000000000000 RBX: ffff880467bd2900 RCX: 0000000000000000
[  989.919486] RDX: ffff88045ab85a80 RSI: 0000000000000000 RDI: ffff88045ab85a00
[  989.919487] RBP: ffff88046f903a28 R08: ffff880457e84000 R09: 0000000000000344
[  989.919488] R10: ffff880457e840c0 R11: ffffffff815921b2 R12: ffff88046d305000
[  989.919488] R13: ffff88046d305000 R14: 0000000000000000 R15: 0000000000000002
[  989.919489] FS:  0000000000000000(0000) GS:ffff88046f900000(0000) knlGS:0000000000000000
[  989.919490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  989.919491] CR2: 00007f5a78e3a000 CR3: 0000000863016001 CR4: 00000000001606e0
[  989.919492] Call Trace:
[  989.919493]  <IRQ>
[  989.919496]  skb_dst_force+0x32/0x34
[  989.919498]  __dev_queue_xmit+0x1ad/0x482
[  989.919501]  ? eth_header+0x28/0xc6
[  989.919502]  dev_queue_xmit+0xb/0xd
[  989.919504]  neigh_connected_output+0x9b/0xb4
[  989.919507]  ip_finish_output2+0x234/0x294
[  989.919509]  ? ipt_do_table+0x369/0x388
[  989.919510]  ip_finish_output+0x12c/0x13f
[  989.919512]  ip_output+0x53/0x87
[  989.919513]  ip_forward_finish+0x53/0x5a
[  989.919515]  ip_forward+0x2cb/0x3e6
[  989.919516]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
[  989.919518]  ip_rcv_finish+0x2e2/0x321
[  989.919519]  ip_rcv+0x26f/0x2eb
[  989.919522]  ? vlan_do_receive+0x4f/0x289
[  989.919523]  __netif_receive_skb_core+0x467/0x50b
[  989.919526]  ? tcp_gro_receive+0x239/0x239
[  989.919529]  ? inet_gro_receive+0x226/0x238
[  989.919530]  __netif_receive_skb+0x4d/0x5f
[  989.919532]  netif_receive_skb_internal+0x5c/0xaf
[  989.919533]  napi_gro_receive+0x45/0x81
[  989.919536]  ixgbe_poll+0xc8a/0xf09
[  989.919539]  ? kmem_cache_free_bulk+0x1b6/0x1f7
[  989.919540]  net_rx_action+0xf4/0x266
[  989.919543]  __do_softirq+0xa8/0x19d
[  989.919545]  irq_exit+0x5d/0x6b
[  989.919546]  do_IRQ+0x9c/0xb5
[  989.919548]  common_interrupt+0x93/0x93
[  989.919548]  </IRQ>
[  989.919551] RIP: 0010:mwait_idle+0x4e/0x6a
[  989.919551] RSP: 0018:ffffc90003283e98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff47
[  989.919553] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  989.919553] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88046da32000
[  989.919554] RBP: ffffc90003283e98 R08: 0000000000000001 R09: ffffffff81c06020
[  989.919554] R10: ffff88046da32000 R11: ffff88046da32000 R12: 0000000000000000
[  989.919555] R13: ffff88046da32000 R14: ffff88046da32000 R15: ffff88046da32000
[  989.919559]  arch_cpu_idle+0xa/0xc
[  989.919561]  default_idle_call+0x19/0x1b
[  989.919565]  do_idle+0xbc/0x196
[  989.919566]  cpu_startup_entry+0x1d/0x20
[  989.919568]  start_secondary+0xf6/0xfb
[  989.919571]  secondary_startup_64+0xa5/0xa5
[  989.919572] Code: 08 0f 4f c2 5d c3 8b 87 80 00 00 00 55 48 8d 97 80 00 00 00 48 89 e5 85 c0 74 10 8d 48 01 f0 0f b1 0a 0f 94 c1 84 c9 75 04 eb ec <0f> ff 5d c3 48 8b 47 58 31 d2 a8 01 74 0b 31 d2 48 a9 fe ff ff
[  989.919593] ---[ end trace c197aa7ded9b754f ]---
[  989.919594] dst_release: dst:ffff88045ab85a00 refcnt:-1
Comment 16 Pawel Staszewski 2017-09-26 08:21:05 UTC
Problem solved by patch:
https://www.mail-archive.com/netdev@vger.kernel.org/msg189383.html