Bug 208003 - kernel panic RIP: 0010:__cgroup_bpf_run_filter_skb+0xd9/0x230
Summary: kernel panic RIP: 0010:__cgroup_bpf_run_filter_skb+0xd9/0x230
Status: RESOLVED CODE_FIX
Alias: None
Product: Networking
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-31 14:08 UTC by karsten.elfenbein
Modified: 2020-07-24 14:22 UTC (History)
13 users (show)

See Also:
Kernel Version: 5.6.15
Subsystem:
Regression: No
Bisected commit-id:


Attachments
blurry kernel panic :( (3.15 MB, image/jpeg)
2020-05-31 14:08 UTC, karsten.elfenbein
Details
next kernel panic with the same output (2.62 MB, image/jpeg)
2020-06-01 07:50 UTC, karsten.elfenbein
Details
screenshot of kernel panic with 5.4.45-1 LTS kernel (929.41 KB, image/jpeg)
2020-06-08 18:41 UTC, Zoé B
Details

Description karsten.elfenbein 2020-05-31 14:08:42 UTC
Created attachment 289439 [details]
blurry kernel panic :(

I get random kernel panic on one of my systems.
It is running Arch Linux with Kernel 5.6.15-arch1-1 with an AMD Ryzen 5 2400G and "RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)" (1462:7a40)

The machine uses a bridge, vlan and docker.
I could only take a photo of the panic as netconsole does not log it.

BUG: kernel NULL pointer dereference, address: 00000000000010
#PF supervisor read access in kernel mode
#PD: error_code(0x0000) - not-present page
PGD 3e65a6067 P4D 3e65a6067 PUD 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.15-arch1-1 #1
Hardware name: Micro-Star International Co., ltd. MS-7A40/B450I GAMING PLUS AC (MS-7A40), BIOS A.60 03/06/2019
RIP: 0010:__cgroup_bpf_run_filter_skb+0xd9/0x230
Code: 00 48 01 c8 48 89 43 50 41 83 ff 01 0f 84 c2 00 00 00 e8 ca f2 ec ff e8 85 31 f2 ff 44 89 fa 48 8d 84 d5 30 06 00 00 48 8b 00 <48> 8b 78 10 4c 8d 78 10 48 85 ff 0f 84 29 01 00 00 bd 01 00 00 00
RSP: 0018:ffffad9b402e4758 EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff93d66b09c900 RCX: 00000000000000f0
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff93d67abe8000 R08: 0000000000000000 R09: 0000000000000001
R10: 000000000000b302 R11: ffff93d66b09c9d4 R12: 0000000000000014
R13: 0000000000000014 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS: ffff93d6908c0000(0000) knlGS: 0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 00000003ecfa0000 CR4: 00000000003406e0
Call Trace:
 <IRQ>
 sk_filter_trim_cap+0x12f/0x270
 ? tcp_v4_inbound_md5_hash+0x5e/0x170
 tcp_v4_rcv+0xb18/0xd80
 ip_protocol_deliver_rcu+0x2b/0x1e0
 ip_local_deliver_finish+0x55&0x70
 ip_local_deliver+0x155/0x130
 ? ip_protocol_deliver_rcu+0x1e0/0x1e0
 ip_sabotage_in+0x5a/0x70 [br_netfilter]
 nf_hook_slow+0x3f/0xb0
 ip_rcv+0xd1/0x110
 ? ip_rev_finish_core.constprop.0+0x470/0x470
 __netif_receive_skb_one_core+0x80/0x90
 netif_receive_skb+0x162/0x1b0
 br_pass_frame_up+0xf0/0x1d0 [bridge]
...
Comment 1 karsten.elfenbein 2020-06-01 07:50:36 UTC
Created attachment 289443 [details]
next kernel panic with the same output

next crash a day later...
Comment 2 Zoé B 2020-06-08 11:38:41 UTC
I can confirm the same on an Intel NUC running 5.6.15-arch1

Call Trace:
<IRQ>
 sk_filter_trim_cap+0x12f/0x270
 ? tcp_v4_inbound_md5_hash+0x5e/0x170
 tcp_v4_rcv+0xb18/0xd80
 ip_protocol_deliver_rcu+0x2b/0x1e0
 ip_local_deliver_finish+0x55/0x70
 ip_local_deliver+0x115/0x130
 ? ip_protocol_deliver_rcu+0x1e0/0x1e0
 ip_sublist_rcv_finish+0x3d/0x50
 ip_sublist_rcv+0x182/0x210
 ? ip_rev_finish_core.constprop.0+0x470/0x470
 ip_list_rcv+0xf6/0x11a
 __netif_receive_skb_list_core+0x263/0x290
 netif_receive_skb_list_internal+0x1c8/0x300
 napi_complete_done+0x93/0x140
 e1000e_poll+0xca/0x2c0 [e1000e]
 net_rx_action+0x168/0x400
 __do_softirq+0x10f/0x358
 irq_exit+0xab/0x120
 do_IRQ+0x89/0x140
 common_interrupt+0xf/0xf
 </IRQ>
RIP: 0033:0x55a466def20c

an other time, RIP was: 0010:cpuidle_enter_state+0xb6/0x420

Computer running 24/7 and crashes about 3-4 times a week. Crashburn does not seem to save logs/core dump. I only have photos of the screen.
Comment 3 Zoé B 2020-06-08 11:42:36 UTC
I wanted to fill in a few more details. It is an Intel NUC6i3SYB
Ethernet interface is the integrated e1000e I219-V / 
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection I219-V (rev 21)

let me know if I can help you debug this
Comment 4 Zoé B 2020-06-08 18:41:53 UTC
Created attachment 289567 [details]
screenshot of kernel panic with 5.4.45-1 LTS kernel

The behavior can also be observed with recent 5.4.45 LTS kernel.
Comment 5 Zoé B 2020-06-08 19:06:43 UTC
https://www.spinics.net/lists/netdev/msg658421.html this thread looks related.
Comment 6 Thomas Clark 2020-06-09 15:12:39 UTC
I have 3 servers that recently upgraded to kernel 5.6.15. Since the upgrade, two have crashed randomly, one is stable.  The two that crashed are running docker and systemd.unified_cgroup_hierarchy=0.  Reverting to kernel 5.6.13 resolved the crashes.  As above, I have nothing in logs to explain the crashes.
Comment 8 wolf 2020-06-09 23:22:39 UTC
Possible workaround based on [0] is:

cd /usr/lib/systemd/system
sed -i.bak 's/^IPAddressDeny/#IPAddressDeny/g' *.service

Just posting it also here so that people coming here from
search engine can find it here.



[0]: https://www.spinics.net/lists/netdev/msg658503.html
Comment 9 wolf 2020-06-09 23:24:43 UTC
Also complete panic from my machine:

"BUG: kernel NULL pointer dereference, address: 0000000000000010
"#PF: supervisor read access in kernel mode
"#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 )
Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G     U            5.4.44-1-lts #1
Hardware name: ASUSTeK COMPUTER INC. P10S WS/P10S WS, BIOS 3402 07/12/2018
RIP: 0010:__cgroup_bpf_run_filter_skb+0xca/0x1d0#
Code: 48 89 83 c8 00 00 00 48 01 c8 48 89 14 24 48 89 43 50 41 83 ff 01 0f 84 92 00 00 00 44 89 fa 48 8d 84 d6 30 06 00 00 48 8b 00 <48> 8b 78 10 4c 8d 78 10 48 85 ff 0f 84 df 00 00 00 bd 01 00 00 00
RSP: 0000:ffffc9000003cc10 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff8882a6c16300 RCX: 0000000000000114
RDX: 0000000000000000 RSI: ffff888820bec000 RDI: ffff8881ec9af1c0
RBP: ffff8882a6c16300 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000014
R13: 0000000000000014 R14: ffff88882054e862 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88884ea80000(0000) knlGS:0000000000000000rK
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 0000000214c0a005 CR4: 00000000003606e0
Call Trace:
 <IRQ>
 sk_filter_trim_cap+0x123/0x220
 ? tcp_v4_inbound_md5_hash+0x56/0x170
 tcp_v4_rcv+0x990/0xbc0
 ip_protocol_deliver_rcu+0x2b/0x1b0
 ip_local_deliver_finish+0x44/0x50
 ip_local_deliver+0xf5/0x100[
 ? ip_protocol_deliver_rcu+0x1b0/0x1b0
 ip_rcv+0xc5/0xd0
 ? ip_rcv_finish_core.constprop.0+0x390/0x390
 __netif_receive_skb_one_core+0x87/0xa0
 netif_receive_skb_internal+0x52/0xc0
 napi_gro_receive+0x114/0x140
 igb_poll+0x56e/0xf40 [igb]
 net_rx_action+0x16d/0x3f0
 ? rcu_core+0xde/0x4e0
 __do_softirq+0xe9/0x2dc
 irq_exit+0xa2/0xe0
 do_IRQ+0x58/0xe0
 common_interrupt+0xf/0xf
 </IRQ>
RIP: 0010:cpuidle_enter_state+0xea/0x400
Code: ed 8d 7e e8 58 39 99 ff 48 89 04 24 0f 1f 44 00 00 31 ff e8 08 47 99 ff 80 7c 24 0b 00 0f 85 12 02 00 00 fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 1e 02 00 00 49 63 cc 48 8b 34 24 48 8d 04 49 48 c1
RSP: 0000:ffffc900000c7e80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd8
RAX: ffff88884eaaa500 RBX: ffffffff822bba00 RCX: 000000000000001f
RDX: 0000000000000000 RSI: 000000002819aad9 RDI: 0000000000000000
RBP: ffff88884eab4548 R08: 00002728d4ce484e R09: 0000000000000005
R10: 0000000000000031 R11: ffff88884eaa92c0 R12: 0000000000000001
R13: ffffffff822bba60 R14: ffffffff822bba78 R15: 00002728d4c27551
 cpuidle_enter+0x29/0x40w6
 do_idle+0x1c4/0x240
 cpu_startup_entry+0x19/0x20
 start_secondary+0x176/0x1d0
 secondary_startup_64+0xb6/0xc0
Modules linked in: rpcsec_gss_krb5 veth tun xt_nat xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_nat nf_nat br_netfilter bridge stp llc xt_tcpudp xt_state msr xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_nhlt snd_hda_codec eeepc_wmi x86_pkg_temp_thermal asus_wmi intel_powerclamp battery sparse_keymap coretemp iTCO_wdt iTCO_vendor_support mei_hdcp rfkill snd_hda_core wmi_bmof snd_hwdep mxm_wmi kvm snd_pcm snd_timer igb irqbypass intel_cstate intel_uncore snd intel_rapl_perf pcspkr i2c_i801 soundcore dca mei_me joydev mousedev mei input_leds wmi evdev mac_hid nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables x_tables btrfs libcrc32c crc32c_generic xor raid6_pq dm_crypt dm_mod hid_logitech_hidpp hid_logitech_dj hid_generic sd_mod usbhid hid crct10dif_pclmul crc32_pclmulFn
 crc32c_intel ghash_clmulni_intel aesni_intel ahci libahci crypto_simd cryptd glue_helper libata xhci_pci xhci_hcd scsi_mod i915 intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgartYD
CR2: 0000000000000010
Comment 10 Zoé B 2020-06-12 14:10:30 UTC
(In reply to wolf from comment #8)
> Possible workaround based on [0] is:

Our machine with this workaround crashed after 3 days.
Comment 11 Andreas Kurth 2020-06-14 08:48:50 UTC
Contrary to the "Kernel Version" tag above this bug has been introduced with kernel 5.6.14.

Have been hit by this, too, on a debian system. Machine is running docker and several other services. Reproducible stops working after several hours. Tested all kernels up to 5.7.1, all showing the same behaviour. Downgrading to 5.6.13 solves the problem.

Trying 5.7.1 with e2d928d5ee43f372618a9f98b0c73674717f2a2c reverted now.
Comment 12 Daniel Sonck 2020-06-14 17:44:27 UTC
I can confirm Andreas Kurth observation as well. My kernel crashed regularly with 5.7.2 but reverting to 5.6.13 does not trigger this problem.

I can add that I do not have anything related to IPAddressDeny however VMs I run on the crashing host do show bpf-prog running. They are part of a kubernetes cluster.
Comment 13 Arne Bockholdt 2020-06-15 09:17:35 UTC
We have the same problem with all kernels past 5.6.13 on two different servers. On both servers VMs are running virtualized by QEMU/KVM and some Docker containers too.

Here's the dmesg of a crash on one of the servers: 

[106909.018431][ T4042] BUG: kernel NULL pointer dereference, address: 0000000000000010
[106909.020070][ T4042] #PF: supervisor read access in kernel mode
[106909.021678][ T4042] #PF: error_code(0x0000) - not-present page
[106909.023550][ T4042] PGD 0 P4D 0 
[106909.025622][ T4042] Oops: 0000 [#1] SMP PTI
[106909.027662][ T4042] CPU: 6 PID: 4042 Comm: vhost-3998 Not tainted 5.7.1-serverv4 #336
[106909.029720][ T4042] Hardware name: Supermicro A1SAi/A1SRi, BIOS 2.2 11/22/2019
[106909.031802][ T4042] RIP: 0010:__cgroup_bpf_run_filter_skb+0xe7/0x3d0
[106909.033912][ T4042] Code: 89 87 c8 00 00 00 48 01 c8 48 89 5c 24 18 49 89 47 50 41 83 fd 01 0f 84 8b 01 00 00 44 89 ea 48 8d 84 d6 30 06 00 00 48 8b 00 <4c> 8b 70 10 4c 8d 68 10 4d 85 f6 0f 84 b7 02 00 00 49 8d 47 30 bb
[106909.039815][ T4042] RSP: 0018:ffffa90f0160f7f8 EFLAGS: 00010297
[106909.042990][ T4042] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000003c
[106909.046252][ T4042] RDX: 0000000000000000 RSI: ffff9b3d6b34d000 RDI: ffff9b3d6cbbb480
[106909.049508][ T4042] RBP: ffff9b3d69267000 R08: 000000000000b403 R09: 0000000000000000
[106909.052783][ T4042] R10: 0000000000000000 R11: 00000000712aa8c0 R12: 0000000000000006
[106909.056057][ T4042] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9b3d69267000
[106909.059322][ T4042] FS:  0000000000000000(0000) GS:ffff9b3d6fd80000(0000) knlGS:0000000000000000
[106909.062623][ T4042] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[106909.065923][ T4042] CR2: 0000000000000010 CR3: 000000043ef76000 CR4: 00000000001026e0
[106909.069247][ T4042] Call Trace:
[106909.072534][ T4042]  sk_filter_trim_cap+0xd7/0x210
[106909.075806][ T4042]  ? tcp_v4_inbound_md5_hash+0x4c/0x160
[106909.079059][ T4042]  tcp_v4_rcv+0xb52/0xd60
[106909.082281][ T4042]  ip_protocol_deliver_rcu+0x2b/0x1b0
[106909.085493][ T4042]  ip_local_deliver_finish+0x44/0x50
[106909.088692][ T4042]  ip_local_deliver+0xe5/0xf0
[106909.091877][ T4042]  ? ip_protocol_deliver_rcu+0x1b0/0x1b0
[106909.095103][ T4042]  ip_rcv+0xbc/0xd0
[106909.098290][ T4042]  ? ip_rcv_finish_core.isra.0+0x410/0x410
[106909.101492][ T4042]  __netif_receive_skb_one_core+0x87/0xa0
[106909.104720][ T4042]  netif_receive_skb+0x46/0x150
[106909.107911][ T4042]  br_pass_frame_up+0x133/0x150
[106909.111047][ T4042]  ? fdb_find_rcu+0xf2/0x170
[106909.114128][ T4042]  br_handle_frame_finish+0x316/0x440
[106909.117193][ T4042]  br_handle_frame+0x241/0x370
[106909.120242][ T4042]  ? br_pass_frame_up+0x150/0x150
[106909.123235][ T4042]  __netif_receive_skb_core+0x2ee/0xef0
[106909.126201][ T4042]  __netif_receive_skb_one_core+0x3d/0xa0
[106909.129139][ T4042]  netif_receive_skb+0x46/0x150
[106909.132041][ T4042]  ? __build_skb+0x1f/0x50
[106909.134910][ T4042]  tun_sendmsg+0x459/0x85e
[106909.137791][ T4042]  vhost_tx_batch.isra.0+0x5d/0xd0 [vhost_net]
[106909.140730][ T4042]  handle_tx_copy+0x15a/0x550 [vhost_net]
[106909.143658][ T4042]  handle_tx+0xa5/0xe0 [vhost_net]
[106909.146553][ T4042]  vhost_worker+0xac/0x100 [vhost]
[106909.149459][ T4042]  kthread+0xfb/0x130
[106909.152321][ T4042]  ? __vhost_add_used_n+0x210/0x210 [vhost]
[106909.155170][ T4042]  ? kthread_park+0x90/0x90
[106909.157948][ T4042]  ret_from_fork+0x35/0x40

The other one crashes too but with completely other output but the crashes are reproducible with 5.6.14+: 

 8959.334989][  T264] ata6.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x6 frozen
[ 8959.335005][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335017][  T264] ata6.00: cmd 60/08:00:58:4c:bb/00:00:03:00:00/40 tag 0 ncq dma 4096 in
[ 8959.335017][  T264]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335025][  T264] ata6.00: status: { DRDY }
[ 8959.335032][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335042][  T264] ata6.00: cmd 60/08:08:b0:4c:98/00:00:07:00:00/40 tag 1 ncq dma 4096 in
[ 8959.335042][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335048][  T264] ata6.00: status: { DRDY }
[ 8959.335054][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335063][  T264] ata6.00: cmd 60/08:10:f0:b5:0a/00:00:09:00:00/40 tag 2 ncq dma 4096 in
[ 8959.335063][  T264]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335068][  T264] ata6.00: status: { DRDY }
[ 8959.335073][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335084][  T264] ata6.00: cmd 60/18:18:78:fe:1f/00:00:00:00:00/40 tag 3 ncq dma 12288 in
[ 8959.335084][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335091][  T264] ata6.00: status: { DRDY }
[ 8959.335096][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335106][  T264] ata6.00: cmd 60/08:20:f0:54:bf/00:00:0c:00:00/40 tag 4 ncq dma 4096 in
[ 8959.335106][  T264]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335111][  T264] ata6.00: status: { DRDY }
[ 8959.335116][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335124][  T264] ata6.00: cmd 60/30:28:98:fe:1f/00:00:00:00:00/40 tag 5 ncq dma 24576 in
[ 8959.335124][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335130][  T264] ata6.00: status: { DRDY }
[ 8959.335135][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335145][  T264] ata6.00: cmd 60/20:30:d0:fe:1f/00:00:00:00:00/40 tag 6 ncq dma 16384 in
[ 8959.335145][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335150][  T264] ata6.00: status: { DRDY }
[ 8959.335155][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335165][  T264] ata6.00: cmd 60/20:38:c0:cc:54/00:00:07:00:00/40 tag 7 ncq dma 16384 in
[ 8959.335165][  T264]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335170][  T264] ata6.00: status: { DRDY }
[ 8959.335175][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335183][  T264] ata6.00: cmd 60/08:40:10:ff:1f/00:00:00:00:00/40 tag 8 ncq dma 4096 in
[ 8959.335183][  T264]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335187][  T264] ata6.00: status: { DRDY }
[ 8959.335191][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335198][  T264] ata6.00: cmd 60/08:48:a8:4c:98/00:00:07:00:00/40 tag 9 ncq dma 4096 in
[ 8959.335198][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335202][  T264] ata6.00: status: { DRDY }
[ 8959.335206][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335213][  T264] ata6.00: cmd 60/28:50:20:ff:1f/00:00:00:00:00/40 tag 10 ncq dma 20480 in
[ 8959.335213][  T264]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335217][  T264] ata6.00: status: { DRDY }
[ 8959.335221][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335228][  T264] ata6.00: cmd 60/08:58:f8:b5:0a/00:00:09:00:00/40 tag 11 ncq dma 4096 in
[ 8959.335228][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335233][  T264] ata6.00: status: { DRDY }
[ 8959.335237][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335244][  T264] ata6.00: cmd 60/20:60:58:ff:1f/00:00:00:00:00/40 tag 12 ncq dma 16384 in
[ 8959.335244][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335249][  T264] ata6.00: status: { DRDY }
[ 8959.335253][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335261][  T264] ata6.00: cmd 60/20:68:70:fd:1f/00:00:00:00:00/40 tag 13 ncq dma 16384 in
[ 8959.335261][  T264]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335265][  T264] ata6.00: status: { DRDY }
[ 8959.335269][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335276][  T264] ata6.00: cmd 60/08:70:80:4c:98/00:00:07:00:00/40 tag 14 ncq dma 4096 in
[ 8959.335276][  T264]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335280][  T264] ata6.00: status: { DRDY }
[ 8959.335284][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335291][  T264] ata6.00: cmd 60/28:78:98:fd:1f/00:00:00:00:00/40 tag 15 ncq dma 20480 in
[ 8959.335291][  T264]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335295][  T264] ata6.00: status: { DRDY }
[ 8959.335300][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335307][  T264] ata6.00: cmd 60/18:80:c8:fd:1f/00:00:00:00:00/40 tag 16 ncq dma 12288 in
[ 8959.335307][  T264]          res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335311][  T264] ata6.00: status: { DRDY }
[ 8959.335316][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335323][  T264] ata6.00: cmd 60/08:88:e8:b5:0a/00:00:09:00:00/40 tag 17 ncq dma 4096 in
[ 8959.335323][  T264]          res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335327][  T264] ata6.00: status: { DRDY }
[ 8959.335331][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335338][  T264] ata6.00: cmd 60/08:90:e8:fd:1f/00:00:00:00:00/40 tag 18 ncq dma 4096 in
[ 8959.335338][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335342][  T264] ata6.00: status: { DRDY }
[ 8959.335346][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335354][  T264] ata6.00: cmd 60/08:98:e8:54:bf/00:00:0c:00:00/40 tag 19 ncq dma 4096 in
[ 8959.335354][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335358][  T264] ata6.00: status: { DRDY }
[ 8959.335362][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335369][  T264] ata6.00: cmd 60/08:a0:f8:fd:1f/00:00:00:00:00/40 tag 20 ncq dma 4096 in
[ 8959.335369][  T264]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335373][  T264] ata6.00: status: { DRDY }
[ 8959.335377][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335383][  T264] ata6.00: cmd 60/18:a8:a8:cc:54/00:00:07:00:00/40 tag 21 ncq dma 12288 in
[ 8959.335383][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335387][  T264] ata6.00: status: { DRDY }
[ 8959.335392][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335399][  T264] ata6.00: cmd 60/08:b0:20:fe:1f/00:00:00:00:00/40 tag 22 ncq dma 4096 in
[ 8959.335399][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335403][  T264] ata6.00: status: { DRDY }
[ 8959.335407][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335413][  T264] ata6.00: cmd 60/20:b8:88:4c:98/00:00:07:00:00/40 tag 23 ncq dma 16384 in
[ 8959.335413][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335417][  T264] ata6.00: status: { DRDY }
[ 8959.335421][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335429][  T264] ata6.00: cmd 60/30:c0:80:ff:1f/00:00:00:00:00/40 tag 24 ncq dma 24576 in
[ 8959.335429][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335433][  T264] ata6.00: status: { DRDY }
[ 8959.335438][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335445][  T264] ata6.00: cmd 60/08:c8:f8:54:bf/00:00:0c:00:00/40 tag 25 ncq dma 4096 in
[ 8959.335445][  T264]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335449][  T264] ata6.00: status: { DRDY }
[ 8959.335453][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335459][  T264] ata6.00: cmd 60/28:d0:b8:ff:1f/00:00:00:00:00/40 tag 26 ncq dma 20480 in
[ 8959.335459][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335463][  T264] ata6.00: status: { DRDY }
[ 8959.335467][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335474][  T264] ata6.00: cmd 60/08:d8:e0:cc:54/00:00:07:00:00/40 tag 27 ncq dma 4096 in
[ 8959.335474][  T264]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335478][  T264] ata6.00: status: { DRDY }
[ 8959.335483][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335489][  T264] ata6.00: cmd 60/18:e0:e8:ff:1f/00:00:00:00:00/40 tag 28 ncq dma 12288 in
[ 8959.335489][  T264]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335494][  T264] ata6.00: status: { DRDY }
[ 8959.335498][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335505][  T264] ata6.00: cmd 60/08:e8:00:4c:bb/00:00:03:00:00/40 tag 29 ncq dma 4096 in
[ 8959.335505][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335508][  T264] ata6.00: status: { DRDY }
[ 8959.335512][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335520][  T264] ata6.00: cmd 60/20:f0:10:4c:bb/00:00:03:00:00/40 tag 30 ncq dma 16384 in
[ 8959.335520][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335524][  T264] ata6.00: status: { DRDY }
[ 8959.335528][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8959.335535][  T264] ata6.00: cmd 60/20:f8:50:fe:1f/00:00:00:00:00/40 tag 31 ncq dma 16384 in
[ 8959.335535][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8959.335539][  T264] ata6.00: status: { DRDY }
[ 8959.335546][  T264] ata6: hard resetting link
[ 8959.810909][  T264] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 8959.816788][  T264] ata6.00: configured for UDMA/133
[ 8959.816866][  T264] ata6: EH complete
[ 8992.114805][  T264] ata6.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x6 frozen
[ 8992.114815][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.114821][  T264] ata6.00: cmd 60/10:00:70:d4:23/00:00:0a:00:00/40 tag 0 ncq dma 8192 in
[ 8992.114821][  T264]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.114830][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.114835][  T264] ata6.00: cmd 60/10:08:88:d4:23/00:00:0a:00:00/40 tag 1 ncq dma 8192 in
[ 8992.114835][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.114839][  T264] ata6.00: status: { DRDY }
[ 8992.114842][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.114850][  T264] ata6.00: cmd 60/28:10:a0:d4:23/00:00:0a:00:00/40 tag 2 ncq dma 20480 in
[ 8992.114850][  T264]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.114855][  T264] ata6.00: status: { DRDY }
[ 8992.114858][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.114862][  T264] ata6.00: cmd 60/18:18:d0:d4:23/00:00:0a:00:00/40 tag 3 ncq dma 12288 in
[ 8992.114862][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.114867][  T264] ata6.00: status: { DRDY }
[ 8992.114870][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.114878][  T264] ata6.00: cmd 60/08:20:f0:d4:23/00:00:0a:00:00/40 tag 4 ncq dma 4096 in
[ 8992.114878][  T264]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.114881][  T264] ata6.00: status: { DRDY }
[ 8992.114886][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.114891][  T264] ata6.00: cmd 60/10:28:10:d5:23/00:00:0a:00:00/40 tag 5 ncq dma 8192 in
[ 8992.114891][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.114895][  T264] ata6.00: status: { DRDY }
[ 8992.114898][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.114902][  T264] ata6.00: cmd 60/28:30:28:d5:23/00:00:0a:00:00/40 tag 6 ncq dma 20480 in
[ 8992.114902][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.114908][  T264] ata6.00: status: { DRDY }
[ 8992.114911][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.114918][  T264] ata6.00: cmd 60/18:38:58:d5:23/00:00:0a:00:00/40 tag 7 ncq dma 12288 in
[ 8992.114918][  T264]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.114924][  T264] ata6.00: status: { DRDY }
[ 8992.114927][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.114933][  T264] ata6.00: cmd 60/18:40:c8:d5:23/00:00:0a:00:00/40 tag 8 ncq dma 12288 in
[ 8992.114933][  T264]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.114936][  T264] ata6.00: status: { DRDY }
[ 8992.114942][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.114947][  T264] ata6.00: cmd 60/28:48:e8:d5:23/00:00:0a:00:00/40 tag 9 ncq dma 20480 in
[ 8992.114947][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.114951][  T264] ata6.00: status: { DRDY }
[ 8992.114954][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.114959][  T264] ata6.00: cmd 60/10:50:18:d6:23/00:00:0a:00:00/40 tag 10 ncq dma 8192 in
[ 8992.114959][  T264]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.114964][  T264] ata6.00: status: { DRDY }
[ 8992.114967][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.114974][  T264] ata6.00: cmd 60/18:58:30:d6:23/00:00:0a:00:00/40 tag 11 ncq dma 12288 in
[ 8992.114974][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.114980][  T264] ata6.00: status: { DRDY }
[ 8992.114983][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.114987][  T264] ata6.00: cmd 60/10:60:50:d6:23/00:00:0a:00:00/40 tag 12 ncq dma 8192 in
[ 8992.114987][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.114992][  T264] ata6.00: status: { DRDY }
[ 8992.114998][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115003][  T264] ata6.00: cmd 60/08:68:90:d6:23/00:00:0a:00:00/40 tag 13 ncq dma 4096 in
[ 8992.115003][  T264]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115008][  T264] ata6.00: status: { DRDY }
[ 8992.115010][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115015][  T264] ata6.00: cmd 60/28:70:a0:d6:23/00:00:0a:00:00/40 tag 14 ncq dma 20480 in
[ 8992.115015][  T264]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115020][  T264] ata6.00: status: { DRDY }
[ 8992.115023][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115031][  T264] ata6.00: cmd 60/20:78:68:d6:7c/00:00:0a:00:00/40 tag 15 ncq dma 16384 in
[ 8992.115031][  T264]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115033][  T264] ata6.00: status: { DRDY }
[ 8992.115039][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115044][  T264] ata6.00: cmd 60/10:80:d0:d6:23/00:00:0a:00:00/40 tag 16 ncq dma 8192 in
[ 8992.115044][  T264]          res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115050][  T264] ata6.00: status: { DRDY }
[ 8992.115055][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115060][  T264] ata6.00: cmd 60/08:88:f0:d6:23/00:00:0a:00:00/40 tag 17 ncq dma 4096 in
[ 8992.115060][  T264]          res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115065][  T264] ata6.00: status: { DRDY }
[ 8992.115068][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115073][  T264] ata6.00: cmd 60/28:90:00:d7:23/00:00:0a:00:00/40 tag 18 ncq dma 20480 in
[ 8992.115073][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115078][  T264] ata6.00: status: { DRDY }
[ 8992.115081][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115089][  T264] ata6.00: cmd 60/08:98:30:d7:23/00:00:0a:00:00/40 tag 19 ncq dma 4096 in
[ 8992.115089][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115091][  T264] ata6.00: status: { DRDY }
[ 8992.115097][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115101][  T264] ata6.00: cmd 60/10:a0:50:d7:23/00:00:0a:00:00/40 tag 20 ncq dma 8192 in
[ 8992.115101][  T264]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115104][  T264] ata6.00: status: { DRDY }
[ 8992.115107][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115112][  T264] ata6.00: cmd 60/10:a8:50:d7:7c/00:00:0a:00:00/40 tag 21 ncq dma 8192 in
[ 8992.115112][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115114][  T264] ata6.00: status: { DRDY }
[ 8992.115123][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115127][  T264] ata6.00: cmd 60/08:b0:68:d7:7c/00:00:0a:00:00/40 tag 22 ncq dma 4096 in
[ 8992.115127][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115130][  T264] ata6.00: status: { DRDY }
[ 8992.115137][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115142][  T264] ata6.00: cmd 60/28:b8:78:d7:7c/00:00:0a:00:00/40 tag 23 ncq dma 20480 in
[ 8992.115142][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115147][  T264] ata6.00: status: { DRDY }
[ 8992.115150][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115154][  T264] ata6.00: cmd 60/20:c0:70:d7:23/00:00:0a:00:00/40 tag 24 ncq dma 16384 in
[ 8992.115154][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115160][  T264] ata6.00: status: { DRDY }
[ 8992.115164][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115170][  T264] ata6.00: cmd 60/20:c8:98:d7:23/00:00:0a:00:00/40 tag 25 ncq dma 16384 in
[ 8992.115170][  T264]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115175][  T264] ata6.00: status: { DRDY }
[ 8992.115178][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115183][  T264] ata6.00: cmd 60/08:d0:00:d4:23/00:00:0a:00:00/40 tag 26 ncq dma 4096 in
[ 8992.115183][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115188][  T264] ata6.00: status: { DRDY }
[ 8992.115191][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115198][  T264] ata6.00: cmd 60/28:d8:10:d4:23/00:00:0a:00:00/40 tag 27 ncq dma 20480 in
[ 8992.115198][  T264]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115200][  T264] ata6.00: status: { DRDY }
[ 8992.115206][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115211][  T264] ata6.00: cmd 60/20:e0:c0:d7:23/00:00:0a:00:00/40 tag 28 ncq dma 16384 in
[ 8992.115211][  T264]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115214][  T264] ata6.00: status: { DRDY }
[ 8992.115219][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115224][  T264] ata6.00: cmd 60/08:e8:f8:d7:23/00:00:0a:00:00/40 tag 29 ncq dma 4096 in
[ 8992.115224][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115228][  T264] ata6.00: status: { DRDY }
[ 8992.115239][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115244][  T264] ata6.00: cmd 60/20:f0:18:58:22/00:00:0b:00:00/40 tag 30 ncq dma 16384 in
[ 8992.115244][  T264]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115246][  T264] ata6.00: status: { DRDY }
[ 8992.115252][  T264] ata6.00: failed command: READ FPDMA QUEUED
[ 8992.115259][  T264] ata6.00: cmd 60/28:f8:40:d4:23/00:00:0a:00:00/40 tag 31 ncq dma 20480 in
[ 8992.115259][  T264]          res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 8992.115262][  T264] ata6.00: status: { DRDY }
[ 8992.115268][  T264] ata6: hard resetting link
[ 8992.590750][  T264] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 8992.596667][  T264] ata6.00: configured for UDMA/133
[ 8992.596741][  T264] ata6: EH complete


I know, the second log snippet is completely unrelated to the first one but both servers are crashing with 5.6.14+ and are stable with all kernel version before that.
Comment 14 karsten.elfenbein 2020-06-27 09:16:00 UTC
While observing the issue I noticed that the crash might be related to me using the NFS4 server on the host. Sometimes the host would crash after I shutdown my client.
Today it crashed after using NFS and then taking a small break.
So I tried to stress nfs a bit. To my luck I could get a reproduction very quickly. It appears to crash within a few cycles of the script loop.

#!/bin/bash
while [ 1 ]
do
        date
        mount /mnt/nfs
        ls -l /mnt/nfs
        umount /mnt/nfs
        echo sleeping...
        sleep 1
done

the share options in fstab are nfs4 with user,intr,async,noauto

Anyone else running NFS4 on their affected machines?
Comment 15 Zoé B 2020-06-27 09:22:38 UTC
Our machine which is affected by this is used as an NFS server.
Comment 16 Hao Wei Tee 2020-06-27 10:19:53 UTC
There is a patch on the netdev mailing list that appears to alleviate the problem partially (although not a full fix - AIUI the problem is wrong refcounting somewhere exacerbated by 090e28b229af92dc5b40786ca673999d59e73056; refcounting tends to be rather hard to debug): https://www.spinics.net/lists/netdev/msg660252.html

It would be great if you could try the patch and then report if you can still repro (or not). Alternatively, just send the repro over to the mailing list, it might help.

For me, it seems to happen sporadically when I use Docker. Haven't figured out a reliable reproduction yet.
Comment 17 Thomas Clark 2020-06-27 18:08:45 UTC
I have two affected servers, both running nfsd and docker containers.  One of the servers crashes in hours, the other runs about a day before crashing.  The server that crashes quickly is both a busier docker server and nfs server than the the slower crasher; however, the nfs load difference is the most significant.  The fast crasher also has a number of automount clients, with much more frequent mount/umount cycles.

I found the following snippet in the log of the "fast crasher" about 130 minutes prior to a crash, so I don't know if it is directly related or not.

Jun 07 06:24:09 kernel: NFSD: Using nfsdcld client tracking operations.
Jun 07 06:24:09 kernel: NFSD: starting 90-second grace period (net f00000a8)
Jun 07 06:24:12 kernel: ------------[ cut here ]------------
Jun 07 06:24:12 kernel: WARNING: CPU: 1 PID: 2018 at fs/nfsd/nfs4state.c:4873 nfsd4_process_open2+0x1291/0x1610 [nfsd]
Jun 07 06:24:12 kernel: Modules linked in: [modules list removed]
Jun 07 06:24:12 kernel:  [modules list removed]
Jun 07 06:24:12 kernel: CPU: 1 PID: 2018 Comm: nfsd Tainted: G           OE     5.6.15-200.fc31.x86_64 #1
Jun 07 06:24:12 kernel: Hardware name: Supermicro X10SAE/X10SAE, BIOS 3.1 02/23/2018
Jun 07 06:24:12 kernel: RIP: 0010:nfsd4_process_open2+0x1291/0x1610 [nfsd]
Jun 07 06:24:12 kernel: Code: 9a c0 c6 07 00 0f 1f 40 00 8b 54 24 48 85 d2 0f 85 b0 02 00 00 8b 05 22 19 e9 ff 4c 8b 64 24 58 89 c1 83 e1 10 e9 f4 f8 ff ff <0f> 0b e9 5a f4 ff ff 0f 0b e9 1b f2 ff ff 48 c7 44 24 50 00 00 00
Jun 07 06:24:12 kernel: RSP: 0018:ffffbd97811f7cc0 EFLAGS: 00010246
Jun 07 06:24:12 kernel: RAX: 0000000000000000 RBX: ffff98fcc16611e0 RCX: ffff98fcb10b6a50
Jun 07 06:24:12 kernel: RDX: 0000000000000001 RSI: ffff98fcb7b05498 RDI: ffff98fcb7b05444
Jun 07 06:24:12 kernel: RBP: ffffbd97811f7da0 R08: ffff98fcc1b020e0 R09: ffff98fcb10b6a50
Jun 07 06:24:12 kernel: R10: ffff98fcc1b02098 R11: 0000000000000000 R12: ffff98fcb7b05444
Jun 07 06:24:12 kernel: R13: ffff98fd39c04398 R14: ffff98fcb7b05440 R15: ffff98fcb7b05440
Jun 07 06:24:12 kernel: FS:  0000000000000000(0000) GS:ffff98fd3fa40000(0000) knlGS:0000000000000000
Jun 07 06:24:12 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 07 06:24:12 kernel: CR2: 00007f41dfc0e790 CR3: 000000014660a004 CR4: 00000000001606e0
Jun 07 06:24:12 kernel: Call Trace:
Jun 07 06:24:12 kernel:  ? nfsd_permission+0x63/0xe0 [nfsd]
Jun 07 06:24:12 kernel:  ? fh_verify+0x133/0x550 [nfsd]
Jun 07 06:24:12 kernel:  nfsd4_open+0x35b/0x730 [nfsd]
Jun 07 06:24:12 kernel:  nfsd4_proc_compound+0x3f2/0x760 [nfsd]
Jun 07 06:24:12 kernel:  ? nfs4svc_decode_compoundargs+0x407/0x4c0 [nfsd]
Jun 07 06:24:12 kernel:  nfsd_dispatch+0xc1/0x200 [nfsd]
Jun 07 06:24:12 kernel:  svc_process_common+0x459/0x6d0 [sunrpc]
Jun 07 06:24:12 kernel:  ? svc_recv+0x313/0x9c0 [sunrpc]
Jun 07 06:24:12 kernel:  ? nfsd_svc+0x2d0/0x2d0 [nfsd]
Jun 07 06:24:12 kernel:  svc_process+0xd4/0x110 [sunrpc]
Jun 07 06:24:12 kernel:  nfsd+0xe3/0x140 [nfsd]
Jun 07 06:24:12 kernel:  kthread+0xf9/0x130
Jun 07 06:24:12 kernel:  ? nfsd_destroy+0x50/0x50 [nfsd]
Jun 07 06:24:12 kernel:  ? kthread_park+0x90/0x90
Jun 07 06:24:12 kernel:  ret_from_fork+0x35/0x40
Jun 07 06:24:12 kernel: ---[ end trace 1d0afbeba344bcd4 ]---
Comment 18 karsten.elfenbein 2020-06-28 13:42:43 UTC
I'm running the patched kernel (https://www.spinics.net/lists/netdev/msg660252.html) now.
The Arch Linux stock kernel (5.7.6) crashed within 30 seconds.
The patched vanilla 5.7.6 is running for 45 minutes already.
Comment 19 salvador.joseluis 2020-06-29 02:00:40 UTC
One full week without any crash running 5.4.47 with the patch (https://www.spinics.net/lists/netdev/msg660252.html). Without the patch, the kernel crash in less than 12h.
Comment 20 Thomas Clark 2020-07-05 21:22:07 UTC
The patch referenced above seems to be successful.  Is there anyone who can address when the patch might make it into the kernel?
Comment 21 Hao Wei Tee 2020-07-06 09:03:33 UTC
Patch is being discussed here: https://patchwork.ozlabs.org/project/netdev/patch/20200616180352.18602-1-xiyou.wangcong@gmail.com/

Sorry for my outdated information earlier.
Comment 22 Hao Wei Tee 2020-07-12 11:16:54 UTC
https://patchwork.ozlabs.org/project/netdev/patch/20200702185256.17917-1-xiyou.wangcong@gmail.com/

Seems like it will be merged into mainline soon.
Comment 23 Hao Wei Tee 2020-07-17 15:39:52 UTC
"group: fix cgroup_sk_alloc() for sk_clone_lock()" was merged into 5.8-rc5 about a few hours after my previous comment: https://lore.kernel.org/lkml/CAHk-=wgX5+Q_trdMPaaQZmko0Og_eqAYoyMa_8S3ie+1Us6rkw@mail.gmail.com/
Comment 24 Thomas Clark 2020-07-21 14:15:22 UTC
Any chance it will make it into the current 5.7? It's kind of a major issue to leave for an additional few weeks.
Comment 25 Hao Wei Tee 2020-07-21 14:19:21 UTC
It should be in 5.7.10, which will probably be released in a few more days. It's currently in the stable queue[1].

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/queue-5.7/cgroup-fix-cgroup_sk_alloc-for-sk_clone_lock.patch
Comment 26 salvador.joseluis 2020-07-22 07:52:57 UTC
Seems that the version 5.4.53 includes the fix:

94886c86e833dbc8995202b6c6aaff592b7abd24
https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.4.53
Comment 27 karsten.elfenbein 2020-07-24 14:22:31 UTC
Fix is in 5.7.10. I tested for multiple hours with the mount/umount script and no crashes.

Marking the bug as resolved.
Thanks to all involved.

Note You need to log in before you can comment on or make changes to this bug.