Bug 210097

Summary: 5.9.6 hard lockup with infiniband
Product: File System Reporter: SimplyCorbett (thezombiehunter)
Component: NFSAssignee: Tejun Heo (tj)
Status: NEW ---    
Severity: high CC: cel
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.9.6 Subsystem:
Regression: No Bisected commit-id:
Attachments: lshw
Kernel log
Kernel config

Description SimplyCorbett 2020-11-07 12:15:18 UTC
Created attachment 293533 [details]
lshw

Upgraded from 5.8.18 to 5.9.6 and I'm now getting hard lockups.

Setup:

3900x
128GB RAM
2080 ti
Several SATA devices with both f2fs and ext4, all mounted with cryptsetup
ConnectX-3 infiniband FDR

I have two connectx-3 cards connected together directly with opensm, no switch.

On 5.8.18 this works fine. On 5.9.6 I am getting hard lockups.

Replication:

CLIENT:

/etc/fstab
172.16.1.20:/mnt/Heavy-1 /mnt/Ares-3TB-1 nfs4 _netdev,auto,proto=rdma,port=20049,hard,intr,rsize=65536,wsize=65536,noatime     0       0

SERVER:

echo rdma 20049 > /proc/fs/nfsd/portlist

BOTH:

/etc/init.d/opensm stop && modprobe ib_ipoib && modprobe ib_umad && modprobe ib_ipoib && modprobe ib_srp && modprobe ib_uverbs && modprobe rdma_ucm && modprobe mlx4_ib && modprobe ib_core && modprobe svcrdma && modprobe xprtrdma && /etc/init.d/net.ib0 start && /etc/init.d/net.ib1 start && /etc/init.d/opensm start

How to replicate:

Transfer 30GB .img file from server to client, wait five to ten seconds, watch kernel go nope and crash.

Logs, lshw attached.
Comment 1 SimplyCorbett 2020-11-07 12:18:07 UTC
Created attachment 293535 [details]
Kernel log
Comment 2 SimplyCorbett 2020-11-07 12:19:03 UTC
Created attachment 293537 [details]
Kernel config
Comment 3 SimplyCorbett 2020-11-27 17:17:59 UTC
As of 5.9.11 the kernel is no longer hard crashing and files successfully transfer.

However, whatever is transferring the files becomes unresponsive until the transfer is finished which did not happen in kernel 5.8.