Bug 210097 - 5.9.6 hard lockup with infiniband
Summary: 5.9.6 hard lockup with infiniband
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-11-07 12:15 UTC by SimplyCorbett
Modified: 2023-02-03 00:46 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.9.6
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lshw (40.86 KB, text/plain)
2020-11-07 12:15 UTC, SimplyCorbett
Details
Kernel log (24.55 KB, text/plain)
2020-11-07 12:18 UTC, SimplyCorbett
Details
Kernel config (146.84 KB, text/plain)
2020-11-07 12:19 UTC, SimplyCorbett
Details

Description SimplyCorbett 2020-11-07 12:15:18 UTC
Created attachment 293533 [details]
lshw

Upgraded from 5.8.18 to 5.9.6 and I'm now getting hard lockups.

Setup:

3900x
128GB RAM
2080 ti
Several SATA devices with both f2fs and ext4, all mounted with cryptsetup
ConnectX-3 infiniband FDR

I have two connectx-3 cards connected together directly with opensm, no switch.

On 5.8.18 this works fine. On 5.9.6 I am getting hard lockups.

Replication:

CLIENT:

/etc/fstab
172.16.1.20:/mnt/Heavy-1 /mnt/Ares-3TB-1 nfs4 _netdev,auto,proto=rdma,port=20049,hard,intr,rsize=65536,wsize=65536,noatime     0       0

SERVER:

echo rdma 20049 > /proc/fs/nfsd/portlist

BOTH:

/etc/init.d/opensm stop && modprobe ib_ipoib && modprobe ib_umad && modprobe ib_ipoib && modprobe ib_srp && modprobe ib_uverbs && modprobe rdma_ucm && modprobe mlx4_ib && modprobe ib_core && modprobe svcrdma && modprobe xprtrdma && /etc/init.d/net.ib0 start && /etc/init.d/net.ib1 start && /etc/init.d/opensm start

How to replicate:

Transfer 30GB .img file from server to client, wait five to ten seconds, watch kernel go nope and crash.

Logs, lshw attached.
Comment 1 SimplyCorbett 2020-11-07 12:18:07 UTC
Created attachment 293535 [details]
Kernel log
Comment 2 SimplyCorbett 2020-11-07 12:19:03 UTC
Created attachment 293537 [details]
Kernel config
Comment 3 SimplyCorbett 2020-11-27 17:17:59 UTC
As of 5.9.11 the kernel is no longer hard crashing and files successfully transfer.

However, whatever is transferring the files becomes unresponsive until the transfer is finished which did not happen in kernel 5.8.

Note You need to log in before you can comment on or make changes to this bug.