Bug 73211
Summary: | Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection | ||
---|---|---|---|
Product: | File System | Reporter: | Chuck Lever (chucklever) |
Component: | NFS | Assignee: | Trond Myklebust (trondmy) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | szg00000 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.10.17 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Chuck Lever
2014-03-30 18:03:09 UTC
Status 5 is IB_WC_WR_FLUSH_ERR, but the reported opcode value is not valid. Steve Wise says: The user verbs man page sez this: > Not all wc attributes are always valid. If the completion > status is other than IBV_WC_SUCCESS, only the following > attributes are valid: wr_id, status, qp_num, and vendor_err. The mlx4 provider (among others) does not set the ib_wc.opcode field for error completions. A Mellanox engineer has confirmed this. Thus rpcrdma_event_process() cannot rely on the contents of ib_wc.opcode when processing error completions. This logic was introduced by commit 5c635e09cec0feeeb310968e51dad01040244851 Author: Tom Tucker <tom@ogc.us> Date: Wed Feb 9 19:45:34 2011 +0000 RPCRDMA: Fix FRMR registration/invalidate handling. A suggestion was made to remove the FAST_REG_MR and LOCAL_INV completion logic introduced in this commit, but it looks like the commit addresses a real problem. To be double sure, I confirmed that the InfiniBand Architecture Specification 1.2.1, pp. 631-632 says:
> If the status of the operation that generates the Work Completion
> is anything other than success, the contents of the Work Completion
> are un-defined except as noted below.
The exceptions, which are valid no matter what the completion status is, are wr_id, completion status, and freed resource count.
The basic problem with 5c635e09 is that the WR cookie (wr_id) is always a pointer, but sometimes it points to struct rpcrdma_rep and sometimes to struct rpcrdma_mw. The ib_wc.opcode field is currently used to distinguish these cases, but we now know this field is not reliable in some cases. One solution is to mark these structures so they can be distinguished during completion processing. A magic number in rpcrdma_rep might accomplish that, and would be a straightforward fix. A more radical solution would split the single CQ we use now into a send CQ and a receive CQ. The receive CQ would handle completions that use rpcrdma_rep, and the send CQ would handle completions that use rpcrdma_mw. I have a Linux NFS/RDMA client here with ConnectX-2/mlx4. I have not successfully reproduced the panic when pulling the IB cable under load. However, I can see the bad rep and opcode as reported by the OP. I've written a couple of patches to split the completion queue. Destructive testing shows they do not reproduce the bad rep/opcode. klemens.senn@ims.co.at reports the completion queue split patch prevents the panic. However, connection recovery is not reliable, and sometimes the client NFS workload hangs after the link is re-connected. This appears to be a separate issue. Looking into it. |