Bug 73211 - Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection
Summary: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Trond Myklebust
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-30 18:03 UTC by Chuck Lever
Modified: 2017-12-04 16:50 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.10.17
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Chuck Lever 2014-03-30 18:03:09 UTC
rafael.reiter@ims.co.at reports the following panic:

Call Trace:
 <IRQ>
 [<ffffffffa04f7cbe>] ? rpcrdma_run_tasklet+0x7e/0xc0 [xprtrdma]
 [<ffffffff81049c82>] tasklet_action+0x52/0xc0
 [<ffffffff81049870>] __do_softirq+0xe0/0x220
 [<ffffffff8155cbac>] call_softirq+0x1c/0x30
 [<ffffffff8100452d>] do_softirq+0x4d/0x80
 [<ffffffff81049b05>] irq_exit+0x95/0xa0
 [<ffffffff8100411e>] do_IRQ+0x5e/0xd0
 [<ffffffff81553eaa>] common_interrupt+0x6a/0x6a
 <EOI>
 [<ffffffff81069090>] ? __hrtimer_start_range_ns+0x1c0/0x400
 [<ffffffff8141de86>] ? cpuidle_enter_state+0x56/0xd0
 [<ffffffff8141de82>] ? cpuidle_enter_state+0x52/0xd0
 [<ffffffff8141dfb6>] cpuidle_idle_call+0xb6/0x200
 [<ffffffff8100aa39>] arch_cpu_idle+0x9/0x20
 [<ffffffff81087cc0>] cpu_startup_entry+0x80/0x200
 [<ffffffff815358a2>] rest_init+0x72/0x80
 [<ffffffff81ac4e28>] start_kernel+0x3b2/0x3bf
 [<ffffffff81ac4875>] ? repair_env_string+0x5e/0x5e
 [<ffffffff81ac45a5>] x86_64_start_reservations+0x2a/0x2c
 [<ffffffff81ac4675>] x86_64_start_kernel+0xce/0xd2

The HCA is Mellanox Technologies MT26428.

Reproduction:
1) Mount a directory via NFS/RDMA

mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/

2) ls /mnt
3) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection
4) ls /mnt
5) wait 5-30 seconds

When debugging is enabled:

 RPC:       rpcrdma_event_process: event rep ffff880848ed0000 status 5 opcode FFFF8808 length 4294936584
 RPC:       rpcrdma_event_process: WC opcode -30712 status 5, connection lost

For more detail, see http://www.spinics.net/lists/linux-nfs/msg42314.html
Comment 1 Chuck Lever 2014-03-30 18:08:06 UTC
Status 5 is IB_WC_WR_FLUSH_ERR, but the reported opcode value is not valid.

Steve Wise says:

The user verbs man page sez this:

>  Not all wc attributes are always valid. If the completion
>  status is other than IBV_WC_SUCCESS, only the following
>  attributes are valid: wr_id, status, qp_num, and vendor_err.

The mlx4 provider (among others) does not set the  ib_wc.opcode field for error completions. A Mellanox engineer has confirmed this.

Thus rpcrdma_event_process() cannot rely on the contents of ib_wc.opcode when processing error completions.

This logic was introduced by

commit 5c635e09cec0feeeb310968e51dad01040244851
Author: Tom Tucker <tom@ogc.us>
Date:   Wed Feb 9 19:45:34 2011 +0000

    RPCRDMA: Fix FRMR registration/invalidate handling.

A suggestion was made to remove the FAST_REG_MR and LOCAL_INV completion logic introduced in this commit, but it looks like the commit addresses a real problem.
Comment 2 Chuck Lever 2014-03-31 01:16:08 UTC
To be double sure, I confirmed that the InfiniBand Architecture Specification 1.2.1, pp. 631-632 says:

> If the status of the operation that generates the Work Completion
> is anything other than success, the contents of the Work Completion
> are un-defined except as noted below.

The exceptions, which are valid no matter what the completion status is, are wr_id, completion status, and freed resource count.
Comment 3 Chuck Lever 2014-04-01 18:14:18 UTC
The basic problem with 5c635e09 is that the WR cookie (wr_id) is always a pointer, but sometimes it points to struct rpcrdma_rep and sometimes to struct rpcrdma_mw. The ib_wc.opcode field is currently used to distinguish these cases, but we now know this field is not reliable in some cases.

One solution is to mark these structures so they can be distinguished during completion processing. A magic number in rpcrdma_rep might accomplish that, and would be a straightforward fix.

A more radical solution would split the single CQ we use now into a send CQ and a receive CQ. The receive CQ would handle completions that use rpcrdma_rep, and the send CQ would handle completions that use rpcrdma_mw.
Comment 4 Chuck Lever 2014-04-08 19:41:05 UTC
I have a Linux NFS/RDMA client here with ConnectX-2/mlx4. I have not successfully reproduced the panic when pulling the IB cable under load. However, I can see the bad rep and opcode as reported by the OP.

I've written a couple of patches to split the completion queue. Destructive testing shows they do not reproduce the bad rep/opcode.
Comment 5 Chuck Lever 2014-04-09 14:30:55 UTC
klemens.senn@ims.co.at reports the completion queue split patch prevents the panic.

However, connection recovery is not reliable, and sometimes the client NFS workload hangs after the link is re-connected. This appears to be a separate issue. Looking into it.

Note You need to log in before you can comment on or make changes to this bug.