Bug 214523 - RDMA Mellanox RoCE drivers are unresponsive to ARP updates during a reconnect
Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP updates during a reconnect
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Infiniband/RDMA (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_infiniband-rdma
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-09-24 15:34 UTC by kolga
Modified: 2021-12-16 21:07 UTC (History)
0 users

See Also:
Kernel Version: 5.14
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description kolga 2021-09-24 15:34:32 UTC
RoCE RDMA connection uses CMA protocol to establish an RDMA connection. During the setup the code uses hard coded timeout/retry values. These values are used for when Connect Request is not being answered to to re-try the request. During the re-try attempts the ARP updates of the destination server are ignored. Current timeout values lead to 4+minutes long attempt at connecting to a server that no longer owns the IP since the ARP update happens. 

The ask is to make the timeout/retry values configurable via procfs or sysfs. This will allow for environments that use RoCE to reduce the timeouts to a more reasonable values and be able to react to the ARP updates faster. Other CMA users (eg IB or others) can continue to use existing values.

The problem exist in all kernel versions but bugzilla is filed for 5.14 kernel.

The use case is (RoCE-based) NFSoRDMA where a server went down and another server was brought up in its place. RDMA layer introduces 4+ minutes in being able to re-establish an RDMA connection and let IO resume, due to inability to react to the ARP update.
Comment 1 leon 2021-09-26 08:02:38 UTC
On Fri, Sep 24, 2021 at 03:34:32PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=214523
> 
>             Bug ID: 214523
>            Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP
>                     updates during a reconnect
>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 5.14
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Infiniband/RDMA
>           Assignee: drivers_infiniband-rdma@kernel-bugs.osdl.org
>           Reporter: kolga@netapp.com
>         Regression: No
> 
> RoCE RDMA connection uses CMA protocol to establish an RDMA connection.
> During
> the setup the code uses hard coded timeout/retry values. These values are
> used
> for when Connect Request is not being answered to to re-try the request.
> During
> the re-try attempts the ARP updates of the destination server are ignored.
> Current timeout values lead to 4+minutes long attempt at connecting to a
> server
> that no longer owns the IP since the ARP update happens. 
> 
> The ask is to make the timeout/retry values configurable via procfs or sysfs.
> This will allow for environments that use RoCE to reduce the timeouts to a
> more
> reasonable values and be able to react to the ARP updates faster. Other CMA
> users (eg IB or others) can continue to use existing values.
> 
> The problem exist in all kernel versions but bugzilla is filed for 5.14
> kernel.
> 
> The use case is (RoCE-based) NFSoRDMA where a server went down and another
> server was brought up in its place. RDMA layer introduces 4+ minutes in being
> able to re-establish an RDMA connection and let IO resume, due to inability
> to
> react to the ARP update.

RDMA-CM has many different timeouts, so I hope that my answer is for the
right timeout.

We probably need to extend rdma_connect() to receive remote_cm_response_timeout
value, so NFSoRDMA will set it to whatever value its appropriate.

The timewait will be calculated based it in ib_send_cm_req().

Thanks

> 
> -- 
> You may reply to this email to add a comment.
> 
> You are receiving this mail because:
> You are watching the assignee of the bug.
Comment 2 Chuck Lever 2021-09-26 17:36:11 UTC
Hi Leon-

Thanks for the suggestion! More below.

> On Sep 26, 2021, at 4:02 AM, Leon Romanovsky <leon@kernel.org> wrote:
> 
> On Fri, Sep 24, 2021 at 03:34:32PM +0000, bugzilla-daemon@bugzilla.kernel.org
> wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=214523
>> 
>>            Bug ID: 214523
>>           Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP
>>                    updates during a reconnect
>>           Product: Drivers
>>           Version: 2.5
>>    Kernel Version: 5.14
>>          Hardware: All
>>                OS: Linux
>>              Tree: Mainline
>>            Status: NEW
>>          Severity: normal
>>          Priority: P1
>>         Component: Infiniband/RDMA
>>          Assignee: drivers_infiniband-rdma@kernel-bugs.osdl.org
>>          Reporter: kolga@netapp.com
>>        Regression: No
>> 
>> RoCE RDMA connection uses CMA protocol to establish an RDMA connection.
>> During
>> the setup the code uses hard coded timeout/retry values. These values are
>> used
>> for when Connect Request is not being answered to to re-try the request.
>> During
>> the re-try attempts the ARP updates of the destination server are ignored.
>> Current timeout values lead to 4+minutes long attempt at connecting to a
>> server
>> that no longer owns the IP since the ARP update happens. 
>> 
>> The ask is to make the timeout/retry values configurable via procfs or
>> sysfs.
>> This will allow for environments that use RoCE to reduce the timeouts to a
>> more
>> reasonable values and be able to react to the ARP updates faster. Other CMA
>> users (eg IB or others) can continue to use existing values.

I would rather not add a user-facing tunable. The fabric should
be better at detecting addressing changes within a reasonable
time. It would be helpful to provide a history of why the ARP
timeout is so lax -- do certain ULPs rely on it being long?


>> The problem exist in all kernel versions but bugzilla is filed for 5.14
>> kernel.
>> 
>> The use case is (RoCE-based) NFSoRDMA where a server went down and another
>> server was brought up in its place. RDMA layer introduces 4+ minutes in
>> being
>> able to re-establish an RDMA connection and let IO resume, due to inability
>> to
>> react to the ARP update.
> 
> RDMA-CM has many different timeouts, so I hope that my answer is for the
> right timeout.
> 
> We probably need to extend rdma_connect() to receive
> remote_cm_response_timeout
> value, so NFSoRDMA will set it to whatever value its appropriate.
> 
> The timewait will be calculated based it in ib_send_cm_req().

I hope a mechanism can be found that behaves the same or nearly the
same way for all RDMA fabrics.

For those who are not NFS-savvy:

Simple NFS server failover is typically implemented with a heartbeat
between two similar platforms that both access the same backend
storage. When one platform fails, the other detects it and takes over
the failing platform's IP address. Clients detect connection loss
with the failing platform, and upon reconnection to that IP address
are transparently directed to the other platform.

NFS server vendors have tried to extend this behavior to RDMA fabrics,
with varying degrees of success.

In addition to enforcing availability SLAs, the time it takes to
re-establish a working connection is critical for NFSv4 because each
client maintains a lease to prevent the server from purging open and
lock state. If the reconnect takes too long, the client's lease is
jeopardized because other clients can then access files that client
might still have locked or open.


--
Chuck Lever
Comment 3 kolga 2021-09-26 19:25:41 UTC
(In reply to leon from comment #1)
> On Fri, Sep 24, 2021 at 03:34:32PM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=214523
> > 
> >             Bug ID: 214523
> >            Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP
> >                     updates during a reconnect
> >            Product: Drivers
> >            Version: 2.5
> >     Kernel Version: 5.14
> >           Hardware: All
> >                 OS: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Infiniband/RDMA
> >           Assignee: drivers_infiniband-rdma@kernel-bugs.osdl.org
> >           Reporter: kolga@netapp.com
> >         Regression: No
> > 
> > RoCE RDMA connection uses CMA protocol to establish an RDMA connection.
> > During
> > the setup the code uses hard coded timeout/retry values. These values are
> > used
> > for when Connect Request is not being answered to to re-try the request.
> > During
> > the re-try attempts the ARP updates of the destination server are ignored.
> > Current timeout values lead to 4+minutes long attempt at connecting to a
> > server
> > that no longer owns the IP since the ARP update happens. 
> > 
> > The ask is to make the timeout/retry values configurable via procfs or
> sysfs.
> > This will allow for environments that use RoCE to reduce the timeouts to a
> > more
> > reasonable values and be able to react to the ARP updates faster. Other CMA
> > users (eg IB or others) can continue to use existing values.
> > 
> > The problem exist in all kernel versions but bugzilla is filed for 5.14
> > kernel.
> > 
> > The use case is (RoCE-based) NFSoRDMA where a server went down and another
> > server was brought up in its place. RDMA layer introduces 4+ minutes in
> being
> > able to re-establish an RDMA connection and let IO resume, due to inability
> > to
> > react to the ARP update.
> 
> RDMA-CM has many different timeouts, so I hope that my answer is for the
> right timeout.

The values that have been suggested by Mellanox as a work around were to change
CMA_CM_RESPONSE_TIMEOUT (16 is a suggested new value)
CMA_MAX_CM_RETIRES (5 is a suggested new value)
CMA_IBOE_PACKET_LIFETIME (16 is a suggested new value)

I don't have enough understanding to know if only the timeout value is sufficient for desired effect. But my educated guess tells me that all is required.
  
> We probably need to extend rdma_connect() to receive
> remote_cm_response_timeout
> value, so NFSoRDMA will set it to whatever value its appropriate.
> 
> The timewait will be calculated based it in ib_send_cm_req().
> 
> Thanks
> 
> > 
> > -- 
> > You may reply to this email to add a comment.
> > 
> > You are receiving this mail because:
> > You are watching the assignee of the bug.
Comment 4 kolga 2021-09-26 19:33:33 UTC
(In reply to Chuck Lever from comment #2)
> Hi Leon-
> 
> Thanks for the suggestion! More below.
> 
> > On Sep 26, 2021, at 4:02 AM, Leon Romanovsky <leon@kernel.org> wrote:
> > 
> > On Fri, Sep 24, 2021 at 03:34:32PM +0000,
> bugzilla-daemon@bugzilla.kernel.org
> > wrote:
> >> https://bugzilla.kernel.org/show_bug.cgi?id=214523
> >> 
> >>            Bug ID: 214523
> >>           Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP
> >>                    updates during a reconnect
> >>           Product: Drivers
> >>           Version: 2.5
> >>    Kernel Version: 5.14
> >>          Hardware: All
> >>                OS: Linux
> >>              Tree: Mainline
> >>            Status: NEW
> >>          Severity: normal
> >>          Priority: P1
> >>         Component: Infiniband/RDMA
> >>          Assignee: drivers_infiniband-rdma@kernel-bugs.osdl.org
> >>          Reporter: kolga@netapp.com
> >>        Regression: No
> >> 
> >> RoCE RDMA connection uses CMA protocol to establish an RDMA connection.
> >> During
> >> the setup the code uses hard coded timeout/retry values. These values are
> >> used
> >> for when Connect Request is not being answered to to re-try the request.
> >> During
> >> the re-try attempts the ARP updates of the destination server are ignored.
> >> Current timeout values lead to 4+minutes long attempt at connecting to a
> >> server
> >> that no longer owns the IP since the ARP update happens. 
> >> 
> >> The ask is to make the timeout/retry values configurable via procfs or
> >> sysfs.
> >> This will allow for environments that use RoCE to reduce the timeouts to a
> >> more
> >> reasonable values and be able to react to the ARP updates faster. Other
> CMA
> >> users (eg IB or others) can continue to use existing values.
> 
> I would rather not add a user-facing tunable. The fabric should
> be better at detecting addressing changes within a reasonable
> time. It would be helpful to provide a history of why the ARP
> timeout is so lax -- do certain ULPs rely on it being long?

I see this as being equivalent to TCP's sysctl of tcp_syn_retries (etc) (equivalent being cam_max_cm_retries). Perhaps some environments want it larger or smaller and having an ability to tune is what this bugzilla request is about. 

I think finding a hardcoded parameter that works for all environments is probably not feasible.
 

> >> The problem exist in all kernel versions but bugzilla is filed for 5.14
> >> kernel.
> >> 
> >> The use case is (RoCE-based) NFSoRDMA where a server went down and another
> >> server was brought up in its place. RDMA layer introduces 4+ minutes in
> >> being
> >> able to re-establish an RDMA connection and let IO resume, due to
> inability
> >> to
> >> react to the ARP update.
> > 
> > RDMA-CM has many different timeouts, so I hope that my answer is for the
> > right timeout.
> > 
> > We probably need to extend rdma_connect() to receive
> > remote_cm_response_timeout
> > value, so NFSoRDMA will set it to whatever value its appropriate.
> > 
> > The timewait will be calculated based it in ib_send_cm_req().
> 
> I hope a mechanism can be found that behaves the same or nearly the
> same way for all RDMA fabrics.

But this is specific to the CMA protocol that is used by IB and RoCE but not by iWarp. Therefore, my ask is really specific configurability of the CMA protocol parameters.
 
> For those who are not NFS-savvy:
> 
> Simple NFS server failover is typically implemented with a heartbeat
> between two similar platforms that both access the same backend
> storage. When one platform fails, the other detects it and takes over
> the failing platform's IP address. Clients detect connection loss
> with the failing platform, and upon reconnection to that IP address
> are transparently directed to the other platform.
> 
> NFS server vendors have tried to extend this behavior to RDMA fabrics,
> with varying degrees of success.
> 
> In addition to enforcing availability SLAs, the time it takes to
> re-establish a working connection is critical for NFSv4 because each
> client maintains a lease to prevent the server from purging open and
> lock state. If the reconnect takes too long, the client's lease is
> jeopardized because other clients can then access files that client
> might still have locked or open.
> 
> 
> --
> Chuck Lever
Comment 5 leon 2021-09-27 12:09:49 UTC
On Sun, Sep 26, 2021 at 05:36:01PM +0000, Chuck Lever III wrote:
> Hi Leon-
> 
> Thanks for the suggestion! More below.
> 
> > On Sep 26, 2021, at 4:02 AM, Leon Romanovsky <leon@kernel.org> wrote:
> > 
> > On Fri, Sep 24, 2021 at 03:34:32PM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> >> https://bugzilla.kernel.org/show_bug.cgi?id=214523
> >> 
> >>            Bug ID: 214523
> >>           Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP
> >>                    updates during a reconnect
> >>           Product: Drivers
> >>           Version: 2.5
> >>    Kernel Version: 5.14
> >>          Hardware: All
> >>                OS: Linux
> >>              Tree: Mainline
> >>            Status: NEW
> >>          Severity: normal
> >>          Priority: P1
> >>         Component: Infiniband/RDMA
> >>          Assignee: drivers_infiniband-rdma@kernel-bugs.osdl.org
> >>          Reporter: kolga@netapp.com
> >>        Regression: No
> >> 
> >> RoCE RDMA connection uses CMA protocol to establish an RDMA connection.
> During
> >> the setup the code uses hard coded timeout/retry values. These values are
> used
> >> for when Connect Request is not being answered to to re-try the request.
> During
> >> the re-try attempts the ARP updates of the destination server are ignored.
> >> Current timeout values lead to 4+minutes long attempt at connecting to a
> server
> >> that no longer owns the IP since the ARP update happens. 
> >> 
> >> The ask is to make the timeout/retry values configurable via procfs or
> sysfs.
> >> This will allow for environments that use RoCE to reduce the timeouts to a
> more
> >> reasonable values and be able to react to the ARP updates faster. Other
> CMA
> >> users (eg IB or others) can continue to use existing values.
> 
> I would rather not add a user-facing tunable. The fabric should
> be better at detecting addressing changes within a reasonable
> time. It would be helpful to provide a history of why the ARP
> timeout is so lax -- do certain ULPs rely on it being long?

I don't know about ULPs and ARPs, but how to calculate TimeWait is
described in the spec.

Regarding tunable, I agree. Because it needs to be per-connection, most
likely not many people in the world will success to configure it properly.

> 
> 
> >> The problem exist in all kernel versions but bugzilla is filed for 5.14
> kernel.
> >> 
> >> The use case is (RoCE-based) NFSoRDMA where a server went down and another
> >> server was brought up in its place. RDMA layer introduces 4+ minutes in
> being
> >> able to re-establish an RDMA connection and let IO resume, due to
> inability to
> >> react to the ARP update.
> > 
> > RDMA-CM has many different timeouts, so I hope that my answer is for the
> > right timeout.
> > 
> > We probably need to extend rdma_connect() to receive
> remote_cm_response_timeout
> > value, so NFSoRDMA will set it to whatever value its appropriate.
> > 
> > The timewait will be calculated based it in ib_send_cm_req().
> 
> I hope a mechanism can be found that behaves the same or nearly the
> same way for all RDMA fabrics.

It depends on the fabric itself, in every network
remote_cm_response_timeout can be different.

> 
> For those who are not NFS-savvy:
> 
> Simple NFS server failover is typically implemented with a heartbeat
> between two similar platforms that both access the same backend
> storage. When one platform fails, the other detects it and takes over
> the failing platform's IP address. Clients detect connection loss
> with the failing platform, and upon reconnection to that IP address
> are transparently directed to the other platform.
> 
> NFS server vendors have tried to extend this behavior to RDMA fabrics,
> with varying degrees of success.
> 
> In addition to enforcing availability SLAs, the time it takes to
> re-establish a working connection is critical for NFSv4 because each
> client maintains a lease to prevent the server from purging open and
> lock state. If the reconnect takes too long, the client's lease is
> jeopardized because other clients can then access files that client
> might still have locked or open.
> 
> 
> --
> Chuck Lever
> 
> 
>
Comment 6 jgg 2021-09-27 12:24:29 UTC
On Mon, Sep 27, 2021 at 03:09:44PM +0300, Leon Romanovsky wrote:
> On Sun, Sep 26, 2021 at 05:36:01PM +0000, Chuck Lever III wrote:
> > Hi Leon-
> > 
> > Thanks for the suggestion! More below.
> > 
> > > On Sep 26, 2021, at 4:02 AM, Leon Romanovsky <leon@kernel.org> wrote:
> > > 
> > > On Fri, Sep 24, 2021 at 03:34:32PM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > >> https://bugzilla.kernel.org/show_bug.cgi?id=214523
> > >> 
> > >>            Bug ID: 214523
> > >>           Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP
> > >>                    updates during a reconnect
> > >>           Product: Drivers
> > >>           Version: 2.5
> > >>    Kernel Version: 5.14
> > >>          Hardware: All
> > >>                OS: Linux
> > >>              Tree: Mainline
> > >>            Status: NEW
> > >>          Severity: normal
> > >>          Priority: P1
> > >>         Component: Infiniband/RDMA
> > >>          Assignee: drivers_infiniband-rdma@kernel-bugs.osdl.org
> > >>          Reporter: kolga@netapp.com
> > >>        Regression: No
> > >> 
> > >> RoCE RDMA connection uses CMA protocol to establish an RDMA connection.
> During
> > >> the setup the code uses hard coded timeout/retry values. These values
> are used
> > >> for when Connect Request is not being answered to to re-try the request.
> During
> > >> the re-try attempts the ARP updates of the destination server are
> ignored.
> > >> Current timeout values lead to 4+minutes long attempt at connecting to a
> server
> > >> that no longer owns the IP since the ARP update happens. 
> > >> 
> > >> The ask is to make the timeout/retry values configurable via procfs or
> sysfs.
> > >> This will allow for environments that use RoCE to reduce the timeouts to
> a more
> > >> reasonable values and be able to react to the ARP updates faster. Other
> CMA
> > >> users (eg IB or others) can continue to use existing values.
> > 
> > I would rather not add a user-facing tunable. The fabric should
> > be better at detecting addressing changes within a reasonable
> > time. It would be helpful to provide a history of why the ARP
> > timeout is so lax -- do certain ULPs rely on it being long?
> 
> I don't know about ULPs and ARPs, but how to calculate TimeWait is
> described in the spec.
> 
> Regarding tunable, I agree. Because it needs to be per-connection, most
> likely not many people in the world will success to configure it properly.

Maybe we should be disconnecting the cm_id if a gratituous ARP changes
the MAC address? The cm_id is surely broken after that event right?

Jason
Comment 7 markzhang 2021-09-27 12:55:42 UTC
On 9/27/2021 8:24 PM, Jason Gunthorpe wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Mon, Sep 27, 2021 at 03:09:44PM +0300, Leon Romanovsky wrote:
>> On Sun, Sep 26, 2021 at 05:36:01PM +0000, Chuck Lever III wrote:
>>> Hi Leon-
>>>
>>> Thanks for the suggestion! More below.
>>>
>>>> On Sep 26, 2021, at 4:02 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>>>
>>>> On Fri, Sep 24, 2021 at 03:34:32PM +0000,
>>>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=214523
>>>>>
>>>>>             Bug ID: 214523
>>>>>            Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP
>>>>>                     updates during a reconnect
>>>>>            Product: Drivers
>>>>>            Version: 2.5
>>>>>     Kernel Version: 5.14
>>>>>           Hardware: All
>>>>>                 OS: Linux
>>>>>               Tree: Mainline
>>>>>             Status: NEW
>>>>>           Severity: normal
>>>>>           Priority: P1
>>>>>          Component: Infiniband/RDMA
>>>>>           Assignee: drivers_infiniband-rdma@kernel-bugs.osdl.org
>>>>>           Reporter: kolga@netapp.com
>>>>>         Regression: No
>>>>>
>>>>> RoCE RDMA connection uses CMA protocol to establish an RDMA connection.
>>>>> During
>>>>> the setup the code uses hard coded timeout/retry values. These values are
>>>>> used
>>>>> for when Connect Request is not being answered to to re-try the request.
>>>>> During
>>>>> the re-try attempts the ARP updates of the destination server are
>>>>> ignored.
>>>>> Current timeout values lead to 4+minutes long attempt at connecting to a
>>>>> server
>>>>> that no longer owns the IP since the ARP update happens.
>>>>>
>>>>> The ask is to make the timeout/retry values configurable via procfs or
>>>>> sysfs.
>>>>> This will allow for environments that use RoCE to reduce the timeouts to
>>>>> a more
>>>>> reasonable values and be able to react to the ARP updates faster. Other
>>>>> CMA
>>>>> users (eg IB or others) can continue to use existing values.
>>>
>>> I would rather not add a user-facing tunable. The fabric should
>>> be better at detecting addressing changes within a reasonable
>>> time. It would be helpful to provide a history of why the ARP
>>> timeout is so lax -- do certain ULPs rely on it being long?
>>
>> I don't know about ULPs and ARPs, but how to calculate TimeWait is
>> described in the spec.
>>
>> Regarding tunable, I agree. Because it needs to be per-connection, most
>> likely not many people in the world will success to configure it properly.
> 
> Maybe we should be disconnecting the cm_id if a gratituous ARP changes
> the MAC address? The cm_id is surely broken after that event right?

Is there an event on gratuitous ARP? And we also need to notify 
user-space application, right?
Comment 8 jgg 2021-09-27 13:10:46 UTC
On Mon, Sep 27, 2021 at 08:55:19PM +0800, Mark Zhang wrote:
> On 9/27/2021 8:24 PM, Jason Gunthorpe wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Mon, Sep 27, 2021 at 03:09:44PM +0300, Leon Romanovsky wrote:
> > > On Sun, Sep 26, 2021 at 05:36:01PM +0000, Chuck Lever III wrote:
> > > > Hi Leon-
> > > > 
> > > > Thanks for the suggestion! More below.
> > > > 
> > > > > On Sep 26, 2021, at 4:02 AM, Leon Romanovsky <leon@kernel.org> wrote:
> > > > > 
> > > > > On Fri, Sep 24, 2021 at 03:34:32PM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=214523
> > > > > > 
> > > > > >             Bug ID: 214523
> > > > > >            Summary: RDMA Mellanox RoCE drivers are unresponsive to
> ARP
> > > > > >                     updates during a reconnect
> > > > > >            Product: Drivers
> > > > > >            Version: 2.5
> > > > > >     Kernel Version: 5.14
> > > > > >           Hardware: All
> > > > > >                 OS: Linux
> > > > > >               Tree: Mainline
> > > > > >             Status: NEW
> > > > > >           Severity: normal
> > > > > >           Priority: P1
> > > > > >          Component: Infiniband/RDMA
> > > > > >           Assignee: drivers_infiniband-rdma@kernel-bugs.osdl.org
> > > > > >           Reporter: kolga@netapp.com
> > > > > >         Regression: No
> > > > > > 
> > > > > > RoCE RDMA connection uses CMA protocol to establish an RDMA
> connection. During
> > > > > > the setup the code uses hard coded timeout/retry values. These
> values are used
> > > > > > for when Connect Request is not being answered to to re-try the
> request. During
> > > > > > the re-try attempts the ARP updates of the destination server are
> ignored.
> > > > > > Current timeout values lead to 4+minutes long attempt at connecting
> to a server
> > > > > > that no longer owns the IP since the ARP update happens.
> > > > > > 
> > > > > > The ask is to make the timeout/retry values configurable via procfs
> or sysfs.
> > > > > > This will allow for environments that use RoCE to reduce the
> timeouts to a more
> > > > > > reasonable values and be able to react to the ARP updates faster.
> Other CMA
> > > > > > users (eg IB or others) can continue to use existing values.
> > > > 
> > > > I would rather not add a user-facing tunable. The fabric should
> > > > be better at detecting addressing changes within a reasonable
> > > > time. It would be helpful to provide a history of why the ARP
> > > > timeout is so lax -- do certain ULPs rely on it being long?
> > > 
> > > I don't know about ULPs and ARPs, but how to calculate TimeWait is
> > > described in the spec.
> > > 
> > > Regarding tunable, I agree. Because it needs to be per-connection, most
> > > likely not many people in the world will success to configure it
> properly.
> > 
> > Maybe we should be disconnecting the cm_id if a gratituous ARP changes
> > the MAC address? The cm_id is surely broken after that event right?
> 
> Is there an event on gratuitous ARP? And we also need to notify user-space
> application, right?

I think there is a net notifier for this?

Userspace will see it via the CM event we'll need to trigger.

Jason
Comment 9 haakon.bugge 2021-09-27 13:32:41 UTC
> On 27 Sep 2021, at 15:10, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Mon, Sep 27, 2021 at 08:55:19PM +0800, Mark Zhang wrote:
>> On 9/27/2021 8:24 PM, Jason Gunthorpe wrote:
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>> On Mon, Sep 27, 2021 at 03:09:44PM +0300, Leon Romanovsky wrote:
>>>> On Sun, Sep 26, 2021 at 05:36:01PM +0000, Chuck Lever III wrote:
>>>>> Hi Leon-
>>>>> 
>>>>> Thanks for the suggestion! More below.
>>>>> 
>>>>>> On Sep 26, 2021, at 4:02 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>>>>> 
>>>>>> On Fri, Sep 24, 2021 at 03:34:32PM +0000,
>>>>>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=214523
>>>>>>> 
>>>>>>>            Bug ID: 214523
>>>>>>>           Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP
>>>>>>>                    updates during a reconnect
>>>>>>>           Product: Drivers
>>>>>>>           Version: 2.5
>>>>>>>    Kernel Version: 5.14
>>>>>>>          Hardware: All
>>>>>>>                OS: Linux
>>>>>>>              Tree: Mainline
>>>>>>>            Status: NEW
>>>>>>>          Severity: normal
>>>>>>>          Priority: P1
>>>>>>>         Component: Infiniband/RDMA
>>>>>>>          Assignee: drivers_infiniband-rdma@kernel-bugs.osdl.org
>>>>>>>          Reporter: kolga@netapp.com
>>>>>>>        Regression: No
>>>>>>> 
>>>>>>> RoCE RDMA connection uses CMA protocol to establish an RDMA connection.
>>>>>>> During
>>>>>>> the setup the code uses hard coded timeout/retry values. These values
>>>>>>> are used
>>>>>>> for when Connect Request is not being answered to to re-try the
>>>>>>> request. During
>>>>>>> the re-try attempts the ARP updates of the destination server are
>>>>>>> ignored.
>>>>>>> Current timeout values lead to 4+minutes long attempt at connecting to
>>>>>>> a server
>>>>>>> that no longer owns the IP since the ARP update happens.
>>>>>>> 
>>>>>>> The ask is to make the timeout/retry values configurable via procfs or
>>>>>>> sysfs.
>>>>>>> This will allow for environments that use RoCE to reduce the timeouts
>>>>>>> to a more
>>>>>>> reasonable values and be able to react to the ARP updates faster. Other
>>>>>>> CMA
>>>>>>> users (eg IB or others) can continue to use existing values.
>>>>> 
>>>>> I would rather not add a user-facing tunable. The fabric should
>>>>> be better at detecting addressing changes within a reasonable
>>>>> time. It would be helpful to provide a history of why the ARP
>>>>> timeout is so lax -- do certain ULPs rely on it being long?
>>>> 
>>>> I don't know about ULPs and ARPs, but how to calculate TimeWait is
>>>> described in the spec.
>>>> 
>>>> Regarding tunable, I agree. Because it needs to be per-connection, most
>>>> likely not many people in the world will success to configure it properly.
>>> 
>>> Maybe we should be disconnecting the cm_id if a gratituous ARP changes
>>> the MAC address? The cm_id is surely broken after that event right?
>> 
>> Is there an event on gratuitous ARP? And we also need to notify user-space
>> application, right?
> 
> I think there is a net notifier for this?

NETEVENT_NEIGH_UPDATE may be?


Thxs, Håkon

> 
> Userspace will see it via the CM event we'll need to trigger.
> 
> Jason
Comment 10 kolga 2021-09-27 13:46:35 UTC
(In reply to leon from comment #5)
> On Sun, Sep 26, 2021 at 05:36:01PM +0000, Chuck Lever III wrote:
> > Hi Leon-
> > 
> > Thanks for the suggestion! More below.
> > 
> > > On Sep 26, 2021, at 4:02 AM, Leon Romanovsky <leon@kernel.org> wrote:
> > > 
> > > On Fri, Sep 24, 2021 at 03:34:32PM +0000,
> > bugzilla-daemon@bugzilla.kernel.org wrote:
> > >> https://bugzilla.kernel.org/show_bug.cgi?id=214523
> > >> 
> > >>            Bug ID: 214523
> > >>           Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP
> > >>                    updates during a reconnect
> > >>           Product: Drivers
> > >>           Version: 2.5
> > >>    Kernel Version: 5.14
> > >>          Hardware: All
> > >>                OS: Linux
> > >>              Tree: Mainline
> > >>            Status: NEW
> > >>          Severity: normal
> > >>          Priority: P1
> > >>         Component: Infiniband/RDMA
> > >>          Assignee: drivers_infiniband-rdma@kernel-bugs.osdl.org
> > >>          Reporter: kolga@netapp.com
> > >>        Regression: No
> > >> 
> > >> RoCE RDMA connection uses CMA protocol to establish an RDMA connection.
> > During
> > >> the setup the code uses hard coded timeout/retry values. These values
> are
> > used
> > >> for when Connect Request is not being answered to to re-try the request.
> > During
> > >> the re-try attempts the ARP updates of the destination server are
> ignored.
> > >> Current timeout values lead to 4+minutes long attempt at connecting to a
> > server
> > >> that no longer owns the IP since the ARP update happens. 
> > >> 
> > >> The ask is to make the timeout/retry values configurable via procfs or
> > sysfs.
> > >> This will allow for environments that use RoCE to reduce the timeouts to
> a
> > more
> > >> reasonable values and be able to react to the ARP updates faster. Other
> > CMA
> > >> users (eg IB or others) can continue to use existing values.
> > 
> > I would rather not add a user-facing tunable. The fabric should
> > be better at detecting addressing changes within a reasonable
> > time. It would be helpful to provide a history of why the ARP
> > timeout is so lax -- do certain ULPs rely on it being long?
> 
> I don't know about ULPs and ARPs, but how to calculate TimeWait is
> described in the spec.
> 
> Regarding tunable, I agree. Because it needs to be per-connection, most
> likely not many people in the world will success to configure it properly.

While it is true that requiring users to configure properly is a problem, not providing such configuration at the machine level and per connection seems bad as well. Then the ULP (NFSoRDMA) would need to know on what type of fabric it is running on (ie meaning the user would need to specify this during mount) and then NFSoRMDA would it supply different timeouts for IB vs RoCE (which seems like breaking layering semantics).

The idea of triggering a disconnect on the ARP request seems like a good idea that would not require any of these configuration tuning.

> 
> > 
> > 
> > >> The problem exist in all kernel versions but bugzilla is filed for 5.14
> > kernel.
> > >> 
> > >> The use case is (RoCE-based) NFSoRDMA where a server went down and
> another
> > >> server was brought up in its place. RDMA layer introduces 4+ minutes in
> > being
> > >> able to re-establish an RDMA connection and let IO resume, due to
> > inability to
> > >> react to the ARP update.
> > > 
> > > RDMA-CM has many different timeouts, so I hope that my answer is for the
> > > right timeout.
> > > 
> > > We probably need to extend rdma_connect() to receive
> > remote_cm_response_timeout
> > > value, so NFSoRDMA will set it to whatever value its appropriate.
> > > 
> > > The timewait will be calculated based it in ib_send_cm_req().
> > 
> > I hope a mechanism can be found that behaves the same or nearly the
> > same way for all RDMA fabrics.
> 
> It depends on the fabric itself, in every network
> remote_cm_response_timeout can be different.
> 
> > 
> > For those who are not NFS-savvy:
> > 
> > Simple NFS server failover is typically implemented with a heartbeat
> > between two similar platforms that both access the same backend
> > storage. When one platform fails, the other detects it and takes over
> > the failing platform's IP address. Clients detect connection loss
> > with the failing platform, and upon reconnection to that IP address
> > are transparently directed to the other platform.
> > 
> > NFS server vendors have tried to extend this behavior to RDMA fabrics,
> > with varying degrees of success.
> > 
> > In addition to enforcing availability SLAs, the time it takes to
> > re-establish a working connection is critical for NFSv4 because each
> > client maintains a lease to prevent the server from purging open and
> > lock state. If the reconnect takes too long, the client's lease is
> > jeopardized because other clients can then access files that client
> > might still have locked or open.
> > 
> > 
> > --
> > Chuck Lever
> > 
> > 
> >
Comment 11 Chuck Lever 2021-09-27 16:14:54 UTC
> On Sep 27, 2021, at 8:09 AM, Leon Romanovsky <leon@kernel.org> wrote:
> 
> On Sun, Sep 26, 2021 at 05:36:01PM +0000, Chuck Lever III wrote:
>> Hi Leon-
>> 
>> Thanks for the suggestion! More below.
>> 
>>> On Sep 26, 2021, at 4:02 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>> 
>>> On Fri, Sep 24, 2021 at 03:34:32PM +0000,
>>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=214523
>>>> 
>>>>           Bug ID: 214523
>>>>          Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP
>>>>                   updates during a reconnect
>>>>          Product: Drivers
>>>>          Version: 2.5
>>>>   Kernel Version: 5.14
>>>>         Hardware: All
>>>>               OS: Linux
>>>>             Tree: Mainline
>>>>           Status: NEW
>>>>         Severity: normal
>>>>         Priority: P1
>>>>        Component: Infiniband/RDMA
>>>>         Assignee: drivers_infiniband-rdma@kernel-bugs.osdl.org
>>>>         Reporter: kolga@netapp.com
>>>>       Regression: No
>>>> 
>>>> RoCE RDMA connection uses CMA protocol to establish an RDMA connection.
>>>> During
>>>> the setup the code uses hard coded timeout/retry values. These values are
>>>> used
>>>> for when Connect Request is not being answered to to re-try the request.
>>>> During
>>>> the re-try attempts the ARP updates of the destination server are ignored.
>>>> Current timeout values lead to 4+minutes long attempt at connecting to a
>>>> server
>>>> that no longer owns the IP since the ARP update happens. 
>>>> 
>>>> The ask is to make the timeout/retry values configurable via procfs or
>>>> sysfs.
>>>> This will allow for environments that use RoCE to reduce the timeouts to a
>>>> more
>>>> reasonable values and be able to react to the ARP updates faster. Other
>>>> CMA
>>>> users (eg IB or others) can continue to use existing values.
>> 
>> I would rather not add a user-facing tunable. The fabric should
>> be better at detecting addressing changes within a reasonable
>> time. It would be helpful to provide a history of why the ARP
>> timeout is so lax -- do certain ULPs rely on it being long?
> 
> I don't know about ULPs and ARPs, but how to calculate TimeWait is
> described in the spec.
> 
> Regarding tunable, I agree. Because it needs to be per-connection, most
> likely not many people in the world will success to configure it properly.

Exactly.


>>>> The problem exist in all kernel versions but bugzilla is filed for 5.14
>>>> kernel.
>>>> 
>>>> The use case is (RoCE-based) NFSoRDMA where a server went down and another
>>>> server was brought up in its place. RDMA layer introduces 4+ minutes in
>>>> being
>>>> able to re-establish an RDMA connection and let IO resume, due to
>>>> inability to
>>>> react to the ARP update.
>>> 
>>> RDMA-CM has many different timeouts, so I hope that my answer is for the
>>> right timeout.
>>> 
>>> We probably need to extend rdma_connect() to receive
>>> remote_cm_response_timeout
>>> value, so NFSoRDMA will set it to whatever value its appropriate.
>>> 
>>> The timewait will be calculated based it in ib_send_cm_req().
>> 
>> I hope a mechanism can be found that behaves the same or nearly the
>> same way for all RDMA fabrics.
> 
> It depends on the fabric itself, in every network
> remote_cm_response_timeout can be different.

What I mean is I hope a way can be found so that RDMA consumers do
not have to be aware of the fabric differences.


>> For those who are not NFS-savvy:
>> 
>> Simple NFS server failover is typically implemented with a heartbeat
>> between two similar platforms that both access the same backend
>> storage. When one platform fails, the other detects it and takes over
>> the failing platform's IP address. Clients detect connection loss
>> with the failing platform, and upon reconnection to that IP address
>> are transparently directed to the other platform.
>> 
>> NFS server vendors have tried to extend this behavior to RDMA fabrics,
>> with varying degrees of success.
>> 
>> In addition to enforcing availability SLAs, the time it takes to
>> re-establish a working connection is critical for NFSv4 because each
>> client maintains a lease to prevent the server from purging open and
>> lock state. If the reconnect takes too long, the client's lease is
>> jeopardized because other clients can then access files that client
>> might still have locked or open.
>> 
>> 
>> --
>> Chuck Lever

--
Chuck Lever
Comment 12 markzhang 2021-10-15 06:36:12 UTC
On 9/27/2021 9:32 PM, Haakon Bugge wrote:
> External email: Use caution opening links or attachments
> 
> 
>> On 27 Sep 2021, at 15:10, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>> On Mon, Sep 27, 2021 at 08:55:19PM +0800, Mark Zhang wrote:
>>> On 9/27/2021 8:24 PM, Jason Gunthorpe wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On Mon, Sep 27, 2021 at 03:09:44PM +0300, Leon Romanovsky wrote:
>>>>> On Sun, Sep 26, 2021 at 05:36:01PM +0000, Chuck Lever III wrote:
>>>>>> Hi Leon-
>>>>>>
>>>>>> Thanks for the suggestion! More below.
>>>>>>
>>>>>>> On Sep 26, 2021, at 4:02 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>>>>>>
>>>>>>> On Fri, Sep 24, 2021 at 03:34:32PM +0000,
>>>>>>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=214523
>>>>>>>>
>>>>>>>>             Bug ID: 214523
>>>>>>>>            Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP
>>>>>>>>                     updates during a reconnect
>>>>>>>>            Product: Drivers
>>>>>>>>            Version: 2.5
>>>>>>>>     Kernel Version: 5.14
>>>>>>>>           Hardware: All
>>>>>>>>                 OS: Linux
>>>>>>>>               Tree: Mainline
>>>>>>>>             Status: NEW
>>>>>>>>           Severity: normal
>>>>>>>>           Priority: P1
>>>>>>>>          Component: Infiniband/RDMA
>>>>>>>>           Assignee: drivers_infiniband-rdma@kernel-bugs.osdl.org
>>>>>>>>           Reporter: kolga@netapp.com
>>>>>>>>         Regression: No
>>>>>>>>
>>>>>>>> RoCE RDMA connection uses CMA protocol to establish an RDMA
>>>>>>>> connection. During
>>>>>>>> the setup the code uses hard coded timeout/retry values. These values
>>>>>>>> are used
>>>>>>>> for when Connect Request is not being answered to to re-try the
>>>>>>>> request. During
>>>>>>>> the re-try attempts the ARP updates of the destination server are
>>>>>>>> ignored.
>>>>>>>> Current timeout values lead to 4+minutes long attempt at connecting to
>>>>>>>> a server
>>>>>>>> that no longer owns the IP since the ARP update happens.
>>>>>>>>
>>>>>>>> The ask is to make the timeout/retry values configurable via procfs or
>>>>>>>> sysfs.
>>>>>>>> This will allow for environments that use RoCE to reduce the timeouts
>>>>>>>> to a more
>>>>>>>> reasonable values and be able to react to the ARP updates faster.
>>>>>>>> Other CMA
>>>>>>>> users (eg IB or others) can continue to use existing values.
>>>>>>
>>>>>> I would rather not add a user-facing tunable. The fabric should
>>>>>> be better at detecting addressing changes within a reasonable
>>>>>> time. It would be helpful to provide a history of why the ARP
>>>>>> timeout is so lax -- do certain ULPs rely on it being long?
>>>>>
>>>>> I don't know about ULPs and ARPs, but how to calculate TimeWait is
>>>>> described in the spec.
>>>>>
>>>>> Regarding tunable, I agree. Because it needs to be per-connection, most
>>>>> likely not many people in the world will success to configure it
>>>>> properly.
>>>>
>>>> Maybe we should be disconnecting the cm_id if a gratituous ARP changes
>>>> the MAC address? The cm_id is surely broken after that event right?
>>>
>>> Is there an event on gratuitous ARP? And we also need to notify user-space
>>> application, right?
>>
>> I think there is a net notifier for this?
> 
> NETEVENT_NEIGH_UPDATE may be?

How about do it like this:

1. In cma.c we do register_netevent_notifier();
2. On each NETEVENT_NEIGH_UPDATE event, in netevent_callback():
    2.1. Allocate a work (as seems the cb is in interrupt context);
    2.2. In the new work:
           foreach(cm_dev) {
               foreach(id_priv) {
                   if ((id_priv.dst_ip == event.ip) &&
                       (id_priv.dst_addr != event.ha)) {

                       /* Anything more to do? */
                       report_event(RDMA_CM_EVENT_ADDR_CHANGE);
                   }
               }
           }

And I have these questions:
1. Should we do it in cma.c or cm.c?
2. Should we do register only once, or per id? If we do register per id
    then there maybe many ids;
3. If we do it in cm.c, then should we do more like ib_cancel_mad()?
    Or report an event is enough?
4. Need to create a work on each ARP event, would it be a heavy load?
5. Do we need a new event, instead of RDMA_CM_EVENT_ADDR_CHANGE?
6. How about if peer is not in same subnet?

Thank you very much.
Comment 13 kolga 2021-12-16 21:07:10 UTC
Any progress on the issue? Has there been a consensus about the approach?

Note You need to log in before you can comment on or make changes to this bug.