Bug 118081 - open-iscsi Ping timeout erro
Summary: open-iscsi Ping timeout erro
Status: NEW
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: scsi_drivers-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-05-12 03:34 UTC by Liu zhengyuan
Modified: 2016-05-24 08:08 UTC (History)
0 users

See Also:
Kernel Version: 4.4.7
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Liu zhengyuan 2016-05-12 03:34:45 UTC
Hi everyone:
I create a target using fileio as the backend storage on ARM64 server. The initiator reported some errors showed bellow  while perform iozone test.

[178444.145679]  connection14:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4339462894, last ping 4339464146, now 4339465400
[178444.145706]  connection14:0: detected conn error (1011)
[178469.674313]  connection14:0: detected conn error (1020)
[178504.420979]  connection14:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4339477953, last ping 4339479204, now 4339480456
[178504.421001]  connection14:0: detected conn error (1011)
[178532.064262]  connection14:0: detected conn error (1020)
[178564.584087]  connection14:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4339492980, last ping 4339494232, now 4339495484
..............................

I try to trace the function call of target iscsi. Then, I found the  receiving  thread of target iscsi blocked at fd_execute_sync_cache -> vfs_fsync_range. Further, vfs_fsync_range may takes more than 10 seconds to return,while initiator Ping timeout would happened after 5 seconds.   vfs_fsync_range was call with the form vfs_fsync_range(fd_dev->fd_file, 0, LLONG_MAX, 1) every times  which means sync all device cache. 
So, is this a bug?
How  does Initiator send sync_cache scsi command? 
Does it need to sync all device cache at once?
Any reply would be thankful.
Comment 1 Mike Christie 2016-05-13 16:20:01 UTC
On 05/11/2016 10:34 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=118081
> 
>             Bug ID: 118081
>            Summary: open-iscsi Ping timeout erro
>            Product: SCSI Drivers
>            Version: 2.5
>     Kernel Version: 4.4.7
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>           Assignee: scsi_drivers-other@kernel-bugs.osdl.org
>           Reporter: liuzhengyuang521@gmail.com
>         Regression: No
> 
> Hi everyone:
> I create a target using fileio as the backend storage on ARM64 server. The
> initiator reported some errors showed bellow  while perform iozone test.
> 
> [178444.145679]  connection14:0: ping timeout of 5 secs expired, recv timeout
> 5, last rx 4339462894, last ping 4339464146, now 4339465400
> [178444.145706]  connection14:0: detected conn error (1011)
> [178469.674313]  connection14:0: detected conn error (1020)
> [178504.420979]  connection14:0: ping timeout of 5 secs expired, recv timeout
> 5, last rx 4339477953, last ping 4339479204, now 4339480456
> [178504.421001]  connection14:0: detected conn error (1011)
> [178532.064262]  connection14:0: detected conn error (1020)
> [178564.584087]  connection14:0: ping timeout of 5 secs expired, recv timeout
> 5, last rx 4339492980, last ping 4339494232, now 4339495484
> ..............................
> 
> I try to trace the function call of target iscsi. Then, I found the 
> receiving 
> thread of target iscsi blocked at fd_execute_sync_cache -> vfs_fsync_range.
> Further, vfs_fsync_range may takes more than 10 seconds to return,while
> initiator Ping timeout would happened after 5 seconds.   vfs_fsync_range was
> call with the form vfs_fsync_range(fd_dev->fd_file, 0, LLONG_MAX, 1) every
> times  which means sync all device cache. 
> So, is this a bug?
> How  does Initiator send sync_cache scsi command? 
> Does it need to sync all device cache at once?
> Any reply would be thankful.
> 

The upper layers like the FS or application determine when to send a
sync cache. They send down a request and the iscsi layer just sends it
to the target.

You are using LIO right? It looks like we end up syncing the entire
device sometimes. I think for iscsi pings/Nops that have the immediate
bit set, the target would want to reply to them right away. They should
not be getting stuck behind these type of commands.

Nick, what do you think?
Comment 2 Liu zhengyuan 2016-05-17 03:46:03 UTC
Thanks for your replay.  Yes, I am using LIO and I had make a XFS file system on the LUN at the Initiator side . As you said,  the sync_cache command was issued by XFS log infrastructure actually. When I replace the XFS with EXT2 that dmesg error don`t appear anymore during the iozone test.  I think pings/Nops got no response  because it still stay in TCP stack not received by the target rx-thread. 
It is not good for me to make a file system on the LUN based on a FILEIO backstore?
The Ping time out would lead to upper layers application like iozone to faild when issued sync at the end of IO?

> On 05/11/2016 10:34 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=118081
> > 
> >             Bug ID: 118081
> >            Summary: open-iscsi Ping timeout erro
> >            Product: SCSI Drivers
> >            Version: 2.5
> >     Kernel Version: 4.4.7
> >           Hardware: All
> >                 OS: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Other
> >           Assignee: scsi_drivers-other@kernel-bugs.osdl.org
> >           Reporter: liuzhengyuang521@gmail.com
> >         Regression: No
> > 
> > Hi everyone:
> > I create a target using fileio as the backend storage on ARM64 server. The
> > initiator reported some errors showed bellow  while perform iozone test.
> > 
> > [178444.145679]  connection14:0: ping timeout of 5 secs expired, recv
> timeout
> > 5, last rx 4339462894, last ping 4339464146, now 4339465400
> > [178444.145706]  connection14:0: detected conn error (1011)
> > [178469.674313]  connection14:0: detected conn error (1020)
> > [178504.420979]  connection14:0: ping timeout of 5 secs expired, recv
> timeout
> > 5, last rx 4339477953, last ping 4339479204, now 4339480456
> > [178504.421001]  connection14:0: detected conn error (1011)
> > [178532.064262]  connection14:0: detected conn error (1020)
> > [178564.584087]  connection14:0: ping timeout of 5 secs expired, recv
> timeout
> > 5, last rx 4339492980, last ping 4339494232, now 4339495484
> > ..............................
> > 
> > I try to trace the function call of target iscsi. Then, I found the 
> receiving 
> > thread of target iscsi blocked at fd_execute_sync_cache -> vfs_fsync_range.
> > Further, vfs_fsync_range may takes more than 10 seconds to return,while
> > initiator Ping timeout would happened after 5 seconds.   vfs_fsync_range
> was
> > call with the form vfs_fsync_range(fd_dev->fd_file, 0, LLONG_MAX, 1) every
> > times  which means sync all device cache. 
> > So, is this a bug?
> > How  does Initiator send sync_cache scsi command? 
> > Does it need to sync all device cache at once?
> > Any reply would be thankful.
> > 
> 
> The upper layers like the FS or application determine when to send a
> sync cache. They send down a request and the iscsi layer just sends it
> to the target.
> 
> You are using LIO right? It looks like we end up syncing the entire
> device sometimes. I think for iscsi pings/Nops that have the immediate
> bit set, the target would want to reply to them right away. They should
> not be getting stuck behind these type of commands.
> 
> Nick, what do you think?
Comment 3 nab 2016-05-24 07:10:13 UTC
Hey Mike & Co,

Apologies for missing this bug-report.  Comments below.

(Adding target-devel CC')

On Fri, 2016-05-13 at 11:19 -0500, Mike Christie wrote:
> On 05/11/2016 10:34 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=118081
> > 
> >             Bug ID: 118081
> >            Summary: open-iscsi Ping timeout erro
> >            Product: SCSI Drivers
> >            Version: 2.5
> >     Kernel Version: 4.4.7
> >           Hardware: All
> >                 OS: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Other
> >           Assignee: scsi_drivers-other@kernel-bugs.osdl.org
> >           Reporter: liuzhengyuang521@gmail.com
> >         Regression: No
> > 
> > Hi everyone:
> > I create a target using fileio as the backend storage on ARM64 server. The
> > initiator reported some errors showed bellow  while perform iozone test.
> > 
> > [178444.145679]  connection14:0: ping timeout of 5 secs expired, recv
> timeout
> > 5, last rx 4339462894, last ping 4339464146, now 4339465400
> > [178444.145706]  connection14:0: detected conn error (1011)
> > [178469.674313]  connection14:0: detected conn error (1020)
> > [178504.420979]  connection14:0: ping timeout of 5 secs expired, recv
> timeout
> > 5, last rx 4339477953, last ping 4339479204, now 4339480456
> > [178504.421001]  connection14:0: detected conn error (1011)
> > [178532.064262]  connection14:0: detected conn error (1020)
> > [178564.584087]  connection14:0: ping timeout of 5 secs expired, recv
> timeout
> > 5, last rx 4339492980, last ping 4339494232, now 4339495484
> > ..............................
> > 
> > I try to trace the function call of target iscsi. Then, I found the 
> receiving 
> > thread of target iscsi blocked at fd_execute_sync_cache -> vfs_fsync_range.
> > Further, vfs_fsync_range may takes more than 10 seconds to return,while
> > initiator Ping timeout would happened after 5 seconds.   vfs_fsync_range
> was
> > call with the form vfs_fsync_range(fd_dev->fd_file, 0, LLONG_MAX, 1) every
> > times  which means sync all device cache. 
> > So, is this a bug?
> > How  does Initiator send sync_cache scsi command? 
> > Does it need to sync all device cache at once?
> > Any reply would be thankful.
> > 
> 
> The upper layers like the FS or application determine when to send a
> sync cache. They send down a request and the iscsi layer just sends it
> to the target.
> 
> You are using LIO right? It looks like we end up syncing the entire
> device sometimes. I think for iscsi pings/Nops that have the immediate
> bit set, the target would want to reply to them right away. They should
> not be getting stuck behind these type of commands.
> 
> Nick, what do you think?
> 

In modern iscsi-target code, the backend sbc_ops->execute_sync_cache()
call is invoked directly from iscsi_trx kthread process context.

For FILEIO backends, this can block immediate iscsi commands like NOPs
on the same connection (eg: socket) processing a SYNCHRONIZE_CACHE CDB
that is taking a long time to complete.

Wrt to Liu's follow-up question, an initiator should attempt to retry
all outstanding commands that did not receive a response once the
initiator side NopOut timeout has fired.

So from an application perspective, the initiator NopOut timeout and
subsequent iscsi session reinstatement should not be propagating up a
SYNCHRONIZE_CACHE failure unless the default 120 second Linux/iSCSI
initiator I/O timeout has elapsed.

Liu, are you seeing something different wrt iozone on a Linux host..?
Comment 4 Liu zhengyuan 2016-05-24 08:08:34 UTC
Hi, nab:
Did you mean doing iozone test on a local disk LUN not a iscsi LUN from the word "are you seeing something different wrt iozone on a Linux host..?"?  If so, I didn`t found any different between this two LUN from application perspective. As for iscsi LUN, the iozone would abort due to write system call failure.
(In reply to nab from comment #3)
> Hey Mike & Co,
> 
> Apologies for missing this bug-report.  Comments below.
> 
> (Adding target-devel CC')
> 
> On Fri, 2016-05-13 at 11:19 -0500, Mike Christie wrote:
> > On 05/11/2016 10:34 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=118081
> > > 
> > >             Bug ID: 118081
> > >            Summary: open-iscsi Ping timeout erro
> > >            Product: SCSI Drivers
> > >            Version: 2.5
> > >     Kernel Version: 4.4.7
> > >           Hardware: All
> > >                 OS: Linux
> > >               Tree: Mainline
> > >             Status: NEW
> > >           Severity: normal
> > >           Priority: P1
> > >          Component: Other
> > >           Assignee: scsi_drivers-other@kernel-bugs.osdl.org
> > >           Reporter: liuzhengyuang521@gmail.com
> > >         Regression: No
> > > 
> > > Hi everyone:
> > > I create a target using fileio as the backend storage on ARM64 server.
> The
> > > initiator reported some errors showed bellow  while perform iozone test.
> > > 
> > > [178444.145679]  connection14:0: ping timeout of 5 secs expired, recv
> timeout
> > > 5, last rx 4339462894, last ping 4339464146, now 4339465400
> > > [178444.145706]  connection14:0: detected conn error (1011)
> > > [178469.674313]  connection14:0: detected conn error (1020)
> > > [178504.420979]  connection14:0: ping timeout of 5 secs expired, recv
> timeout
> > > 5, last rx 4339477953, last ping 4339479204, now 4339480456
> > > [178504.421001]  connection14:0: detected conn error (1011)
> > > [178532.064262]  connection14:0: detected conn error (1020)
> > > [178564.584087]  connection14:0: ping timeout of 5 secs expired, recv
> timeout
> > > 5, last rx 4339492980, last ping 4339494232, now 4339495484
> > > ..............................
> > > 
> > > I try to trace the function call of target iscsi. Then, I found the 
> receiving 
> > > thread of target iscsi blocked at fd_execute_sync_cache ->
> vfs_fsync_range.
> > > Further, vfs_fsync_range may takes more than 10 seconds to return,while
> > > initiator Ping timeout would happened after 5 seconds.   vfs_fsync_range
> was
> > > call with the form vfs_fsync_range(fd_dev->fd_file, 0, LLONG_MAX, 1)
> every
> > > times  which means sync all device cache. 
> > > So, is this a bug?
> > > How  does Initiator send sync_cache scsi command? 
> > > Does it need to sync all device cache at once?
> > > Any reply would be thankful.
> > > 
> > 
> > The upper layers like the FS or application determine when to send a
> > sync cache. They send down a request and the iscsi layer just sends it
> > to the target.
> > 
> > You are using LIO right? It looks like we end up syncing the entire
> > device sometimes. I think for iscsi pings/Nops that have the immediate
> > bit set, the target would want to reply to them right away. They should
> > not be getting stuck behind these type of commands.
> > 
> > Nick, what do you think?
> > 
> 
> In modern iscsi-target code, the backend sbc_ops->execute_sync_cache()
> call is invoked directly from iscsi_trx kthread process context.
> 
> For FILEIO backends, this can block immediate iscsi commands like NOPs
> on the same connection (eg: socket) processing a SYNCHRONIZE_CACHE CDB
> that is taking a long time to complete.
> 
> Wrt to Liu's follow-up question, an initiator should attempt to retry
> all outstanding commands that did not receive a response once the
> initiator side NopOut timeout has fired.
> 
> So from an application perspective, the initiator NopOut timeout and
> subsequent iscsi session reinstatement should not be propagating up a
> SYNCHRONIZE_CACHE failure unless the default 120 second Linux/iSCSI
> initiator I/O timeout has elapsed.
> 
> Liu, are you seeing something different wrt iozone on a Linux host..?

Note You need to log in before you can comment on or make changes to this bug.