Bug 213869 - NFS Video Stream Frequently Drops With Kernel 5.13.5
Summary: NFS Video Stream Frequently Drops With Kernel 5.13.5
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: bfields
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-07-27 04:25 UTC by Chad Stobbie
Modified: 2021-09-06 22:06 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.13.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Log from Kodi (1.08 MB, application/octet-stream)
2021-07-27 04:25 UTC, Chad Stobbie
Details

Description Chad Stobbie 2021-07-27 04:25:51 UTC
Created attachment 298053 [details]
Log from Kodi

I use Kodi to stream video via NFSv3. Both the client running Kodi and the server are Linux x86-64 PCs. The NFS is mounted within Kodi using the libnfs library version 4.0.0. 

I recently upgraded both PCs to the 5.13 kernel, and noticed that video frequently stops in the middle of playback. I took a Kodi log and attached it to this report. To summarize, the error in the log is:

ERROR <general>: Read - Error( -4, read call failed with "Command was cancelled" )

I looked at the source code for libnfs. The "Command was cancelled" error is triggered when a "status" variable is equal to a macro titled "RPC_STATUS_CANCEL", which is defined as 2. I presume this status code is what is being received from the server.

Downgrading the server to its previous kernel version 5.12.11 fixed the issue. Therefore, I believe that one of the commits to the NFS code in 5.13 is the cause of this bug. I am happy to provide any other information or logs if they would be helpful in diagnosing the issue.
Comment 1 bfields 2021-07-27 13:16:22 UTC
The bugzilla kernel version field says 5.13.5, but the description refers to 5.13.  Which version are you running on the server?

I ask because there is a known bug in the 5.13 nfs server, but it's fixed by 7f6cbd0b3e1e "mm/page_alloc: correct return value of populated elements if bulk array is populated", which went into 5.13.1.
Comment 2 Chad Stobbie 2021-07-27 17:09:53 UTC
The version is 5.13.5. I was referring to the 5.13 kernel series generally, not version 5.13 specifically. I upgraded from 5.12.x directly to 5.13.4, and then to 5.13.5. I never used 5.13 or 5.13.1. 

Sorry for any confusion, I have corrected the title.
Comment 3 bfields 2021-07-30 16:05:09 UTC
Got it, thanks.

I'm not sure how to proceed.

From a quick look I don't think RPC_STATUS_CANCEL is anything directly returned by the server.  It may be a sign there was a timeout or the connection was dropped?  Some further debugging of libnfs to figure out exactly what's happening in the case might help.

It's often useful to collect the network traffic (e.g. with "tcpdump -s0 -wtmp.pcap") but video streaming may generate rather a lot of it.

Bisecting versions between v5.12.11 and 5.13.4 would be tedious, but helpful if it's possible.
Comment 4 aduvnjak 2021-08-29 16:09:31 UTC
I can confirm the same happening here.  Its not limited to Kodi. I see degraded performance using the native Windows NFS client, and the MacOS client.  I've stayed back on the 5.12.x series (currently running 5.12.19) due to the issue.  All the 5.13.x I've tried and 5.14.0-rc7 have the issue.

I did a bisect back when reporting a similar error in 5.10.x where Kodi, Windows and MacOS had issues  (specifically commit 8e081627f3a7f733a4955ee40b385c972f010f05 - SUNRPC: Fix NFS READs that start at non-page-aligned offsets).  Which seemed to help in the creation of a patch for 5.10.15.

The only thing is, this one is harder to verify the issue quickly (would need to run each kernel for at least an hour or so coping files).  On top of that  I imagine the changes between 5.12.19 and 5.13.x would be substantial.  It might be a gigantic task, but if there is no other way to proceed I'm willing to give it a go. 

Out of interest is it possible to just manually compile 5.12.19 with the relevent nfs changes from the 5.13.x branch, and then do the reverse (compile 5.13.13 with the nfs relevent files from 5.12.19) - just do to do a double disassociation and confirm that is where the issue lies ?   That might be a quick way to at least get started?
Comment 5 aduvnjak 2021-08-29 16:21:28 UTC
Sorry I forgot to add, that I believe its an NFS bug, purely because other file sharing/transfer protocols (AFP for example) aren't effected in the jump from 5.12.x to 5.13.x.  

Any suggestions on how to proceed to trace the issue down would be appreciated.
Comment 6 bfields 2021-08-29 20:23:16 UTC
The patch mentioned here might be relevant: https://bugzilla.kernel.org/show_bug.cgi?id=213887#c14 (though as Neil says, only if you're setting the svc_rpc_per_connection_limit module parameter to something non-zero).

Anj Duvnjak, could you describe your symptoms in more detail?  It's not clear to me why it's necessarily the same bug as Chad Stobbie's.
Comment 7 aduvnjak 2021-08-29 22:00:57 UTC
I might have been overly presumptuous by thinking its the same bug.  Sorry in advance I jumped the gun on that.

My Kodi symptom is similar to Chad's, in that using Kodi's internal NFS library to mount a folder, then streaming a video over NFS, causes occasional stutters and an occasional stop.  This only occurs when the server is running 5.13.x but not 5.12.x.  I havent checked Kodi logs to see if I have the same error as Chad.

In my case, what I've noticed across three clients (Kodi, Windows 10 native NFS client, and MacOS NFS mounts), is that I'd get sudden stops (or extreme slow downs) when transferring decent amounts of data.  So streaming a 4K video, or DSD128 audio file will stutter or fail occasionally.  These might last a few seconds at a time, before normal speed is restored, and then repeats, at seemingly random points.

What I have been doing lately to try and test if a kernel version may be misbehaving, is copy a large directory (say 50Gb) using NFS in Windows 10.  Windows 10 shows a handy little graph of transfer speed when copying files.  If all is working well (i.e. running 5.12.x), I will see a steady line of approx 85MB/s through out the entire transfer.  When running 5.13.x, I will usually not be able to reach the end of the 50Gb transfer without seeing at least a single dive down to 15MB/s (or often even lower).
Comment 8 aduvnjak 2021-08-29 22:17:32 UTC
I should have mentioned in relation to that patch, sadly cat /sys/module/sunrpc/parameters/svc_rpc_per_connection_limit shows 0 here.
Comment 9 bfields 2021-08-30 15:11:25 UTC
Thanks for the additional detail.

Looks like Neil Brown spotted the problem; could you try this patch?:

https://lore.kernel.org/linux-nfs/162915504980.9892.4132343755469951234@noble.neil.brown.name/T/#mf3f1199dfa591215f8df6f91d4d96cdb85f7e990

I think it could explain the symptoms that both of you are seeing.
Comment 10 aduvnjak 2021-08-30 16:52:50 UTC
Looks like that's done the trick.  To test, I installed kernel 5.13.13 (unpatched), confirmed the issue was there (normally see a dip or stall within within 20GB of file transfers, and multiple times within a 50GB file transfer).  

After patching 5.13.13, I have transferred in excess of 300GB, completely rock solid.  I'd say thats it!
Comment 11 Chad Stobbie 2021-08-31 06:06:43 UTC
I recently upgraded to kernel 5.13.13 and can also confirm that the issue still persists. That patch looks very promising. I am currently away from home on a trip, but I will test the patch this weekend after I get back and report the results.
Comment 12 Chad Stobbie 2021-09-06 04:26:39 UTC
I applied the patch to 5.13.13, and it's working for me too. I've streamed about 10 hours of video to test it, and haven't had a single drop. I'm going to mark this bug as RESOLVED. Thanks for your support in tracking this one down. Will this patch be merged into a 5.14.x stable release?
Comment 13 bfields 2021-09-06 22:06:41 UTC
(In reply to Chad Stobbie from comment #12)
> Thanks for your support in tracking this one
> down. Will this patch be merged into a 5.14.x stable release?

I'd guess so, but I'm a little hazy on the stable process.  I'd say give it a week or so and then ask if it hasn't happened.

Note You need to log in before you can comment on or make changes to this bug.