Bug 38572

Summary: New NFS: directory XXX contains a readdir loop seems to be triggered by well-behaving server
Product: File System Reporter: Petr Vandrovec (petr)
Component: NFSAssignee: Trond Myklebust (trondmy)
Status: CLOSED CODE_FIX    
Severity: normal CC: akpm, anna.schumaker, difrost.kernel, florian, maciej.rutecki, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.0.0-rc4-00096 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 32012    
Attachments: wireshark decoded trace
[PATCH] NFS: Fix spurious readdir cookie loop messages
[PATCH v2] NFS: Fix spurious readdir cookie loop messages
[PATCH v3] NFS: Fix spurious readdir cookie loop messages

Description Petr Vandrovec 2011-06-30 19:38:54 UTC
Created attachment 64052 [details]
wireshark decoded trace

When I run

for a in `seq 1 1000`; do mkdir $a; touch $a-test; done
for a in `seq 1 1000`; do rmdir $a; done
for a in `seq 1 1000`; do rm $a-test; done

in one terminal, and

while true; do ls; done

in the another, every now and then kernel says:

Jun 30 11:38:36 petr-dev3 kernel: [65277.074773] NFS: directory petr/asd contains a readdir loop.  Please contact your server vendor.  Offending cookie: 41912
Jun 30 11:38:37 petr-dev3 kernel: [65277.520933] NFS: directory petr/asd contains a readdir loop.  Please contact your server vendor.  Offending cookie: 42904

First I talked to our IT, but after capturing packets I cannot see anything wrong on server's side.  In wireshark capture I see that cookie 41912 was returned twice for the same readdir loop (server does not use cookieverf, it returns 0 always):

petr-dev3:~# grep '\(cookie: 0\|Cookie: 41912\|EOF: 1\)' trace.txt4
    cookie: 0
        Cookie: 41912
        Cookie: 41912
    cookie: 0
        Cookie: 41912
        Cookie: 41912

But it did so because client issued very same readdir request twice, completely ignoring all data from first request:

petr-dev3:~# grep '\(cookie:\|Cookie: 41912\|EOF: 1\)' trace.txt4 | grep -C2 41912
    cookie: 40584
    cookie: 41744
        Cookie: 41912
    cookie: 41744
        Cookie: 41912
    cookie: 42904
    cookie: 44068
--
    cookie: 40616
    cookie: 41776
        Cookie: 41912
    cookie: 41776
        Cookie: 41912
    cookie: 42936
    cookie: 44100

Textual decoded trace of readdirplus cycle which triggered kernel messages is attached.  Non-decoded trace of full session is 80MB :-(
Comment 1 Andrew Morton 2011-06-30 19:44:49 UTC
This is marked as a regression.  From which kernel version?

Thanks.
Comment 2 Petr Vandrovec 2011-06-30 19:52:58 UTC
Message is new since March.  It was not present in 2.6.38, but it is present in 2.6.39 (and 3.0.0-rc).  My testing with 2.6.38 kernel shows no message, and 'ls -U' does not report any duplicates.

commit 8ef2ce3e16d9bec6cf015207c1c82a5b864046ac
Author: Bryan Schumaker <bjschuma@netapp.com>
Date:   Wed Mar 23 15:04:31 2011 -0400

    NFS: Detect loops in a readdir due to bad cookies

    Some filesystems (such as ext4) can return the same cookie value for
    multiple files.  If we try to start a readdir with one of these cookies,
    the server will return the first file found with a cookie of the same
    value.  This can cause the client to enter an infinite loop.

    Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Comment 3 Trond Myklebust 2011-06-30 22:17:09 UTC
Hmm... So you are basically alternating between growing and shrinking
the directory, while running 'ls'?

I'll see if I can look into that over the vacation.
Comment 4 Petr Vandrovec 2011-06-30 23:31:38 UTC
Not really.  One script creates and then deletes lots of files & directories, while other script list directory content.  I've never hit problem when first script is in creation phase, only 'ls' during deletion phase generates duplicate warning.

Originally I noticed problem by deleting directory content while watching progress with 'ls | wc -l' ...
Comment 5 Trond Myklebust 2011-07-01 18:22:25 UTC
Created attachment 64402 [details]
[PATCH] NFS: Fix spurious readdir cookie loop messages

If the directory contents change, then we have to accept that the
file->f_pos value may shrink if we do a 'search-by-cookie'. In that
case, we should turn off the loop detection and let the NFS client
try to recover.

Does this patch help?
Comment 6 Petr Vandrovec 2011-07-08 00:15:25 UTC
I'm afraid that for whatever reason it does not.

[539613.027512] FS-Cache: Netfs 'nfs' registered for caching
[539650.085051] NFS: directory petr/x contains a readdir loop.  Please contact your server vendor.  Offending cookie: 42528
[539652.812132] NFS: directory petr/x contains a readdir loop.  Please contact your server vendor.  Offending cookie: 54200
[539654.673139] NFS: directory petr/x contains a readdir loop.  Please contact your server vendor.  Offending cookie: 62156
[539655.846894] NFS: directory petr/x contains a readdir loop.  Please contact your server vendor.  Offending cookie: 66900
Comment 7 Jacek Luczak 2011-07-20 14:10:31 UTC
I guess I got same issue. Will try to trace cookies but looks like this is not ,,vendor'' problem.
Comment 8 Trond Myklebust 2011-07-28 17:02:07 UTC
Created attachment 67042 [details]
[PATCH v2] NFS: Fix spurious readdir cookie loop messages

Bryan Schumaker reports that this slight variant on the original patch
appears to fix the spurious loop warnings in his testing.

Could you please check if it works for you.
Comment 9 Trond Myklebust 2011-07-28 18:25:31 UTC
Gah... Never mind, that one is buggy too. I'm working on a fix, and will
have Bryan test before I ask anyone else to do so...
Comment 10 Trond Myklebust 2011-07-28 20:38:38 UTC
Created attachment 67052 [details]
[PATCH v3] NFS: Fix spurious readdir cookie loop messages

OK. This one seems to have passed everything we have thrown at it so far.
In addition to the bug that was fixed by the v2 patch, it also fixes an
issue whereby the loop detection would kick in too early.
Comment 11 Florian Mickler 2011-08-09 07:32:41 UTC
The patch seems to have been merged in v3.1-rc1: 

commit 0c0308066ca53fdf1423895f3a42838b67b3a5a8
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Sat Jul 30 12:45:35 2011 -0400

    NFS: Fix spurious readdir cookie loop messages
Comment 12 Jacek Luczak 2011-08-10 11:41:26 UTC
Trond looks like your patch is correct and fix the issue in my case.