Bug 76851 - inotify_rm_watch(2) unspecified behavior
Summary: inotify_rm_watch(2) unspecified behavior
Status: NEW
Alias: None
Product: Documentation
Classification: Unclassified
Component: man-pages (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: documentation_man-pages@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-05-24 20:35 UTC by Jeff
Modified: 2014-05-30 12:34 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
test_wd_reuse.c - Test watch descriptor reuse (3.79 KB, text/plain)
2014-05-29 17:03 UTC, Heinrich Schuchardt
Details

Description Jeff 2014-05-24 20:35:24 UTC
The inotify interface leaves unspecified what happens to potentially "pending" watch events (i.e., caught by the kernel but not yet read()) after an inotify_rm_watch() call.

IN_IGNORE flagged events *allow* inotify_rm_watch() calls to be monitored asynchronously, but it is not stated whether triggering actions between a most recent read() and an inotify_rm_watch() call *require* all other watch events to be handled so when users key map inotify_event wd fields to local objects.

Similarly, a conservative user would need to keep a queue of removed objects formerly associated with a watch descriptor if it is assumed possible that the kernel can both deliver events on an rm'd wd and reuse wds before the last pertinent event has been read.

If the purpose of IN_IGNORE being generated on the watch removal is to protect against poll/blocking-read vs. inotify_rm_watch() races, it should be plainly stated. Likewise, if it is needed for safely reacting to "stale" events, this should be explained. I believe that the current language of inotify_rm_watch(2) and the read(2) section of inotify(7) leave both interpretations defensible, which is sub-optimal.
Comment 1 Jeff 2014-05-24 21:36:15 UTC
For what it's worth, neither Documentation/filesystems/inotify.txt nor fs/notify/inotify/inotify_user.c itself in the kernel source plainly comment on the issue. As far as I can tell, nothing besides the implementation standardizes the behavior.
Comment 2 Michael Kerrisk 2014-05-26 09:10:02 UTC
(In reply to Jeff from comment #0)
> The inotify interface leaves unspecified what happens to potentially
> "pending" watch events (i.e., caught by the kernel but not yet read()) after
> an inotify_rm_watch() call.
> 
> IN_IGNORE flagged events *allow* inotify_rm_watch() calls to be monitored
> asynchronously, but it is not stated whether triggering actions between a
> most recent read() and an inotify_rm_watch() call *require* all other watch
> events to be handled so when users key map inotify_event wd fields to local
> objects.
> 
> Similarly, a conservative user would need to keep a queue of removed objects
> formerly associated with a watch descriptor if it is assumed possible that
> the kernel can both deliver events on an rm'd wd and reuse wds before the
> last pertinent event has been read.
> 
> If the purpose of IN_IGNORE being generated on the watch removal is to
> protect against poll/blocking-read vs. inotify_rm_watch() races, it should
> be plainly stated. Likewise, if it is needed for safely reacting to "stale"
> events, this should be explained. I believe that the current language of
> inotify_rm_watch(2) and the read(2) section of inotify(7) leave both
> interpretations defensible, which is sub-optimal.


Jeff, thanks for the report. You're quite correct that there's no spec for the behavior (and your points about the implications of the lack of such spec are right on the money). However, man-pages is (sadly) not a spec, at least given current development practices. Nevertheless, something should be documented.

One thing that would have been useful in the bug report would have been a statement of what the existing behavior actually is. As far as I can see, as things stand, if a watch is removed, pending unread events for that watch descriptor remain pending. Can you confirm?
Comment 3 Jeff 2014-05-26 16:38:43 UTC
(In reply to Michael Kerrisk from comment #2)
> Jeff, thanks for the report. You're quite correct that there's no spec for
> the behavior (and your points about the implications of the lack of such
> spec are right on the money). However, man-pages is (sadly) not a spec, at
> least given current development practices. Nevertheless, something should be
> documented.
> 
> One thing that would have been useful in the bug report would have been a
> statement of what the existing behavior actually is. As far as I can see, as
> things stand, if a watch is removed, pending unread events for that watch
> descriptor remain pending. Can you confirm?

Michael, I can confirm that this behavior, for at least 3.x kernels. I didn't test this or dig through the kernel source for anything older though.

I've not been involved enough in the kernel scene to know best how to approach this, so far as claiming what exactly seemed wrong, but the documentation seemed like the most innocuous reasonable place to start. I frankly don't even know where "standard" system call APIs are supposed to be officially specified, if anywhere.

I tried to be dispassionate in my initial report since I don't know what peoples' intents were, but it's my opinion that the current behavior is bad enough to warrant changing. I've never used any other multiplexing interface that can deliver events after a handle de-registration. I've also never seen anyone's inotify-handling code other than my own that even tried to safely handle late queued updates, and I suspect that some fraction of those people are ignorant of the hazards.

As I tried to imply above, extant IN_IGNORE event generation would provide sufficient safety from client poll/rm/read deadlock races in the event of an implementation change, but I don't have the slightest clue about tactfully approaching the kernel community with this, being a non-established individual.
Comment 4 Jeff 2014-05-27 16:25:24 UTC
I should also add that my assumptions about watch descriptor reuse may very well be overly cautious. If there is a guarantee that a wd will never be returned by a inotify_add_watch() call until nothing with that wd remains in the read queue, clients could handle deregistration semi-synchronously by knowing to drop locally unmapped events.

This is something that I've not investigated recently, and I am completely uncertain of any guarantees, but the kernel folks may have already done this and just not passed the information along to users.
Comment 5 Heinrich Schuchardt 2014-05-29 17:03:35 UTC
Created attachment 137681 [details]
test_wd_reuse.c - Test watch descriptor reuse

The test program demonstrates that inotify_add_watch may return a watch descriptor ID for which events still exist on the inotify queue.

It creates one watch for file "0" to demonstrate the problem.
Afterwards it creates and removes watches again and again just to force idr_alloc_cyclic to reach INT_MAX.
Then events are created for file "0" on the queue.
The watch for "0" is removed.
A watch for another file is created.

Example program output (after a few hours, on Linux 3.14.4 x86_64):

...
2147459044      2147467235      2147475426      2147483617      2147483647      
Preparation done
BINGO Collision detected
Watch descriptor 1 for /tmp/test/8192
Watch descriptor 1 for /tmp/test/0
Comment 6 Heinrich Schuchardt 2014-05-29 17:05:01 UTC
Watch descriptor IDs are returned by inotify_add_watch.
When calling inotify_rm_watch an IN_IGNORE is placed on the inotify queue pointing to the ID of the removed watch.

inotify_add_watch should not return a watch descriptor ID for which events are still on the queue but should return an unused ID.

Unfortunately the existing Kernel code does not provide such a guarantee.

Actually in rare cases watch descriptor IDs are returned by inotify_add_watch for which events are still on the inotify queue.

I verified the problem exists with the appended test program (test_wd_reuse.c).
Comment 7 Michael Kerrisk 2014-05-30 12:24:00 UTC
(In reply to Jeff from comment #3)

Hi Jeff,

> I tried to be dispassionate in my initial report since I don't know what
> peoples' intents were, but it's my opinion that the current behavior is bad
> enough to warrant changing. I've never used any other multiplexing interface
> that can deliver events after a handle de-registration. I've also never seen
> anyone's inotify-handling code other than my own that even tried to safely
> handle late queued updates, 

(I have some code that handles it.)

> and I suspect that some fraction of those people
> are ignorant of the hazards.

I expect you are right,

> As I tried to imply above, extant IN_IGNORE event generation would provide
> sufficient safety from client poll/rm/read deadlock races in the event of an
> implementation change, but I don't have the slightest clue about tactfully
> approaching the kernel community with this, being a non-established
> individual.

I'm not sure I understand the sentence "As I tried to imply...", but I'm guessing you mean that it would be sufficient to discard all pending events except the IN_IGNORED event, right?

It depends on the use case, I guess, some folk would, I expect, want to do log tracking -- that is, record all events that happened on a file. Such applications wouldn't want events to be discarded.

Cheers,

Michael
Comment 8 Michael Kerrisk 2014-05-30 12:25:42 UTC
(In reply to Jeff from comment #4)
> I should also add that my assumptions about watch descriptor reuse may very
> well be overly cautious. If there is a guarantee that a wd will never be
> returned by a inotify_add_watch() call until nothing with that wd remains in
> the read queue [...]


And that's the nub of the issue of course. As I suspected, and Heinrich has demonstrated in #5, the kernel does not provide such a guarantee.
Comment 9 Michael Kerrisk 2014-05-30 12:34:02 UTC
(In reply to Heinrich Schuchardt from comment #5)
> Created attachment 137681 [details]
> test_wd_reuse.c - Test watch descriptor reuse
> 
> The test program demonstrates that inotify_add_watch may return a watch
> descriptor ID for which events still exist on the inotify queue.
> 
> It creates one watch for file "0" to demonstrate the problem.
> Afterwards it creates and removes watches again and again just to force
> idr_alloc_cyclic to reach INT_MAX.
> Then events are created for file "0" on the queue.
> The watch for "0" is removed.
> A watch for another file is created.
> 
> Example program output (after a few hours, on Linux 3.14.4 x86_64):
> 
> ...
> 2147459044      2147467235      2147475426      2147483617      2147483647  
> 
> Preparation done
> BINGO Collision detected
> Watch descriptor 1 for /tmp/test/8192
> Watch descriptor 1 for /tmp/test/0

Nice work Heinrich! I pretty much suspected this would be the case, even before looking at the kernel code, and it's nice to see a demonstration[1] of the problem. (I had in mind to code something similar myself, just to be sure, but you beat me to it.)

Anyway, this recycling behavior, coupled with "events remain pending after inotify_rm_watch()" behavior that Jeff reports mean that yes indeed we can events on a recycled watch descriptor that do indeed belong to a previous incarnation of that WD. On the other hand, you have to work pretty hard to do this: you have to fail to read your queued events while at the same time cycling through INT_MAX watch descriptors. I suppose no application is going to run into that scenario in a hurry.

Cheers,

Michael

[1] No need to wait for hours when you test, Just rebuild a kernel with the following edit in fs/notify/inotify/inotify_user.c::inotify_add_to_idr()

-        ret = idr_alloc_cyclic(idr, i_mark, 1, 0, GFP_NOWAIT);
-        ret = idr_alloc_cyclic(idr, i_mark, 1, 500000, GFP_NOWAIT);

That's sufficient as a proof of concept of the problem.

Note You need to log in before you can comment on or make changes to this bug.