Bug 205933

Summary: There seems to be an in-kernel race between epoll_wait and epoll_ctl which leads to lost events on an eventfd file descriptor
Product: IO/Storage Reporter: Max Neunhöffer (max)
Component: OtherAssignee: io_other
Status: NEW ---    
Severity: normal CC: joh82875, leitao
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: Stock 5.4.0 from tag v5.4 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Program which shows the bug.
Kernel configuration of Stock 5.4 kernel which shows the behaviour
Another program showing a variant of the buggy behaviour.
Kernel config for commit a218cc4914209ac14476cb32769b31a556355b22 to see the problem.
Kernel config for commit c3e320b61581ef7919269ca242ff13951ccfc763 to see the problem not yet there.
Example program for bug with level triggered events.

Description Max Neunhöffer 2019-12-20 15:52:24 UTC
Created attachment 286381 [details]
Program which shows the bug.

To whom it may concern,

We are developing a network server written in C++ (https://arangodb.com) based on boost::asio. We observe that very occasionally wakeup events are missed in the main event loop.

We have narrowed it down to a race between the epoll_wait call waiting on an epoll file descriptor which contains (among other things) an eventfd filedescriptor, which is constantly kept in the readable state. We use edge-based notification relying on the fact that epoll_wait tracks which edges have already been reported.

We observe that if the wakeup call (done via epoll_ctl by updating the epoll file descriptor with the same eventfd file descriptor) happens approximately at the same time as the epoll_wait call, then it can happen that the event is not reported and the epoll_wait runs into its timeout.

We observe that this happens with good reproducability on 5.3.0 kernels but not on other versions we have tried (5.0.0 and 4.15). Unfortunately, we could only try this out on generic Ubuntu kernels so far. The attachment contains a reproducing program. Simply compile with

  gcc epollbug.c -o epollbug -Wall -lpthread

and run `./epollbug`. If it writes something with "Timeout" the bug has been reproduced. On our machines it happens in most runs but not all. On older kernels it has never happened.

What does the program do? It plays ping pong between two threads, synchronized via an atomic_int. It then uses a pthread_mutex for memory barriers and basically tries to do epoll_ctl in one thread and epoll_wait in the other. It then checks that the event is caught by epoll_wait and complains if it runs into a timeout.

Note that the idea to use edge-based events is not ours but is what boost::asio does. We could not find documentation if the reinitialization of the epoll file descriptor with the eventfd file descriptor is actually guaranteed to create a new edge. If not, then this is a misuse which seems no longer to work in 5.3.0. But more likely it seems to be an in-kernel race in this area.

Best regards,
  Max
Comment 1 Max Neunhöffer 2020-01-06 13:19:51 UTC
Dear All,

I have now replicated the problem on a stock vanilla kernel version 5.4.

I used this git repo:

   git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

and the "v5.4" tag in there.

Best regards,
  Max.
Comment 2 Max Neunhöffer 2020-01-06 13:48:32 UTC
Created attachment 286655 [details]
Kernel configuration of Stock 5.4 kernel which shows the behaviour
Comment 3 Max Neunhöffer 2020-01-28 09:09:41 UTC
Dear All,

I have exchanged emails with Chris Kohlhoff who uses this functionality in `boost::asio`. He has kindly provided a patched version of the program which displays the buggy behaviour. The patched version uses `eventfd` and `epoll_wait` in a way which "some would consider a more traditional use". This time, the
flag `EPOLLONESHOT` is used and the `eventfd` is explicitly rearmed by a
`epoll_ctl` call with `EPOLL_CTL_MOD` as suggested in the `epoll_ctl` manpage.
Furthermore, the flag `EFD_SEMAPHORE` is used for the event fd. I would imagine that other projects than `boost::asio` would follow similar usage patterns like this one and might therefore also suffer from lost wakeup events.

Furthermore, Chris has reproduced the problem on a 5.4 kernel.

I attach the patched program.

Any comment or help with this would be really appreciated.

Cheers,
  Max.
Comment 4 Max Neunhöffer 2020-01-28 09:10:32 UTC
Created attachment 287007 [details]
Another program showing a variant of the buggy behaviour.
Comment 5 Max Neunhöffer 2020-02-03 14:34:12 UTC
I did a git bisect and have tracked down the bug to this commit:

https://github.com/torvalds/linux/commit/a218cc4914209ac14476cb32769b31a556355b22
Comment 6 Max Neunhöffer 2020-02-03 14:56:32 UTC
The latest kernel at https://github.com/torvalds/linux still shows the problem.
Note furthermore that the problem does not occur if the VM to run the kernel has only one CPU. With 4 CPUs I could reproduce.

Below I will post the kernel configs I used to compile the two commits

  a218cc4914209ac14476cb32769b31a556355b22 (bad)
  c3e320b61581ef7919269ca242ff13951ccfc763 (good)

Cheers,
  Max.
Comment 7 Max Neunhöffer 2020-02-03 14:57:38 UTC
Created attachment 287089 [details]
Kernel config for commit a218cc4914209ac14476cb32769b31a556355b22 to see the problem.

Kernel config for commit a218cc4914209ac14476cb32769b31a556355b22 to see the problem.
Comment 8 Max Neunhöffer 2020-02-03 14:58:20 UTC
Created attachment 287091 [details]
Kernel config for commit c3e320b61581ef7919269ca242ff13951ccfc763 to see the problem not yet there.
Comment 9 Max Neunhöffer 2020-02-03 14:58:58 UTC
Christopher Kohlhoff has produced another program to show that the bug also shows up with level triggered eventfds. I attach the program.

  Max.
Comment 10 Max Neunhöffer 2020-02-03 14:59:28 UTC
Created attachment 287093 [details]
Example program for bug with level triggered events.
Comment 11 Max Neunhöffer 2020-02-26 12:13:58 UTC
For info: This has in the meantime been discussed and fixed on the netdev mailing list, see 

  https://lore.kernel.org/netdev/d4d188512ea84f243310dd9464922a82@suse.de/

 Cheers, Max
Comment 12 Breno Leitao 2020-03-12 16:05:46 UTC
I understand that this bug is still not fixed upstream (5.6-rc5), is this right?