Bug 99671
Summary: | glibc deadlock in __check_pf() presumed due to missing netlink response | ||
---|---|---|---|
Product: | Networking | Reporter: | David Woodhouse (dwmw2) |
Component: | Other | Assignee: | Stephen Hemminger (stephen) |
Status: | NEW --- | ||
Severity: | normal | CC: | fweimer, koct9i |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.0 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
David Woodhouse
2015-06-08 21:20:23 UTC
This is the glibc code in question: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/check_pf.c;h=162606d7;hb=glibc-2.21#l166 I'm running a kernel with http://patchwork.ozlabs.org/patch/473041/ now. I don't have a reliable reproducer but it usually happens within a day or two. If it lasts a week on this kernel, I'll call it fixed. (Thanks Eric Dumazet for pointing it out). I've found couple bugs in glibc: 1) this function ignores NLMSG_ERROR message. In this case kernel don't sent NLMSG_DONE at all 2) check (nlmh->nlmsg_pid != pid) isn't safe: in case of pid collision (for example several pid-namespace in one net-namespace) kernel binds netlink socket to some random (and negative?) pid. Kernel sends -ECONNREFUSED because it cannot find socket in hash and looks like this code is broken differently in 3.17..3.19 and in 4.0.. Neither of those were actually causing the problem in my case though; we were getting *no* messages back. I did see it again after applying the patch mentioned in comment 2. After *also* applying a version of the patch from https://patchwork.ozlabs.org/patch/473049/ I think it finally seems to have gone away. I've found race in kernels 3.17 .. 3.19 Fix: https://patchwork.ozlabs.org/patch/488736/ In my case I see NLMSG_ERROR message in receive buffer, glibc ignores it and wants more. |