Bug 198791

Summary: Deadlock/hang in lock_sock_nested from getorigdst
Product: Networking Reporter: Petr Vandrovec (petr)
Component: Netfilter/IptablesAssignee: networking_netfilter-iptables (networking_netfilter-iptables)
Status: NEW ---    
Severity: normal CC: pabeni
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.16.0-rc1-64-00034 Subsystem:
Regression: Yes Bisected commit-id:

Description Petr Vandrovec 2018-02-15 00:08:17 UTC
I have configured NAT on my system to automatically proxy all outgoing connections to the external proxy, using REDIRECT operation to forward all external traffic to local port 8080, from where userspace daemon adds 'CONNECT <dstip>' in front of TCP stream and forwards whole thing to our web proxy:

proxy -h proxy.vmware.com:3128 -p 8080
iptables -t nat -F OUTPUT
iptables -t nat -A OUTPUT -d 127.0.0.0/255.0.0.0 -p tcp -j ACCEPT
iptables -t nat -A OUTPUT -d 10.0.0.0/255.0.0.0 -p tcp -j ACCEPT
iptables -t nat -A OUTPUT -d 172.16.0.0/255.255.0.0 -p tcp -j ACCEPT
iptables -t nat -A OUTPUT -d 192.168.0.0/255.255.0.0 -p tcp -j ACCEPT
iptables -t nat -A OUTPUT -p tcp -j REDIRECT --to-ports 8080

Everything worked until today's update from 4.15.0-rc9-00023 to 4.16.0-rc1-00032 - now it seems that proxy process deadlocked when it received first connection request - it is stuck in getsockopt(fd, SOL_IP, SO_ORIGINAL_DST):

[ 1600.470029] INFO: task proxy:5124 blocked for more than 120 seconds.
[ 1600.476490]       Tainted: G           OE    4.16.0-rc1-64-00034-g124cf890fee7 #17
[ 1600.484131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1600.492094] proxy           D    0  5124      1 0x00000000
[ 1600.497622] Call Trace:
[ 1600.500196]  ? __schedule+0x238/0x740
[ 1600.503967]  schedule+0x2e/0x90
[ 1600.507214]  __lock_sock+0x74/0xb0
[ 1600.510722]  ? wait_woken+0x80/0x80
[ 1600.514331]  lock_sock_nested+0x4b/0x50
[ 1600.518268]  getorigdst+0x55/0x240 [nf_conntrack_ipv4]
[ 1600.523497]  ? _copy_to_user+0x22/0x30
[ 1600.527361]  ? move_addr_to_user+0x74/0xc0
[ 1600.531555]  nf_getsockopt+0x3c/0x60
[ 1600.535246]  ip_getsockopt+0x72/0xc0
[ 1600.538921]  SyS_getsockopt+0x6b/0xd0
[ 1600.542683]  do_syscall_64+0x73/0x190
[ 1600.546462]  entry_SYSCALL_64_after_hwframe+0x21/0x86
[ 1600.551599] RIP: 0033:0x7fbdd05982aa
[ 1600.555286] RSP: 002b:00007ffc5b162c48 EFLAGS: 00000206 ORIG_RAX: 0000000000000037
[ 1600.562923] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fbdd05982aa
[ 1600.570146] RDX: 0000000000000050 RSI: 0000000000000000 RDI: 0000000000000004
[ 1600.577369] RBP: 00007ffc5b163ef2 R08: 00007ffc5b162c5c R09: 00007fbdd0853230
[ 1600.584573] R10: 00007ffc5b162c70 R11: 0000000000000206 R12: 0000000000000005
[ 1600.591798] R13: 00007ffc5b163f03 R14: 000000000000901f R15: 0000000000000000
[ 1723.350030] INFO: task proxy:5124 blocked for more than 120 seconds.
[ 1723.356495]       Tainted: G           OE    4.16.0-rc1-64-00034-g124cf890fee7 #17
[ 1723.364138] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1723.372057] proxy           D    0  5124      1 0x00000000
[ 1723.377652] Call Trace:
[ 1723.380221]  ? __schedule+0x238/0x740
[ 1723.383997]  schedule+0x2e/0x90
[ 1723.387251]  __lock_sock+0x74/0xb0
[ 1723.390763]  ? wait_woken+0x80/0x80
[ 1723.394376]  lock_sock_nested+0x4b/0x50
[ 1723.398317]  getorigdst+0x55/0x240 [nf_conntrack_ipv4]
[ 1723.403548]  ? _copy_to_user+0x22/0x30
[ 1723.407420]  ? move_addr_to_user+0x74/0xc0
[ 1723.411616]  nf_getsockopt+0x3c/0x60
[ 1723.415310]  ip_getsockopt+0x72/0xc0
[ 1723.418986]  SyS_getsockopt+0x6b/0xd0
[ 1723.422745]  do_syscall_64+0x73/0x190
[ 1723.426521]  entry_SYSCALL_64_after_hwframe+0x21/0x86
[ 1723.431662] RIP: 0033:0x7fbdd05982aa
[ 1723.435342] RSP: 002b:00007ffc5b162c48 EFLAGS: 00000206 ORIG_RAX: 0000000000000037
[ 1723.442980] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fbdd05982aa
[ 1723.450210] RDX: 0000000000000050 RSI: 0000000000000000 RDI: 0000000000000004
[ 1723.457429] RBP: 00007ffc5b163ef2 R08: 00007ffc5b162c5c R09: 00007fbdd0853230
[ 1723.464640] R10: 00007ffc5b162c70 R11: 0000000000000206 R12: 0000000000000005
[ 1723.471869] R13: 00007ffc5b163f03 R14: 000000000000901f R15: 0000000000000000


I'll try to bisect it tomorrow if nobody beats me to it.
Comment 1 Petr Vandrovec 2018-02-15 08:19:06 UTC
bisect says that there is a problem with move of lock_sock from ip_setsockopt into getorigdst(): is getsockopt operation, not setsockopt, and that change did not modify ip_getsockopt() to not acquire socket lock, it modified ip_setsockopt() only.  So sk lock gets acquired twice, once in ip_getsockopt, and once in getorigdst :-(

Perhaps intention was to remove lock from ip_getsockopt rather than ip_setsockopt?  I would guess that most of nf_setsockopt stuff will need exclusion, no?

commit 3f34cfae1238848fd53f25e5c8fd59da57901f4b
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Tue Jan 30 19:01:40 2018 +0100

    netfilter: on sockopt() acquire sock lock only in the required scope

    Syzbot reported several deadlocks in the netfilter area caused by
    rtnl lock and socket lock being acquired with a different order on
    different code paths, leading to backtraces like the following one:
...
    The problem, as Florian noted, is that nf_setsockopt() is always
    called with the socket held, even if the lock itself is required only
    for very tight scopes and only for some operation.

    This patch addresses the issues moving the lock_sock() call only
    where really needed, namely in ipv*_getorigdst(), so that nf_setsockopt()
    does not need anymore to acquire both locks.

    Fixes: 22265a5c3c10 ("netfilter: xt_TEE: resolve oif using netdevice notifiers")
    Reported-by: syzbot+a4c2dc980ac1af699b36@syzkaller.appspotmail.com
    Suggested-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Comment 2 Petr Vandrovec 2018-02-15 08:42:27 UTC
BTW, net/decnet/af_decnet.c seems to be broken too - both its setsockopt & getsockopt call nf_[gs]etsockopt with sk lock held, which does not match new invocation rules for IPv4/IPv6.
Comment 3 Paolo Abeni 2018-02-15 10:37:46 UTC
(In reply to Petr Vandrovec from comment #1)
> bisect says that there is a problem with move of lock_sock from
> ip_setsockopt into getorigdst(): is getsockopt operation, not setsockopt,
> and that change did not modify ip_getsockopt() to not acquire socket lock,
> it modified ip_setsockopt() only.  So sk lock gets acquired twice, once in
> ip_getsockopt, and once in getorigdst :-(

This should be fixed by the following patch:

http://patchwork.ozlabs.org/patch/870827/

The patch should already be in the nf tree and should land mainline soon.

Thanks,

Paolo