Bug 121131 - Regression: No traffic when connected via SSL VPN (e.g. Juniper Network Connect)
Summary: Regression: No traffic when connected via SSL VPN (e.g. Juniper Network Connect)
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 high
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-28 23:50 UTC by Jonas Lippuner
Modified: 2017-04-24 15:44 UTC (History)
13 users (show)

See Also:
Kernel Version: 4.6.1
Tree: Mainline
Regression: Yes


Attachments
Merge Patch (2.06 KB, patch)
2016-07-06 22:58 UTC, [account disabled by administrator]
Details | Diff
changin default addrgenmode to "none" instead of changing "none" to "random" and default to "eui64" (496 bytes, patch)
2016-07-09 17:17 UTC, Bjørn Mork
Details | Diff

Description Jonas Lippuner 2016-06-28 23:50:36 UTC
When I connect to a Juniper VPN network, the tunnel gets established, but no data is sent over it.

ifconfig gives

tun0: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>  mtu 1400
        inet <my ip>  netmask 255.255.255.255  destination <my ip>
        inet6 <my ip>  prefixlen 64  scopeid 0x20<link>
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 500  (UNSPEC)
        RX packets 10  bytes 600 (600.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1  bytes 48 (48.0 B)
        TX errors 0  dropped 1476 overruns 0  carrier 0  collisions 0

Pinging anything (not just something in the VPN network) fails. When I ping <my ip> from another machine that's correctly connected to the VPN network (via Windows, *ugh*), the ping fails, but the the received data counter increases on my Linux system, indicating that it receives the ping packets, but it's not sending a response. Also, the dropped packet counter increases.

This seems to be a regression of https://bugzilla.kernel.org/show_bug.cgi?id=90901

I've seen it reported online that connecting to a Juniper VPN worked in 4.4, but stopped working in 4.5 and 4.6. See for example https://wiki.archlinux.org/index.php/Juniper_VPN#ncsvc_and_kernel_versions_3.19.2C_4.5_and_4.6 and http://www.unixgr.com/juniper-ncsvc-and-linux-3-19/#comment-342
Comment 1 [account disabled by administrator] 2016-07-02 04:25:21 UTC
Seems this is already fixed by the report here,https://bugzilla.kernel.org/show_bug.cgi?id=90901.
Comment 2 Jonas Lippuner 2016-07-02 04:42:36 UTC
This is a regression. The VPN works as expected with kernel 4.4, but it does not work at all with kernel 4.6. The fix mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=90901 (which was 3.19) is already implemented in 4.6, so something else is going wrong.
Comment 3 [account disabled by administrator] 2016-07-02 08:15:47 UTC
Sorry I didn't check if that commit was merged into the kernel version you were using. Since you have a good and bad kernel version, git bisection may be a good place to start finding the offending commit.
Comment 4 Jonas Lippuner 2016-07-04 00:33:00 UTC
Result of git bisect

cc9da6cc4f56e05cc9e591459fe0192727ff58b3 is the first bad commit
commit cc9da6cc4f56e05cc9e591459fe0192727ff58b3
Author: Bjørn Mork <bjorn@mork.no>
Date:   Wed Dec 16 16:44:38 2015 +0100

    ipv6: addrconf: use stable address generator for ARPHRD_NONE

    Add a new address generator mode, using the stable address generator
    with an automatically generated secret. This is intended as a default
    address generator mode for device types with no EUI64 implementation.
    The new generator is used for ARPHRD_NONE interfaces initially, adding
    default IPv6 autoconf support to e.g. tun interfaces.

    If the addrgenmode is set to 'random', either by default or manually,
    and no stable secret is available, then a random secret is used as
    input for the stable-privacy address generator.  The secret can be
    read and modified like manually configured secrets, using the proc
    interface.  Modifying the secret will change the addrgen mode to
    'stable-privacy' to indicate that it operates on a known secret.

    Existing behaviour of the 'stable-privacy' mode is kept unchanged. If
    a known secret is available when the device is created, then the mode
    will default to 'stable-privacy' as before.  The mode can be manually
    set to 'random' but it will behave exactly like 'stable-privacy' in
    this case. The secret will not change.

    Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
    Cc: 吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
    Signed-off-by: Bjørn Mork <bjorn@mork.no>
    Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

:040000 040000 d836f778ff48680479f9e38e138e315dc64409f4 54151e4934861855dc02e58b15683389e7c8ebf2 M   include
:040000 040000 8f303ac24b9e83e9b0ffda29aacf87cc3dcaf018 7c8ea2a7339ecb8cf28f50c680d428551490d7e4 M   net


And the following patch fixes the problem for me:

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 819b777..d57dd53 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3109,11 +3109,6 @@ static void addrconf_dev_config(struct net_device *dev)
        if (IS_ERR(idev))
                return;
 
-       /* this device type has no EUI support */
-       if (dev->type == ARPHRD_NONE &&
-           idev->addr_gen_mode == IN6_ADDR_GEN_MODE_EUI64)
-               idev->addr_gen_mode = IN6_ADDR_GEN_MODE_RANDOM;
-
        addrconf_addr_gen(idev, false);
 }


What this commit did is that it generated a random IPv6 address for the tunnel interface but the tunnel does apparently not support IPv6. With the bad commit I get

tun0: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>  mtu 1400
        inet <my ip>  netmask 255.255.255.255  destination <my ip>
        inet6 fe80::4a6b:8329:ee97:25fe  prefixlen 64  scopeid 0x20<link>
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 500  (UNSPEC)

and no traffic is sent of the VPN tunnel. With my patch, I get

tun0: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>  mtu 1400
        inet <my ip>  netmask 255.255.255.255  destination <my ip>
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 500  (UNSPEC)

and VPN works as expected.

Since I don't know all the intricacies of the network driver and IPv6 etc, my patch may not be the optimal solution. Maybe there needs to be some code that detects whether the interface (in this case Juniper VPN tunnel) supports IPv6 or not and only generate a random IPv6 if this is sure to not break the interface.

I'm adding the email addresses listed on the bad commit to the CC of this bug.
Comment 5 [account disabled by administrator] 2016-07-04 01:54:15 UTC
Seems the line:
idev->addr_gen_mode == IN6_ADDR_GEN_MODE_EUI64
needs to be:
idev->addr_gen_mode == IN6_ADDR_GEN_MODE_NONE.
Can you change that equals check to that and see if it fixes your issue.
Comment 6 Jonas Lippuner 2016-07-04 02:16:04 UTC
Indeed, the following patch DOES WORK:

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 819b777..24c2287 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3111,7 +3111,7 @@ static void addrconf_dev_config(struct net_device *dev)
 
        /* this device type has no EUI support */
        if (dev->type == ARPHRD_NONE &&
-           idev->addr_gen_mode == IN6_ADDR_GEN_MODE_EUI64)                                  
+           idev->addr_gen_mode == IN6_ADDR_GEN_MODE_NONE)                                   
                idev->addr_gen_mode = IN6_ADDR_GEN_MODE_RANDOM;                              
                                                                                             
        addrconf_addr_gen(idev, false);
Comment 7 Bjørn Mork 2016-07-04 07:57:27 UTC
Is it possible, and does it help, if you do the same from userspace before bringing up the tun0 interface?  I.e.

 ip link set tun0 addrgenmode none

Sorry about the regression.  I have no strong feelings against changing the default, except that this now is so late that it will cause a regression for anyone depending on the v4.5 and v4.6 behaviour.  Don't know the best way to deal with that....
Comment 8 Jonas Lippuner 2016-07-04 15:02:13 UTC
Thanks for your thoughts, Bjørn. If I run

$ ip link set tun0 addrgenmode none

as both user or root before connecting to the VPN, I just get:

  Cannot find device "tun0"

If I run the command after connecting to VPN, it succeeds (only as root), but doesn't change anything because at that time the IPv6 has already been set on tun0.

Is there a way to set the addrgenmode to none for the device before it exists? In a config file somewhere perhaps?
Comment 9 [account disabled by administrator] 2016-07-04 17:14:30 UTC
4.5 was mainline only so unless a distribution or product choose their kernel for it and hits this regression we don't need to backport to 4.5 I feel. However I will keep a mental note if this does occur in the future. On the other hand 4.6 is a stable release so back porting to that particular release is important. On the hand that line is broken and should be fixed if you read the comment above it we are actually checking for what address gen mode we should not have enabled. Jonas please close this bug as I feel it's been fixed and if there are any further issues just reopen it.
Comment 10 Josh Boyer 2016-07-06 13:42:24 UTC
This bug isn't actually fixed as far as I can tell.  There is nothing queued in davem's tree, or even posted to the netdev list.  Given that it's a regression, I don't see how you can consider this fixed.
Comment 11 [account disabled by administrator] 2016-07-06 21:12:13 UTC
The protocol is to generally close bugs once their fixed. The merge will happen in a week or some as David is probably busy.
Comment 12 Josh Boyer 2016-07-06 22:13:56 UTC
(In reply to bastienphilbert from comment #11)
> The protocol is to generally close bugs once their fixed. The merge will
> happen in a week or some as David is probably busy.

It's not fixed.  There's a fragment of a patch in this bug.  There is no upstream acceptable patch, there is no posting on netdev, the patch isn't on patchwork, and there is nothing in a maintainer's tree.

What this bug has is an identified code change that can work, but until it is sent upstream I do not see how anyone can conclude it is actually the correct fix.  I also don't see how you expect a merge to happen in a week when nobody has actually sent the patch out.
Comment 13 Jonas Lippuner 2016-07-06 22:16:19 UTC
Ok, so do I need to submit my patch somewhere? I am very new to submitting a bug report and patch to the Linux kernel, so I don't know what the usual work flow is...
Comment 14 Josh Boyer 2016-07-06 22:22:22 UTC
(In reply to Jonas Lippuner from comment #13)
> Ok, so do I need to submit my patch somewhere? I am very new to submitting a
> bug report and patch to the Linux kernel, so I don't know what the usual
> work flow is...

Yes.  Ideally you would write up a patch with a commit log that describes what the problem is, points to the commit that introduced the regression, and send it to the netdev maintainers.  Perhaps Bjørn could help you through the process.
Comment 15 [account disabled by administrator] 2016-07-06 22:54:06 UTC
Josh,
Here is a patch for upstream.
Comment 16 [account disabled by administrator] 2016-07-06 22:58:24 UTC
Created attachment 222291 [details]
Merge Patch
Comment 17 Josh Boyer 2016-07-07 11:34:45 UTC
(In reply to bastienphilbert from comment #16)
> Created attachment 222291 [details]
> Merge Patch

Fantastic.  Now it just needs to be sent to the netdev list.  Thanks for doing that.
Comment 18 Bjørn Mork 2016-07-09 17:17:37 UTC
Created attachment 222531 [details]
changin default addrgenmode to "none" instead of changing "none" to "random" and default to "eui64"

Rethinking this a bit, I believe the proposed patch will cause regressions for other userspace applications.  In particular NetworkManager, which depends on being able to manage IPv6 LL addresses itself by setting "addrgenmode none".  The proposed patch will convert this into "addrgenmode random", which is pretty unexpected.

Could any of you who are being affected by theregression test the attached patch instead?  It should revert the problematic behaviour without causing unnecessary regressions to other applications (it will cause regressions for anyone depending on the "random" default, but we have to accept that I guess).
Comment 19 Jonas Lippuner 2016-07-09 20:03:01 UTC
(In reply to Bjørn Mork from comment #18)
> Could any of you who are being affected by theregression test the attached
> patch instead?

It works, thanks!
Comment 20 [account disabled by administrator] 2016-07-15 03:06:50 UTC
Bjorn,
Would you like me to send this version instead in order to avoid the regressions your are discussing.
Comment 21 Bjørn Mork 2016-07-15 09:23:44 UTC
More info is needed if this is going to progress. The proposed workaround is tested and we know it is effective.  But:

The workaround is *not* a bugfix.  It simply disables a feature the rest of the world wants, just to accommodate a buggy VPN client.  This issue is percevied as a regression simply because it is a feature which previously was not implemented for tun interfaces. But it is fundamentally wrong if that fact then should prevent the feature from ever being implemented.

Like the Juniper TAC has told me numerous times: Function As Designed :)

Anyway. everybody is interested in making this client work.  We just don't want to blindly go around and disable features, without understanding the underlying issues. So someone with the capability to test this VPN client needs to step up and do some debugging. Please see the patchwork discussion of the last proposed workaround.  It has all the necessary details:

http://patchwork.ozlabs.org/patch/646958/

There will be no further progress here until we have more information.  Sorry.  I'm told OpenConnect now supports Juniper/Pulse VPNs, but I don't have first hand info since I have no such VPN to test against.
Comment 22 [account disabled by administrator] 2016-07-15 16:41:46 UTC
Bjorn,
If this was a new feature being implemented then your right in that this is a workaround seems to be an issue at least with setting up the right bits on the the device during certain key tun functions from what I am reading.
Comment 23 John Brooks 2016-07-19 17:30:00 UTC
I have access to a VPN system that I can use to test this. I'll poke around and see if I can get anywhere.
Comment 24 John Brooks 2016-07-20 21:48:15 UTC
My findings so far:

I ran strace on ncsvc right as it was started, and found that it reads 48 bytes from /dev/net/tun:

open("/dev/net/tun", O_RDWR)            = 25
ioctl(25, TUNSETIFF, 0xfff077f0)        = 0
fcntl64(25, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
read(25, "`\0\0\0\0\10:\377\376\200\0\0\0\0\0\0\337\315\224#Wd\336T\377\2\0\0\0\0\0\0"..., 2048) = 48

Immediately after, it writes the following line to the log file:
20160718175944.702288 ncsvc[p28122.t28122] adapter.para reading 48 bytes from tun (adapter.cpp:309)
20160718175944.702347 ncsvc[p28122.t28122] adapter.warn Bad ip packet len 48 - should be 0 (adapter.cpp:180)

It does not read from /dev/net/tun again after this.

Disabling router solicitations (echo 0 > /proc/sys/net/ipv6/conf/default/router_solicitations) and attempting to connect again will fix the problem. It may be worth noting that the ncsvc process does not need to be killed in between.
Comment 25 hannes 2016-07-20 22:31:15 UTC
(In reply to John Brooks from comment #24)
> Disabling router solicitations (echo 0 >
> /proc/sys/net/ipv6/conf/default/router_solicitations) and attempting to
> connect again will fix the problem. It may be worth noting that the ncsvc
> process does not need to be killed in between.

That is bad to hear, somehow I fear we can't fix this basically. :(
Comment 26 John Brooks 2016-07-20 23:40:13 UTC
After the VPN was up and running with router solicitations disabled, I did this:
ping6 -I tun0 fe80::1

It shows up in Wireshark watching tun0, and strace showed the ncsvc process reading the ping packet from /dev/net/tun:
read(25, "`\3\333\204\0@:@\376\200\0\0\0\0\0\0\2076r\20\343\363Fs\376\200\0\0\0\0\0\0"..., 2048) = 104

ncsvc will not read anything else from the tunnel until the connection is restarted (to clarify, I mean that no more read calls are even made). Though the packets will continue to show up in Wireshark.

So it looks like trying to send any IPv6 packet over the tunnel interface will cause the client to stop relaying packets.

I do not think that this is a remote problem (such as security software in the remote network triggering on IPv6 packets) for two reasons:
1. The trace shows that ncsvc does not even try to send anything to the network after this occurs
2. This doesn't happen on Windows with the Pulse Secure desktop client; I watched IPv6 packets go through in Wireshark, and the connection remained functional

It's likely that this is a bug in Network Connect. Unfortunately, Network Connect is not open source, so it's difficult to verify.
Comment 27 Dennis Kieselhorst 2016-11-03 09:07:23 UTC
After updating to Ubuntu 16.10 which contains Kernel 4.8 more and more users run into this problem. Even the latest Network Connect client 8.2R5 doesn't work.

How to proceed with this issue?
Comment 28 Bjørn Mork 2016-11-03 09:47:22 UTC
I believe John Brooks presented a workaround in comment #24: https://bugzilla.kernel.org/show_bug.cgi?id=121131#c24

A more permanent fix is unlikely at this point.  The problem is identified to a rather stupid and simple-to-fix bug in the client. An in-kernel workaround is not appropriate and has been rejected.  The fact that the client is unmaitained and closed source does not help.

I recommend selecting a new and manitained client.  If that is not possible, then implement the suggested workaround locally.
Comment 29 Dennis Kieselhorst 2016-11-03 10:24:49 UTC
I should have read the comments more concentrated. Thank you for pointing this out, the workaround is very helpful for me.

Anyway as far as I understand Pulse Secure should provide a solution here.
Comment 30 hannes 2017-03-12 23:10:43 UTC
Please have a look at this patch:

https://patchwork.ozlabs.org/patch/737900/

Maybe it helps?
Comment 31 David Watt 2017-03-16 17:55:56 UTC
I have raised this issue in a service request with PulseSecure. I sent them to this thread, so it is possible (at least I hope!) that they may reach out to some of the people who have posted such constructive analysis here.  You *may* be able to see my discussion with them in their SalesForce database at https://pulsesecure.force.com/ps/500j000000TEe7e, but no promises; I don't know what their security model is.
Comment 32 David Watt 2017-03-16 18:04:29 UTC
I've posted to PulseSecure's support system; if you don't have access to that, I'd encourage you to post here:

https://forums.pulsesecure.net/topic/pulse-connect-secure

Note You need to log in before you can comment on or make changes to this bug.