Created attachment 84271 [details] Extend inet_select_addr() to match most specific address Hello. I have discovered following two cases, where neighbour discovery for IPv4 works not as expected: 1. Interface has no IPv4 address assigned and arp_ignore set to 0 (default). 2. Interface has multiple IPv4 addresses assigned from within same IP subnet but with different subnet mask (address space overlapping). 3. When sysctl net/ipv4/icmp_use_inbound_ifaddr is 1 ICMP Reply send with first address configured on interface instead of looking most specific address (matching subnetwork) or first address on any network interface (also no matching done). Describe each problem more closely: ----------------------------------- 1. Common ISP configuration in IPv4 to conserve address is to use so called "ip unnumbered" address assignment schema: instead of allocating to each customer /30 subnet for one IPv4 address (two addresses get's unused: network & broadcast), customer gets assigned one address but with mask (for example) /24. Many network equipment vendors implement this since long ago. Linux also has this implemented (but still needs to be configured) with thanks to arp_ignore sysctl. Configuration schema on Linux loks like following: PC1 | | ip: 10.0.1.2/24 | | Linux Router gw: 10.0.1.1 |--------eth0-| Lo0: 10.10.10.10/32 | Lo255: 10.0.1.1/24 PC2 |--------eth1-| 10.0.2.1/24 ip: 10.0.1.3/24 | | eth[0-2]: no ip address gw: 10.0.1.1 | | ip route 10.0.1.2/32 dev eth0 src 10.0.1.1 +-| ip route 10.0.1.3/32 dev eth1 src 10.0.1.1 | | ip route 10.0.2.2/32 dev eth2 src 10.0.2.1 PC3 |-----eth2--+ ip: 10.0.2.2/24 | gw: 10.0.2.1 | PC1-3 - hosts with Linux 3.2.23-3-amd64 (debian) kernels. No ip address assigned on eth0-2, but IPv4 enabled. Next sysctl settings used (at least): sysctl net/ipv4/ip_forward=1 sysctl net/ipv4/conf/all/forwarding=1 // default sysctl net/ipv4/conf/default/forward=1 // default sysctl net/ipv4/all/proxy_arp=0 sysctl net/ipv4/default/proxy_arp=1 // needed for communication between // PC1 & PC2 sysctl net/ipv4/all/arp_ignore=0 // default sysctl net/ipv4/default/arp_ignore=0 // default Main routing table (254) contains routes to customers ip addresses. Lo0 - dummy network interface with name Lo(opback0) hosts IP address of the router (10.10.10.10). Lo255 - dummy network interface with name Lo(opback)255 created with ip link add name lo255 type dummy However selecting dummy interface as host for addresses not necessary: addresses may be configured on ANY interface even real physical nic but NOT system Loopback (127.0.0.1) because routes on it considered to be local to the host and placed into local (255) routing table and according to Linux policy routing rules (which plases rule to lookup local routing table at idx 0) "ip unnumbered" schema can't work because route found in local routing table (/24) even if in main more specific (/32) routes exists. PROBLEM: If there are more than one ip address assigned in system, ARP Request generated by kernel in NUD PROBE phase gets address from within first IP address assigned to interface (and in ip unnumbered there is no such address on interface) or first found address on any interface with scope <= LINK. Parameter src (ip(8)) not taken into account even if configured. Here is confirmation created in lab environment, reflecting schema described above. reading from file arp-probe-bug.pcap, link-type EN10MB (Ethernet) !--- PROBE phase begins --- 13:28:57.395181 08:00:27:3b:63:ae > 0a:00:27:00:00:00, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.1.2 tell 10.10.10.10, length 28 13:28:58.395257 08:00:27:3b:63:ae > 0a:00:27:00:00:00, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.1.2 tell 10.10.10.10, length 28 13:28:59.395207 08:00:27:3b:63:ae > 0a:00:27:00:00:00, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.1.2 tell 10.10.10.10, length 28 !--- FAILED phase begins --- 13:29:01.393739 08:00:27:3b:63:ae > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.1.2 tell 10.0.1.1, length 28 13:29:01.393862 0a:00:27:00:00:00 > 08:00:27:3b:63:ae, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 10.0.1.2 is-at 0a:00:27:00:00:00, length 46 As can be seen from capture router sends ARP with source NOT from subnet where destination address 10.0.1.2 resides (configured on Lo255). This causes NUD entry reneval to go into FAILED state on Linux router and make resolving neighbour entry from begin where it looks for 'src' parameter associated with /32 routes and successfuly resolving address. Hosts PC1-3 ignored ARP with "fake" ip address in source (linux behavior). Network equipment from other vendors (especially security devices) treat this as attack and may have to apply specified actions. See attachment arp-case-1.tar.xz for network stack configuration of Linux router. 2. Next case closely relates to first but configuration has changed PC1 | | ip: 10.0.1.2/24 | | Linux Router gw: 10.0.1.1 |--------+ | Lo0: 10.10.10.10/32 | | Lo255: 10.0.2.1/24 PC2 +----| eth0: 10.0.1.1/24 ip: 10.0.1.130/25 | | | 10.0.1.129/25 gw: 10.0.1.129 |------+ | We have overlapping address space (10.0.1.0/24 contains 10.0.1.0/25). This could be in plase when network gets grown from 10.0.1.0/25 to 10.0.1.0/24 but not all hosts migrated to new settings). In this case we also have same as in first case, but address taken from eth0 and it is first, that matches subnet even if more specific exists (/25). reading from file arp-probe-bug.pcap, link-type EN10MB (Ethernet) !--- PROBE phase begins --- 14:15:59.716100 08:00:27:3b:63:ae > 0a:00:27:00:00:00, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.1.130 tell 10.0.1.1, length 28 14:16:00.715429 08:00:27:3b:63:ae > 0a:00:27:00:00:00, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.1.130 tell 10.0.1.1, length 28 14:16:01.715302 08:00:27:3b:63:ae > 0a:00:27:00:00:00, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.1.130 tell 10.0.1.1, length 28 ! --- FAILED phase begins --- 14:16:03.713596 08:00:27:3b:63:ae > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.1.130 tell 10.0.1.129, length 28 14:16:03.713663 0a:00:27:00:00:00 > 08:00:27:3b:63:ae, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 10.0.1.130 is-at 0a:00:27:00:00:00, length 46 See attachment arp-case-2.tar.xz for network stack configuration of Linux router (sysctls identical to first case). 3. On the same schema as in case 2 when doing tracepath (traceroute) we get ICMP answer from 10.0.1.1 and NOT from 10.0.1.129 as expected. sysctl net/ipv4/icmp_errors_use_inbound_ifaddr = 1 # tracepath -n 10.0.2.2 1: 10.0.1.130 0.130ms pmtu 1500 1: no reply // Here should be 10.0.1.129, but see next output from tcpdump 16:19:32.043941 08:00:27:3b:63:ae > 0a:00:27:00:00:00, ethertype IPv4 (0x0800), length 590: (tos 0xc0, ttl 64, id 61526, offset 0, flags [none], proto ICMP (1), length 576) 10.0.1.1 > 10.0.1.130: ICMP time exceeded in-transit, length 556 (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto UDP (17), length 1500) 10.0.1.130.55610 > 10.0.2.2.44444: UDP, length 1472 -------------------------------------------------------------------------- Proposed patch in attachment. It expands inet_select_addr() in net/ipv4/devinet.c to look at most specific network configured on interface it tested in our lab environment and found to work correctly. arp-probe-bug.bsh script in attachment could be used to reproduce bug. In general: even if inet_select_addr updated with our patch there is still one place that breaks source address selection for kernel generated IPv4 traffic: ip_route_output_slow() (net/ipv4/route.c) FIB_RES_PREFSRC() (include/net/ip_fib.h) in fib_info_update_nh_saddr() at net/ipv4/fib_semantic.c inet_select_addr() called with dst == nh->nh_gw, but if network is directly connected nh_gw == 0! and we get first best address instead of searching on packet destination. However this is in case only when route does't have prefsrc ('src' parameter) accociated with it.
Created attachment 84281 [details] ARP case 1 system configuration and tcpdump captures
Created attachment 84291 [details] ARP case 2 system configuration and tcpdump captures
Created attachment 84301 [details] bash script to reproduce bug
Created attachment 84311 [details] ICMP Reply on net/ipv4/icmp_errors_use_inbound_ifaddr (related)
Please post a summary of this to netdev@vger.kernel.org (you don't need to be subscribed). We use bugzilla for tracking reported bugs rather than for necessarily fixing them.