Bug 109471
Summary: | Linux generating invalid Ethernet frames which get dropped (as over-size) by gateway | ||
---|---|---|---|
Product: | Networking | Reporter: | Brad Hein (linuxbrad) |
Component: | IPV4 | Assignee: | Stephen Hemminger (stephen) |
Status: | NEW --- | ||
Severity: | normal | CC: | Dean_Jenkins, linux, shemming, szg00000 |
Priority: | P1 | ||
Hardware: | ARM | ||
OS: | Linux | ||
Kernel Version: | 3.18.11-v7+ #781 SMP PREEMPT armv7l GNU/Linux | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | attachment-7946-0.html |
Description
Brad Hein
2015-12-16 15:47:16 UTC
I also have an intrusion detection server (IDS) on the network that happens to see the over-sized packets, and it also throwing kernel alerts about the oversized packet length. Dec 16 23:57:11 mri kernel: [1587969.108160] asix 1-1.5:1.0 eth1: asix_rx_fixup() Bad RX Length 1522 Dec 16 23:57:07 mri kernel: [1587964.778865] asix 1-1.5:1.0 eth1: asix_rx_fixup() Bad RX Length 1522 Dec 16 23:57:06 mri kernel: [1587963.821912] asix 1-1.5:1.0 eth1: asix_rx_fixup() Bad RX Length 1522 Dec 16 23:57:05 mri kernel: [1587962.620922] asix 1-1.5:1.0 eth1: asix_rx_fixup() Bad RX Length 1522 Dec 16 23:57:04 mri kernel: [1587962.119229] asix 1-1.5:1.0 eth1: asix_rx_fixup() Bad RX Length 1522 Dec 16 23:57:04 mri kernel: [1587961.535235] asix 1-1.5:1.0 eth1: asix_rx_fixup() Bad RX Length 1522 Dec 16 23:57:03 mri kernel: [1587961.000442] asix 1-1.5:1.0 eth1: asix_rx_fixup() Bad RX Length 1522 Its a driver bug. The Linux networking stack does not care what received packet size (only that the frame is valid). The driver is responsible for handling long frames. A correctly written driver should: * drop any frame that is truncated on receive * keep local statistics about dropped packets The gateway doesn't show any errors in the syslog associated with the drops... but the statistic rx_length_errors does correlate to the drops. The fixup errors I posted are from another server listening to the traffic to/from the gateway. I hope I didn't confuse the question by adding the syslog errors from the IDS server. I meant only to illustrate that multiple servers on the network are complaining about the packet length coming from the access point. Just to be extra clear, the issue being reported is the over-sized network frames. I'd like to try and rule some thing out... here's a sketch of the packet flow: wifi client -> hostapd -> wlan0 -> Linux Bridge -> eth0 I'll work on ruling out the hardware by trying the same OS image on another device. Maybe bad hardware? I could also use some pointers/suggestions for ruling out the Linux Bridge drivers/logic. I swapped hardware (took the OS SD card and moved it to another Pi) and the problem followed the software. So that at least rules out hardware. I took a new disk and loaded a slightly older OS but with the same hostapd/bridge/network configuration and same hardware. The problem doesn't exist with same configuration except older kernel/software. Not Broken: Linux 3.18.7-v7+ #755 SMP PREEMPT Thu Feb 12 17:20:48 GMT 2015 armv7l GNU/Linux (Source: 2015-02-16-raspbian-wheezy.img) Broken: Linux 3.18.11-v7+ #781 SMP PREEMPT Tue Apr 21 18:07:59 BST 2015 armv7l GNU/Linux (Source: 2015-09-24-raspbian-jessie.img) Scratch that - the problem happens in both OS images, it took a while to recreate the conditions necessary to see the big packets but eventually the older image also exhibited the problem. I did however narrow down the problem further by trying a different USB wifi adapter. I found that switching to a Tenda RT5370 USB adapter made the problem go away. **The problem followed the USB adapter**. The following adapter exhibits the too-long-frame problem reported in this ticket: Bus 001 Device 005: ID 0cf3:9271 Atheros Communications, Inc. AR9271 802.11n (TP-Link TL-WN-722N USB with external RPSMA antenna) For comparison the following adapter does not exhibit the problem: GOOD: Bus 001 Device 004: ID 148f:5370 Ralink Technology, Corp. RT5370 Wireless Adapter (Tenda USB 802.11N) Wrote up a bug report on the Atheros firmware site, in case it's a problem with the in-RAM firmware on this adapter. https://github.com/qca/open-ath9k-htc-firmware/issues/80 From what i see here right now: - old kernel is used - old firmware is used Lets start with updating things... It's a little tricky getting the latest kernel onto a Raspberry Pi. the public repo (https://github.com/raspberrypi/linux) is currently at 4.1.y and it looks like 4.3 is current. Today I loaded this 4.1.y kernel and I'll re-check the wifi card to see if the behavior is still present. Not sure how feasible it is to take a kernel from kernel.org and apply the raspi config to build it. I'm guessing it's nontrivial since there is a dedicated repo for the raspberry pi ARM branch. Nonetheless I may also tray that route to help narrow down the search area. Suddenly i get this kind of answer from RPi users all the time. Not sure if i wish to accept any bug report from RPi side. Now i need to let out my frustration: RPi is not an open source product! Bana Pi is much more open then RPi, it is just working with vanilla kernel from kernel.org. Why do people care about RPi at all?! Seems the above analysis in comment 1 above is incomplete. The warning is Dec 16 23:57:11 mri kernel: [1587969.108160] asix 1-1.5:1.0 eth1: asix_rx_fixup() Bad RX Length 1522 which shows the received Ethernet frame length is 1522 octets long. This is a legal Ethernet frame length with a MTU size of 1500 and with 802.1Q tag (VLAN) field present, see https://en.wikipedia.org/wiki/Ethernet_frame#Structure Looking at the ASIX driver source code in drivers/net/usb/asix_common.c (linux 4.7) shows: int asix_rx_fixup_internal(struct usbnet *dev, struct sk_buff *skb, struct asix_rx_fixup_info *rx) <snip> if (size > dev->net->mtu + ETH_HLEN + VLAN_HLEN) { netdev_dbg(dev->net, "asix_rx_fixup() Bad RX Length %d\n", size); return 0; } <snip> Note linux 4.7 changed the netdev_err() to netdev_dbg() as the "fix" for the reported issue by no longer reporting the size issue, sigh. Analysis of the size calculation shows it has a bug in it: if (size > dev->net->mtu + ETH_HLEN + VLAN_HLEN) because the maximum Ethernet frame length for MTU of 1500 is: Ethernet header length (14) + 802.1Q tag (VLAN) length (4) + payload length (MTU) (1500) + FCS length (4) = 1522 Therefore the Frame Check Sequence Frame (FCS) field length is missing from the size calculation so the maximum length accepted by the source code was 1518. Note the VLAN length and FCS length are both 4 so the code works OK when there is no VLAN in use which is the usual case. This masks the defect. I will raise a patch for this issue to add in the missing ETH_FCS_LEN into the size calculation. In the failure case, in the above comment description the wireshark report shows a bad FCS value "ETHERNET FRAME CHECK SEQUENCE INCORRECT". That is worrying. Also I think wireshark is not showing a 802.1Q tag (VLAN) field. So it is unclear why the Ethernet frame length is 1522 octets from the wireshark log. Note that when the size variable value exceeds 1518 octets (current source code), the Ethernet frame is immediately not read out of the URB buffer and the URB is not processed further (potentially losing other Ethernet frames to be read). This can also impact the next URB as synchronisation with the next header data word may have been lost. In addition no network statistics are incremented to indicate a dropped frame because the frame never not read out of the URB buffer. This was why netdev_err() was used to report the failure. Created attachment 226511 [details] attachment-7946-0.html Just to confirm, no tagging is in use on the network in question, so there were no 802.1q headers present at any point. [Sent from mobile device] On Jul 26, 2016 9:29 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=109471 > > Dean Jenkins <Dean_Jenkins@mentor.com> changed: > > What |Removed |Added > > ---------------------------------------------------------------------------- > CC| |Dean_Jenkins@mentor.com > > --- Comment #12 from Dean Jenkins <Dean_Jenkins@mentor.com> --- > Seems the above analysis in comment 1 above is incomplete. The warning is > > Dec 16 23:57:11 mri kernel: [1587969.108160] asix 1-1.5:1.0 eth1: > asix_rx_fixup() Bad RX Length 1522 > > which shows the received Ethernet frame length is 1522 octets long. This > is a > legal Ethernet frame length with a MTU size of 1500 and with 802.1Q tag > (VLAN) > field present, see https://en.wikipedia.org/wiki/Ethernet_frame#Structure > > Looking at the ASIX driver source code in drivers/net/usb/asix_common.c > (linux > 4.7) shows: > > int asix_rx_fixup_internal(struct usbnet *dev, struct sk_buff *skb, > struct asix_rx_fixup_info *rx) > <snip> > if (size > dev->net->mtu + ETH_HLEN + VLAN_HLEN) { > netdev_dbg(dev->net, "asix_rx_fixup() Bad RX Length %d\n", > size); > return 0; > } > <snip> > > Note linux 4.7 changed the netdev_err() to netdev_dbg() as the "fix" for > the > reported issue by no longer reporting the size issue, sigh. > > Analysis of the size calculation shows it has a bug in it: > if (size > dev->net->mtu + ETH_HLEN + VLAN_HLEN) > > because the maximum Ethernet frame length for MTU of 1500 is: > Ethernet header length (14) + 802.1Q tag (VLAN) length (4) + payload length > (MTU) (1500) + FCS length (4) = 1522 > > Therefore the Frame Check Sequence Frame (FCS) field length is missing > from the > size calculation so the maximum length accepted by the source code was > 1518. > > Note the VLAN length and FCS length are both 4 so the code works OK when > there > is no VLAN in use which is the usual case. This masks the defect. > > I will raise a patch for this issue to add in the missing ETH_FCS_LEN into > the > size calculation. > > In the failure case, in the above comment description the wireshark report > shows a bad FCS value "ETHERNET FRAME CHECK SEQUENCE INCORRECT". That is > worrying. Also I think wireshark is not showing a 802.1Q tag (VLAN) field. > So > it is unclear why the Ethernet frame length is 1522 octets from the > wireshark > log. > > Note that when the size variable value exceeds 1518 octets (current source > code), the Ethernet frame is immediately not read out of the URB buffer > and the > URB is not processed further (potentially losing other Ethernet frames to > be > read). This can also impact the next URB as synchronisation with the next > header data word may have been lost. > > In addition no network statistics are incremented to indicate a dropped > frame > because the frame never not read out of the URB buffer. This was why > netdev_err() was used to report the failure. > > -- > You are receiving this mail because: > You reported the bug. > (In reply to Brad Hein from comment #13) > > Just to confirm, no tagging is in use on the network in question, so there > were no 802.1q headers present at any point. > I was aware that "no tagging" was reported. It is a separate issue for why the Ethernet frame contains 1522 octets which seems to be non-conforming to the Ethernet specifications were transmitted. However, the scenario unmasks a bug in the ASIX driver's sanity length calculation for receiving Ethernet frames. The ASIX driver at the point of reporting a bad RX frame length only cares that the number of octets to read out of the URB buffer does not exceed the maximum permissible Ethernet frame length. The ASIX driver does not decode the Ethernet frame, the size value comes from the USB to Ethernet ASIX chip itself by inserting an in-band 32-bit header value saying that the next block of data has a certain length until the next in-band 32-bit header value appears in the data stream. A block of data is the Ethernet frame. Therefore, the ASIX driver should accept a block of data up to the maximum length of 1522 octets for a MTU value of 1500. Then pass this block of data to the network stack to decode and reject the Ethernet frame as necessary. As the ASIX sanity length check calculation is wrong, the network stack never gets the opportunity to check the 1522 octet block of data against the Ethernet protocol specifications. In other words, the ASIX driver is ignoring legal Ethernet frame lengths of 1522 octets for a MTU of 1500. Had VLAN tagging been used so 1522 octet frame lengths were used, the Ethernet frame would still not get through to the network stack so clearly is a bug. I understand why the FCS check fails when receiving the "gateway big Ethernet frame". See https://en.wikipedia.org/wiki/Ethernet_frame#Frame_check_sequence When an Ethernet frame is received, the FCS algorithm (CRC) is applied across ALL octets of the received data block. This means the FCS field contents is treated as data to the algorithm. The FCS is declared good when the FCS algorithm outputs the expected magic number result. In the case of the "gateway big Ethernet frame", 4 extra octets are erroneously appended to the end of the Ethernet frame. Therefore, when the FCS algorithm is applied, these extra octets are consumed causing the CRC calculation to not create the expected magic number result. This suggests that the gateway has a bug in it which allocates a block of data too big to contain the Ethernet frame although the length of the data block is a legal length. Probably, part of the gateway software thought VLAN tagging was deployed but the actual Ethernet frame did not have a VLAN tag field so there is a mismatch in the lengths of the block and the Ethernet frame. |