Bug 29082
Summary: | asix_rx_fixup() bad header length and Bad RX Length in dmesg | ||
---|---|---|---|
Product: | Drivers | Reporter: | Carl Hetherington (cth) |
Component: | Network | Assignee: | drivers_network (drivers_network) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | alan, emilgoode, igor.raits, kernel, lipski.bogdan, lvml, myxal.mxl, slava, windmail01, zarhan |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.4 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Output of lsusb -vvxxx
syslog from working session syslog from non-working session Test patch Test patch for preventing package loss on AX88772B. |
Created attachment 47692 [details]
syslog from working session
Created attachment 47702 [details]
syslog from non-working session
In the logs there are outputs from a few additional printks that I added. It seems that in the "bad" condition the asix_rx_fixup function continuously receives SKBs of length 2048, as opposed to shorter SKBs of variable length in the "good" condition. I can confirm this bug to still exist in the current git head, a recent patch posted here http://www.mail-archive.com/stable@vger.kernel.org/msg09600.html does _not_ fix the issue for me (but seems to address a similar issue that happens only with VLAN tagging, while the "Bad RX Length" appear for about 0.6% of packets I send via the AXIS USB-to-EtherNet adapter). For those who experience this problem with the Asus-supplied AXIS adapter, I have described a work-around here: http://www.linlap.com/wiki/asus+ux32vd#asix_usb-to-ethernet_adapter (but continously using a patched manufacturer driver does not leave me too happy). Useful to know there is a vendor driver which doesn't go wrong - it may help identify the difference. I confirm this bug, it is still present in linux-3.11.2, with "ASIX AX88772 USB 2.0 Ethernet" (according to dmesg). The adapter works in both cases - whether it was plugged in at boot time or not, but message "asix_rx_fixup() Bad Header Length 0xf0823dfb, offset 16" is periodically logged, especially when moderate traffic (about several mbytes/sec) passes through it . I've noticed interesting peculiarity: the 'Bad header length' message is logged very often (one time per several seconds) during uploading large file through fuse/sshfs, but not logged at all (or logged very infrequently - I didn't observe it for at least 4 minutes) when piping /dev/urandom through netcat. So I suppose that the problem is triggered by packets of specific size (or by sequence of packets with specific sizes). I've finally discovered the easy way to reproduce this bug. ping -s 1965 $HOSTNAME causes repeated messages Dec 19 22:31:30 fileserver kernel: asix 1-1:1.0 eth1: asix_rx_fixup() Bad Header Length 0xd3c8e7c3, offset 6 and no icmp replies from host being pinged. Both hosts are connected to 100mbit ethernet with standard mtu=1500, so this packet is splitted into two packets with sizes 1514 and 527. 'ping -s 485', which produces one packet of size 527, doesn't trigger this bug; 'ping -s 1964' (which emits two packets 1514 and 526 bytes long) doesn't trigger it too. Two packets of legth 1514 and 527 generated by successive ping commands also doesn't trigger it (perhaps because they have larger time interval between them). So the bug is triggered by two successive packets of sepcific sizes (specifically 1514 then (any from 527 to 538) bytes, perhaps there are other combinations too) recieved in a short time (perhaps sufficiently short to be buffered together). I hope my investigation will be useful. I've found the reason of this problem. The device may send several ethernet packets to the host at once, and the total size of such a pack is limited by 2048 bytes. But it is the summary size of raw ethernet frames; when the device sends them to the host, each ethernet frame is prefixed by 32bit header (with frame size and some kind of checksum for this size). As a result, the actual size of data being transfered may exceed dev->rx_urb_size, which is set to 2048. I don't understand what exactly happens in this case (the data is either truncated or, even worse, written beyond the end of the buffer), but, anyway, asix_rx_fixup doesn't see the rest of the data, tries to read it from the begining of next 2048 bytes received and "lose sync". The increase of rx_urb_size to 4096 solves the problem (at least for me): diff -ur drivers/net/usb.orig/asix_devices.c drivers/net/usb/asix_devices.c --- drivers/net/usb.orig/asix_devices.c 2013-12-21 02:25:44.372538206 +0400 +++ drivers/net/usb/asix_devices.c 2013-12-21 03:50:14.406536611 +0400 @@ -488,7 +488,7 @@ if (dev->driver_info->flags & FLAG_FRAMING_AX) { /* hard_mtu is still the default - the device does not support jumbo eth frames */ - dev->rx_urb_size = 2048; + dev->rx_urb_size = 4096; } dev->driver_priv = kzalloc(sizeof(struct asix_common_private), GFP_KERNEL); The exact rx_urb_size could be selected more precisely - 4096 is more than enough. Also I'm not sure that this solution is complete - rx_urb_size seems to be changed in other places (for example, in usbnet.c when changing MTU). I wonder if the MFB_2048 etc values are '2048 bytes + a header' in the devicce itself ? If you change ax88178_set_mfb to do dev->rx_urb_size +=32; just before it does rxctl=asix_read_rx_ctl(dev) and change the line you already did to 2048 + 32 does it then behave - and do the combinations 32 bytes higher still work ? Alan *** Bug 30952 has been marked as a duplicate of this bug. *** *** Bug 16831 has been marked as a duplicate of this bug. *** *** Bug 69671 has been marked as a duplicate of this bug. *** Ok. I'll test it after 2h.. (In reply to Alan from comment #10) > I wonder if the MFB_2048 etc values are '2048 bytes + a header' in the > devicce itself ? > > If you change ax88178_set_mfb to do > > dev->rx_urb_size +=32; > > just before it does rxctl=asix_read_rx_ctl(dev) > > and change the line you already did to 2048 + 32 > > does it then behave - and do the combinations 32 bytes higher still work ? > > Alan Well. Thank you. I've testing this patch: diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c index 9765a7d..6d60fc2 100644 --- a/drivers/net/usb/asix_devices.c +++ b/drivers/net/usb/asix_devices.c @@ -487,7 +487,7 @@ static int ax88772_bind(struct usbnet *dev, struct usb_interface *intf) if (dev->driver_info->flags & FLAG_FRAMING_AX) { /* hard_mtu is still the default - the device does not support jumbo eth frames */ - dev->rx_urb_size = 2048; + dev->rx_urb_size = 2048 + 32; } dev->driver_priv = kzalloc(sizeof(struct asix_common_private), GFP_KERNEL); @@ -746,6 +746,7 @@ static void ax88178_set_mfb(struct usbnet *dev) mfb = AX_RX_CTL_MFB_16384; } + dev->rx_urb_size += 32; rxctl = asix_read_rx_ctl(dev); asix_write_rx_ctl(dev, (rxctl & ~AX_RX_CTL_MFB_16384) | mfb); And it works for me (basic tests). I will do more testing tomorrow, but I think it will work. Could you prepare correctly patch for commiting to the git ? Feel free to add: Tested-by: Igor Gnatenko <i.gnatenko.brain@gmail.com> If you have another machine on the same LAN, I suggest testing with "endless zeros": machine A: nc -l 4444 </dev/zero >/dev/null machine B: nc <machineA-IP> 4444 </dev/zero >/dev/null Back when I had the ASIX NIC this would trigger the bug within 30 seconds. Watch the bandwidth with ifstat or similar. If it passes testing I will do so mh. sad news. Jan 30 10:09:08 X1Carbon.localdomain kernel: asix 1-1.2:1.0 enp0s26u1u2: asix_rx_fixup() Bad Header Length 0x9f2750ba, offset 4 Jan 30 10:09:09 X1Carbon.localdomain kernel: asix 1-1.2:1.0 enp0s26u1u2: asix_rx_fixup() Bad Header Length 0xc63f9f61, offset 4 Jan 30 10:09:09 X1Carbon.localdomain kernel: asix 1-1.2:1.0 enp0s26u1u2: asix_rx_fixup() Bad Header Length 0x8d231476, offset 4 Jan 30 10:09:09 X1Carbon.localdomain kernel: asix 1-1.2:1.0 enp0s26u1u2: asix_rx_fixup() Bad Header Length 0x2b0add58, offset 4 Jan 30 10:09:10 X1Carbon.localdomain kernel: asix 1-1.2:1.0 enp0s26u1u2: asix_rx_fixup() Bad Header Length 0xe9eab2bc, offset 4 Jan 30 10:09:10 X1Carbon.localdomain kernel: asix 1-1.2:1.0 enp0s26u1u2: asix_rx_fixup() Bad Header Length 0xe4f91d4d, offset 4 Jan 30 10:09:11 X1Carbon.localdomain kernel: asix 1-1.2:1.0 enp0s26u1u2: asix_rx_fixup() Bad Header Length 0x7ba31fe0, offset 4 Digging in the info I have can you reverse any other changes and apply the patch I'll attach in a moment and see if that helps. If not can you retry after changing + return asix_rx_fixup_internal(dev, skb, rx, 3 * sizeof(u16), 3); in the patch to + return asix_rx_fixup_internal(dev, skb, rx, 2 * sizeof(u16), 3); Created attachment 123871 [details]
Test patch
(In reply to Alan from comment #20) > Created attachment 123871 [details] > Test patch I'm testing this patch w/o/ previous fix. Still buggy. Testing + return asix_rx_fixup_internal(dev, skb, rx, 2 * sizeof(u16), 3); Yes. With return asix_rx_fixup_internal(dev, skb, rx, 2 * sizeof(u16), 3); It works for me. BUt getting weird msgs from NM. Jan 31 14:57:54 X1Carbon.localdomain NetworkManager[839]: <info> Activation (enp0s20u2) Stage 5 of 5 (IPv6 Commit) complete. Jan 31 14:57:58 X1Carbon.localdomain NetworkManager[839]: <info> Activation (enp0s20u2) Stage 5 of 5 (IPv6 Commit) scheduled... Jan 31 14:57:58 X1Carbon.localdomain NetworkManager[839]: <info> Activation (enp0s20u2) Stage 5 of 5 (IPv6 Commit) started... Jan 31 14:57:58 X1Carbon.localdomain NetworkManager[839]: <info> Activation (enp0s20u2) Stage 5 of 5 (IPv6 Commit) complete. Jan 31 14:58:51 X1Carbon.localdomain NetworkManager[839]: <info> Activation (enp0s20u2) Stage 5 of 5 (IPv6 Commit) scheduled... Jan 31 14:58:51 X1Carbon.localdomain NetworkManager[839]: <info> Activation (enp0s20u2) Stage 5 of 5 (IPv6 Commit) started... Jan 31 14:58:51 X1Carbon.localdomain NetworkManager[839]: <info> Activation (enp0s20u2) Stage 5 of 5 (IPv6 Commit) complete. Jan 31 14:58:55 X1Carbon.localdomain NetworkManager[839]: <info> Activation (enp0s20u2) Stage 5 of 5 (IPv6 Commit) scheduled... Jan 31 14:58:55 X1Carbon.localdomain NetworkManager[839]: <info> Activation (enp0s20u2) Stage 5 of 5 (IPv6 Commit) started... Jan 31 14:58:55 X1Carbon.localdomain NetworkManager[839]: <info> Activation (enp0s20u2) Stage 5 of 5 (IPv6 Commit) complete. Jan 31 14:58:59 X1Carbon.localdomain NetworkManager[839]: <info> Activation (enp0s20u2) Stage 5 of 5 (IPv6 Commit) scheduled... Jan 31 14:58:59 X1Carbon.localdomain NetworkManager[839]: <info> Activation (enp0s20u2) Stage 5 of 5 (IPv6 Commit) started... Jan 31 14:58:59 X1Carbon.localdomain NetworkManager[839]: <info> Activation (enp0s20u2) Stage 5 of 5 (IPv6 Commit) complete. Oh. bad news. after some time got again err.. [ 2807.803177] asix 3-2:1.0 enp0s20u2: asix_rx_fixup() Bad Header Length 0x66db500, offset 8 [ 2829.395928] asix 3-2:1.0 enp0s20u2: asix_rx_fixup() Bad Header Length 0x1b72148, offset 16 [ 2834.755826] asix 3-2:1.0 enp0s20u2: asix_rx_fixup() Bad Header Length 0xf738066d, offset 10 [ 2836.130169] asix 3-2:1.0 enp0s20u2: asix_rx_fixup() Bad Header Length 0xaf3f9d21, offset 16 [ 3078.781824] asix 3-2:1.0 enp0s20u2: asix_rx_fixup() Bad Header Length 0xf738066d, offset 10 [ 3102.137216] asix 3-2:1.0 enp0s20u2: asix_rx_fixup() Bad Header Length 0xf738066d, offset 10 [ 3110.797625] asix 3-2:1.0 enp0s20u2: asix_rx_fixup() Bad Header Length 0xf738066d, offset 10 [ 3119.510456] asix 3-2:1.0 enp0s20u2: asix_rx_fixup() Bad Header Length 0xf738066d, offset 10 [ 3403.743047] asix 3-2:1.0 enp0s20u2: asix_rx_fixup() Bad Header Length 0x66db500, offset 8 [ 3409.607052] asix 3-2:1.0 enp0s20u2: asix_rx_fixup() Bad Header Length 0xffffffff, offset 10 [ 3745.013895] asix 3-2:1.0 enp0s20u2: asix_rx_fixup() Bad Header Length 0x1b00f738, offset 12 Created attachment 124481 [details]
Test patch for preventing package loss on AX88772B.
I also noticed this bug and I have found that the problem is that the usbnet module discards buffers when skb->len is less than dev->net->hard_header_len. Since the AX88772B occasionally places more than one packet in a single urb, causing the amount of data to exceed the dev->rx_urb_size, this causes the next urb to contain the remaining headerless data of the last packet in the previous urb. When the remaining data is of less than hard_header_len number of octets the buffer is discarded. The default value of hard_header_len is 14 octets, which causes the following behavior: ping -c 1 -s 1976 $HOSTNAME In the above case the we get two packets of the sizes 1514+4 and 538+4 in one urb. This is a total of 2060 octets which leaves 12 remaining octets to be sent in the next urb since rx_urb_size is 2048. But instead the next urb contains a new packet at offset 0, which means we lost the remaining data of the last package in the previous urb. ping -c 1 -s 1977 $HOSTNAME In this case the second urb contains 14 octets of header less data. We sent only one more octet so it should be 13, but the rx_fixup function rounds it to an even number. The following is what the code in the rx_complete function in usbnet.c does: if (skb->len < dev->net->hard_header_len) { state = rx_cleanup; ^^^^^^^^^^ dev->net->stats.rx_errors++; dev->net->stats.rx_length_errors++; } ... /* stop rx if packet error rate is high */ if (++dev->pkt_cnt > 30) { dev->pkt_cnt = 0; dev->pkt_err = 0; } else { if (state == rx_cleanup) dev->pkt_err++; ^^^^^^^^^^ if (dev->pkt_err > 20) set_bit(EVENT_RX_KILL, &dev->flags); } The quick fix is to set dev->net->hard_header_len to 0, you can test the temporary patch in the previous post. However I don't think this is a pretty solution as I believe the hard_header_len value is also used when transmitting packages. Perhaps there should be a way to indicate to the usbnet module that it should make an exception to the hard_header_len test in a situation like this? Best regards, Emil Goode (In reply to Emil Goode from comment #24) > Created attachment 124481 [details] > Test patch for preventing package loss on AX88772B. Yep. This patch fixed my bug. Ok, I will send a patch with this fix. any news here ? This is really uncomfortably... I have sent the below patch instead to fix this problem, it's waiting for testing/maintainer review. http://marc.info/?l=linux-kernel&m=139231010015039&w=2 (In reply to Emil Goode from comment #29) > I have sent the below patch instead to fix this problem, it's waiting for > testing/maintainer review. > > http://marc.info/?l=linux-kernel&m=139231010015039&w=2 Should I test something ? I have tested it on a AX88772B, if you have a different device that would be great. I have interesting device. https://bugzilla.kernel.org/attachment.cgi?id=123781 I will test patch. The patch has been applied and queued for stable kernels. commit 9e88626331cf11d45adaf32aadd1f245dcfcdf05 Author: Emil Goode <emilgoode@gmail.com> Date: Wed Feb 12 17:37:53 2014 +0100 usbnet: remove generic hard_header_len check (In reply to Emil Goode from comment #31) > I have tested it on a AX88772B, if you have a different device that would be > great. I'm a little confused. Isn't ax88179_178a.c supposed to be the driver for ax88179 or ax88178a? Can it work with ax88772b? The AX88772B uses the asix module. There is a separate minidriver for the AX88179 and AX88178A. Let's close this bug! Emil, thank you for patch! |
Created attachment 47682 [details] Output of lsusb -vvxxx Using a Edimax EU-4207 USB ethernet adapter on a HP Probook 4520s. Output of lsusb -vvxxx attached. If the adapter is plugged in on boot, the adapter lights up but does not pass network traffic, and repeated "asix_rx_fixup() bad header length" and "asix_rx_fixup() Bad RX Length messages" appear in dmesg. On unplugging and replugging the adapter, everything works fine. I have also seen the bad messages start to appear after a period of the card working fine. I recompiled asix.ko and usbnet.ko with DEBUG=1, and I attach two syslogs; "unhappy" is when the errors are occurring, and "happy" is when they are not. Both these logs start from insmod asix.ko.