Bug 5030

Summary: net VIA Rhine Oversized Ethernet frame spanned multiple buffers, entry 0x4 length 0 status 00000600!
Product: Drivers Reporter: udo (udovdh)
Component: NetworkAssignee: Roger Luethi (rl)
Status: REJECTED INSUFFICIENT_DATA    
Severity: normal CC: a1bert, bunk, folkert, protasnb, richard, rl
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: various, now 2.6.13 Subsystem:
Regression: No Bisected commit-id:
Attachments: Patch from Roger (maintainer)
Some results from the patch, hopefully useful debug output
Some results from the patch, hopefully useful debug output
Another debug patch from Roger (maintainer).
Results of the 2nd debug patch from Roger.
More results

Description udo 2005-08-08 21:55:11 UTC
Distribution: Fedora
Hardware Environment: VIA CL6000E
Software Environment: Kernel
Problem Description: After about 24h+ of uptime I see messages like:

Jan 13 19:35:46 epia kernel: eth1: Oversized Ethernet frame spanned multiple
buffers, entry 0x4 length 0 status 00000600!
Jan 13 19:35:46 epia kernel: eth1: Oversized Ethernet frame ccf80040 vs
ccf80040.
Jan 13 19:35:46 epia kernel: eth1: Oversized Ethernet frame spanned multiple
buffers, entry 0x5 length 0 status 00000400!
Jan 13 19:35:46 epia kernel: eth1: Oversized Ethernet frame ccf80050 vs
ccf80050.
[...]

Jan 13 19:35:46 epia kernel: eth1: Oversized Ethernet frame spanned multiple
buffers, entry 0xf length 0 status 00000400!
Jan 13 19:35:46 epia kernel: eth1: Oversized Ethernet frame ccf800f0 vs
ccf800f0.
Jan 13 19:35:46 epia kernel: eth1: Oversized Ethernet frame spanned multiple
buffers, entry 0x0 length 0 status 00000400!
Jan 13 19:35:46 epia kernel: eth1: Oversized Ethernet frame ccf80000 vs
ccf80000.
Jan 13 19:35:46 epia kernel: eth1: Oversized Ethernet frame spanned multiple
buffers, entry 0x1 length 0 status 00000400!
Jan 13 19:35:46 epia kernel: eth1: Oversized Ethernet frame ccf80010 vs
ccf80010.
Jan 13 19:35:46 epia kernel: eth1: Oversized Ethernet frame spanned multiple
buffers, entry 0x2 length 0 status 00000400!
Jan 13 19:35:46 epia kernel: eth1: Oversized Ethernet frame ccf80020 vs
ccf80020.
Jan 13 19:35:46 epia kernel: eth1: Oversized Ethernet frame spanned multiple
buffers, entry 0x3 length 0 status 00000581!
Jan 13 19:35:46 epia kernel: eth1: Oversized Ethernet frame ccf80030 vs
ccf80030.

every 3 or 4 days or so when I really use the card. (please notice all 16
entries are used and the length is 0)
Connection is gone most of the time when these message apear.
Eth1 is connected to an Alcatel Speedtouch Home at 10 Mbits, Half Duplex.
This problem did not occur when I was using other hardware for the firewall.

While googlin' I saw that this is an old bug which was not fixed for years.
Reported before at http://lkml.org/lkml/2005/1/15/47
Alternate driver by VIA does not have this bug but has high(er) CPU conspumption.

Steps to reproduce:

Build kernel, boot kernel, use it.
Comment 1 udo 2005-08-08 21:56:06 UTC
ifconfig eth1 down
ifconfig eth1 up

can fix the issue if it occurs.
Comment 2 udo 2005-08-08 21:59:41 UTC
My firewall is VIA EPIA CL-6000 with VIA Rhine network chips running FC3/4.
Comment 3 udo 2005-08-09 07:17:52 UTC
relevant .config:
CONFIG_VIA_RHINE=y
CONFIG_VIA_RHINE_MMIO=y
Comment 4 udo 2005-08-22 10:33:33 UTC
We tried the VIA Rhinefet driver instead with a slight modification. (see
http://www.viaarena.com/downloads/Source/rhinefet.tgz)
Then we get:

Aug 22 00:06:55 epia kernel:  eth1 : the received frame span multple RDs
Aug 22 00:14:05 epia kernel:  eth1 : the received frame span multple RDs
Aug 22 00:25:51 epia kernel:  eth1 : the received frame span multple RDs
Aug 22 00:44:09 epia kernel:  eth1 : the received frame span multple RDs
Aug 22 00:49:13 epia kernel:  eth1 : the received frame span multple RDs

But the connection is NOT dropped!
Modification:

in rhine_receive_frame():

    if ((pRD->rdesc0.byRSR1 & (RSR1_STP|RSR1_EDP)) != (RSR1_STP|RSR1_EDP)) {
        RHINE_PRT(MSG_LEVEL_VERBOSE,
            KERN_NOTICE " %s : the received frame span multple RDs\n",
            dev->name);
        pStats->rx_length_errors++;
        return FALSE;
    }

MSG_LEVEL_VERBOSE was changed to 0 and KERN_NOTICE replaced to KERN_ERR. 

So it is a chipset bug which is not well handled by the standard Linux driver.
Comment 5 udo 2005-09-03 07:42:29 UTC
Created attachment 5876 [details]
Patch from Roger (maintainer)

This patch may help find the cause of the bug.
Comment 6 udo 2005-09-04 02:13:05 UTC
Created attachment 5886 [details]
Some results from the patch, hopefully useful debug output

Some results from the patch, hopefully useful debug output
Comment 7 udo 2005-09-04 07:19:03 UTC
Created attachment 5888 [details]
Some results from the patch, hopefully useful debug output

Some results from the patch, hopefully useful debug output
Comment 8 Roger Luethi 2005-09-04 10:36:06 UTC
Log files are interesting if they are taken when the driver fails. They should
indicate where the driver stopped working (making the ifconfig routine as
described in the report necessary).
Comment 9 Roger Luethi 2005-09-04 10:44:27 UTC
move severity down to normal
Comment 10 udo 2005-09-06 08:54:36 UTC
Created attachment 5917 [details]
Another debug patch from Roger (maintainer).
Comment 11 udo 2005-09-06 08:55:44 UTC
Using the first instance of the debug patch extra messages were created when
stuff went wrong but the link was not dropped anymore. (as would be the case in
the normal situation)
Comment 12 udo 2005-09-08 10:20:06 UTC
Created attachment 5933 [details]
Results of the 2nd debug patch from Roger.
Comment 13 udo 2005-09-08 10:24:34 UTC
Using the second debug patch the connection was dropped when the error messages
were produced. (as in the vanilla situation but unlike the first debug patch)
Comment 14 udo 2005-09-08 10:42:12 UTC
Created attachment 5934 [details]
More results
Comment 15 Roger Luethi 2005-09-13 04:36:48 UTC
Looks as if there were two types of errors here: If only two or three buffers
are affected, the error handling recovers properly. There are, however, rare
events where all buffers are affected and the Rx engine is shut down by the chip
(something that is not supposed to happen). So the solution is to turn the Rx
engine back on. Patch in testing. ETA Linux 2.6.15.
Comment 16 Adrian Bunk 2006-12-17 18:35:07 UTC
What is the current status of this issue?
Comment 17 udo 2006-12-17 22:34:15 UTC
I don't have the CL6000 board anymore that I experienced the bug with.
I now use an EK8000.
When using CL6000 I switched ADSL modems and noticed that the bug goes away when
the ethernet link goes from half to full duplex. I.o.w.: full duplex == no bug,
AFAIK.
Roger provided various patches to find what we (he!) knows now.
With the patches that Roger made the impact was reduced quite a lot.
I did not get around to testing more because of the ADSL modem issue. Due to the
ADSL 2+ upgrade of the network my old Alcatel (was it?) modem dropped the link
randomly where my Thomson (that I got as replacement) is stable but misses the
diagnostic info about line quality.
So no usable half-duplex network gear around to test with. :-(

Comment 18 Natalie Protasevich 2007-07-18 14:45:51 UTC
Any updates on this issue? Udo did you have any luck on getting hold of half duplex adapters? If not we close this bug for now and re-open if needed in the future.
Comment 19 Roger Luethi 2007-07-19 00:04:20 UTC
I agree with closing it for now since no one is able to reproduce the error right now. It is not supposed to happen anyway (yeah, I know). If it returned it could be addressed by restarting the chip's Rx engine (which can die on this error, but appears to take a indeterminate amount (33+ ms!) of time to do so)
Comment 20 udo 2007-07-20 09:27:47 UTC
I do not have a working half duplex ADSL modem around.
I do not have other machines that I can connectt to the firewall in question.
I even replaced the VIA CL6000 board by an EK8000.
The bug is still there if not much changed.
I can not help you much though due to my hardware situation.
Comment 21 Itay Perl 2007-12-13 12:34:19 UTC
I have a VIA Rhine II ethernet controller, and an Alcatel SpeedTouch Home.
I get oversized frames very frequently; connection lost about once a day.

Running kernel 2.6.22. 
FWIW, I get this on every boot:
via-rhine.c:v1.10-LK1.4.3 2007-03-06 Written by Donald Becker
via-rhine: Broken BIOS detected, avoid_D3 enabled.
Comment 22 Natalie Protasevich 2007-12-13 13:08:19 UTC
Roger, is this something you can look at? According to the reporter, there is still a problem in 2.6.22.
Comment 23 udo 2007-12-13 23:43:21 UTC
at #21, Itay, your link is in half-duplex?
(my problems went away when I could run the link in full duplex)
Comment 24 Itay Perl 2007-12-14 14:44:36 UTC
I have no idea what this means, but I guess the answer is yes:
Dec 15 00:30:52 localhost eth0: link up, 10Mbps, half-duplex, lpa 0x0021
Comment 25 Itay Perl 2008-01-29 23:24:16 UTC
OK, I took some time to check the connection full-duplex (I used mii-tool to change). It appears to be a bit more stable but still disconnects every day or two.
Comment 26 udo 2008-01-30 07:08:06 UTC
Disconnects? I did not have that experience back then.
Comment 27 Natalie Protasevich 2008-01-30 16:18:22 UTC
Is there a consensus that the driver is stable now? Itay, can you elaborate on the disconnects (kind of traffic, maybe traces, error messages in system logs), unless your connection is working OK currently.
Comment 28 Itay Perl 2008-02-04 23:46:03 UTC
I have tons of messages similar to this one in the logs:

pptp[5894]: anon log[decaps_gre:pptp_gre.c:407]: buffering packet 1865279 (expecting 1865270, lost or reordered)

Once in a while I get these errors:

eth0: Oversized Ethernet frame spanned multiple buffers, entry 0xe length 0 status 00000600!
eth0: Oversized Ethernet frame f683c0e0 vs f683c0e0.
eth0: Oversized Ethernet frame spanned multiple buffers, entry 0xf length 23187 status 5a930d19!
eth0: Oversized Ethernet frame f683c0f0 vs f683c0f0.

The connection to the Alcatel is lost about once a day. Restarting the modem manually restores the connection. Just before disconnecting I get these errors:

pptp[5900]: anon log[pptp_handle_timer:pptp_ctrl.c:1049]: closing control connection due to missing echo reply
pptp[5900]: anon log[ctrlp_rep:pptp_ctrl.c:251]: Sent control packet type is 12 'Call-Clear-Request'
pptp[5900]: anon log[pptp_conn_close:pptp_ctrl.c:430]: Closing PPTP connection
pptp[5900]: anon log[ctrlp_rep:pptp_ctrl.c:251]: Sent control packet type is 3 'Stop-Control-Connection-Request'
pptp[5900]: anon log[call_callback:pptp_callmgr.c:78]: Closing connection (call state)
Comment 29 udo 2008-02-05 08:08:50 UTC
Only 'the Oversized Ethernet frame spanned multiple buffers' messages.
I do not get them anymore since getting a FDX link.

I do get the pptp errors here, mostly because of CPU-load I guess?
Comment 30 Richard Wall 2010-06-24 11:32:47 UTC
Hello Roger,

I am encountering this bug in Linux Kernel 2.6.31.13, 2.6.32.15 and 2.6.34 and have gathered some further information which may be useful in tracking down the cause....

Device is a Dlink DFE-530TX PCI card

# cat /sys/class/net/eth1/device/vendor 
0x1106
# cat /sys/class/net/eth1/device/device           
0x3043

http://www.pcidatabase.com/vendor_details.php?id=648
{{{
0x3043	
Chip Number:	VT86C100A
Chip Description:	Rhine 10/100 Ethernet Adapter
Notes:	Dlink released a NIC (530TX revA1) that has this Device ID and number, it's the same card
}}}

{{{
#ethtool eth1
Settings for eth1:
	Supported ports: [ TP MII ]
	Supported link modes:   10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	Advertised auto-negotiation: Yes
	Speed: 100Mb/s
	Duplex: Full
	Port: MII
	PHYAD: 8
	Transceiver: internal
	Auto-negotiation: on
	Supports Wake-on: d
	Wake-on: d
	Current message level: 0x00000001 (1)
	Link detected: yes
}}}


The following errors are logged with Kernel 2.6.32.15...
{{{
Jun 23 16:23:11 rwcb100 kernel: [ 1028.410162] eth1: Oversized Ethernet frame f58dd1d0 vs f58dd1d0.
Jun 23 16:23:11 rwcb100 kernel: [ 1028.410168] eth1: Oversized Ethernet frame spanned multiple buffers, entry 0x1e length 1518 status 05ee8d00!
Jun 23 16:23:11 rwcb100 kernel: [ 1028.410174] eth1: Oversized Ethernet frame f58dd1e0 vs f58dd1e0.
Jun 23 16:23:11 rwcb100 kernel: [ 1028.544841] eth1: Too much work at interrupt, status=0x0000ffff.
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004012] ------------[ cut here ]------------
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004027] WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x1be/0x1d0()
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004033] Hardware name:         
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004037] NETDEV WATCHDOG: eth1 (via-rhine): transmit queue 0 timed out
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004041] Modules linked in: ip_gre e1000 via_rhine tg3 libphy r8169 pcnet32 e100 8139too mii vt8231 asus_atk0110 via686
a hwmon_vid coretemp hwmon
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004069] Pid: 0, comm: swapper Not tainted 2.6.32.15 #2
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004073] Call Trace:
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004083]  [<c0129fae>] ? warn_slowpath_common+0x6e/0xb0
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004090]  [<c035df5e>] ? dev_watchdog+0x1be/0x1d0
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004097]  [<c012a03b>] ? warn_slowpath_fmt+0x2b/0x30
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004103]  [<c035df5e>] ? dev_watchdog+0x1be/0x1d0
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004111]  [<c0120382>] ? __wake_up+0x42/0x60
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004119]  [<c0139a42>] ? insert_work+0x42/0x50
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004125]  [<c035dda0>] ? dev_watchdog+0x0/0x1d0
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004132]  [<c01331d9>] ? run_timer_softirq+0xf9/0x1c0
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004139]  [<c0149678>] ? tick_program_event+0x28/0x40
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004147]  [<c012f1b0>] ? __do_softirq+0x80/0x100
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004153]  [<c012f25d>] ? do_softirq+0x2d/0x40
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004160]  [<c0114ed4>] ? smp_apic_timer_interrupt+0x54/0x90
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004168]  [<c0103616>] ? apic_timer_interrupt+0x2a/0x30
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004175]  [<c0109c02>] ? mwait_idle+0x42/0x60
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004181]  [<c0101d25>] ? cpu_idle+0x35/0x60
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004189]  [<c05487c3>] ? start_kernel+0x2b4/0x2b9
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004196]  [<c054834d>] ? unknown_bootoption+0x0/0x190
Jun 23 16:23:12 rwcb100 kernel: [ 1032.004200] ---[ end trace 2ff4834a8e699e17 ]---
Jun 23 16:23:12 rwcb100 kernel: [ 1032.006472] eth1: Transmit timed out, status ffff, PHY status ffff, resetting...
Jun 23 16:23:12 rwcb100 kernel: [ 1032.006602] via-rhine: Reset not complete yet. Trying harder.
Jun 23 16:23:12 rwcb100 kernel: [ 1032.016952] eth1: link up, 100Mbps, full-duplex, lpa 0xFFFF
Jun 23 16:23:16 rwcb100 kernel: [ 1036.006282] eth1: Transmit timed out, status ffff, PHY status ffff, resetting...
Jun 23 16:23:16 rwcb100 kernel: [ 1036.006407] via-rhine: Reset not complete yet. Trying harder.
Jun 23 16:23:16 rwcb100 kernel: [ 1036.016797] eth1: link up, 100Mbps, full-duplex, lpa 0xFFFF
Jun 23 16:23:20 rwcb100 kernel: [ 1040.006278] eth1: Transmit timed out, status ffff, PHY status ffff, resetting...
Jun 23 16:23:20 rwcb100 kernel: [ 1040.006404] via-rhine: Reset not complete yet. Trying harder.
Jun 23 16:23:20 rwcb100 kernel: [ 1040.016797] eth1: link up, 100Mbps, full-duplex, lpa 0xFFFF

}}}

We get the same errors in 2.6.31.13 and 2.6.34 although not always a Call trace

We can recreate this problem by downloading a large file (eg  linux kernel) via that interface.

When the problem occurs, dmesg and syslog are flooded with these errors, the box is not accessible via network, but it sometimes remains responsive on a serial console. Serial console often also becomes unresponsive to normal key presses, but does respond to sysrq magic keys.

If you have any possible fixes for this problem I am happy to try out patches and can provide any other information that might be useful to you.

I have not been able to compile and test the VIA Rhinenet driver for any of these Kernels due to the changes in the netdev.h structures.

Thanks in advance.

-RichardW.
Comment 31 a1bert 2010-06-29 10:13:30 UTC
well, we are experiencing the same problem on busy alix based routers, looks like there is a bcast spike in the same time...

please REOPEN
Comment 32 Roger Luethi 2010-06-29 10:37:29 UTC
I don't seem to be able to reopen the bug, but I'm monitoring it anyway.

Richard Wall is currently trying an older kernel to determine whether this could have been a regression to begin with.

a1bert: Instructions for reproducing the problem would be appreciated.
Comment 33 Richard Wall 2010-06-29 10:41:31 UTC
Roger: I haven't had time to try 2.6.8 yet - will try and do it this week.

A1bert: We also use Alix boards in some products, and I will try and recreate the problem on that board.

Here's a record of the conversation I've had with Roger in the last few days.
{{{
On Fri, 25 Jun 2010 10:16:28 +0100, Richard Wall wrote:
> Sorry for contacting you directly. I posted some information to the
> following Linux Kernel bugzilla ticket yesterday, but since it is
> marked as RESOLVED I don't know whether you have received a
> notification.
>  * https://bugzilla.kernel.org/show_bug.cgi?id=5030

I have, but I'm not sure what to make of it. This bug is ancient, and so is
your hardware. I'm not sure I even have a Rhine-I (VT86C100A) anymore.

> Here is the information I posted and if you had any ideas about what
> might be wrong or if you have any potential workarounds I'd be really
> interested to try them out.

Can you try Linux 2.6.8?

Roger Luethi
}}}

{{{
On Fri, Jun 25, 2010 at 4:25 PM, Roger Luethi <rl@hellgate.ch> wrote:
> On Fri, 25 Jun 2010 10:16:28 +0100, Richard Wall wrote:
>> Sorry for contacting you directly. I posted some information to the
>> following Linux Kernel bugzilla ticket yesterday, but since it is
>> marked as RESOLVED I don't know whether you have received a
>> notification.
>>  * https://bugzilla.kernel.org/show_bug.cgi?id=5030
>
> I have, but I'm not sure what to make of it. This bug is ancient, and so is
> your hardware. I'm not sure I even have a Rhine-I (VT86C100A) anymore.

Hi Roger,

Thanks for replying. I understand that it must be difficult for you to
develop for all versions of the hardware, but in fact that chipset is
still being sold. - Note that the card is actually a Dlink with VIA
chipset....

DFE-530TX

http://www.google.co.uk/products/catalog?hl=en&q=DFE-530TX&cid=11389062735910045865&ei=P9QkTKjZD8KR-Abkz53WDw&sa=title&ved=0CAcQ8wIwADgA#p

>> Here is the information I posted and if you had any ideas about what
>> might be wrong or if you have any potential workarounds I'd be really
>> interested to try them out.
>
> Can you try Linux 2.6.8?

I might be able to try it and attempt to recreate the problem on that
version, but ultimately we rely on some of the features in the latest
Kernels (particularly TPROXY) so this won't be a long term solution
for us.

I'll let you know whether the problem also occurs with 2.6.8

-RichardW.
}}}


{{{
On Fri, 25 Jun 2010 17:18:29 +0100, Richard Wall wrote:
> Thanks for replying. I understand that it must be difficult for you to
> develop for all versions of the hardware, but in fact that chipset is
> still being sold. - Note that the card is actually a Dlink with VIA
> chipset....

I didn't realize they were still being made.

> I might be able to try it and attempt to recreate the problem on that
> version, but ultimately we rely on some of the features in the latest
> Kernels (particularly TPROXY) so this won't be a long term solution
> for us.

I am grasping for straws here. But 2.6.8 had a feature that more recent
kernels don't have -- one which might help recovering from Rx problems.

> I'll let you know whether the problem also occurs with 2.6.8

Thx.

Roger
}}}
Comment 34 a1bert 2010-06-29 11:05:30 UTC
well, forget about the bcast spike I have reported, It's arp storm after the NIC has died...
Comment 35 Richard Wall 2010-07-05 09:17:55 UTC
On Sun, Jun 27, 2010 at 6:55 AM, Roger Luethi <rl@hellgate.ch> wrote:
<snip>
> I am grasping for straws here. But 2.6.8 had a feature that more recent
> kernels don't have -- one which might help recovering from Rx problems.
>> I'll let you know whether the problem also occurs with 2.6.8

Roger,

I wasn't able to compile the 2.6.8 kernel on my Ubuntu 10.04 machine, it looked like an incompatibility with the latest GCC. Even if I could compile it, we wouldn't be able to use such an old kernel on our products, so I'm going to recommend to colleagues that we avoid this old chipset.

Thanks for your help and for maintaining the Via drivers.

A1bert, 

I wasn't able to recreate the problem on one of our Geode / Alix boards - by doing the same series of large downloads through each of its three interfaces. It has a newer chipset, so perhaps it's not the same problem.

{{{
root@geode:~# cat /sys/class/net/eth1/device/vendor
0x1106
root@geode:~# cat /sys/class/net/eth1/device/device 
0x3053
}}}

http://www.pcidatabase.com/vendor_details.php?id=648
{{{
0x3053	
Chip Number:	VT6105M
Chip Description:	Rhine III Management Adapter
Notes:	drivers
}}}

-RichardW.