Bug 14737

Summary: e1000e driver experiences large packet losses
Product: Drivers Reporter: Caleb Cushing (xenoterracide)
Component: NetworkAssignee: drivers_network (drivers_network)
Status: REOPENED ---    
Severity: normal CC: alan, jbrandeb, korgie, szg00000
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: 2.6.31 dmesg
netstat -s
lspci -vv 2.6.32 (while bug not occurring)
/proc/interrupts (while bug not occurring)
2.6.32 config
netstat -s (while bug not occurring)

Description Caleb Cushing 2009-12-05 07:02:48 UTC
possibly related to this bug #13835 same symptoms and I upped a lot of info for this bug on there.

Also Read this thread http://marc.info/?t=125699907100001&r=1&w=2

I have thus far been able to find similar symptoms all the way back to 2.6.29.6 I've not yet done testing farther. problem is intermittent. It does not appear to affect another nic on the system (however testing of that hasn't been extensive,and a different driver). if I do not reboot the computer and the bug hasn't manifested, it will not manifest (perhaps unless I reload modules or restart interfaces (not tested).
Comment 1 Caleb Cushing 2009-12-05 14:18:00 UTC
Created attachment 24034 [details]
2.6.31 dmesg
Comment 2 Caleb Cushing 2009-12-05 14:21:37 UTC
Created attachment 24036 [details]
netstat -s
Comment 3 Caleb Cushing 2009-12-05 14:24:09 UTC
Created attachment 24037 [details]
lspci -vv 2.6.32 (while bug not occurring)
Comment 4 Caleb Cushing 2009-12-05 14:26:51 UTC
Created attachment 24038 [details]
/proc/interrupts (while bug not occurring)
Comment 5 Caleb Cushing 2009-12-05 14:29:36 UTC
Created attachment 24039 [details]
2.6.32 config
Comment 6 Caleb Cushing 2009-12-05 14:32:33 UTC
upstream router is Linksys WRT54GL (1.1 I think?) running OpenWRT 809.1 and has several other windows systems working fine on it, it had another linux system that ran fine too, on its wireless.
Comment 7 Caleb Cushing 2009-12-05 14:51:48 UTC
Created attachment 24040 [details]
netstat -s (while bug not occurring)
Comment 8 Andrew Morton 2009-12-07 21:19:34 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sat, 5 Dec 2009 07:02:49 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14737
> 
>            Summary: e1000e driver experiences large packet losses
>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 2.6.32--
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Network
>         AssignedTo: drivers_network@kernel-bugs.osdl.org
>         ReportedBy: xenoterracide@gmail.com
>         Regression: No
> 
> 
> possibly related to this bug #13835 same symptoms and I upped a lot of info
> for
> this bug on there.
> 
> Also Read this thread http://marc.info/?t=125699907100001&r=1&w=2
> 
> I have thus far been able to find similar symptoms all the way back to
> 2.6.29.6
> I've not yet done testing farther. problem is intermittent. It does not
> appear
> to affect another nic on the system (however testing of that hasn't been
> extensive,and a different driver). if I do not reboot the computer and the
> bug
> hasn't manifested, it will not manifest (perhaps unless I reload modules or
> restart interfaces (not tested).
>
Comment 9 Jesse Brandeburg 2009-12-07 21:53:14 UTC
thanks akpm, I've been watching this thread but now I will try to jump in.

Caleb, can you please summarize where we are today, you've done a lot of 
testing and the thread has gone on a while.

Kernels known to fail (after any length):

Kernels known to work:

Have you been able to try the latest e1000e from 2.6.32?  it has some 
fixes in it, although none right off the top of my head that will fix your 
issue.

I have a couple of related questions, why don't you have irqbalance 
enabled?  Network interrupts should not be migrating across all cpus 
evenly, at the very least your system should be reconfigured to lock the 
interrupts to a particular core with smp_affinity.


           CPU0       CPU1       CPU2       CPU3       
  0:        119         59         69         70   IO-APIC-edge      timer
  1:          1          2          1          0   IO-APIC-edge      i8042
  6:          0          1          0          1   IO-APIC-edge      floppy
  8:        185        178        175        180   IO-APIC-edge      rtc0
  9:          0          0          0          0   IO-APIC-fasteoi   acpi
 12:          0          1          2          3   IO-APIC-edge      i8042
 16:     761720     767583     765772     762262   IO-APIC-fasteoi   uhci_hcd:usb3, EMU10K1
 17:          2          1          0          0   IO-APIC-fasteoi   ohci1394
 18:          0          0          0          2   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb8
 19:     192022     191598     191809     191886   IO-APIC-fasteoi   uhci_hcd:usb5, uhci_hcd:usb7
 21:          0          1          1          3   IO-APIC-fasteoi   uhci_hcd:usb4, eth0
 23:      19600      19263      19489      19502   IO-APIC-fasteoi   ehci_hcd:usb2, uhci_hcd:usb6
 25:     419910     412980     411109     416834   PCI-MSI-edge      i915
 26:     233236     233744     233647     233567   PCI-MSI-edge      ahci
 27:     709493     708677     709630     708963   PCI-MSI-edge      eth1
NMI:          0          0          0          0   Non-maskable interrupts
LOC:   10375694    9592098    6283658    6319369   Local timer interrupts
SPU:          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0   Performance monitoring interrupts
PND:          0          0          0          0   Performance pending work
RES:      50103      49240      47545      45606   Rescheduling interrupts
CAL:      74174        408      71586        453   Function call interrupts
TLB:      49410      53567      50409      52426   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0   Threshold APIC interrupts
MCE:          0          0          0          0   Machine check exceptions
MCP:        271        271        271        271   Machine check polls
ERR:          0
MIS:          0

There is nothing in the ethtool -S statistics that I see that indicates 
anything is wrong, you've gotten no tx timeouts as far as I can tell, have 
you had any system panics (possibly seeming unrelated to network?)


On Mon, 7 Dec 2009, Andrew Morton wrote:
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Sat, 5 Dec 2009 07:02:49 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=14737
> > 
> >            Summary: e1000e driver experiences large packet losses
> >            Product: Drivers
> >            Version: 2.5
> >     Kernel Version: 2.6.32--
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Network
> >         AssignedTo: drivers_network@kernel-bugs.osdl.org
> >         ReportedBy: xenoterracide@gmail.com
> >         Regression: No
> > 
> > 
> > possibly related to this bug #13835 same symptoms and I upped a lot of info
> for
> > this bug on there.
> > 
> > Also Read this thread http://marc.info/?t=125699907100001&r=1&w=2
> > 
> > I have thus far been able to find similar symptoms all the way back to
> 2.6.29.6
> > I've not yet done testing farther. problem is intermittent. It does not
> appear
> > to affect another nic on the system (however testing of that hasn't been
> > extensive,and a different driver). if I do not reboot the computer and the
> bug
> > hasn't manifested, it will not manifest (perhaps unless I reload modules or
> > restart interfaces (not tested).
> > 
> 
>
Comment 10 Caleb Cushing 2009-12-07 22:21:07 UTC
On Mon, Dec 7, 2009 at 4:53 PM, Brandeburg, Jesse
<jesse.brandeburg@intel.com> wrote:
> thanks akpm, I've been watching this thread but now I will try to jump in.
>
> Caleb, can you please summarize where we are today, you've done a lot of
> testing and the thread has gone on a while.
>
> Kernels known to fail (after any length):
2.6.32 - 2.6.29.6 is the range I've tested 2.6.29 only seemed to have
10% packet loss with mtr as opposed to the later, higher 30-50% still
that's abnormal and shouldn't be happening. I haven't tested farther
back yet.

> Kernels known to work:

flawlessly, none at this point. I've been able to replicate on every
version tested. given the fact it doesn't happen on every reboot and I
rarely reboot this makes it difficult to test. Other than the fact
that I've been unable to find a good kernel nothing suggests hardware
failure. given that some of the other e1000e bugs go back farther than
I've tested...

> Have you been able to try the latest e1000e from 2.6.32?  it has some
> fixes in it, although none right off the top of my head that will fix your
> issue.

yes. reproducible. whether it occurs as often I'm not sure.

> I have a couple of related questions, why don't you have irqbalance
> enabled?  Network interrupts should not be migrating across all cpus
> evenly, at the very least your system should be reconfigured to lock the
> interrupts to a particular core with smp_affinity.

is that new with 32? if not I don't know... I'm using arch linux's
config as a base, if it's something they should have enabled I can
relay the message.

> There is nothing in the ethtool -S statistics that I see that indicates
> anything is wrong, you've gotten no tx timeouts as far as I can tell, have
> you had any system panics (possibly seeming unrelated to network?)

no. My system seems perfectly stable (outside of some end user
software bugs, and even then only kopete seems to crash these days,
due to me using an experimental protocol). I'm unable to account for
the fact that tests aren't accounting for anything wrong...

hmm... thought... possibly iptables is dropping them as INVALID? I'm
still thinking that testing on just this system with one nic hooked
into the other might be a good idea, as the firewall configuration in
openwrt is not straightforward to me, this would also remove any QoS
rules that the router is applying, and random packets floating around
(that windows boxen are sending).
-- 
Caleb Cushing

http://xenoterracide.blogspot.com
Comment 11 Jarek Poplawski 2009-12-07 22:59:44 UTC
Brandeburg, Jesse wrote, On 12/07/2009 10:53 PM:

> There is nothing in the ethtool -S statistics that I see that indicates 
> anything is wrong, you've gotten no tx timeouts as far as I can tell, have 
> you had any system panics (possibly seeming unrelated to network?)


There are unreplied icmp echos and a lot of tcp retransmits in the
first one (netstat_after.slave4.log.gz):

Ip:
    812 total packets received
    1 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    802 incoming packets delivered
    1048 requests sent out
Icmp:
    488 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 2
        timeout in transit: 289
        echo replies: 197
    677 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 2
IcmpMsg:
        InType0: 197
        InType3: 2
        InType11: 289
        OutType3: 2
        OutType69: 675
Tcp:
    17 active connections openings
    0 passive connection openings
    14 failed connection attempts
    0 connection resets received
    0 connections established
    45 segments received
    49 segments send out
    19 segments retransmited
    0 bad segments received.
    19 resets sent

I analyzed tcpdumps from the router and there were really skipped
icmp echo requests on input. At the same time nothing wrong in the
stats (qdisc, ifconfig, ethtool) of this sending box with e1000e,
and the router's ifconfig (I only didn't see router's netstat).

Anyway, I doubt it's accidental or router to blame if another NIC,
and this e1000e with some boots/kernels(?) can work flawlessly.

Btw, after finding this similarly mysterious story below (with the
same NIC) I wonder if the router model can matter here too, but
maybe I'm wrong.

http://bugzilla.kernel.org/show_bug.cgi?id=11998


Jarek P.
  

> On Mon, 7 Dec 2009, Andrew Morton wrote:
>> (switched to email.  Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).
>>
>> On Sat, 5 Dec 2009 07:02:49 GMT
>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>
>>> http://bugzilla.kernel.org/show_bug.cgi?id=14737
>>>
>>>            Summary: e1000e driver experiences large packet losses
>>>            Product: Drivers
Comment 12 Alan 2012-06-18 13:02:29 UTC
Closing as obsolete, if this is still seen versus a modern kernel (3.2+) please update the bug
Comment 13 Caleb Cushing 2012-06-18 14:15:41 UTC
about 2 months ago which would have been a 3.3 or 3.2 kernel (arch is on 3.4 now) had this problem when I tried connecting my computer directly to the modem. problem doesn't occur when connecting to new router, I suspect because it only happens when I'm connecting at less than a gigabit connection (e.g. modem to pc is probably only 10/100, but router is 10/100/1000 )
Comment 14 Alan 2012-06-18 14:26:37 UTC
Sounds like a problem with the devices falling out with one another when negotiating. Something we do see now and then with cheap stuff one end particularly.

Bug updated so we know its current