possibly related to this bug #13835 same symptoms and I upped a lot of info for this bug on there. Also Read this thread http://marc.info/?t=125699907100001&r=1&w=2 I have thus far been able to find similar symptoms all the way back to 2.6.29.6 I've not yet done testing farther. problem is intermittent. It does not appear to affect another nic on the system (however testing of that hasn't been extensive,and a different driver). if I do not reboot the computer and the bug hasn't manifested, it will not manifest (perhaps unless I reload modules or restart interfaces (not tested).
Created attachment 24034 [details] 2.6.31 dmesg
Created attachment 24036 [details] netstat -s
Created attachment 24037 [details] lspci -vv 2.6.32 (while bug not occurring)
Created attachment 24038 [details] /proc/interrupts (while bug not occurring)
Created attachment 24039 [details] 2.6.32 config
upstream router is Linksys WRT54GL (1.1 I think?) running OpenWRT 809.1 and has several other windows systems working fine on it, it had another linux system that ran fine too, on its wireless.
Created attachment 24040 [details] netstat -s (while bug not occurring)
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Sat, 5 Dec 2009 07:02:49 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14737 > > Summary: e1000e driver experiences large packet losses > Product: Drivers > Version: 2.5 > Kernel Version: 2.6.32-- > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Network > AssignedTo: drivers_network@kernel-bugs.osdl.org > ReportedBy: xenoterracide@gmail.com > Regression: No > > > possibly related to this bug #13835 same symptoms and I upped a lot of info > for > this bug on there. > > Also Read this thread http://marc.info/?t=125699907100001&r=1&w=2 > > I have thus far been able to find similar symptoms all the way back to > 2.6.29.6 > I've not yet done testing farther. problem is intermittent. It does not > appear > to affect another nic on the system (however testing of that hasn't been > extensive,and a different driver). if I do not reboot the computer and the > bug > hasn't manifested, it will not manifest (perhaps unless I reload modules or > restart interfaces (not tested). >
thanks akpm, I've been watching this thread but now I will try to jump in. Caleb, can you please summarize where we are today, you've done a lot of testing and the thread has gone on a while. Kernels known to fail (after any length): Kernels known to work: Have you been able to try the latest e1000e from 2.6.32? it has some fixes in it, although none right off the top of my head that will fix your issue. I have a couple of related questions, why don't you have irqbalance enabled? Network interrupts should not be migrating across all cpus evenly, at the very least your system should be reconfigured to lock the interrupts to a particular core with smp_affinity. CPU0 CPU1 CPU2 CPU3 0: 119 59 69 70 IO-APIC-edge timer 1: 1 2 1 0 IO-APIC-edge i8042 6: 0 1 0 1 IO-APIC-edge floppy 8: 185 178 175 180 IO-APIC-edge rtc0 9: 0 0 0 0 IO-APIC-fasteoi acpi 12: 0 1 2 3 IO-APIC-edge i8042 16: 761720 767583 765772 762262 IO-APIC-fasteoi uhci_hcd:usb3, EMU10K1 17: 2 1 0 0 IO-APIC-fasteoi ohci1394 18: 0 0 0 2 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb8 19: 192022 191598 191809 191886 IO-APIC-fasteoi uhci_hcd:usb5, uhci_hcd:usb7 21: 0 1 1 3 IO-APIC-fasteoi uhci_hcd:usb4, eth0 23: 19600 19263 19489 19502 IO-APIC-fasteoi ehci_hcd:usb2, uhci_hcd:usb6 25: 419910 412980 411109 416834 PCI-MSI-edge i915 26: 233236 233744 233647 233567 PCI-MSI-edge ahci 27: 709493 708677 709630 708963 PCI-MSI-edge eth1 NMI: 0 0 0 0 Non-maskable interrupts LOC: 10375694 9592098 6283658 6319369 Local timer interrupts SPU: 0 0 0 0 Spurious interrupts PMI: 0 0 0 0 Performance monitoring interrupts PND: 0 0 0 0 Performance pending work RES: 50103 49240 47545 45606 Rescheduling interrupts CAL: 74174 408 71586 453 Function call interrupts TLB: 49410 53567 50409 52426 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 Threshold APIC interrupts MCE: 0 0 0 0 Machine check exceptions MCP: 271 271 271 271 Machine check polls ERR: 0 MIS: 0 There is nothing in the ethtool -S statistics that I see that indicates anything is wrong, you've gotten no tx timeouts as far as I can tell, have you had any system panics (possibly seeming unrelated to network?) On Mon, 7 Dec 2009, Andrew Morton wrote: > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Sat, 5 Dec 2009 07:02:49 GMT > bugzilla-daemon@bugzilla.kernel.org wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=14737 > > > > Summary: e1000e driver experiences large packet losses > > Product: Drivers > > Version: 2.5 > > Kernel Version: 2.6.32-- > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: Network > > AssignedTo: drivers_network@kernel-bugs.osdl.org > > ReportedBy: xenoterracide@gmail.com > > Regression: No > > > > > > possibly related to this bug #13835 same symptoms and I upped a lot of info > for > > this bug on there. > > > > Also Read this thread http://marc.info/?t=125699907100001&r=1&w=2 > > > > I have thus far been able to find similar symptoms all the way back to > 2.6.29.6 > > I've not yet done testing farther. problem is intermittent. It does not > appear > > to affect another nic on the system (however testing of that hasn't been > > extensive,and a different driver). if I do not reboot the computer and the > bug > > hasn't manifested, it will not manifest (perhaps unless I reload modules or > > restart interfaces (not tested). > > > >
On Mon, Dec 7, 2009 at 4:53 PM, Brandeburg, Jesse <jesse.brandeburg@intel.com> wrote: > thanks akpm, I've been watching this thread but now I will try to jump in. > > Caleb, can you please summarize where we are today, you've done a lot of > testing and the thread has gone on a while. > > Kernels known to fail (after any length): 2.6.32 - 2.6.29.6 is the range I've tested 2.6.29 only seemed to have 10% packet loss with mtr as opposed to the later, higher 30-50% still that's abnormal and shouldn't be happening. I haven't tested farther back yet. > Kernels known to work: flawlessly, none at this point. I've been able to replicate on every version tested. given the fact it doesn't happen on every reboot and I rarely reboot this makes it difficult to test. Other than the fact that I've been unable to find a good kernel nothing suggests hardware failure. given that some of the other e1000e bugs go back farther than I've tested... > Have you been able to try the latest e1000e from 2.6.32? it has some > fixes in it, although none right off the top of my head that will fix your > issue. yes. reproducible. whether it occurs as often I'm not sure. > I have a couple of related questions, why don't you have irqbalance > enabled? Network interrupts should not be migrating across all cpus > evenly, at the very least your system should be reconfigured to lock the > interrupts to a particular core with smp_affinity. is that new with 32? if not I don't know... I'm using arch linux's config as a base, if it's something they should have enabled I can relay the message. > There is nothing in the ethtool -S statistics that I see that indicates > anything is wrong, you've gotten no tx timeouts as far as I can tell, have > you had any system panics (possibly seeming unrelated to network?) no. My system seems perfectly stable (outside of some end user software bugs, and even then only kopete seems to crash these days, due to me using an experimental protocol). I'm unable to account for the fact that tests aren't accounting for anything wrong... hmm... thought... possibly iptables is dropping them as INVALID? I'm still thinking that testing on just this system with one nic hooked into the other might be a good idea, as the firewall configuration in openwrt is not straightforward to me, this would also remove any QoS rules that the router is applying, and random packets floating around (that windows boxen are sending). -- Caleb Cushing http://xenoterracide.blogspot.com
Brandeburg, Jesse wrote, On 12/07/2009 10:53 PM: > There is nothing in the ethtool -S statistics that I see that indicates > anything is wrong, you've gotten no tx timeouts as far as I can tell, have > you had any system panics (possibly seeming unrelated to network?) There are unreplied icmp echos and a lot of tcp retransmits in the first one (netstat_after.slave4.log.gz): Ip: 812 total packets received 1 with invalid addresses 0 forwarded 0 incoming packets discarded 802 incoming packets delivered 1048 requests sent out Icmp: 488 ICMP messages received 0 input ICMP message failed. ICMP input histogram: destination unreachable: 2 timeout in transit: 289 echo replies: 197 677 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 2 IcmpMsg: InType0: 197 InType3: 2 InType11: 289 OutType3: 2 OutType69: 675 Tcp: 17 active connections openings 0 passive connection openings 14 failed connection attempts 0 connection resets received 0 connections established 45 segments received 49 segments send out 19 segments retransmited 0 bad segments received. 19 resets sent I analyzed tcpdumps from the router and there were really skipped icmp echo requests on input. At the same time nothing wrong in the stats (qdisc, ifconfig, ethtool) of this sending box with e1000e, and the router's ifconfig (I only didn't see router's netstat). Anyway, I doubt it's accidental or router to blame if another NIC, and this e1000e with some boots/kernels(?) can work flawlessly. Btw, after finding this similarly mysterious story below (with the same NIC) I wonder if the router model can matter here too, but maybe I'm wrong. http://bugzilla.kernel.org/show_bug.cgi?id=11998 Jarek P. > On Mon, 7 Dec 2009, Andrew Morton wrote: >> (switched to email. Please respond via emailed reply-to-all, not via the >> bugzilla web interface). >> >> On Sat, 5 Dec 2009 07:02:49 GMT >> bugzilla-daemon@bugzilla.kernel.org wrote: >> >>> http://bugzilla.kernel.org/show_bug.cgi?id=14737 >>> >>> Summary: e1000e driver experiences large packet losses >>> Product: Drivers
Closing as obsolete, if this is still seen versus a modern kernel (3.2+) please update the bug
about 2 months ago which would have been a 3.3 or 3.2 kernel (arch is on 3.4 now) had this problem when I tried connecting my computer directly to the modem. problem doesn't occur when connecting to new router, I suspect because it only happens when I'm connecting at less than a gigabit connection (e.g. modem to pc is probably only 10/100, but router is 10/100/1000 )
Sounds like a problem with the devices falling out with one another when negotiating. Something we do see now and then with cheap stuff one end particularly. Bug updated so we know its current