Bug 11998 - [PATCH]e1000e driver does not initialize correctly
Summary: [PATCH]e1000e driver does not initialize correctly
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jesse Brandeburg
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-11-09 14:40 UTC by Bastien Durel
Modified: 2012-11-12 20:58 UTC (History)
14 users (show)

See Also:
Kernel Version: 3.3.3
Subsystem:
Regression: No
Bisected commit-id:


Attachments
ethregs dumps on Vostro 200 bisect good/bad (28.48 KB, application/x-gtar)
2009-02-20 15:00 UTC, Joao S Veiga
Details
router's good tcpdump (24.28 KB, application/octet-stream)
2009-02-28 03:12 UTC, Bastien Durel
Details
Removes e1000e ICH8 MDIX detect code (821 bytes, patch)
2009-03-02 13:35 UTC, dave graham
Details | Diff
Switched switch and revertICH8MDIX tests (130.00 KB, application/gzip)
2009-03-03 16:35 UTC, Joao S Veiga
Details
e1000e patch dumps phy registers to msg log on ethtool -d (649 bytes, patch)
2009-03-26 23:48 UTC, dave graham
Details | Diff
small script to invoke mulitple phy register dumps (114 bytes, application/octet-stream)
2009-03-26 23:50 UTC, dave graham
Details
phy registers dump tests (24.91 KB, application/octet-stream)
2009-03-27 23:34 UTC, Joao S Veiga
Details
proposed parameter to control mdi-x (2.85 KB, patch)
2010-01-27 23:43 UTC, Jesse Brandeburg
Details | Diff

Description Bastien Durel 2008-11-09 14:40:57 UTC
Latest working kernel version: 2.6.25
Earliest failing kernel version: 2.6.28-rc3
Distribution: Debian
Hardware Environment: Dell Vostro 200
Software Environment: Debian Lenny
Problem Description: I initially signaled this bug in debian BTS, see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=497392

The e1000e kernel module does not initialize correctly, then the network cannot initialize.
It does not complain, but packets do not exit correctly. The port's LED on switch blinks quickly all time, IPv6 RA's are recevied by the kernel, but no DHCP request packet reaches router, nor any ICMP one after manual configuration.
rmmod / modprobe module *sometimes* makes it working correctly, then the network interface is bringed up, DHCP responds, and communication can be established.

The problem appeared first in Debian 2.6.26-3 kernel, but I reproduced it with stock 2.6.28-rc3, as requested by Bastian Blank.

Steps to reproduce:

Boot the computer
When the module is inserted, nothing complains, the RAs are recevied,
but ipv4 cannot be used on the interface.
DHCP does not respond, and manual configuration does not allow packets to reach another computer.

tcpdump on the router does not show any packet going from the computer
during DHCP or after manual configuration.

After a few(exact number may change) rmmod's & modprobe's, IPv4 can goes
through interface.
Comment 1 Troels Liebe Bentsen 2009-01-23 04:46:08 UTC
I have the same problem, also on a vostro 200, replugging the cable sometimes solves the problem:

Settings for eth0:
	Supported ports: [ TP ]
	Supported link modes:   10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	Advertised auto-negotiation: Yes
	Speed: 100Mb/s
	Duplex: Full
	Port: Twisted Pair
	PHYAD: 1
	Transceiver: internal
	Auto-negotiation: on
	Supports Wake-on: pumbag
	Wake-on: g
	Current message level: 0x00000001 (1)
	Link detected: yes

root@ubuntu:/home/tlb# ethtool -e eth0 length 256
Offset		Values
------		------
0x0000		00 21 70 08 dd 4f 00 08 ff ff 12 10 ff ff ff ff 
0x0010		ff ff ff ff c3 10 38 02 28 10 c0 10 86 80 00 00 
0x0020		02 04 00 00 00 00 85 86 20 00 00 00 00 00 07 00 
0x0030		84 06 40 2b 43 00 04 00 ad ba ad ba be 10 bf 10 
0x0040		ad ba 4c 29 bd 10 ad ba 00 00 00 00 00 00 00 00 
0x0050		00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0060		00 01 00 40 32 12 07 40 ff ff ff ff ff ff ff ff 
0x0070		ff ff ff ff ff ff ff ff ff ff ff ff ff ff d6 de 
0x0080		00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0090		00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x00a0		ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00b0		ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00c0		ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00d0		ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00e0		ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00f0		ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 


root@ubuntu:/home/tlb# cat /proc/interrupts
           CPU0       CPU1       
  0:         30          1   IO-APIC-edge      timer
  1:          1          1   IO-APIC-edge      i8042
  8:         26         27   IO-APIC-edge      rtc0
  9:          0          0   IO-APIC-fasteoi   acpi
 12:          2          2   IO-APIC-edge      i8042
 16:          0          0   IO-APIC-fasteoi   uhci_hcd:usb1
 18:          1          2   IO-APIC-fasteoi   ehci_hcd:usb2, uhci_hcd:usb8
 19:       9659       9705   IO-APIC-fasteoi   uhci_hcd:usb5, uhci_hcd:usb7, ata_piix, ata_piix
 21:        478        488   IO-APIC-fasteoi   uhci_hcd:usb3
 22:         88         89   IO-APIC-fasteoi   HDA Intel
 23:          0          0   IO-APIC-fasteoi   ehci_hcd:usb4, uhci_hcd:usb6
 25:       1327       1272   PCI-MSI-edge      eth0
NMI:          0          0   Non-maskable interrupts
LOC:      11532       7848   Local timer interrupts
RES:       1144        971   Rescheduling interrupts
CAL:         83         51   Function call interrupts
TLB:        927        820   TLB shootdowns
SPU:          0          0   Spurious interrupts
ERR:          0
MIS:          0

cat /var/log/dmesg
x0]
pci_bus 0000:03: resource 0 io:  [0x3000-0x4fff]
pci_bus 0000:03: resource 1 mem: [0xec000000-0xedffffff]
pci_bus 0000:03: resource 2 mem: [0xe4000000-0xe40fffff]
pci_bus 0000:03: resource 3 mem: [0x0-0x0]
pci_bus 0000:04: resource 0 io:  [0x5000-0x6fff]
pci_bus 0000:04: resource 1 mem: [0xe8000000-0xe9ffffff]
pci_bus 0000:04: resource 2 mem: [0xe4100000-0xe41fffff]
pci_bus 0000:04: resource 3 mem: [0x0-0x0]
pci_bus 0000:0c: resource 0 io:  [0x7000-0x8fff]
pci_bus 0000:0c: resource 1 mem: [0xea000000-0xebffffff]
pci_bus 0000:0c: resource 2 mem: [0xe4200000-0xe42fffff]
pci_bus 0000:0c: resource 3 mem: [0x0-0x0]
pci_bus 0000:15: resource 0 io:  [0x9000-0xcfff]
pci_bus 0000:15: resource 1 mem: [0xe4300000-0xe7ffffff]
pci_bus 0000:15: resource 2 mem: [0xe0000000-0xe3ffffff]
pci_bus 0000:15: resource 3 io:  [0x00-0xffff]
pci_bus 0000:15: resource 4 mem: [0x000000-0xffffffff]
pci_bus 0000:16: resource 0 io:  [0x9000-0x90ff]
pci_bus 0000:16: resource 1 io:  [0x9400-0x94ff]
pci_bus 0000:16: resource 2 mem: [0xe0000000-0xe3ffffff]
pci_bus 0000:16: resource 3 mem: [0x88000000-0x8bffffff]
NET: Registered protocol family 2
IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
TCP bind hash table entries: 65536 (order: 7, 524288 bytes)
TCP: Hash tables configured (established 131072 bind 65536)
TCP reno registered
NET: Registered protocol family 1
checking if image is initramfs... it is
Freeing initrd memory: 4732k freed
Simple Boot Flag at 0x35 set to 0x1
Machine check exception polling timer started.
Scanning for low memory corruption every 60 seconds
audit: initializing netlink socket (disabled)
type=2000 audit(1232711864.715:1): initialized
highmem bounce pool size: 64 pages
HugeTLB registered 4 MB page size, pre-allocated 0 pages
VFS: Disk quotas dquot_6.5.2
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
Created ptmx node in devpts ino 2
msgmni has been set to 1712
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 254)
io scheduler noop registered
io scheduler cfq registered (default)
pci 0000:00:02.0: Boot video device
Linux agpgart interface v0.103
agpgart-intel 0000:00:00.0: Intel 945GM Chipset
agpgart-intel 0000:00:00.0: detected 7932K stolen memory
agpgart-intel 0000:00:00.0: AGP aperture is 256M @ 0xd0000000
brd: module loaded
PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
cpuidle: using governor ladder
cpuidle: using governor menu
Using IPI No-Shortcut mode
registered taskstats version 1
Freeing unused kernel memory: 292k freed
fuse init (API version 7.11)
ACPI: SSDT 7F6F1D26, 0240 (r1  PmRef  Cpu0Ist      100 INTL 20050513)
ACPI: SSDT 7F6F1FEB, 065A (r1  PmRef  Cpu0Cst      100 INTL 20050513)
ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
processor ACPI_CPU:00: registered as cooling_device0
ACPI: Processor [CPU0] (supports 8 throttling states)
ACPI: SSDT 7F6F1C5E, 00C8 (r1  PmRef  Cpu1Ist      100 INTL 20050513)
ACPI: SSDT 7F6F1F66, 0085 (r1  PmRef  Cpu1Cst      100 INTL 20050513)
ACPI: CPU1 (power states: C1[C1] C2[C2] C3[C3])
processor ACPI_CPU:01: registered as cooling_device1
ACPI: Processor [CPU1] (supports 8 throttling states)
thermal LNXTHERM:01: registered as thermal_zone0
ACPI: Thermal Zone [THM0] (33 C)
thermal LNXTHERM:02: registered as thermal_zone1
ACPI: Thermal Zone [THM1] (36 C)
IBM TrackPoint firmware: 0x0e, buttons: 3/3
input: TPPS/2 IBM TrackPoint as /devices/platform/i8042/serio1/input/input0
e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k6
e1000e: Copyright (c) 1999-2008 Intel Corporation.
e1000e 0000:02:00.0: Disabling L1 ASPM
e1000e 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
e1000e 0000:02:00.0: setting latency timer to 64
0000:02:00.0: 0000:02:00.0: Failed to initialize MSI interrupts.  Falling back to legacy interrupts.
input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input1
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
uhci_hcd: USB Universal Host Controller Interface driver
uhci_hcd 0000:00:1d.0: power state changed by ACPI to D0
uhci_hcd 0000:00:1d.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
uhci_hcd 0000:00:1d.0: setting latency timer to 64
uhci_hcd 0000:00:1d.0: UHCI Host Controller
uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 1
uhci_hcd 0000:00:1d.0: irq 16, io base 0x00001820
usb usb1: New USB device found, idVendor=1d6b, idProduct=0001
usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
usb usb1: Product: UHCI Host Controller
usb usb1: Manufacturer: Linux 2.6.29-rc2-custom uhci_hcd
usb usb1: SerialNumber: 0000:00:1d.0
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 2 ports detected
uhci_hcd 0000:00:1d.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
uhci_hcd 0000:00:1d.1: setting latency timer to 64
uhci_hcd 0000:00:1d.1: UHCI Host Controller
uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 2
uhci_hcd 0000:00:1d.1: irq 17, io base 0x00001840
usb usb2: New USB device found, idVendor=1d6b, idProduct=0001
usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
usb usb2: Product: UHCI Host Controller
usb usb2: Manufacturer: Linux 2.6.29-rc2-custom uhci_hcd
usb usb2: SerialNumber: 0000:00:1d.1
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
Warning! ehci_hcd should always be loaded before uhci_hcd and ohci_hcd, not after
uhci_hcd 0000:00:1d.2: power state changed by ACPI to D0
uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
uhci_hcd 0000:00:1d.2: setting latency timer to 64
uhci_hcd 0000:00:1d.2: UHCI Host Controller
uhci_hcd 0000:00:1d.2: new USB bus registered, assigned bus number 3
uhci_hcd 0000:00:1d.2: irq 18, io base 0x00001860
usb usb3: New USB device found, idVendor=1d6b, idProduct=0001
usb usb3: New USB device strings: Mfr=3, Product=2, SerialNumber=1
usb usb3: Product: UHCI Host Controller
usb usb3: Manufacturer: Linux 2.6.29-rc2-custom uhci_hcd
usb usb3: SerialNumber: 0000:00:1d.2
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
ehci_hcd 0000:00:1d.7: power state changed by ACPI to D0
ehci_hcd 0000:00:1d.7: PCI INT D -> GSI 19 (level, low) -> IRQ 19
ehci_hcd 0000:00:1d.7: setting latency timer to 64
ehci_hcd 0000:00:1d.7: EHCI Host Controller
ehci_hcd 0000:00:1d.7: new USB bus registered, assigned bus number 4
ehci_hcd 0000:00:1d.7: debug port 1
ehci_hcd 0000:00:1d.7: cache line size of 32 is not supported
ehci_hcd 0000:00:1d.7: irq 19, io mem 0xee444000
ehci_hcd 0000:00:1d.7: USB 2.0 started, EHCI 1.00
usb usb4: New USB device found, idVendor=1d6b, idProduct=0002
usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
usb usb4: Product: EHCI Host Controller
usb usb4: Manufacturer: Linux 2.6.29-rc2-custom ehci_hcd
usb usb4: SerialNumber: 0000:00:1d.7
usb usb4: configuration #1 chosen from 1 choice
hub 4-0:1.0: USB hub found
hub 4-0:1.0: 8 ports detected
uhci_hcd 0000:00:1d.3: PCI INT D -> GSI 19 (level, low) -> IRQ 19
uhci_hcd 0000:00:1d.3: setting latency timer to 64
uhci_hcd 0000:00:1d.3: UHCI Host Controller
uhci_hcd 0000:00:1d.3: new USB bus registered, assigned bus number 5
uhci_hcd 0000:00:1d.3: irq 19, io base 0x00001880
usb usb5: New USB device found, idVendor=1d6b, idProduct=0001
usb usb5: New USB device strings: Mfr=3, Product=2, SerialNumber=1
usb usb5: Product: UHCI Host Controller
usb usb5: Manufacturer: Linux 2.6.29-rc2-custom uhci_hcd
usb usb5: SerialNumber: 0000:00:1d.3
usb usb5: configuration #1 chosen from 1 choice
hub 5-0:1.0: USB hub found
hub 5-0:1.0: 2 ports detected
e1000e 0000:02:00.0: Warning: detected ASPM enabled in EEPROM
SCSI subsystem initialized
libata version 3.00 loaded.
ahci 0000:00:1f.2: version 3.0
ahci 0000:00:1f.2: PCI INT B -> GSI 16 (level, low) -> IRQ 16
ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 1.5 Gbps 0x1 impl SATA mode
ahci 0000:00:1f.2: flags: 64bit ncq pm led clo pio slum part 
ahci 0000:00:1f.2: setting latency timer to 64
scsi0 : ahci
scsi1 : ahci
scsi2 : ahci
scsi3 : ahci
ata1: SATA max UDMA/133 abar m1024@0xee444400 port 0xee444500 irq 16
ata2: DUMMY
ata3: DUMMY
ata4: DUMMY
0000:02:00.0: eth0: (PCI Express:2.5GB/s:Width x1) 00:16:d3:3b:a6:18
0000:02:00.0: eth0: Intel(R) PRO/1000 Network Connection
0000:02:00.0: eth0: MAC: 2, PHY: 2, PBA No: 005302-003
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
usb 1-1: new full speed USB device using uhci_hcd and address 2
ata1.00: ATA-7: HTS721010G9SA00, MCZIC10V, max UDMA/100
ata1.00: 195371568 sectors, multi 16: LBA48 
ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
ata1.00: configured for UDMA/100
ata1.00: configured for UDMA/100
ata1: EH complete
scsi 0:0:0:0: Direct-Access     ATA      HTS721010G9SA00  MCZI PQ: 0 ANSI: 5
pata_acpi 0000:00:1f.1: PCI INT C -> GSI 16 (level, low) -> IRQ 16
pata_acpi 0000:00:1f.1: setting latency timer to 64
pata_acpi 0000:00:1f.1: PCI INT C disabled
scsi 0:0:0:0: Attached scsi generic sg0 type 0
ata_piix 0000:00:1f.1: version 2.12
ata_piix 0000:00:1f.1: PCI INT C -> GSI 16 (level, low) -> IRQ 16
ata_piix 0000:00:1f.1: setting latency timer to 64
scsi4 : ata_piix
scsi5 : ata_piix
ata5: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x1810 irq 14
ata6: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0x1818 irq 15
Driver 'sd' needs updating - please use bus_type methods
sd 0:0:0:0: [sda] 195371568 512-byte hardware sectors: (100 GB/93.1 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 0:0:0:0: [sda] 195371568 512-byte hardware sectors: (100 GB/93.1 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sda:<6>usb 1-1: New USB device found, idVendor=056a, idProduct=00b2
usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
usb 1-1: Product: PTZ-930
usb 1-1: Manufacturer: Tablet
usb 1-1: configuration #1 chosen from 1 choice
usbcore: registered new interface driver hiddev
usbcore: registered new interface driver usbhid
usbhid: v2.6:USB HID core driver
ata6: port disabled. ignoring.
usb 3-1: new low speed USB device using uhci_hcd and address 2
 sda1 sda2
sd 0:0:0:0: [sda] Attached SCSI disk
usb 3-1: New USB device found, idVendor=413c, idProduct=2003
usb 3-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
usb 3-1: Product: Dell USB Keyboard
usb 3-1: Manufacturer: Dell
usb 3-1: configuration #1 chosen from 1 choice
input: Dell Dell USB Keyboard as /devices/pci0000:00/0000:00:1d.2/usb3/3-1/3-1:1.0/input/input2
generic-usb 0003:413C:2003.0001: input,hidraw0: USB HID v1.10 Keyboard [Dell Dell USB Keyboard] on usb-0000:00:1d.2-1/input0
PM: Starting manual resume from disk
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
udevd version 124 started
cfg80211: Calling CRDA to update world regulatory domain
lib80211: common routines for IEEE802.11 drivers
lib80211_crypt: registered algorithm 'NULL'
intel_rng: FWH not detected
Non-volatile memory driver v1.3
input: Power Button (FF) as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input3
ACPI: Power Button (FF) [PWRF]
input: Lid Switch as /devices/LNXSYSTM:00/device:00/PNP0C0D:00/input/input4
ACPI: Lid Switch [LID]
input: Sleep Button (CM) as /devices/LNXSYSTM:00/device:00/PNP0C0E:00/input/input5
ACPI: Sleep Button (CM) [SLPB]
acpi device:03: registered as cooling_device2
input: Video Bus as /devices/LNXSYSTM:00/device:00/PNP0A08:00/device:02/input/input6
ACPI: Video Device [VID] (multi-head: yes  rom: no  post: no)
ACPI: AC Adapter [AC] (on-line)
ACPI: Battery Slot [BAT0] (battery present)
sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
sdhci-pci 0000:15:00.2: SDHCI controller found [1180:0822] (rev 18)
sdhci-pci 0000:15:00.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
Registered led device: mmc0
mmc0: SDHCI controller on PCI [0000:15:00.2] using PIO
yenta_cardbus 0000:15:00.0: CardBus bridge found [17aa:201c]
iwl3945: Intel(R) PRO/Wireless 3945ABG/BG Network Connection driver for Linux, 1.2.26ks
iwl3945: Copyright(c) 2003-2008 Intel Corporation
iwl3945 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
iwl3945 0000:03:00.0: setting latency timer to 64
thinkpad_acpi: ThinkPad ACPI Extras v0.22
thinkpad_acpi: http://ibm-acpi.sf.net/
thinkpad_acpi: ThinkPad BIOS 7BETD2WW (2.13 ), EC 7BHT40WW-1.13
thinkpad_acpi: Lenovo ThinkPad X60s, model 170237G
thinkpad_acpi: ACPI backlight control delay disabled
thinkpad_acpi: radio switch found; radios are enabled
thinkpad_acpi: This ThinkPad has standard ACPI backlight brightness control, supported by the ACPI video driver
thinkpad_acpi: Disabling thinkpad-acpi brightness events by default...
Registered led device: tpacpi::thinklight
Registered led device: tpacpi::power
Registered led device: tpacpi:orange:batt
Registered led device: tpacpi:green:batt
Registered led device: tpacpi::dock_active
Registered led device: tpacpi::bay_active
Registered led device: tpacpi::dock_batt
Registered led device: tpacpi::unknown_led
Registered led device: tpacpi::standby
thinkpad_acpi: Standard ACPI backlight interface available, not loading native one.
input: ThinkPad Extra Buttons as /devices/virtual/input/input7
iwl3945: Tunable channels: 13 802.11bg, 23 802.11a channels
iwl3945: Detected Intel Wireless WiFi Link 3945ABG
iwl3945 0000:03:00.0: PCI INT A disabled
wmaster0 (iwl3945): not using net_device_ops yet
phy0: Selected rate control algorithm 'iwl-3945-rs'
yenta_cardbus 0000:15:00.0: ISA IRQ mask 0x0cb8, PCI irq 16
yenta_cardbus 0000:15:00.0: Socket status: 30000006
yenta_cardbus 0000:15:00.0: pcmcia: parent PCI bridge I/O window: 0x9000 - 0xcfff
yenta_cardbus 0000:15:00.0: pcmcia: parent PCI bridge Memory window: 0xe4300000 - 0xe7ffffff
yenta_cardbus 0000:15:00.0: pcmcia: parent PCI bridge Memory window: 0xe0000000 - 0xe3ffffff
wlan0 (iwl3945): not using net_device_ops yet
HDA Intel 0000:00:1b.0: PCI INT B -> GSI 17 (level, low) -> IRQ 17
hda_intel: probe_mask set to 0x1 for device 17aa:2010
HDA Intel 0000:00:1b.0: setting latency timer to 64
loop: module loaded
rt2870sta: module is from the staging directory, the quality is unknown, you have been warned.
rtusb init --->
usbcore: registered new interface driver rt2870
swap_cgroup: uses 1940 bytes of vmalloc for pointer array space and 1986560 bytes to hold mem_cgroup pointers on swap
swap_cgroup can be disabled by noswapaccount boot option.
Adding 1984016k swap on /dev/sda2.  Priority:-1 extents:1 across:1984016k 
EXT3 FS on sda1, internal journal
ip_tables: (C) 2000-2006 Netfilter Core Team
Comment 2 Joao S Veiga 2009-01-23 18:53:32 UTC
Exactly the same here, Debian Lenny, Dell Vostro 200 too. I after upgrading the kernel from 2.6.24-1 (debian) to 2.6.26 (debian) that started happening. I'm stuck with 2.6.24 since then (didn't try 2.6.25). Found Bastien's report at bugs.debian, and followed the same suggestion (try vanilla kernel.org 2.6.28). Still the same.
dell:/home/jsveiga# lspci
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller (rev 02)
00:01.0 PCI bridge: Intel Corporation 82G33/G31/P35/P31 Express PCI Express Root Port (rev 02)
00:02.0 VGA compatible controller: Intel Corporation 82G33/G31 Express Integrated Graphics Controller (rev 02)
00:19.0 Ethernet controller: Intel Corporation 82562V-2 10/100 Network Connection (rev 02)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 02)
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 02)
00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 02)
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 02)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
00:1f.0 ISA bridge: Intel Corporation 82801IR (ICH9R) LPC Interface Controller (rev 02)
00:1f.2 IDE interface: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 4 port SATA IDE Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
00:1f.5 IDE interface: Intel Corporation 82801I (ICH9 Family) 2 port SATA IDE Controller (rev 02)
02:00.0 Communication controller: Conexant HSF 56k Data/Fax Modem
Comment 3 Joao S Veiga 2009-01-23 20:11:18 UTC
Now here's an interesting fact: I'm beta testing Windows 7, and got the same problem there (sometimes network does not work, disabling/enabling the connection brings it back - not a DHCP issue, my IP is fixed). Works fine with Windows Vista. Maybe whatever changed from 2.6.25 to 2.6.26 also changed from Vista to Seven!
Comment 4 Nico Poppelier 2009-02-09 05:10:44 UTC
It looks like I have the same problem: Intel 82562V-2 network card in a Dell Vostro 200. When I run kernel 2.6.24 (driver version 0.2.0) the network card works fine 100% of the time. When I run kernel 2.6.27 (driver version 0.3.3.3-k6) chances are 50/50 that the network card gives trouble.

Symptoms of the problem are: (1) even before there is a login prompt, there is considerable activity on the eth0 device; (2) getting an IP address through DHCP (normal mode of operation on my home LAN) is impossible; (3) manual configuration of eth0 is possible, but then the network is excruciatingly slow.

I can send debugging output to everyone who wants to have it.
Comment 5 Joao S Veiga 2009-02-09 17:47:02 UTC
Looks like the problem "birth" was between vanilla 2.6.25.19 to 2.6.26.

You can copy ../drivers/net/e1000e/* from 2.6.25.19 to 2.6.26, cleanly compile the 2.6.26 kernel, and it seems to be Ok (just rebooted 10 times and got the network working everytime).

With the "complete" 2.6.26 (same .config), it's 50/50 of reboots with the quickly flashing router LED and no network. I've seen some other reports on the web even with Vista getting troubles with this NIC on Dell computers. Something introduced by Intel on the newer drivers does not like Dell?

Unfortunately the e1000e tree from 2.6.25.19 does not compile on 2.6.28.4 (which still has the problem). I'm not willing to try that with 2.6.27 because of the 2.6.27+e1000e bricking NICs history.

Developers, any test you'd like us to try?
Thanks,
Comment 6 Jesse Brandeburg 2009-02-10 11:08:07 UTC
there were quite a few patches between those two kernel versions, is there a chance you can do git-bisect on drivers/net/e1000e
git-bisect start v2.6.26 v2.6.25 drivers/net/e1000e

We have not seen this behavior here.
Comment 7 Joao S Veiga 2009-02-12 14:54:10 UTC
Sorry, somehow I did not get the cc: when this was updated; only saw it now.

I'm new to git; I'm on "git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6" so I can try the bisect process (I was trying to do the job with diff and vi).
Will post result as soon as I get them.

Thanks (I was worried that nobody was looking at this)
Comment 8 Joao S Veiga 2009-02-13 16:15:28 UTC
Hello Jesse, below is the result of bisect. To avoid false "goods" I used a script to reboot 8 times and ping the router, and only considered "good" if 8 times it succeeded. All the "bads" actually happened on the first reboot, so there's little chance of error here.

Dell Vostro 200 indeed uses ICH9, so it sounds right. Summarizing the problem: During the boot process one can see the router/switch led associated with the PC go off, then start blinking quite fast. ethtool says the the link is detected, and ifconfig says the interface is up, no errors in dmesg, but it is not possible to access the network. Sometimes disconnecting/reconnecting the ethernet cable brings it back to normal, sometimes not.

dell:/usr/src/git/linux-2.6# git bisect good
97ac8caee238d2a81c23661916f7acd3a22c85fe is first bad commit
commit 97ac8caee238d2a81c23661916f7acd3a22c85fe
Author: Bruce Allan <bruce.w.allan@intel.com>
Date:   Tue Apr 29 09:16:05 2008 -0700

    e1000e: Add support for BM PHYs on ICH9

    This patch adds support for the BM PHY, a new PHY model being used
    on ICH9-based implementations.

    This new PHY exposes issues in the ICH9 silicon when receiving
    jumbo frames large enough to use more than a certain part of the
    Rx FIFO, and this unfortunately breaks packet split jumbo receives.
    For this reason we re-introduce (for affected adapters only) the
    jumbo single-skb receive routine back so that people who do
    wish to use jumbo frames on these ich9 platforms can do so.
    Part of this problem has to do with CPU sleep states and to make
    sure that all the wake up timings are correctly we force them
    with the recently merged pm_qos infrastructure written by Mark
    Gross. (See http://lkml.org/lkml/2007/10/4/400).

    To make code read a bit easier we introduce a _IS_ICH flag so
    that we don't need to do mac type checks over the code.

    Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
    Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
    Signed-off-by: Jeff Garzik <jgarzik@redhat.com>

:040000 040000 692d08ad725d7abb00e82a7eb012e60a4e5a8048 ff07a5d12182d2b3054dfd735372ff1a23ce0c3f M      drivers

BR,
Joao
Comment 9 Joao S Veiga 2009-02-13 16:23:39 UTC
PS1: Not sure if this is important, but when I first got the problem, my BIOS was 1.0.11; I updated to 1.0.15 because it said:
Fixes/Enhancements:
1.Updated Intel microcode.
but it didn't help.
Comment 10 Jesse Brandeburg 2009-02-18 21:15:34 UTC
adding bruce, our ICH expert.  Bruce do you see anything in the offending commit 97ac8caee that seems in error?
Comment 11 Jesse Brandeburg 2009-02-18 21:23:11 UTC
we unfortunately don't have one of these systems here, if it isn't too much trouble, can you run the ethregs utility and dump the output to a file before (when its working) and after (when its not) 
It doesn't dump the phy registers, which we may need to do with a driver change to the before and after drivers.

I really appreciate the effort taken with the bisect!  I'm a little suprised that this system is effected by that patch (and that it works before that patch)

ethregs can be downloaded from http://prdownloads.sf.net/e1000/ethregs-1.1.tar.gz

you'll need to probably install pciutils-devel to be able to build it, I haven't tried it before on debian.
Comment 12 Bruce Allan 2009-02-19 11:41:26 UTC
Comment#1 from Troels Liebe Bentsen looks suspicious because the EEPROM dump suggests it is an 82562-V device but the dmesg output suggests it is an 82573.

As for comment#5 from Joao S Veiga, the e1000e bricking NICs issue only happens on 2.6.27-rcX and 2.6.27 kernels with ftrace enabled.  The issue was fixed in 2.6.27.1.

Offhand, the only thing I see that is possibly wrong in the commit reported by the git bisect (btw, thanks for doing that) is e1000e_disable_gig_wol_ich8lan() should check the PHY type and *not* write the PHY_CTRL register if it is an IFE PHY.  This is part of the suspend/shutdown path so it might not be related to this bug however.
Comment 13 Bruce Allan 2009-02-19 12:05:47 UTC
It turns out while we don't have a Dell Vostro 200, I do have a similar ICH9 w/ IFE PHY platform and will try to repro this issue in-house.
Comment 14 Joao S Veiga 2009-02-20 15:00:27 UTC
Created attachment 20318 [details]
ethregs dumps on Vostro 200 bisect good/bad
Comment 15 Joao S Veiga 2009-02-20 15:01:19 UTC
Sorry for the delay; I git-replayed to the bad state and collected the dumps.

While booted on the bad state, I collected some dumps with the NIC not working, then after I managed to make it work (dis/re-connect nw cable, un/re-load e1000e) collected one more.

Then I moved on with the bisect to the good state and collected one dump.

(to compile ethregs in debian, the pciutils-devel is in the libpci-dev package)

I'll be far from my Vostro 200 from feb/22 to mar/3, so unfortunately I won't be able to do more tests next week, but I'll be glad to do it as soon as I'm back.
Comment 16 dave graham 2009-02-25 17:00:07 UTC
I'm continuing to work this issue with Bruce Allan.

I looked at the ethregs good & bad dumps, and they are essentially the same, with nothing obviously wrong in either one. 

The earlier dmesg trace is indeed bogus. It may show a problem, but not one with the 82562V-2 at Bus:Device.Function 0:19.0. Could someone provide a dmesg trace taken after the problem first shows, taken on a system with this original reported device ? Maybe I'll find a clue in it.

Like Bruce, I don't have a Dell Vostro 200. I do have an ICH9 with the same 82562V-2 device at 00:19.0. I tried for local repro using 2.6.26 vanilla kernel under Fedora 9, so I'm likely using a significantly different kernel .config, and certainly networking 'middleware' than on the Debian Lenny installs that I see you (Joao and Bastien) are reporting. That may be relevant. I don't know yet, but plan a Debian Lenny install next to find out.

With the exception of the MAC address, and the PXE version, my EEPROM image is the same as reported. I'm following up separately on the PXE version, but doubt its relevant.

It may also be that the important difference between our configurations is in the connection to the link partner. Can you tell me please what make/model switch you are connected to, and if you are using a straight through cable or a crossover (if you know) ?

Thanks
Dave
Comment 17 Bastien Durel 2009-02-25 23:41:17 UTC
Hello,

Thanks for your investigations.

I'm connecting to a SMC FS5 switch with a straight through cable.
Comment 18 dave graham 2009-02-27 09:29:25 UTC
Thanks Bastien. I can't get hold of exactly that switch, but based on it specs, I tried the closest I could find, a Trendnet TE100 S5 10/100 switch that does auto MDI/MDIX (which accommodates both straight-thru and corossover TX/RX cablings). On my system I am still not able to see the reported failure. 

btw, Joao - what switch are you using ?

I strongly suspect that MDI/MDIX behavior is relevant, esp because commit 97ac8ca that Joao identifed in his bisect introduces (or tries to improve - I'm not sure yet) auto-MDI/MDIX for the IFE phy in your platform. Of course if we do have an interoperability problem with this switch type, it might still be an issue with the connected switch.

I can't be sure that its the Auto-MDI/MDIX feature that's causing the problem just from the commit id, as the commit is 1100 lines of code, and the MDI/MDIX parrt is ony about 20 lines within it. But by testing with an updated e1000e driver that removes the MDI/MDIX code alone, we could find out.  If you can build a kernel, and are prepared to help in that effort, please let me know, and I'll send you a patch to apply to your 2.6.26 kernel tree.

There's other stuff that you could do to help, and might add circumstantial evidence to the suspected MDI/MDIX failure. See if there's a firmware update for your switch, and if so apply it and retest (I actually went looking through the SMC site & FAQs for the SMC-FS5, but couldn't find anything interesting). I think its unlikely it will resolve the issue, but if a FW update does resolve it, then that's almost an admission of a problem on the switch vendor's part. Could you try another cable (a crossover if you were using a straight-through), maybe try another switch ?

I'll keep trying to get a repro here, but am not too optimistic that I'll succeed without the right switch. I still haven't tried Debian Lenny, but don;t expect that to be relevant. 

If you're up for a switch-swap, I think that would help us get definite root cause, and be my best shot at providing a true fix or workaround. I can  arrange to send you a (110V, 50Hz) replacement, so let me know if your interesting in that too, and we would then discuss the detail offline from this bugzilla.
 
Comment 19 Bastien Durel 2009-02-28 03:12:04 UTC
Created attachment 20389 [details]
router's good tcpdump

As I live in France, I cannot use a 110V switch, and I don't have another one.

I tried with a crossover cacle, it didn't help.

But as I plugged a crossover cable, I tried to connect it directly to the router, and the connection was up each time I plugged or reboot.
I attach the router's tcpdump, but I'm not sure it can be useful. I may try to build ethregs if it works on OpenBSD (as my router runs it)

I can test a patch for the 2.6.26 kernel if you send one
Comment 20 Joao S Veiga 2009-02-28 13:01:37 UTC
Hi Dave (sorry, as I mentioned before I'm away from home this week, and not often on the web).
I'll get the dmesg from before/after as soonb as I'm back home; but I can answer the switch connection part:

I'm using a Gi-Link (rtl8186 chipset), with a 2.4-kernel based firmware, connected to the Vostro 200 with a direct (not crossover) CAT5 cable, all wires correctly crimped (not only rx/tx pairs).

Although I'd love to accept your offer for a switch, I'm in Brazil, so shipping costs would be too big. I have a 100Mbps switch (encore) and a 10Mbps hub (also encore IIRC) which I can test with as soon as I'm back home on Tuesday 3/3.

I'll also be glad to test any patch or debug version of the module.

BR,
Joao
Comment 21 dave graham 2009-03-02 13:35:22 UTC
Created attachment 20408 [details]
Removes e1000e ICH8 MDIX detect code

To be applied to vanilla 2.6.26 kernel tree.
Comment 22 dave graham 2009-03-02 13:36:11 UTC
Thanks Bastien, Joao.

I've added a small patch (revertICH8MDIX.patch) that reverts the MDI/MDIX section of the commit that's been identified as introducing the problem. Bastien, Joao, if one or both of you could test with this patch applied that would be great.

I think that there may be a problem with enabling MDI/MDIX detection because I see in at least one other place in the code [force_speed_duplex()] where we are disabling it for ICH8 100Mbps interfaces. I don't yet understand the relation between this new code - the code my patch removes - and force_speed_duplex, but knowing for sure that it is (or isn't) the MDI/MDIX stuff causing the problem will help me know that my focus is good.

Dave
Comment 23 Joao S Veiga 2009-03-03 16:33:26 UTC
Hello Dave,

I'm back, and made the tests.

First I tried the first bad commit from git bisect with an Encore 100Mbps switch (8 port NWay Switch, model ENH908-NWY), but it didn't make any difference. the first boot did come up with the NIC working, but it was a lucky one. I'm attaching dmesg outputs from booting with the encore and the gilink. (my 10Mbps hub I'd also test vanished, sorry)

Then I compiled a vanilla kernel.org 2.6.26 to make sure it didn't work, booted 5 times (all 5 got a nonworking NIC), then patched it with your revert patch, compiled, and got 5 good boots in a row.

I've attached dmesgs for these two cases, and also the /proc/config.gz.

Please let me know if more tests are needed.

BR,
Joao
Comment 24 Joao S Veiga 2009-03-03 16:35:21 UTC
Created attachment 20419 [details]
Switched switch and revertICH8MDIX tests
Comment 25 Bastien Durel 2009-03-07 03:39:56 UTC
Hello,

it works well with the revertICH8MDIX patch applied, plugged on switch.
Comment 26 dave graham 2009-03-11 14:19:05 UTC
Thanks. I apologize for the delay in getting back to you. I am working on 2 fronts to move this along.

1) I have our interop lab helping in the testing , as they have access to a lot of swiitch types and are our best shot at seeing a local repro. 

2) I am working on some code that will allow us to get relevant phy configuration and status information from your configurations that will let me see what's going on.  

The current theory is that the auto-MDI/MDIX behavior within our own phy is switching TX/RX , but that so is the auto-MDI/MDIX capbility in the connected switch, and that they both keep switching. 

If you could try another switch, that'd still be a good datapoint.

Also, if you could try the standalone latest released driver at sourceforge,
(http://sourceforge.net/projects/e1000/ click Download, then Browse All Packages, then click in the download link for e1000e_stable), it'd be useful to know if the problem still exists there. I expect that it does.

  
Comment 27 Joao S Veiga 2009-03-12 16:39:22 UTC
Hello, thanks for the update.
Using the module from sourceforge:
e1000e: Intel(R) PRO/1000 Network Driver - 0.5.11.2-NAPI
Symptoms are the same (fast blinking, no network access most of the times).

Now, at the risk of losing any credibility I might have, here's a strange thing:
After a thunderstorm earlier this week (which killed my PS3), my Gi-Link router is going south. It's WAN (ethernet) interface is behaving badly (I have to plug/unplug the cable several times to make it connect to the cable modem).

BUT - I no longer can reproduce the fast-blinking/no connection problem on this Gi-Link with the Vostro 200!! I can use the 2.6.26 vanilla, the first bad bisect from 2.6.25 to 2.6.26, or the latest standalone driver, and the connection between the router and the Vostro 200 works all the time. I don't know if there's any reasonable explanation for this. ifconfig shows no error/drops on both sides of the vostro/router link (does show some errors on the WAN interface though). The only change I could expect from the lightning would be on the impedance from half-damaged front-end circuits.

LUCKLY I still get the problem with the Encore switch (ENH908-NWY) I had tested before, and which I used to confirm that the standalone driver still does the same. 

I also found that the problem can be reproduced with the KNOPPIX_V6.0.1CD-2009-02-08-EN.iso live CD/pendrive (debian lenny based, 2.6.28.4 kernel), so if you want a quick way of checking multiple PCs for selecting a testbed, using this on a pendrive could make it easier.

BR,
Joao
Comment 28 dave graham 2009-03-24 15:58:47 UTC
Joao, 
That's strange about the thunderstorm, but OK you still see the problem with the Encore. I now have tried a bunch of auto MDI/MDIX 10/100 equipment, and one of those is also the Encore, which I managed to get from Amazon. Here's my current list.

Netgear RP614 v2
NetGear MR314
LinkSys BEFSR81
Encore ENH908-NWY

Using each of these, I still fail to see the problem, so far. 

Thanks also for testing using the e1000e-0.5.11.2. I'll provide the debug code(which I referenced in my previous note) on that as a base - it'll be faster to work with the standalone driver. I should have that as a new comment and attachment in a day or two, and will provide instructions with it.
Dave
Comment 29 Joao S Veiga 2009-03-25 23:06:04 UTC
Thank you Dave,

I've brought a Netgear ProSafe Gigabit Switch GS108 from work for testing, and the problem does not appears with it (0.5.11.2-NAPI).

I was never able to reproduce it again with the half-fried Gi-Link; I'll maybe buy a Linksys WRT54G as a replacement this weekend, so I'll have one more "partner" to try. Currently only the Encore switch is doing the fast-blinking with the Vostro 200.

I will also try to bring more switch models from work (including other same-model Encore) to check if I can get more "bad" test cases to try with your debug code.

Best regards,

Joao S Veiga
Comment 30 Joao S Veiga 2009-03-26 21:55:25 UTC
I brought another Encore ENH908-NWY from work, and it also fastblinks (so it's not unique to my specific ENH908-NWY unit). I'll try to bring a Gbit 3Com switch, and an Edimax tomorrow.

I also noticed something interesting: I connected the switch before turning the Vostro 200 on, and when I turned the PC on I got the fast blinking BEFORE linux booted - while still on the GRUB menu.

At this point, the behavior is the same as after booted with the "bad" MDIX detect: Unplugging/plugging the network cable sometimes brings a steady led on, most of the times the fast blinking.

So whatever causes the "resonance" is the default behavior of the NIC, without the driver telling it what to do. I did mention I get the same thing with Windows 7 Beta 64bits (but not with Vista 32); do they use different MDIX detection code too?

BR,

Joao
Comment 31 dave graham 2009-03-26 23:48:44 UTC
Created attachment 20696 [details]
e1000e patch dumps phy registers to msg log on ethtool -d
Comment 32 dave graham 2009-03-26 23:50:50 UTC
Created attachment 20697 [details]
small script to invoke mulitple phy register dumps

Requires that the patch e1000e-0.5.18.3.dumphy.patch is applied to e1000e-0.5.18.3 base driver.
Comment 33 dave graham 2009-03-26 23:59:48 UTC
Thanks for continuing to work with me on this Joao.
Your new note is interesting, and I haven't had too much time to ponder it yet. Somehow we'll have to exaplain how the MDI/MDIX commit can cause/resolve the issue.

I have attached items that together should allow you to collect more data for me , so that I can see whatever the phy is seeing with the MDI/MDIX code running.

e1000e-0.5.18.3.dumpphy.patch
This is to apply to the (new) e1000e-0.5.18.3 sourceforge driver, which will almost certianly still exhibit the problem we are discussing. Could you please use that driver with this patch for your continued testing. The patch simply adds a phy regsiter dump capability to the "ethtool -d ethx" command, and will place the data in the system message log.

e1000e-0.5.18.3.dump.sh
This small bash script invokes 21 phy dumps at 300ms apart, placing the result in a file (via dmesg) "phyregisters".

If you could run the script when the LAN port is exhibiting the flashing LED behavior under the new driver, and send me the file "phyregisters", I think it should allow me to see what's going on.

Thanks again
Dave
Comment 34 Joao S Veiga 2009-03-27 23:33:45 UTC
Thank you Dave,

Before going to the dump tests:

I tested an old Edimax "Switch Fast Ethernet" (no model number), and a 3Com Gbit 3CGSU08 switch, and none fast blinked.

I could also get fastblinks again with my Gi-Link, but I have to un/plug the ethernet cable several times (about 10% fastblinks)

Tests (running vanilla kernel.org 2.6.26, attached config.gz):

1 - I first compiled the new driver module as it came from sourceforge, to make sure it still has the fastblinks.

2 - ifdown, unloaded old modules, loaded the new module:
e1000e: Intel(R) PRO/1000 Network Driver - 0.5.18.3-NAPI
e1000e: Copyright(c) 1999 - 2009 Intel Corporation.
ACPI: PCI Interrupt 0000:00:19.0[A] -> GSI 20 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:00:19.0 to 64
0000:00:19.0: : Failed to initialize MSI interrupts.  Falling back to legacy interrupts.
0000:00:19.0: eth0: (PCI Express:2.5GB/s:Width x1) 00:1e:c9:1a:83:5d
0000:00:19.0: eth0: Intel(R) PRO/10/100 Network Connection
0000:00:19.0: eth0: MAC: 8, PHY: 7, PBA No: ffffff-0ff

3 - ifup, unplugged the CAT5e from the GiLink, plugged on the Encore, it fastblinks. Connected/disconnected the cable from the Encore 10 times, got 7 bads, 3 goods.

4 - applied the e1000e-0.5.18.3.dumpphy.patch, recompiled, ifdown, un/load e1000e module, ifup.

Result note: I had left the Encore in a working state, but when I unloaded the module it started fastblinking. When I ifup'ped the led turned off, then back to fastblink. I think this confirms that when "left alone" the NIC hardware also causes the fastblinks.

5 - ran the script, collected bad_phyregisters_1

6 - un/plugged the cable 5 times, until I got a working state, collected good_phyregisters_2

7 - un/plugged the cable twice, until I got fastblinking, collected bad_phyregisters_3

8 - unplugged from Encore, un/plugged to Gi-Link 8 times, until I got a bad state, collected bad_phyregisters_4

9 - I recompiled the kernel enabling MSI interrupts, to avoid the warning when loading the module, just in case, but it did not change the fastblinking results.
ACPI: PCI interrupt for device 0000:00:19.0 disabled
e1000e: Intel(R) PRO/1000 Network Driver - 0.5.18.3-NAPI
e1000e: Copyright(c) 1999 - 2009 Intel Corporation.
ACPI: PCI Interrupt 0000:00:19.0[A] -> GSI 20 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:00:19.0 to 64
0000:00:19.0: eth0: (PCI Express:2.5GB/s:Width x1) 00:1e:c9:1a:83:5d
0000:00:19.0: eth0: Intel(R) PRO/10/100 Network Connection
0000:00:19.0: eth0: MAC: 8, PHY: 7, PBA No: ffffff-0ff

10 - collected bad_phyregisters_5 and good_phyregisters_6 with MSI interrupts enabled and the Encore switch, also attached the msi_config.gz.

Please let me know if more tests are needed.

Best regards,

Joao S Veiga
Comment 35 Joao S Veiga 2009-03-27 23:34:45 UTC
Created attachment 20714 [details]
phy registers dump tests
Comment 36 Joao S Veiga 2009-03-31 21:49:40 UTC
Linksys WRT54G V8 firmware 8.00.5 does not fastblink.
BR
Joao
Comment 37 dave graham 2009-04-01 21:56:55 UTC
Thanks Joao for the register dumps. They show what's happening. 

First, for both the good and bad cases the first read result was always different from the others, but it seems that's because the registers around 0x13 are statistics registers that are cleareed on read, and so probably not important. The outstanding item is the setting of register 0x1C, which reads:

    bad case   0x90
    good case  0xB0 (I see this value in my testing, even with the NWay switch)

In both cases, the 82562-V indicates (bit 4) that the Autodetect function has sucessfully acheived resolution, the difference is bit 5, which the Phy documentation says "Indicates the state of the MDI pair. 1 = MDI-X (cross over). 0 = MDI (straight through)." 

So in the failing case, our 8256V-2 has decided that the cable is X-over, which it isn't. Interestingly, all the failure cases show the same thing, and they don't show us changing back and forth. Maybe we did indeed detect the cross-over configuration based on the link partner's first auto-MDI/MDI-X switch, changed to the crosssover configuration itself, which sent the link partner into a weird mode. I don't know the detail of the auto-detcted function though, so have forwarded the data to our Silicon sustaining team for further analysis, and will update when they respond.

In the meantime, its good that you have the LinkSys.
Comment 38 Dominique Leducq 2009-11-10 14:52:06 UTC
I have this same bug on a Dell Inspiron 530, also with an Intel 82562V-2 NIC.

I am running a Mandriva 1010.0, with a 2.6.31 kernel, so 2.6.31 still has this bug apparently.
details at :
https://qa.mandriva.com/show_bug.cgi?id=43493
I have a Ruby Tech 8 port 10/100 switch, a straight throw cable, and no other switch to test with at the moment.

Best regards,

Dominique
Comment 39 Bastien Durel 2009-11-10 15:39:24 UTC
I switched to a netgear GS-108T, and there is no more problem.
For Dominique : I ran a few months with dave's patch applied on my debian kernel, it worked weel.
Comment 40 Caleb Cushing 2009-12-05 07:04:58 UTC
bug #14737 may be related But I'm not sure.
Comment 41 Jesse Brandeburg 2010-01-27 23:43:57 UTC
Created attachment 24749 [details]
proposed parameter to control mdi-x

This is a proposed patch to allow the user to override MDI-X behavior at module load time.

patch is against 2.6.33-rc5, please test with your funny switch(es).

<apply patch>
make M=drivers/net/e1000e
insmod drivers/net/e1000e/e1000e.ko mdix=1,1

this will force MDI (straight through) mode on the phy, value of 2 forces MDI-X mode.
Comment 42 Joao S Veiga 2010-01-31 20:24:32 UTC
Hi, one of my funny switches (actually a Gi-link router) had a lightning problem, and does not misbehave with the mdi-x, but I still have the Encore ENH908-NWY which triggers the problem, so I used it to test.

I was having troubles booting with 2.6.33-rc5, so instead of deviate from the intended course to solve that, I applied your patch to the 2.6.29.1 I'm running now (it applied Ok, and I installed the new module in /lib/modules):

# modinfo e1000e
filename:       /lib/modules/2.6.29.1/kernel/drivers/net/e1000e/e1000e.ko
version:        0.3.3.3-k6
(...)
parm:           mdix:MDI-X crossover control: 0 - auto (default), 1 - mdi only, 2 - mdix only (array of int)

Using the module with mdix=1 worked 100% of the time, with mdix=2 made the fastblink appear 100% of the time. Without the parameter, it ramdom (seems it depends on how it was set previously):

If I booted on a fastblink state, reloading e1000e with mdix=1 always fixed it. Reloading with mdix=2 didn't.

Good:
ifdown eth0; modprobe -r e1000e; modprobe e1000e mdix=1; ifup eth0
Fastblinks:
ifdown eth0; modprobe -r e1000e; modprobe e1000e mdix=2; ifup eth0

(note: I used "mdix=1" or "mdix=2", not "mdix=1,1" - is that right?)

I couldn't however make it work directly from boot. I tried:
- adding "options e1000e mdix=1" on /etc/modprobe.d/e1000e (which makes modprobe e1000e work without the option, but not at boot time, it seems), 
- adding "mdix=1" and "e1000e.mdix=1" in the kernel line (grub), but got ramdom results
- adding "e1000e mdix=1" on /etc/modules, also ramdom results.

Also noticed the option does not show up at /sys/module/e1000e/parameters/, which makes it hard to understand if the failures at boot happened because the option was not set or if it was set but did not work well.

When loading the module on the commandline, it tells what the MDI-X was set to, but I couldn't see that during boot time.

Best regards,

Joao S Veiga
Comment 43 Bastien Durel 2010-02-01 07:38:45 UTC
As I gave my "funny switch" to my father-in-law, it shall be hard for me to test it.
I may try to borrow it next week-end, if you need more feedback, though.
Comment 44 Bastien Durel 2010-02-05 18:11:42 UTC
Hello,

I borrowed my old switch for testing.

I applied patch to Debian's 2.6.32-trunk, built & installed the module, and made tests.
* mdix=1 leads to 100% connection
* mdix=2 leads to 100% failure
* mdix=0 leads to random results (looks like it depends on timing, more failures if I try to connect as soon as I install the module, but I cannot be sure)

option assignment via /etc/modprobe.d works well after rebuilding initrd (did you do that Joao ?)

Thanks for your work.
Best regards,

Bastien Durel
Comment 45 Nico Poppelier 2010-03-01 13:31:32 UTC
Bastien, could you explain with an example how option assignment via /etc/modprobe.d works? Like Joao, I tried "mdix=1" on the kernel line in /boot/grub/menu.lst, but got random results. 

Thanks! Best regards,

Nico Poppelier
Comment 46 Bastien Durel 2010-03-01 15:07:06 UTC
Hello,

Mos Linux distros uses initial ram images to boot. These images are generated once, at kernel install time, and ships copies of /etc/modprobe.conf & /etc/modprobe.d/* and running (and mabye others) modules
Their init script loads embed modules with parameters taken from embed modprobe config files.
Then when Linux switches its root to your root disk, e1000e is already loaded with its old, empty, configuration.

You have to rebuild the initrd after modifying modprobe files (calling mkinitrd with proper arguments, or "update-initrd -u" under debian)

Regards,

Bastien Durel
Comment 47 Jesse Brandeburg 2010-05-25 17:59:53 UTC
in fedora you can use dracut to rebuild the initrd.

latest update: I still haven't committed this patch to our out-of-tree driver, but I still believe it to be a good feature.  I hope we will release the fix soon, but we need some time (hard to find) to complete the correct upstream fix, which is probably to enable ethtool to control MDI-X state as well as report it.
Comment 48 Nico Poppelier 2010-06-19 19:32:40 UTC
When could you commit a correct upstream fix? I just upgraded to Ubuntu 10.04 and am again building a patched kernel (2.6.32-22)! It would be nice to be able to use stock kernels instead of patched ones. The patch works of course, and I'm still grateful it is available. :-)

Regards, Nico
Comment 49 Bastien Durel 2010-11-22 18:20:05 UTC
Hello,

Have you commited a patch ? Any reports ?

Thanks,

-- 
Bastien
Comment 50 spuerhund 2011-02-13 17:52:26 UTC
I am also affected by this bug. It exists also in Ubuntu 10.10, so i assume the patch has not bin commited yet. The issue is described in the launchpad system at https://bugs.launchpad.net/ubuntu/+bug/408351
Comment 51 Nico Poppelier 2011-05-02 07:47:04 UTC
I upgraded to Ubuntu 11.04 (Natty Narwhal) yesterday, and the problem is still there. Narwhal uses kernel version 2.6.38. I can apply the patch presented here more than a year ago (see post of 2010-01-27) but would really appreciate a permanent solution!
Comment 52 Mac Michaels 2012-06-07 21:48:49 UTC
I applied the patch linked to by Comment #41 From Jesse Brandeburg 2010-01-27 and added the option mdix=1 to the e1000e module.  It corrected a long standing problem booting the Vostro 200.  I am connected to a Netgear 8 port 10/100 mbps switch FS608 v3.  The switch node blinked rapidly and the computer could not access the network when booting sometimes.  My work around was to power cycle the Netgear switch repeatedly until it started working.  Power cycling the switch causes the driver to reset and try again.

This patch needs to be in the kernel. I am using Linux version 3.3.3-gentoo.  The patch applied OK once I found the actual e1000e driver directory in my source tree.

Please add the patch in Comment #41 to the official kernel sources.
Comment 53 Jesse Brandeburg 2012-06-07 23:07:35 UTC
The module parameter patch will not be accepted upstream, but I have finished the code for the driver changes and associated app changes to ethtool to allow this to work.  These changes will go into our internal test and then be pushed upstream.

there will be a new argument to ethtool -s ethx called mdix to be used like this:

ethtool -s ethx mdix <auto|on|off>
Comment 54 Florian Mickler 2012-11-11 18:57:07 UTC
A patch referencing this bug report has been merged in Linux v3.7-rc1:

commit 4e8186b68fb944ad9e7fd4080cd8bd8f10eb7cbd
Author: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date:   Thu Jul 26 02:31:14 2012 +0000

    e1000e: implement MDI/MDI-X control

Note You need to log in before you can comment on or make changes to this bug.