Bug 6675

Summary:	e1000: eeprom checksum error should not be fatal
Product:	Drivers	Reporter:	Patrick Horn (phrh)
Component:	Network	Assignee:	Jesse Brandeburg (jesse.brandeburg)
Status:	REJECTED INSUFFICIENT_DATA
Severity:	normal	CC:	auke-jan.h.kok, bunk, jesse.brandeburg, rl03, sanjoy, sebastian
Priority:	P2
Hardware:	i386
OS:	Linux
Kernel Version:	2.6.17-rc6	Subsystem:
Regression:	---	Bisected commit-id:
Attachments:	ethtool -e and lspci output

Description Patrick Horn 2006-06-10 17:20:21 UTC

Distribution: Debian testing (etch), with compiled kernel
Hardware Environment: Pentium 4 2.6GHZ with ACPI/Hyperthreading and SMP.
Software Environment: gcc (GCC) 4.0.4 20060507 (prerelease) (Debian 4.0.3-3)
Problem Description:
The e1000 module fails to load with the error
e1000: 0000:02:01.0: e1000_probe: The EEPROM Checksum Is Not Valid

This error is not fatal, as commenting out
//		err = -EIO;
//		goto err_eeprom;
on line 759 of e1000_main.c (immediately below the above printk)
allows things to work perfectly.

The solution should be to either provide an argument to the kernel to override
this error (like e1000checksum=force ).
It would be better to not treat this as a error in any situation, as it may
cause less confusion for most people when it occurs.

Steps to reproduce:
1. Suspend with "echo mem > /sys/power/state"
2. Reboot Linux to find this error.

Comment 1 Patrick Horn 2006-06-11 01:57:46 UTC

This error seems to stay in the EEPROM after any amount of reboots/powerdowns.

All it seemed to have taken is one suspend, and now my EEPROM on my network card
is permanently changed.  So I am now unable to get network support in any kernel
or on LiveCD's as they all report this same error.

Another possibility would be to have a utility to rewrite the EEPROM checksum.
But it seems that this would be at least as dangerous as simply ignoring the
error, and I would prefer a permanent kernel change over having to modify an
EEPROM each time this happens.

Comment 2 Jesse Brandeburg 2006-06-16 11:41:01 UTC

A couple of comments:

First, we need some information about your hardware, lspci, driver version,
ethtool -e output.

Second, our position is that if the eeprom has a bad checksum we won't load the
driver because we can't depend on the configuration of the hardware to be
correct.  The e1000 hardware relies on the eeprom to configure many parameters
of the hardware.

We will not implement a driver change that will override the eeprom check.  In
general the driver never writes to the eeprom, and so it should never be able to
be corrupted unless your hardware has failed.  You should be working with your
motherboard or NIC vendor (it might be intel) to determine if your hardware has
failed.

And lastly, it is likely that if you have suspended the system, the LAN adapter
is in D3 after you've resumed (its not clear from your description if you
resumed or rebooted), and e1000_suspend should wake it up to D0 and reinitialize
everything.

Comment 3 Patrick Horn 2006-06-20 19:05:30 UTC

Created attachment 8357 [details]
ethtool -e and lspci output

Comment 4 Patrick Horn 2006-06-20 19:13:18 UTC

This is definitely a permanent change in the EEPROM (I have tried shutting off
and rebooting, and trying a different kernel).  This problem seems to happen for
many other people as well, given the number of Google hits.

Suspending seemed to freeze my computer, so I had no choice but to power off and
on, which means that it probably didn't cause this unless suspending changed the
eeprom.

I tried using Windows, and my onboard e1000 works perfectly fine there; however
the Linux e1000 driver still gives the EEPROM error even after rebooting from
Windows.  This means that the error can be safely ignored without any ill effects.

pts/3:~>sudo ethtool -t eth0 offline
The test result is FAIL
The test extra info:
Register test  (offline)         0
Eeprom test    (offline)         2
Interrupt test (offline)         0
Loopback test  (offline)         0
Link test   (on/offline)         0

The error code 2 (from drivers/net/e1000/e1000_ethtool.c) is invalid checksum.

I added a line in DMESG to print out the expected checksums:
[17183163.480000] e1000: eth0: e1000_eeprom_test: Failed EEPROM Checksum (Error
2): got BABD, expected BABA [Added by Patrick]

As you can see, the checksum is off by a mere 3, which shows that very little
data actually changed, possibly related to a BIOS setting that I may have
changed at the time.

It would be nice to either have some way to safely rewrite the eeprom, or else
to be able to override the check once you have determined that nothing has
actually failed.

Comment 5 Sebastian Bergmann 2006-07-01 07:18:09 UTC

Same here on my new ThinkPad X60s. Applying the patch makes the device/driver
work without problems.

Comment 6 Sanjoy Mahajan 2006-07-07 19:36:03 UTC

I get the same error on my TP T60 in 2.6.18-rc1.  The error comes and goes. 
With Ubuntu's 2.6.15-25-386 kernel, it showed up a few times early on.  Those
times I was able to reboot into Windows, reboot back into Linux, and it worked
fine.  

But the error has showed up again (in 2.6.18-rc1 and Ubuntu's kernel) and I've
long erased Windows, so that trick doesn't help.  Some reports found via google
said that powering off and taking out the battery fixed it for them, but I
haven't tried that yet.

Comment 7 Sanjoy Mahajan 2006-07-07 19:48:36 UTC

Based on a suggestion at <http://forum.thinkpads.com/viewtopic.php?t=23776>, I tried

modprobe -r e1000
/* plug in ethernet cable */
modprobe e1000

And now it works fine, even though I didn't reboot.  So somehow the driver gets
confused if the cable isn't connected when it's loaded?

Comment 8 Tomasz Chmielewski 2006-08-14 03:03:17 UTC

It looks like some vendors ship a "crippled" version of e1000 network card with
their hardware.

This is for example with Thecus n4100: with a vanilla kernel, it is impossible
to use this card ("The EEPROM Checksum Is Not Valid"). Commenting EEPROM
checksumming doesn't help - the module will complain about a broken MAC.

A quick investigation shows that Thecus n4100's e1000 network cards don't
contain EEPROM, in which MAC is stored.
MAC is hardcoded in the kernel module, and a diff between a vanilla 2.6.9 code,
and 2.6.9 Thecus code shows us these changes (note the hardcoded
00:50:60:70:80:90 MAC):


diff -ur linux-2.6.9/drivers/net/e1000/e1000_hw.c linux/drivers/net/e1000/e1000_hw.c
--- linux-2.6.9/drivers/net/e1000/e1000_hw.c    2004-10-18 23:55:28.000000000 +0200
+++ linux/drivers/net/e1000/e1000_hw.c  2005-09-16 09:54:15.000000000 +0200
@@ -3814,11 +3814,10 @@
 int32_t
 e1000_read_mac_addr(struct e1000_hw * hw)
 {
-    uint16_t offset;
-    uint16_t eeprom_data, i;
+    uint16_t i;

     DEBUGFUNC("e1000_read_mac_addr");
-
+/*
     for(i = 0; i < NODE_ADDRESS_SIZE; i += 2) {
         offset = i >> 1;
         if(e1000_read_eeprom(hw, offset, 1, &eeprom_data) < 0) {
@@ -3831,7 +3830,13 @@
     if(((hw->mac_type == e1000_82546) || (hw->mac_type == e1000_82546_rev_3)) &&
        (E1000_READ_REG(hw, STATUS) & E1000_STATUS_FUNC_1))
             hw->perm_mac_addr[5] ^= 0x01;
-
+*/
+    hw->perm_mac_addr[0]=0x00;
+    hw->perm_mac_addr[1]=0x50;
+    hw->perm_mac_addr[2]=0x60;
+    hw->perm_mac_addr[3]=0x70;
+    hw->perm_mac_addr[4]=0x80;
+    hw->perm_mac_addr[5]=0x90;
     for(i = 0; i < NODE_ADDRESS_SIZE; i++)
         hw->mac_addr[i] = hw->perm_mac_addr[i];
     return E1000_SUCCESS;


--- linux-2.6.9/drivers/net/e1000/e1000_main.c  2004-10-18 23:53:50.000000000 +0200
+++ linux/drivers/net/e1000/e1000_main.c        2006-02-21 10:08:44.000000000 +0100
@@ -58,6 +58,7 @@
  * Macro expands to...
  *   {PCI_DEVICE(PCI_VENDOR_ID_INTEL, device_id)}
  */
+
 static struct pci_device_id e1000_pci_tbl[] = {
        INTEL_E1000_ETHERNET_DEVICE(0x1000),
        INTEL_E1000_ETHERNET_DEVICE(0x1001),
@@ -386,6 +387,7 @@
        int pci_using_dac;
        int i;
        int err;
+       uint8_t lattimer = 0x45;
        uint16_t eeprom_data;

        if((err = pci_enable_device(pdev)))
@@ -436,6 +438,8 @@
                goto err_ioremap;
        }

+       pci_write_config_byte(pdev,PCI_LATENCY_TIMER,lattimer);
+
        for(i = BAR_1; i <= BAR_5; i++) {
                if(pci_resource_len(pdev, i) == 0)
                        continue;
@@ -510,13 +514,13 @@
        e1000_reset_hw(&adapter->hw);

        /* make sure the EEPROM is good */
-
+/*
        if(e1000_validate_eeprom_checksum(&adapter->hw) < 0) {
                DPRINTK(PROBE, ERR, "The EEPROM Checksum Is Not Valid\n");
                err = -EIO;
                goto err_eeprom;
        }
-
+*/
        /* copy the MAC address out of the EEPROM */

        if (e1000_read_mac_addr(&adapter->hw))

Comment 9 Patrick Horn 2006-11-04 15:25:14 UTC

I just tried 2.6.18.1, and I am still having to make this patch to the e1000
driver in an otherwise vanilla kernel to fix something that has always worked in
Windows.

I don't see how deleting line 845-846 of e1000_main.c with "err = -EIO; goto
err_eeprom;" and leaving the DPRINTK on the line above will cause anything bad
to happen.

If the EEPROM checksum is wrong, there will be a few possibilities:
1. The checksum mismatch causes nothing bad to happen.  This is the case I, and
most other results about this I find on Google are in.
2. The MAC address is wrong but the card still works... again, the chances that
this causes something bad are low, but if the MAC address is unusable, it should
get caught by the next check (Invalid MAC Address).  This seems to be the case
for Tomasz.
3. The EEPROM is unusable and possibly causes strange behavior, in which case a
look at dmesg will reveal a nasty error message about the EEPROM.  Again, this
is no worse than the current problem of having no network card.
4. Maybe it is possible that reading in the EEPROM and using it incorrectly
causes hardware damage or crashes the kernel.

In the first two cases, the card will operate no worse than before.  In the
third, the card will be unusable, and the user could disable the e1000 kernel
module in the worst case.

Case 4 is the only problem to be worried about, but I don't think anything that
important is stored in the EEPROM.
It seems likely that there would already be other problems as well if it ever
gets this bad. Again, the e1000 module could be disabled.

If we are just worried about case 4, then maybe there should be a "dangerous"
config option named "Override EEPROM Check".

Jesse, in response to your comment that "it should never be able to be corrupted
unless your hardware has failed", this most certainly is not the case for me or
most others as it works fine in Linux with the patch, and the Windows e1000
driver doesn't even seem to care about the EEPROM checksum!

Comment 10 Jesse Brandeburg 2006-11-06 14:46:44 UTC

ugh, sorry for the long response, but here goes.

bugme-daemon@bugzilla.kernel.org wrote:
> ------- Additional Comments From phrh@yahoo.com  2006-11-04 15:25
> I just tried 2.6.18.1, and I am still having to make this patch to
> the e1000 driver in an otherwise vanilla kernel to fix something that
> has always worked in Windows.

Just because windows works isn't a good thing, see below.

> I don't see how deleting line 845-846 of e1000_main.c with "err =
> -EIO; goto err_eeprom;" and leaving the DPRINTK on the line above
> will cause anything bad to happen.
> 
> If the EEPROM checksum is wrong, there will be a few possibilities:
> 1. The checksum mismatch causes nothing bad to happen.  This is the
> case I, and most other results about this I find on Google are in.
> 2. The MAC address is wrong but the card still works... again, the
> chances that this causes something bad are low, but if the MAC
> address is unusable, it should get caught by the next check (Invalid
> MAC Address).  This seems to be the case for Tomasz.
> 3. The EEPROM is unusable and possibly causes strange behavior, in
> which case a look at dmesg will reveal a nasty error message about
> the EEPROM.  Again, this is no worse than the current problem of
> having no network card. 
> 4. Maybe it is possible that reading in the EEPROM and using it
> incorrectly causes hardware damage or crashes the kernel.
> 
> In the first two cases, the card will operate no worse than before. 
> In the third, the card will be unusable, and the user could disable
> the e1000 kernel module in the worst case.
> 
> Case 4 is the only problem to be worried about, but I don't think
> anything that important is stored in the EEPROM.
> It seems likely that there would already be other problems as well if
> it ever gets this bad. Again, the e1000 module could be disabled.

This is where the argument falls apart.  On the PCI/PCI-X (and some
PCIe) versions of e1000 the phy for the most part initializes itself
correctly.  On the 82547, and 82571/2, and 82562 chips the eeprom
configures more than just the MAC, it also configures the PHY, including
transmit power settings and PCI express link settings, and possibly the
firmware for the management engine.  Especially these versions of
hardware MUST have an eeprom which is configured correctly to be
reliable and stable.  Worst case is that a marginal quality link partner
is actually ruined (I'm not sure if this really can happen or not), best
case (for a screwed up eeprom phy settings) is that we wouldn't pass
FCC, but that things would still work.  The eeprom also configures pins
on our part to be input/output, if these are configured wrong lots of
different things could happen, some as simple as LEDs not working,
others as bad as never getting link up messages from fiber cards.

> If we are just worried about case 4, then maybe there should be a
> "dangerous" config option named "Override EEPROM Check".

This would probably be safe, but we are in between a rock and a hard
place here.  We actually do have to pay money out to repair/replace
motherboards that are returned to Intel under warranty.  

> Jesse, in response to your comment that "it should never be able to
> be corrupted unless your hardware has failed", this most certainly is
> not the case for me or most others as it works fine in Linux with the
> patch, and the Windows e1000 driver doesn't even seem to care about
> the EEPROM checksum! 

Windows (driver) doesn't care about the checksum because they have to
try to decrease driver load time due to crazy microsoft constraints.  If
you run the PROSet/Control panel application from Intel and run the
eeprom test (maybe as part of the self test?), it should fail with an
eeprom checksum failure.

If it passes then this gets even wierder.

If you would like to update your eeprom checksum (at your own risk) you
can use ethtool -E and program word 5 (bytes offset decimal 0 based
count 10/11) to anything you want.  ethtool -i will show the binary
coded decimal version you programmed into the eeprom, but it is not used
for anything else.

****** Patrick, in your very specific case I was able to validate that
your eeprom looks okay for all the important bytes.  You can try
updating the checksum using ethtool -E as mentioned above and things
should be okay.  This will not apply to everyone.

Comment 11 Jesse Brandeburg 2006-12-18 14:58:40 UTC

patrick? any response?

Comment 12 Adrian Bunk 2007-02-20 11:54:53 UTC

Please reopen this bug if it's still present with kernel 2.6.20.