Bug 6675
Summary: | e1000: eeprom checksum error should not be fatal | ||
---|---|---|---|
Product: | Drivers | Reporter: | Patrick Horn (phrh) |
Component: | Network | Assignee: | Jesse Brandeburg (jesse.brandeburg) |
Status: | REJECTED INSUFFICIENT_DATA | ||
Severity: | normal | CC: | auke-jan.h.kok, bunk, jesse.brandeburg, rl03, sanjoy, sebastian |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.17-rc6 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: | ethtool -e and lspci output |
Description
Patrick Horn
2006-06-10 17:20:21 UTC
This error seems to stay in the EEPROM after any amount of reboots/powerdowns. All it seemed to have taken is one suspend, and now my EEPROM on my network card is permanently changed. So I am now unable to get network support in any kernel or on LiveCD's as they all report this same error. Another possibility would be to have a utility to rewrite the EEPROM checksum. But it seems that this would be at least as dangerous as simply ignoring the error, and I would prefer a permanent kernel change over having to modify an EEPROM each time this happens. A couple of comments: First, we need some information about your hardware, lspci, driver version, ethtool -e output. Second, our position is that if the eeprom has a bad checksum we won't load the driver because we can't depend on the configuration of the hardware to be correct. The e1000 hardware relies on the eeprom to configure many parameters of the hardware. We will not implement a driver change that will override the eeprom check. In general the driver never writes to the eeprom, and so it should never be able to be corrupted unless your hardware has failed. You should be working with your motherboard or NIC vendor (it might be intel) to determine if your hardware has failed. And lastly, it is likely that if you have suspended the system, the LAN adapter is in D3 after you've resumed (its not clear from your description if you resumed or rebooted), and e1000_suspend should wake it up to D0 and reinitialize everything. Created attachment 8357 [details]
ethtool -e and lspci output
This is definitely a permanent change in the EEPROM (I have tried shutting off and rebooting, and trying a different kernel). This problem seems to happen for many other people as well, given the number of Google hits. Suspending seemed to freeze my computer, so I had no choice but to power off and on, which means that it probably didn't cause this unless suspending changed the eeprom. I tried using Windows, and my onboard e1000 works perfectly fine there; however the Linux e1000 driver still gives the EEPROM error even after rebooting from Windows. This means that the error can be safely ignored without any ill effects. pts/3:~>sudo ethtool -t eth0 offline The test result is FAIL The test extra info: Register test (offline) 0 Eeprom test (offline) 2 Interrupt test (offline) 0 Loopback test (offline) 0 Link test (on/offline) 0 The error code 2 (from drivers/net/e1000/e1000_ethtool.c) is invalid checksum. I added a line in DMESG to print out the expected checksums: [17183163.480000] e1000: eth0: e1000_eeprom_test: Failed EEPROM Checksum (Error 2): got BABD, expected BABA [Added by Patrick] As you can see, the checksum is off by a mere 3, which shows that very little data actually changed, possibly related to a BIOS setting that I may have changed at the time. It would be nice to either have some way to safely rewrite the eeprom, or else to be able to override the check once you have determined that nothing has actually failed. Same here on my new ThinkPad X60s. Applying the patch makes the device/driver work without problems. I get the same error on my TP T60 in 2.6.18-rc1. The error comes and goes. With Ubuntu's 2.6.15-25-386 kernel, it showed up a few times early on. Those times I was able to reboot into Windows, reboot back into Linux, and it worked fine. But the error has showed up again (in 2.6.18-rc1 and Ubuntu's kernel) and I've long erased Windows, so that trick doesn't help. Some reports found via google said that powering off and taking out the battery fixed it for them, but I haven't tried that yet. Based on a suggestion at <http://forum.thinkpads.com/viewtopic.php?t=23776>, I tried modprobe -r e1000 /* plug in ethernet cable */ modprobe e1000 And now it works fine, even though I didn't reboot. So somehow the driver gets confused if the cable isn't connected when it's loaded? It looks like some vendors ship a "crippled" version of e1000 network card with their hardware. This is for example with Thecus n4100: with a vanilla kernel, it is impossible to use this card ("The EEPROM Checksum Is Not Valid"). Commenting EEPROM checksumming doesn't help - the module will complain about a broken MAC. A quick investigation shows that Thecus n4100's e1000 network cards don't contain EEPROM, in which MAC is stored. MAC is hardcoded in the kernel module, and a diff between a vanilla 2.6.9 code, and 2.6.9 Thecus code shows us these changes (note the hardcoded 00:50:60:70:80:90 MAC): diff -ur linux-2.6.9/drivers/net/e1000/e1000_hw.c linux/drivers/net/e1000/e1000_hw.c --- linux-2.6.9/drivers/net/e1000/e1000_hw.c 2004-10-18 23:55:28.000000000 +0200 +++ linux/drivers/net/e1000/e1000_hw.c 2005-09-16 09:54:15.000000000 +0200 @@ -3814,11 +3814,10 @@ int32_t e1000_read_mac_addr(struct e1000_hw * hw) { - uint16_t offset; - uint16_t eeprom_data, i; + uint16_t i; DEBUGFUNC("e1000_read_mac_addr"); - +/* for(i = 0; i < NODE_ADDRESS_SIZE; i += 2) { offset = i >> 1; if(e1000_read_eeprom(hw, offset, 1, &eeprom_data) < 0) { @@ -3831,7 +3830,13 @@ if(((hw->mac_type == e1000_82546) || (hw->mac_type == e1000_82546_rev_3)) && (E1000_READ_REG(hw, STATUS) & E1000_STATUS_FUNC_1)) hw->perm_mac_addr[5] ^= 0x01; - +*/ + hw->perm_mac_addr[0]=0x00; + hw->perm_mac_addr[1]=0x50; + hw->perm_mac_addr[2]=0x60; + hw->perm_mac_addr[3]=0x70; + hw->perm_mac_addr[4]=0x80; + hw->perm_mac_addr[5]=0x90; for(i = 0; i < NODE_ADDRESS_SIZE; i++) hw->mac_addr[i] = hw->perm_mac_addr[i]; return E1000_SUCCESS; --- linux-2.6.9/drivers/net/e1000/e1000_main.c 2004-10-18 23:53:50.000000000 +0200 +++ linux/drivers/net/e1000/e1000_main.c 2006-02-21 10:08:44.000000000 +0100 @@ -58,6 +58,7 @@ * Macro expands to... * {PCI_DEVICE(PCI_VENDOR_ID_INTEL, device_id)} */ + static struct pci_device_id e1000_pci_tbl[] = { INTEL_E1000_ETHERNET_DEVICE(0x1000), INTEL_E1000_ETHERNET_DEVICE(0x1001), @@ -386,6 +387,7 @@ int pci_using_dac; int i; int err; + uint8_t lattimer = 0x45; uint16_t eeprom_data; if((err = pci_enable_device(pdev))) @@ -436,6 +438,8 @@ goto err_ioremap; } + pci_write_config_byte(pdev,PCI_LATENCY_TIMER,lattimer); + for(i = BAR_1; i <= BAR_5; i++) { if(pci_resource_len(pdev, i) == 0) continue; @@ -510,13 +514,13 @@ e1000_reset_hw(&adapter->hw); /* make sure the EEPROM is good */ - +/* if(e1000_validate_eeprom_checksum(&adapter->hw) < 0) { DPRINTK(PROBE, ERR, "The EEPROM Checksum Is Not Valid\n"); err = -EIO; goto err_eeprom; } - +*/ /* copy the MAC address out of the EEPROM */ if (e1000_read_mac_addr(&adapter->hw)) I just tried 2.6.18.1, and I am still having to make this patch to the e1000 driver in an otherwise vanilla kernel to fix something that has always worked in Windows. I don't see how deleting line 845-846 of e1000_main.c with "err = -EIO; goto err_eeprom;" and leaving the DPRINTK on the line above will cause anything bad to happen. If the EEPROM checksum is wrong, there will be a few possibilities: 1. The checksum mismatch causes nothing bad to happen. This is the case I, and most other results about this I find on Google are in. 2. The MAC address is wrong but the card still works... again, the chances that this causes something bad are low, but if the MAC address is unusable, it should get caught by the next check (Invalid MAC Address). This seems to be the case for Tomasz. 3. The EEPROM is unusable and possibly causes strange behavior, in which case a look at dmesg will reveal a nasty error message about the EEPROM. Again, this is no worse than the current problem of having no network card. 4. Maybe it is possible that reading in the EEPROM and using it incorrectly causes hardware damage or crashes the kernel. In the first two cases, the card will operate no worse than before. In the third, the card will be unusable, and the user could disable the e1000 kernel module in the worst case. Case 4 is the only problem to be worried about, but I don't think anything that important is stored in the EEPROM. It seems likely that there would already be other problems as well if it ever gets this bad. Again, the e1000 module could be disabled. If we are just worried about case 4, then maybe there should be a "dangerous" config option named "Override EEPROM Check". Jesse, in response to your comment that "it should never be able to be corrupted unless your hardware has failed", this most certainly is not the case for me or most others as it works fine in Linux with the patch, and the Windows e1000 driver doesn't even seem to care about the EEPROM checksum! ugh, sorry for the long response, but here goes. bugme-daemon@bugzilla.kernel.org wrote: > ------- Additional Comments From phrh@yahoo.com 2006-11-04 15:25 > I just tried 2.6.18.1, and I am still having to make this patch to > the e1000 driver in an otherwise vanilla kernel to fix something that > has always worked in Windows. Just because windows works isn't a good thing, see below. > I don't see how deleting line 845-846 of e1000_main.c with "err = > -EIO; goto err_eeprom;" and leaving the DPRINTK on the line above > will cause anything bad to happen. > > If the EEPROM checksum is wrong, there will be a few possibilities: > 1. The checksum mismatch causes nothing bad to happen. This is the > case I, and most other results about this I find on Google are in. > 2. The MAC address is wrong but the card still works... again, the > chances that this causes something bad are low, but if the MAC > address is unusable, it should get caught by the next check (Invalid > MAC Address). This seems to be the case for Tomasz. > 3. The EEPROM is unusable and possibly causes strange behavior, in > which case a look at dmesg will reveal a nasty error message about > the EEPROM. Again, this is no worse than the current problem of > having no network card. > 4. Maybe it is possible that reading in the EEPROM and using it > incorrectly causes hardware damage or crashes the kernel. > > In the first two cases, the card will operate no worse than before. > In the third, the card will be unusable, and the user could disable > the e1000 kernel module in the worst case. > > Case 4 is the only problem to be worried about, but I don't think > anything that important is stored in the EEPROM. > It seems likely that there would already be other problems as well if > it ever gets this bad. Again, the e1000 module could be disabled. This is where the argument falls apart. On the PCI/PCI-X (and some PCIe) versions of e1000 the phy for the most part initializes itself correctly. On the 82547, and 82571/2, and 82562 chips the eeprom configures more than just the MAC, it also configures the PHY, including transmit power settings and PCI express link settings, and possibly the firmware for the management engine. Especially these versions of hardware MUST have an eeprom which is configured correctly to be reliable and stable. Worst case is that a marginal quality link partner is actually ruined (I'm not sure if this really can happen or not), best case (for a screwed up eeprom phy settings) is that we wouldn't pass FCC, but that things would still work. The eeprom also configures pins on our part to be input/output, if these are configured wrong lots of different things could happen, some as simple as LEDs not working, others as bad as never getting link up messages from fiber cards. > If we are just worried about case 4, then maybe there should be a > "dangerous" config option named "Override EEPROM Check". This would probably be safe, but we are in between a rock and a hard place here. We actually do have to pay money out to repair/replace motherboards that are returned to Intel under warranty. > Jesse, in response to your comment that "it should never be able to > be corrupted unless your hardware has failed", this most certainly is > not the case for me or most others as it works fine in Linux with the > patch, and the Windows e1000 driver doesn't even seem to care about > the EEPROM checksum! Windows (driver) doesn't care about the checksum because they have to try to decrease driver load time due to crazy microsoft constraints. If you run the PROSet/Control panel application from Intel and run the eeprom test (maybe as part of the self test?), it should fail with an eeprom checksum failure. If it passes then this gets even wierder. If you would like to update your eeprom checksum (at your own risk) you can use ethtool -E and program word 5 (bytes offset decimal 0 based count 10/11) to anything you want. ethtool -i will show the binary coded decimal version you programmed into the eeprom, but it is not used for anything else. ****** Patrick, in your very specific case I was able to validate that your eeprom looks okay for all the important bytes. You can try updating the checksum using ethtool -E as mentioned above and things should be okay. This will not apply to everyone. patrick? any response? Please reopen this bug if it's still present with kernel 2.6.20. |