Bug 11382 - e1000e: 2.6.27-rc1 corrupts EEPROM/NVM
e1000e: 2.6.27-rc1 corrupts EEPROM/NVM
Status: CLOSED CODE_FIX
Product: Drivers
Classification: Unclassified
Component: Network
All Linux
: P1 normal
Assigned To: Jesse Brandeburg
:
Depends on:
Blocks: Regressions-2.6.26
  Show dependency treegraph
 
Reported: 2008-08-20 06:20 UTC by Rafael J. Wysocki
Modified: 2008-10-23 14:17 UTC (History)
29 users (show)

See Also:
Kernel Version: 2.6.27-rc1
Tree: Mainline
Regression: Yes


Attachments
eeprom dump from P5E-VM DO (14.28 KB, text/plain)
2008-09-26 18:36 UTC, Gabriel C
Details
patch to avoid nvm corruption (7.68 KB, patch)
2008-10-02 12:32 UTC, Jesse Brandeburg
Details | Diff

Description Rafael J. Wysocki 2008-08-20 06:20:35 UTC
Subject    : e1000e: 2.6.27-rc1 corrupts EEPROM/NVM
Submitter  : David Vrabel <david.vrabel@csr.com>
Date       : 2008-08-08 10:47
References : http://marc.info/?l=linux-kernel&m=121819267211679&w=4

This entry is being used for tracking a regression from 2.6.26.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Pierre Ossman 2008-08-26 08:45:16 UTC
I sent the following to e1000-devel, but it didn't show up in any archive so I guess the list is subscribers-only (yet not marked as such in MAINTAINERS):

From: Pierre Ossman
Date: Sun, 24 Aug 2008 00:35:36 +0200

I've just noticed that the e1000e has delightfully made poo poo all
over my EEPROM (something David Vrabel also has reported). Shit happens
and all that I guess, but how do I get the thing back in a working
order? Couldn't find anything useful on the interwebs...

Rgds
-- 
     -- Pierre Ossman

  Linux kernel, MMC maintainer        http://www.kernel.org
  rdesktop, core developer          http://www.rdesktop.org

  WARNING: This correspondence is being monitored by the
  Swedish government. Make sure your server uses encryption
  for SMTP traffic and consider using PGP for end-to-end
  encryption.

There is one reply that might reveal some of the following thread:

http://marc.info/?t=121969255400001&r=1&w=2
Comment 2 Yanko Kaneti 2008-08-31 03:45:34 UTC
I follow fedora rawhide and I guess somewhere around the rc1 timeframe had a Q35 integrated e1000 on a ASUS motherboard (PE5-VM DO) stop working because of an invalid checksum. Managed to bring it back to life with the ibautil.exe from the Intel Boot Agent utilities download http://downloadcenter.intel.com/Detail_Desc.aspx?ProductID=412&DwnldID=8242&lang=eng  with a broken mac (88:88:88:88:8E:88 or something like that), but working.

This week (fedora kernels around rc4) this happened again. When I tried to revive it using ibautil.exe it got corrupt enough to no longer be enumerated. Would be nice if there was a way to bring that port back to life, because I really don't think anything electrical has happend to it.
Comment 3 Pierre Ossman 2008-08-31 04:29:32 UTC
Right, the same thing happened here when I ran the ibautil program. So until this is figured out, I suggest people stay clear of that tool.
Comment 4 Chris Jones 2008-09-01 02:28:29 UTC
the ibautil tool seems to be careless about what it does, and for users on Laptops, it's not even supposed to be used at all (but it cheerfully runs anyway and invalidates your firmware to the point where the device won't enumerate).

See http://www.mail-archive.com/e1000-devel@lists.sourceforge.net/msg00398.html

I have had this happen to me on a Thinkpad X300 which will be shipped back to Lenovo tomorrow for repair and I am very keen to see that this does not happen again, or to other people.

What can be done to prevent it? There seem to be very few use cases which require the ability for a LAN driver to write to its own firmware, so would it perhaps be wise to simply remove the ability of the driver to do this? Even if it is controlled by a module option, the functions will still be in the kernel to write to the firmware and a bug elsewhere could presumably lead to the writing code being executed with bogus data?
Comment 5 Rafael J. Wysocki 2008-09-02 14:10:38 UTC
Handled-By : Christopher Li <chrisl@vmware.com>
Patch : http://marc.info/?l=linux-mm-commits&m=122038324200305&w=4
Comment 6 Jesse Brandeburg 2008-09-02 14:44:15 UTC
Rafael, where did this patch come from?  I don't see it on any mailing list, not to mention netdev@vger wasn't CC'd, I'm replying to the list now.  I don't think the patch is technically correct nor is it going to fix the actual issue, but would be interested to see if it did somehow.
Comment 7 Rafael J. Wysocki 2008-09-03 08:28:34 UTC
On Tuesday, 2 of September 2008, bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=11382
> 
> 
> jesse.brandeburg@intel.com changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |bruce.w.allan@intel.com
> 
> 
> 
> 
> ------- Comment #6 from jesse.brandeburg@intel.com  2008-09-02 14:44 -------
> Rafael, where did this patch come from?  I don't see it on any mailing list,
> not to mention netdev@vger wasn't CC'd, I'm replying to the list now.  I don't
> think the patch is technically correct nor is it going to fix the actual issue,
> but would be interested to see if it did somehow.

Well, I spotted the message on mm-commits and since Andrew had picked that
patch up, I assumed it was confirmed to work at least.

Comment 8 Pierre Ossman 2008-09-03 09:21:46 UTC
As soon as a way of restoring the device is presented, I'll gladly help out confirming any and all patches. :)
Comment 9 Chris Jones 2008-09-03 09:25:34 UTC
Pierre raises a good point - how feasible are dump/restore tools? I encountered this while testing on a machine that I really can't afford to keep RMAing, so I won't be able to do any more tests without a reasonable expectation of recovery.

Also, Jesse - I've seen your post on linux-netdev about this and in it you mention that you think this is something other than the e1000e driver, because the chip's NVRAM is part of a larger storage area. If that were the case, would I not have seen other problems as a result? e.g. my BIOS settings being lost, or.... (I have no idea what else would be stored in said EEPROM, I'm just thinking out loud).
Comment 11 Chris Jones 2008-09-20 15:36:42 UTC
Rafael: I don't believe that is a fix for this bug - this is about e1000e and that is a commit to e1000?
I don't appear to be able to re-open this bug, but I believe it should be.
Comment 12 Rafael J. Wysocki 2008-09-20 16:09:24 UTC
OK, reopened.

Ignore-Patch : http://marc.info/?l=linux-mm-commits&m=122038324200305&w=4
Comment 13 Jesse Brandeburg 2008-09-22 17:21:39 UTC
More reports are occurring of this bug, I am raising it internally in priority so we can have a dedicated person working on this.

Many users seem to experience a graphics panic just before they reboot and have this problem.

All users that I have seen with the original issue of e1000e reporting an incorrect checksum seem to be running 2.6.27-something.

Some users can only read 0xff out of their NVM, others are able to see valid data.  

Due to differences in the images from each system it is difficult for us to provide a tool that will "fix" any broken system.
Comment 14 Hanno Boeck 2008-09-23 05:52:08 UTC
I saw this issue popping up on a number of newspages, though I find the information situation not very satisfying. Maybe it'd be good if the kernel-devs could send out some information WHO is endangered.

That means especially which kernel versions (is rc7 fixed? there's something in the changelog) are dangerous. ATM I'm on wlan and I don't dare to use my wired e1000e (removed the module), though I'd like to change that again asap.
Comment 15 Stefan Richter 2008-09-23 06:07:38 UTC
Does the driver make the EEPROM accessible as an MMIO area?  This is alluded to
in the bug description at the top of
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555 and its reference
[0] to the OpenBSD driver.
Comment 16 Stefan Richter 2008-09-23 06:24:53 UTC
Re comment #15:
Looks like the answer is that there are memory mapped control registers which control the flashing of the nonvolatile memory.
Comment 17 Jesse Brandeburg 2008-09-23 10:12:37 UTC
I strongly recommend if you are going to test for this bug or haven't seen it
yet on your ich8/9 system, that you RIGHT NOW, do ethtool -e ethX >
savemyeep.txt

Having a saved copy of your eeprom means we can help you write it back to your
system.
Comment 18 David S. Miller 2008-09-23 13:59:23 UTC
Are people really running vanilla 2.6.27-rcX when they hit this bug
or are most people running a Fedora or Opensuse kernel?

Also, what X server are people running when they trigger this corruption?

My hunch is I think the X server or the GEM kernel patches might
be the culprit, as that's the kind of pattern that is starting to look
believable.

Comment 19 Chris Jones 2008-09-23 14:18:28 UTC
David: I hit it running Ubuntu Intrepid (which is currently at Alpha stage). It's not a completely vanilla kernel, but there are no GEM kernel patches.
It's a fairly recent Xorg and Intel graphics driver (my laptop is i965).

Sorry this is a bit vague, but I don't know exactly which versions were installed at the time. If it's particularly important I might be able to backtrack through logs to figure it out, but it would take a bit of time.
Comment 20 Chris Snook 2008-09-23 15:57:01 UTC
For the benefit of the panicking masses (me) can you tell us if there's anything you're certain is *not* impacted?

From everything I've read, it seems to be confined to ICH8* and ICH9* LOMs, with a 2.6.27-rc kernel.

Have we ruled out the possibility of this impacting add-on cards, older/newer ICH chipsets, ESB chipsets, and older kernels running bleeding-edge X servers?
Comment 21 Adam Williamson 2008-09-23 16:10:46 UTC
If anyone sees anything inaccurate or overly broad in http://blog.mandriva.com/2008/09/23/urgent-notification-major-bug-in-all-mandriva-linux-2009-pre-releases/ , please let me know. It seems this is truly restricted to 2.6.27, so I'll probably remove the references to 2.6.25 and 2.6.26 (we added them because some Fedora bugs appeared to point to the problem cropping up there, but it seems this was not actually the case).

We haven't actually had any live reports of this happening on Mandriva yet, but we decided to be cautious since it seems to have affected both SUSE and Fedora. I'll pass on the reports if we do get any.
Comment 22 Stefan Richter 2008-09-24 03:02:14 UTC
Re my comment #16:
The recently posted proposed patches suggest that MMIO writes have more direct consequences on ICH8 and ICH9 than I assumed.
Comment 23 Renato S. Yamane 2008-09-24 05:48:26 UTC
Someone try this patchs from Jeff Kirsher?
http://lkml.org/lkml/2008/9/23/427
http://lkml.org/lkml/2008/9/23/431
http://lkml.org/lkml/2008/9/23/432

Best regards,
Renato
Comment 24 Herton Ronaldo Krzesinski 2008-09-25 10:50:43 UTC
We got a report of this happening with e100 too, but in this case it corrupts only PXE rom (https://qa.mandriva.com/show_bug.cgi?id=44192). Not sure is the same problem.

Summarizing the report:
* Affected laptops: samsung X15, sony vaio PCG-V505AP
* One of bios reflashs of the bios (not told on which laptop) made it work again, but latest bios reflash didn't work out and PXE stopped working. But ethernet interface still works, only PXE boot is affected.
* Looks related to 2.6.27 kernels (latest we are using is 2.6.27-rc7-git1). But we don't have any useful testcase or anything to help debugging :/, also doesn't seem related to a gfx crash in this case.
Comment 25 Jiri Kosina 2008-09-25 11:11:10 UTC
(In reply to comment #24)
> We got a report of this happening with e100 too, but in this case it corrupts
> only PXE rom (https://qa.mandriva.com/show_bug.cgi?id=44192). Not sure is the
> same problem.

Do we know what X server and graphic card does this user have?
Comment 26 Herton Ronaldo Krzesinski 2008-09-25 11:31:37 UTC
(In reply to comment #25)
> Do we know what X server and graphic card does this user have?

Samsung should have an Intel 855GM but Vaio a 845MP chipset + ATI video ('ATI
Mobility Radeon') looking at product specs, I'll request the lspci info from
the reporter.

The X server is at 1.4.x (1.4.2 at the moment).
Comment 27 Gabriel C 2008-09-26 04:57:36 UTC
(In reply to comment #2)
> I follow fedora rawhide and I guess somewhere around the rc1 timeframe had a
> Q35 integrated e1000 on a ASUS motherboard (PE5-VM DO) stop working because of
> an invalid checksum. Managed to bring it back to life with the ibautil.exe from
> the Intel Boot Agent utilities download
> http://downloadcenter.intel.com/Detail_Desc.aspx?ProductID=412&DwnldID=8242&lang=eng
>  with a broken mac (88:88:88:88:8E:88 or something like that), but working.
> 
> This week (fedora kernels around rc4) this happened again. When I tried to
> revive it using ibautil.exe it got corrupt enough to no longer be enumerated.
> Would be nice if there was a way to bring that port back to life, because I
> really don't think anything electrical has happend to it.
> 

Hmm , I have the same motherboard and I cannot see something like this with an vanilla kernel using
Frugalware Linux 0.9-devel.

Comment 28 Herton Ronaldo Krzesinski 2008-09-26 05:44:28 UTC
(In reply to comment #26)
> Samsung should have an Intel 855GM but Vaio a 845MP chipset + ATI video ('ATI
> Mobility Radeon') looking at product specs, I'll request the lspci info from
> the reporter.

Got the lspci from the machines, and confirms the specs above, the lspci can be seen on the original ticket in case it could be useful, I didn't found any notable pattern.
Comment 29 Yanko Kaneti 2008-09-26 11:16:55 UTC
(In reply to comment #27)
> (In reply to comment #2)
> > I follow fedora rawhide and I guess somewhere around the rc1 timeframe had a
> > Q35 integrated e1000 on a ASUS motherboard (PE5-VM DO) stop working because 
> > of an invalid checksum.88:88:88:8E:88 or something like that), but working.
... 
> > This week (fedora kernels around rc4) this happened again. When I tried to
> > revive it using ibautil.exe it got corrupt enough to no longer be
> > enumerated.
> > 
> 
> Hmm , I have the same motherboard and I cannot see something like this with an
> vanilla kernel using Frugalware Linux 0.9-devel.

Could you, please, attach here or mail me the output of ethtool -e eth0
I think I won't be able to do much with it because the device is no longer on the bus, but perhaps the clued people here could help with that.

Also when you say "vanilla kernel" its unclear which particular version.  

TIA.
Comment 30 Gabriel C 2008-09-26 18:33:43 UTC
(In reply to comment #29)
> (In reply to comment #27)
> > (In reply to comment #2)
> > > I follow fedora rawhide and I guess somewhere around the rc1 timeframe had a
> > > Q35 integrated e1000 on a ASUS motherboard (PE5-VM DO) stop working because 
> > > of an invalid checksum.88:88:88:8E:88 or something like that), but working.
> ... 
> > > This week (fedora kernels around rc4) this happened again. When I tried to
> > > revive it using ibautil.exe it got corrupt enough to no longer be
> > > enumerated.
> > > 
> > 
> > Hmm , I have the same motherboard and I cannot see something like this with an
> > vanilla kernel using Frugalware Linux 0.9-devel.
> 
> Could you, please, attach here or mail me the output of ethtool -e eth0
> I think I won't be able to do much with it because the device is no longer on
> the bus, but perhaps the clued people here could help with that.

Sure , attached.

> 
> Also when you say "vanilla kernel" its unclear which particular version.

I've tested all 2.6.27-gitX/-rcX kernels on that box also latest tip/master and linux-next kernel(s).
 

Comment 31 Gabriel C 2008-09-26 18:36:10 UTC
Created attachment 18063 [details]
eeprom dump from P5E-VM DO
Comment 32 Herton Ronaldo Krzesinski 2008-09-30 13:16:39 UTC
(In reply to comment #24)
> We got a report of this happening with e100 too, but in this case it corrupts
> only PXE rom (https://qa.mandriva.com/show_bug.cgi?id=44192). Not sure is the
> same problem.

Hi, please disregard the problem with e100, it was a hardware issue with the switch being used, sorry for the noise.
Comment 33 Gustavo De Nardin (spuk) 2008-10-01 10:39:51 UTC
@intel guys, I see http://lkml.org/lkml/2008/9/29/395 ("[RFC PATCH 11/12] e1000e: write protect ICHx NVM to prevent malicious write/erase"). I believe you are internally doing some good testing on this (since you'd be the ones with more resources and less risks)? Would you ping here when you're very sure the card won't get "bricked", if possible? Thanks.
Comment 34 Gustavo De Nardin (spuk) 2008-10-02 12:19:24 UTC
I'm not aware of how many people where actually affected by this, but maybe it would help to start a little "database" of affected systems, and ask other people to donate EEPROM dumps from similar working systems, in case it will be possible to revive the bricked ethernets.

Even if there would be no security issues with this, more than 1 dump per system could be useful I guess (both for verification and for different hardware revisions etc., possibly).
Comment 35 Jesse Brandeburg 2008-10-02 12:32:02 UTC
Created attachment 18139 [details]
patch to avoid nvm corruption

This patch should prevent any further NVM corruptions by locking the NVM and locking those registers that lock the NVM, until the next reboot.  This of course only happens once the driver is loaded.
Comment 37 Michael Fritscher 2008-10-14 09:49:00 UTC
I fixed my x61t with the Vista drivers from intel.com (11-12 MB). Yust installed them and played with their diagnostic tools in the driver.

Both PXE and normal networking under Linux are working again :-)
I BIOS-upgrade alone did not help (actually, it was a downgrade^^)

Please write something about this solution in the error message about the wrong crc, it would help many people.

I reopen the Bug because of this, though it is more of help people unbricking their hardware.

Thanks,
Michael Fritscher
Comment 38 Jesse Brandeburg 2008-10-17 13:58:59 UTC
So in the interests of adding some closure to this bug.  The issue turns out to have never been the e1000e driver's fault.  The fault lies with the CONFIG_DYNAMIC_FTRACE option.  So specifically when the FTRACE code was enabled, it was doing a locked cmpxchg instruction on memory that had been previously used as __INIT code from some other module.

a) some other module loads
b) that module's init code calls into ftrace which stores the EIP
c) that module discards its init code
d) e1000e loads
e) e1000e asks the kernel for memory to ioremap onto, and gets the memory location of the code at b) and maps the flash/NVM control registers there.
f) ftraced runs and rewrites onto bytes 4-8 of the memory location from b/e
g) since the lock/cmpxchg instruction is undefined for memory mapped registers, random junk is written to the b/e location
h) depending on the contents of the junk in g) the NVM is either byte corrupted or block erased, which is detected the next time the e1000e driver is loaded.

a short term workaround is in 2.6.27.1 (disable CONFIG_DYNAMIC_FTRACE) and the longer term fix is rewrites of the cmpxchg code (which is already done and will be in 2.6.28-rc1)
Comment 39 Pierre Ossman 2008-10-18 00:17:56 UTC
I don't suppose you've found a way to bring back the cards from the dead?
Comment 40 Stefan Richter 2008-10-18 09:17:49 UTC
Re comment #39: http://lkml.org/lkml/2008/10/17/214
Comment 41 Pierre Ossman 2008-10-18 10:08:14 UTC
Thanks, I'll poke Karsten and see if he can sort out my machine.
Comment 42 Pierre Ossman 2008-10-23 14:17:53 UTC
In case anyone finds this bug while looking for a way to repair their card, Karsten was able to help me fully restore my NVM (via an image from an identical machine).

It should also be noted that I was in the state where the device had "fallen of the bus" (it turned out that the device was still there, but reporting a bad vendor and device id as a result of the broken NVM data, causing Linux and Windows to ignore the device). So there should be hope for everyone. :)

Note You need to log in before you can comment on or make changes to this bug.