Subject : e1000e: 2.6.27-rc1 corrupts EEPROM/NVM Submitter : David Vrabel <david.vrabel@csr.com> Date : 2008-08-08 10:47 References : http://marc.info/?l=linux-kernel&m=121819267211679&w=4 This entry is being used for tracking a regression from 2.6.26. Please don't close it until the problem is fixed in the mainline.
I sent the following to e1000-devel, but it didn't show up in any archive so I guess the list is subscribers-only (yet not marked as such in MAINTAINERS): From: Pierre Ossman Date: Sun, 24 Aug 2008 00:35:36 +0200 I've just noticed that the e1000e has delightfully made poo poo all over my EEPROM (something David Vrabel also has reported). Shit happens and all that I guess, but how do I get the thing back in a working order? Couldn't find anything useful on the interwebs... Rgds -- -- Pierre Ossman Linux kernel, MMC maintainer http://www.kernel.org rdesktop, core developer http://www.rdesktop.org WARNING: This correspondence is being monitored by the Swedish government. Make sure your server uses encryption for SMTP traffic and consider using PGP for end-to-end encryption. There is one reply that might reveal some of the following thread: http://marc.info/?t=121969255400001&r=1&w=2
I follow fedora rawhide and I guess somewhere around the rc1 timeframe had a Q35 integrated e1000 on a ASUS motherboard (PE5-VM DO) stop working because of an invalid checksum. Managed to bring it back to life with the ibautil.exe from the Intel Boot Agent utilities download http://downloadcenter.intel.com/Detail_Desc.aspx?ProductID=412&DwnldID=8242&lang=eng with a broken mac (88:88:88:88:8E:88 or something like that), but working. This week (fedora kernels around rc4) this happened again. When I tried to revive it using ibautil.exe it got corrupt enough to no longer be enumerated. Would be nice if there was a way to bring that port back to life, because I really don't think anything electrical has happend to it.
Right, the same thing happened here when I ran the ibautil program. So until this is figured out, I suggest people stay clear of that tool.
the ibautil tool seems to be careless about what it does, and for users on Laptops, it's not even supposed to be used at all (but it cheerfully runs anyway and invalidates your firmware to the point where the device won't enumerate). See http://www.mail-archive.com/e1000-devel@lists.sourceforge.net/msg00398.html I have had this happen to me on a Thinkpad X300 which will be shipped back to Lenovo tomorrow for repair and I am very keen to see that this does not happen again, or to other people. What can be done to prevent it? There seem to be very few use cases which require the ability for a LAN driver to write to its own firmware, so would it perhaps be wise to simply remove the ability of the driver to do this? Even if it is controlled by a module option, the functions will still be in the kernel to write to the firmware and a bug elsewhere could presumably lead to the writing code being executed with bogus data?
Handled-By : Christopher Li <chrisl@vmware.com> Patch : http://marc.info/?l=linux-mm-commits&m=122038324200305&w=4
Rafael, where did this patch come from? I don't see it on any mailing list, not to mention netdev@vger wasn't CC'd, I'm replying to the list now. I don't think the patch is technically correct nor is it going to fix the actual issue, but would be interested to see if it did somehow.
On Tuesday, 2 of September 2008, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=11382 > > > jesse.brandeburg@intel.com changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |bruce.w.allan@intel.com > > > > > ------- Comment #6 from jesse.brandeburg@intel.com 2008-09-02 14:44 ------- > Rafael, where did this patch come from? I don't see it on any mailing list, > not to mention netdev@vger wasn't CC'd, I'm replying to the list now. I > don't > think the patch is technically correct nor is it going to fix the actual > issue, > but would be interested to see if it did somehow. Well, I spotted the message on mm-commits and since Andrew had picked that patch up, I assumed it was confirmed to work at least.
As soon as a way of restoring the device is presented, I'll gladly help out confirming any and all patches. :)
Pierre raises a good point - how feasible are dump/restore tools? I encountered this while testing on a machine that I really can't afford to keep RMAing, so I won't be able to do any more tests without a reasonable expectation of recovery. Also, Jesse - I've seen your post on linux-netdev about this and in it you mention that you think this is something other than the e1000e driver, because the chip's NVRAM is part of a larger storage area. If that were the case, would I not have seen other problems as a result? e.g. my BIOS settings being lost, or.... (I have no idea what else would be stored in said EEPROM, I'm just thinking out loud).
Fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=78566fecbb12a7616ae9a88b2ffbc8062c4a89e3
Rafael: I don't believe that is a fix for this bug - this is about e1000e and that is a commit to e1000? I don't appear to be able to re-open this bug, but I believe it should be.
OK, reopened. Ignore-Patch : http://marc.info/?l=linux-mm-commits&m=122038324200305&w=4
More reports are occurring of this bug, I am raising it internally in priority so we can have a dedicated person working on this. Many users seem to experience a graphics panic just before they reboot and have this problem. All users that I have seen with the original issue of e1000e reporting an incorrect checksum seem to be running 2.6.27-something. Some users can only read 0xff out of their NVM, others are able to see valid data. Due to differences in the images from each system it is difficult for us to provide a tool that will "fix" any broken system.
I saw this issue popping up on a number of newspages, though I find the information situation not very satisfying. Maybe it'd be good if the kernel-devs could send out some information WHO is endangered. That means especially which kernel versions (is rc7 fixed? there's something in the changelog) are dangerous. ATM I'm on wlan and I don't dare to use my wired e1000e (removed the module), though I'd like to change that again asap.
Does the driver make the EEPROM accessible as an MMIO area? This is alluded to in the bug description at the top of https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555 and its reference [0] to the OpenBSD driver.
Re comment #15: Looks like the answer is that there are memory mapped control registers which control the flashing of the nonvolatile memory.
I strongly recommend if you are going to test for this bug or haven't seen it yet on your ich8/9 system, that you RIGHT NOW, do ethtool -e ethX > savemyeep.txt Having a saved copy of your eeprom means we can help you write it back to your system.
Are people really running vanilla 2.6.27-rcX when they hit this bug or are most people running a Fedora or Opensuse kernel? Also, what X server are people running when they trigger this corruption? My hunch is I think the X server or the GEM kernel patches might be the culprit, as that's the kind of pattern that is starting to look believable.
David: I hit it running Ubuntu Intrepid (which is currently at Alpha stage). It's not a completely vanilla kernel, but there are no GEM kernel patches. It's a fairly recent Xorg and Intel graphics driver (my laptop is i965). Sorry this is a bit vague, but I don't know exactly which versions were installed at the time. If it's particularly important I might be able to backtrack through logs to figure it out, but it would take a bit of time.
For the benefit of the panicking masses (me) can you tell us if there's anything you're certain is *not* impacted? From everything I've read, it seems to be confined to ICH8* and ICH9* LOMs, with a 2.6.27-rc kernel. Have we ruled out the possibility of this impacting add-on cards, older/newer ICH chipsets, ESB chipsets, and older kernels running bleeding-edge X servers?
If anyone sees anything inaccurate or overly broad in http://blog.mandriva.com/2008/09/23/urgent-notification-major-bug-in-all-mandriva-linux-2009-pre-releases/ , please let me know. It seems this is truly restricted to 2.6.27, so I'll probably remove the references to 2.6.25 and 2.6.26 (we added them because some Fedora bugs appeared to point to the problem cropping up there, but it seems this was not actually the case). We haven't actually had any live reports of this happening on Mandriva yet, but we decided to be cautious since it seems to have affected both SUSE and Fedora. I'll pass on the reports if we do get any.
Re my comment #16: The recently posted proposed patches suggest that MMIO writes have more direct consequences on ICH8 and ICH9 than I assumed.
Someone try this patchs from Jeff Kirsher? http://lkml.org/lkml/2008/9/23/427 http://lkml.org/lkml/2008/9/23/431 http://lkml.org/lkml/2008/9/23/432 Best regards, Renato
We got a report of this happening with e100 too, but in this case it corrupts only PXE rom (https://qa.mandriva.com/show_bug.cgi?id=44192). Not sure is the same problem. Summarizing the report: * Affected laptops: samsung X15, sony vaio PCG-V505AP * One of bios reflashs of the bios (not told on which laptop) made it work again, but latest bios reflash didn't work out and PXE stopped working. But ethernet interface still works, only PXE boot is affected. * Looks related to 2.6.27 kernels (latest we are using is 2.6.27-rc7-git1). But we don't have any useful testcase or anything to help debugging :/, also doesn't seem related to a gfx crash in this case.
(In reply to comment #24) > We got a report of this happening with e100 too, but in this case it corrupts > only PXE rom (https://qa.mandriva.com/show_bug.cgi?id=44192). Not sure is the > same problem. Do we know what X server and graphic card does this user have?
(In reply to comment #25) > Do we know what X server and graphic card does this user have? Samsung should have an Intel 855GM but Vaio a 845MP chipset + ATI video ('ATI Mobility Radeon') looking at product specs, I'll request the lspci info from the reporter. The X server is at 1.4.x (1.4.2 at the moment).
(In reply to comment #2) > I follow fedora rawhide and I guess somewhere around the rc1 timeframe had a > Q35 integrated e1000 on a ASUS motherboard (PE5-VM DO) stop working because > of > an invalid checksum. Managed to bring it back to life with the ibautil.exe > from > the Intel Boot Agent utilities download > > http://downloadcenter.intel.com/Detail_Desc.aspx?ProductID=412&DwnldID=8242&lang=eng > with a broken mac (88:88:88:88:8E:88 or something like that), but working. > > This week (fedora kernels around rc4) this happened again. When I tried to > revive it using ibautil.exe it got corrupt enough to no longer be enumerated. > Would be nice if there was a way to bring that port back to life, because I > really don't think anything electrical has happend to it. > Hmm , I have the same motherboard and I cannot see something like this with an vanilla kernel using Frugalware Linux 0.9-devel.
(In reply to comment #26) > Samsung should have an Intel 855GM but Vaio a 845MP chipset + ATI video ('ATI > Mobility Radeon') looking at product specs, I'll request the lspci info from > the reporter. Got the lspci from the machines, and confirms the specs above, the lspci can be seen on the original ticket in case it could be useful, I didn't found any notable pattern.
(In reply to comment #27) > (In reply to comment #2) > > I follow fedora rawhide and I guess somewhere around the rc1 timeframe had > a > > Q35 integrated e1000 on a ASUS motherboard (PE5-VM DO) stop working because > > of an invalid checksum.88:88:88:8E:88 or something like that), but working. ... > > This week (fedora kernels around rc4) this happened again. When I tried to > > revive it using ibautil.exe it got corrupt enough to no longer be > > enumerated. > > > > Hmm , I have the same motherboard and I cannot see something like this with > an > vanilla kernel using Frugalware Linux 0.9-devel. Could you, please, attach here or mail me the output of ethtool -e eth0 I think I won't be able to do much with it because the device is no longer on the bus, but perhaps the clued people here could help with that. Also when you say "vanilla kernel" its unclear which particular version. TIA.
(In reply to comment #29) > (In reply to comment #27) > > (In reply to comment #2) > > > I follow fedora rawhide and I guess somewhere around the rc1 timeframe > had a > > > Q35 integrated e1000 on a ASUS motherboard (PE5-VM DO) stop working > because > > > of an invalid checksum.88:88:88:8E:88 or something like that), but > working. > ... > > > This week (fedora kernels around rc4) this happened again. When I tried > to > > > revive it using ibautil.exe it got corrupt enough to no longer be > > > enumerated. > > > > > > > Hmm , I have the same motherboard and I cannot see something like this with > an > > vanilla kernel using Frugalware Linux 0.9-devel. > > Could you, please, attach here or mail me the output of ethtool -e eth0 > I think I won't be able to do much with it because the device is no longer on > the bus, but perhaps the clued people here could help with that. Sure , attached. > > Also when you say "vanilla kernel" its unclear which particular version. I've tested all 2.6.27-gitX/-rcX kernels on that box also latest tip/master and linux-next kernel(s).
Created attachment 18063 [details] eeprom dump from P5E-VM DO
(In reply to comment #24) > We got a report of this happening with e100 too, but in this case it corrupts > only PXE rom (https://qa.mandriva.com/show_bug.cgi?id=44192). Not sure is the > same problem. Hi, please disregard the problem with e100, it was a hardware issue with the switch being used, sorry for the noise.
@intel guys, I see http://lkml.org/lkml/2008/9/29/395 ("[RFC PATCH 11/12] e1000e: write protect ICHx NVM to prevent malicious write/erase"). I believe you are internally doing some good testing on this (since you'd be the ones with more resources and less risks)? Would you ping here when you're very sure the card won't get "bricked", if possible? Thanks.
I'm not aware of how many people where actually affected by this, but maybe it would help to start a little "database" of affected systems, and ask other people to donate EEPROM dumps from similar working systems, in case it will be possible to revive the bricked ethernets. Even if there would be no security issues with this, more than 1 dump per system could be useful I guess (both for verification and for different hardware revisions etc., possibly).
Created attachment 18139 [details] patch to avoid nvm corruption This patch should prevent any further NVM corruptions by locking the NVM and locking those registers that lock the NVM, until the next reboot. This of course only happens once the driver is loaded.
Fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4a7703582836f55a1cbad0e2c1c6ebbee3f9b3a7
I fixed my x61t with the Vista drivers from intel.com (11-12 MB). Yust installed them and played with their diagnostic tools in the driver. Both PXE and normal networking under Linux are working again :-) I BIOS-upgrade alone did not help (actually, it was a downgrade^^) Please write something about this solution in the error message about the wrong crc, it would help many people. I reopen the Bug because of this, though it is more of help people unbricking their hardware. Thanks, Michael Fritscher
So in the interests of adding some closure to this bug. The issue turns out to have never been the e1000e driver's fault. The fault lies with the CONFIG_DYNAMIC_FTRACE option. So specifically when the FTRACE code was enabled, it was doing a locked cmpxchg instruction on memory that had been previously used as __INIT code from some other module. a) some other module loads b) that module's init code calls into ftrace which stores the EIP c) that module discards its init code d) e1000e loads e) e1000e asks the kernel for memory to ioremap onto, and gets the memory location of the code at b) and maps the flash/NVM control registers there. f) ftraced runs and rewrites onto bytes 4-8 of the memory location from b/e g) since the lock/cmpxchg instruction is undefined for memory mapped registers, random junk is written to the b/e location h) depending on the contents of the junk in g) the NVM is either byte corrupted or block erased, which is detected the next time the e1000e driver is loaded. a short term workaround is in 2.6.27.1 (disable CONFIG_DYNAMIC_FTRACE) and the longer term fix is rewrites of the cmpxchg code (which is already done and will be in 2.6.28-rc1)
I don't suppose you've found a way to bring back the cards from the dead?
Re comment #39: http://lkml.org/lkml/2008/10/17/214
Thanks, I'll poke Karsten and see if he can sort out my machine.
In case anyone finds this bug while looking for a way to repair their card, Karsten was able to help me fully restore my NVM (via an image from an identical machine). It should also be noted that I was in the state where the device had "fallen of the bus" (it turned out that the device was still there, but reporting a bad vendor and device id as a result of the broken NVM data, causing Linux and Windows to ignore the device). So there should be hope for everyone. :)