Bug 3598 - (net sk98lin) eth0 error: Cannot read VPD keys (Marvell Yukon gigabit chipset / Shuttle SN95G5)
Summary: (net sk98lin) eth0 error: Cannot read VPD keys (Marvell Yukon gigabit chipset...
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-10-19 21:25 UTC by Luke Hutchison
Modified: 2006-01-18 12:02 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.8
Tree: Mainline
Regression: ---


Attachments
Dump of VPD area from a SN95G5 system (929 bytes, text/plain)
2004-11-17 13:09 UTC, Oliver Jowett
Details
deleted (7 bytes, text/plain)
2005-02-05 12:54 UTC, Luke Hutchison
Details
Sticker showing the digits your MAC address, with the required digits indicated (23.75 KB, image/jpeg)
2005-02-05 12:55 UTC, Luke Hutchison
Details

Description Luke Hutchison 2004-10-19 21:25:50 UTC
Distribution: FC3-test3
Hardware Environment: Shuttle-SN95G5 nForce3 250
Software Environment: n/a
Problem Description:

I installed fedora-development on an Shuttle-SN95G5 nForce3 250
system, which has a Marvell Yukon gigabit ethernet chip in it.  When
inserting the sk98lin module, the kernel gives the following error
every second or two:

Class: internal Software error
Nr: 0x19e
Msg: Vpd Cannot read VPD keys

Google found the following related problems in many pages for the ASUS K8VSE mobos:

http://kerneltrap.org/node/view/3419
http://lkml.org/lkml/2004/4/6/317
http://linux.derkeiler.com/Mailing-Lists/Kernel/2004-03/5484.html

It seems that the Shuttle BIOS may be having the same problem.

My questions are:

(1) is it possible to have the driver report this error once, rather
than every couple of seconds, so that the logs don't fill up?
(2) is there a way I can determine what the current VPD data is, and
what it should be, so I can write a fix similar to the one described
in the links above?  Is the VPD data supposed to be the same for all
Marvell-based boards?
(3) is it true this is a BIOS problem in the first place, or is it a driver problem?
(4) what problems are likely to arise from this error?  (Ethernet seems to work
fine, in spite of it, although apparently people with the problem on Asus boards
have a limited subset of features available without the workaround.  I haven't
tested extensively enough to know yet.)
(5) should I report this to Shuttle?

Thanks..
Comment 1 Luke Hutchison 2004-11-05 06:40:34 UTC
I figured out that the reason I get the error every second is because nifd is
checking the interface status at 1 sec intervals.  I disabled the nifd service.
 I still get the error on startup and shutdown, but at least my logs are not
getting long.  Data transfer on the interface seems flawless.

I also reported this to redhat bugzilla, bug number 136158.  No response there yet.
Comment 2 Oliver Jowett 2004-11-17 13:07:36 UTC
I have the same problem on the same hardware.

The underlying problem seems to be a bad VPD checksum, as in the Asus case.

I will attach a dump of my VPD area. The driver computes an 8-bit sum of bytes
00..75 inclusive; this should total 00 but actually totals c0 in my case.
Repairing the checksum along the lines of the existing driver's fix for Asus
boards eliminates the error and makes the driver work without errors. Since the
VPD area appears to have system-specific data (serial numbers etc) it seems hard
to generalize this without knowing if there is a pattern to the bad checksums.
It's not even clear if it's the checksum at fault or if there really is
corruption in the VPD area somewhere.

SysKonnect support (linux@syskonnect.de) reckons this is a known
Shuttle-mainboard-specific problem and that Shuttle should have updated VPD
data. I am waiting on a response from Shuttle at the moment.

Perhaps the driver should log bad VPD checksums and continue, rather than
failing the VPD initialization entirely?
Comment 3 Oliver Jowett 2004-11-17 13:09:04 UTC
Created attachment 4067 [details]
Dump of VPD area from a SN95G5 system
Comment 4 Luke Hutchison 2004-11-17 14:27:12 UTC
It is definitely a problem with some Shuttle mainboards, but it seems that it is
even more of a prevalent problem on Asus boards, as noted in the original
comment.  I got far more hits on Google for Asus than for Shuttle, but I'm glad
I'm not the only Shuttle user having this problem.

Is it just that the programmers of the affected BIOSes forgot to calculate the
VPD checksum?  That might explain why it was left as zero.  Is there any reason
(security-related or otherwise) to even check the VPD checksum?  Couldn't
drivers just assume the data in the VPD area is good?
Comment 5 Luke Hutchison 2005-01-31 00:56:08 UTC
Interestingly, the latest Fedora-development packages cause the VPD checksum
error to occur whenever the interface status is checked, e.g. by the GNOME
System Monitor applet.  This didn't use to happen (the nifd daemon, ifup/ifdown
etc. cause the error, but not the System Monitor applet).

Also, now that the error occurs a lot more frequently (according to whatever the
applet update frequency is set to), it is easy to see that each time the error
occurs, the system blocks / locks up for around 100ms.  This means that
screensavers jump and DVD playback jumps periodically.  Even mouse movement
jumps.  (And of course the logs fill up fast.)  I think that makes this error a
"higher-impact" problem than it may seem on face value.

Oliver -- can you please attach a slight modification to your patch, so that it
logs the error once when it first occurs, and then sets the checksum to what it
should be, so that the error doesn't occur again?  It would be nice to see if
that would get accepted into the kernel.  (I have a brother who just got a
Shuttle system (different model), and noticed his DVDs were jumping, aparently
because of this issue.  I don't think this will be a rare problem.)
Comment 6 Luke Hutchison 2005-02-05 12:52:53 UTC
I emailed Shuttle, and they sent me the following reply after about 2 weeks:

---- <begin> ----

From: 	Support.S <Support@tw.shuttle.com>
Subject: 	Re: Website problem]

Dear Sir/Madam:
 
Thank you choosing Shuttle.
 
Regarding your concern about SN95G5 issue, please follow the procedure set out
below.
Please follow the procedure set out below.
    1. Upzip the attachment file to floppy or any boot up device.
    2. Boot up to DOS mode.
    3. Execute "Yukonvpd  -R  FN95.raw"
    4. Edit Mac address in FN95.ini file 
    5. Full in "BEGIN=  xxxxxx ", Please refer to the pic.
    6. Execute "Yukonmt  FN95.ini  FN95.raw "
    7. Execute "Yukonvpd  -P  FN95.raw "
    8. Execute "Yukondg" to view EEPROM data 
 

Thank you very much.
 
Best Regards,
 
Shuttle Inc.
Technical Support

---- <end> ----

I followed the procedure, and my Marvell Yukon's VPD area now has the correct
checksum, so I don't see the VPD error anymore.

I will attach the two files that came with the above email.  The first is the
Yukon Manufacturer's Tools, that reads/writes VPD data.  The second is an image
that shows the sticker on the inside of your box that contains your MAC address.
 You have to fill in the last 6 digits in the "BEGIN=" section of FN95.ini as
described in the email.

This is a permanent fix on a per-machine basis, but it would be nice if the
kernel were fixed to only output the error once (and/or automatically fix the
checksum), as following the above process is kind of a pain (especially since
you need to set up a bootable DOS partition to install the files to).
Comment 7 Luke Hutchison 2005-02-05 12:54:29 UTC
Created attachment 4518 [details]
deleted
Comment 8 Luke Hutchison 2005-02-05 12:55:49 UTC
Created attachment 4520 [details]
Sticker showing the digits your MAC address, with the required digits indicated
Comment 9 Luke Hutchison 2005-02-05 13:02:32 UTC
Another issue -- when the error is generated, the system locks up for something
between 25-50 ms.  It seems even interrupts may be blocked during this time
period (the entire system locks up briefly).  This may be indicative of a bigger
problem with how the error is actually handled by the kernel.  Shouldn't an
error condition not cause the system to lock up for such a long period of time?
Comment 10 Oliver Jowett 2005-02-05 13:19:10 UTC
I guess that reading the EEPROM blocks the system briefly; when the error
occurs, the driver is reading the EEPROM on each access, but bailing out with a
checksum error before it marks the in-memory data as valid.

I'll try the EEPROM update and get before & after VPD images to see if we can
intelligently detect whatever is causing the checksum error.
Comment 11 Oliver Jowett 2005-02-05 15:30:44 UTC
I've updated the EEPROM on my system and it does fix the checksum error.

Comparing the before- and after- VPD data (and also the .raw files the vpd tools
generate), the only change to the EEPROM was the checksum byte. I also noticed
that the diagnostics tool (yukondg.exe) when run before the EEPROM update
complained about a VPD checksum/format error. So it's not our checksum
calculation at fault.

I will sort out a patch to warn and ignore checksum errors when a Shuttle MAC
address is seen. Don't think we can do anything cleaner than that :(
Comment 12 Luke Hutchison 2005-02-05 23:22:35 UTC
Awesome, thanks Oliver.
Comment 13 Luke Hutchison 2005-02-05 23:25:43 UTC
BTW I noticed the vpd tools show the checksum passes after the update.  I also
noticed that you can manually change vpd keys after updating.  So it's possible
the Shuttle guys ran the update, then decided to change a VPD key manually, and
didn't re-calculate the checksum.

A recent Google search shows the prevalence of this problem is increasing
(compared to when I looked 2-3 months ago), as more people get SN95G5 systems
and run Linux on them.
Comment 14 Waldo 2005-04-15 17:30:24 UTC
FWIW--  I have the same sn95g5 shuttle and the same problem.  What an annoying
bug this is too.  Found two (software, ie, no flashing) solutions online.

The first is a patch for the sk98lin driver, which I'll attach.  All credit for
this goes to Jan Willem, the blogmaster of this site:

http://www.lxtreme.nl/index.pl/blog/1105288158
http://www.lxtreme.nl/index.pl/blog/1107365956
etc.

His patch is here:  http://www.lxtreme.nl/pub/skvpd_fn95_090105.patch.gz

It patches against

/usr/src/linux/drivers/net/sk98lin/skge.c

Solution #2, which I think is even better, is to use the New SysKonnect
GigaEthernet support driver (aka the SKGE driver, currently EXPERIMENTAL in the
-mm kernel).  It's a smaller driver for Yukon which doesn't give the VPD errors
at all, even in an unflashed system.  Supposedly it's faster too.  I would LOVE
to see it included in the mainstream kernel.

If you use gentoo, it's in the gentoo-sources kernel, starting around 2.6.11-r2
or so.

    -> Device Drivers                                                  
         -> Networking support                                            
           -> Network device support (NETDEVICES [=y])                   
             -> Ethernet (1000 Mbit)

Finally, if you have Windows (or FreeDOS, which is what I'd have to use), the
BIOS patch is supposedly found here:

http://www.asus.com/support/download/selectftp.aspx?l1_id=1&l2_id=20&l3_id=1&m_id=3&f_name=vpd_patch.zip~zaqwedc

W
Comment 15 Luke Hutchison 2005-08-13 21:37:22 UTC
Yukon Manufacturer's Tools attachment has been deleted at Shuttle's
(kindly-written) request, due to copyright issues in their contract with Marvell.
Comment 16 Paul Sorensen 2005-08-28 14:31:07 UTC
This problem is not unique to the Shuttle motherboards. I am using a couple of
D-Link DGE-530R PCI ethernet adapters. Both show the same problem. My guess is
that the same buggy manufacturer's tool is being used to write the EEPROM with
a bad checksum.In other words, is probably a manufacturing defect, not a problem
in the driver. Maybe a gpl'd tool like Yukon's that corrects the checksum
would be a better solution than patching the driver as a work around.
Comment 17 Stephen Hemminger 2006-01-18 12:02:00 UTC
This bug will be closed. There are two possible solutions for people with
affected systems.
Either run the skge driver that doesn't use VPD; or get vendor to provide tools
that provide correct checksum.

Since skge is up and stable, putting a workaround in sk98lin is a bad idea.

The D-Link board has a different problem. The board has no EEPROM, so the sk98lin
driver will never work, therefore that PCI-id has been deleted from the table,
and the
skge driver will be picked up instead.

Note You need to log in before you can comment on or make changes to this bug.