Distribution: FC3-test3 Hardware Environment: Shuttle-SN95G5 nForce3 250 Software Environment: n/a Problem Description: I installed fedora-development on an Shuttle-SN95G5 nForce3 250 system, which has a Marvell Yukon gigabit ethernet chip in it. When inserting the sk98lin module, the kernel gives the following error every second or two: Class: internal Software error Nr: 0x19e Msg: Vpd Cannot read VPD keys Google found the following related problems in many pages for the ASUS K8VSE mobos: http://kerneltrap.org/node/view/3419 http://lkml.org/lkml/2004/4/6/317 http://linux.derkeiler.com/Mailing-Lists/Kernel/2004-03/5484.html It seems that the Shuttle BIOS may be having the same problem. My questions are: (1) is it possible to have the driver report this error once, rather than every couple of seconds, so that the logs don't fill up? (2) is there a way I can determine what the current VPD data is, and what it should be, so I can write a fix similar to the one described in the links above? Is the VPD data supposed to be the same for all Marvell-based boards? (3) is it true this is a BIOS problem in the first place, or is it a driver problem? (4) what problems are likely to arise from this error? (Ethernet seems to work fine, in spite of it, although apparently people with the problem on Asus boards have a limited subset of features available without the workaround. I haven't tested extensively enough to know yet.) (5) should I report this to Shuttle? Thanks..
I figured out that the reason I get the error every second is because nifd is checking the interface status at 1 sec intervals. I disabled the nifd service. I still get the error on startup and shutdown, but at least my logs are not getting long. Data transfer on the interface seems flawless. I also reported this to redhat bugzilla, bug number 136158. No response there yet.
I have the same problem on the same hardware. The underlying problem seems to be a bad VPD checksum, as in the Asus case. I will attach a dump of my VPD area. The driver computes an 8-bit sum of bytes 00..75 inclusive; this should total 00 but actually totals c0 in my case. Repairing the checksum along the lines of the existing driver's fix for Asus boards eliminates the error and makes the driver work without errors. Since the VPD area appears to have system-specific data (serial numbers etc) it seems hard to generalize this without knowing if there is a pattern to the bad checksums. It's not even clear if it's the checksum at fault or if there really is corruption in the VPD area somewhere. SysKonnect support (linux@syskonnect.de) reckons this is a known Shuttle-mainboard-specific problem and that Shuttle should have updated VPD data. I am waiting on a response from Shuttle at the moment. Perhaps the driver should log bad VPD checksums and continue, rather than failing the VPD initialization entirely?
Created attachment 4067 [details] Dump of VPD area from a SN95G5 system
It is definitely a problem with some Shuttle mainboards, but it seems that it is even more of a prevalent problem on Asus boards, as noted in the original comment. I got far more hits on Google for Asus than for Shuttle, but I'm glad I'm not the only Shuttle user having this problem. Is it just that the programmers of the affected BIOSes forgot to calculate the VPD checksum? That might explain why it was left as zero. Is there any reason (security-related or otherwise) to even check the VPD checksum? Couldn't drivers just assume the data in the VPD area is good?
Interestingly, the latest Fedora-development packages cause the VPD checksum error to occur whenever the interface status is checked, e.g. by the GNOME System Monitor applet. This didn't use to happen (the nifd daemon, ifup/ifdown etc. cause the error, but not the System Monitor applet). Also, now that the error occurs a lot more frequently (according to whatever the applet update frequency is set to), it is easy to see that each time the error occurs, the system blocks / locks up for around 100ms. This means that screensavers jump and DVD playback jumps periodically. Even mouse movement jumps. (And of course the logs fill up fast.) I think that makes this error a "higher-impact" problem than it may seem on face value. Oliver -- can you please attach a slight modification to your patch, so that it logs the error once when it first occurs, and then sets the checksum to what it should be, so that the error doesn't occur again? It would be nice to see if that would get accepted into the kernel. (I have a brother who just got a Shuttle system (different model), and noticed his DVDs were jumping, aparently because of this issue. I don't think this will be a rare problem.)
I emailed Shuttle, and they sent me the following reply after about 2 weeks: ---- <begin> ---- From: Support.S <Support@tw.shuttle.com> Subject: Re: Website problem] Dear Sir/Madam: Thank you choosing Shuttle. Regarding your concern about SN95G5 issue, please follow the procedure set out below. Please follow the procedure set out below. 1. Upzip the attachment file to floppy or any boot up device. 2. Boot up to DOS mode. 3. Execute "Yukonvpd -R FN95.raw" 4. Edit Mac address in FN95.ini file 5. Full in "BEGIN= xxxxxx ", Please refer to the pic. 6. Execute "Yukonmt FN95.ini FN95.raw " 7. Execute "Yukonvpd -P FN95.raw " 8. Execute "Yukondg" to view EEPROM data Thank you very much. Best Regards, Shuttle Inc. Technical Support ---- <end> ---- I followed the procedure, and my Marvell Yukon's VPD area now has the correct checksum, so I don't see the VPD error anymore. I will attach the two files that came with the above email. The first is the Yukon Manufacturer's Tools, that reads/writes VPD data. The second is an image that shows the sticker on the inside of your box that contains your MAC address. You have to fill in the last 6 digits in the "BEGIN=" section of FN95.ini as described in the email. This is a permanent fix on a per-machine basis, but it would be nice if the kernel were fixed to only output the error once (and/or automatically fix the checksum), as following the above process is kind of a pain (especially since you need to set up a bootable DOS partition to install the files to).
Created attachment 4518 [details] deleted
Created attachment 4520 [details] Sticker showing the digits your MAC address, with the required digits indicated
Another issue -- when the error is generated, the system locks up for something between 25-50 ms. It seems even interrupts may be blocked during this time period (the entire system locks up briefly). This may be indicative of a bigger problem with how the error is actually handled by the kernel. Shouldn't an error condition not cause the system to lock up for such a long period of time?
I guess that reading the EEPROM blocks the system briefly; when the error occurs, the driver is reading the EEPROM on each access, but bailing out with a checksum error before it marks the in-memory data as valid. I'll try the EEPROM update and get before & after VPD images to see if we can intelligently detect whatever is causing the checksum error.
I've updated the EEPROM on my system and it does fix the checksum error. Comparing the before- and after- VPD data (and also the .raw files the vpd tools generate), the only change to the EEPROM was the checksum byte. I also noticed that the diagnostics tool (yukondg.exe) when run before the EEPROM update complained about a VPD checksum/format error. So it's not our checksum calculation at fault. I will sort out a patch to warn and ignore checksum errors when a Shuttle MAC address is seen. Don't think we can do anything cleaner than that :(
Awesome, thanks Oliver.
BTW I noticed the vpd tools show the checksum passes after the update. I also noticed that you can manually change vpd keys after updating. So it's possible the Shuttle guys ran the update, then decided to change a VPD key manually, and didn't re-calculate the checksum. A recent Google search shows the prevalence of this problem is increasing (compared to when I looked 2-3 months ago), as more people get SN95G5 systems and run Linux on them.
FWIW-- I have the same sn95g5 shuttle and the same problem. What an annoying bug this is too. Found two (software, ie, no flashing) solutions online. The first is a patch for the sk98lin driver, which I'll attach. All credit for this goes to Jan Willem, the blogmaster of this site: http://www.lxtreme.nl/index.pl/blog/1105288158 http://www.lxtreme.nl/index.pl/blog/1107365956 etc. His patch is here: http://www.lxtreme.nl/pub/skvpd_fn95_090105.patch.gz It patches against /usr/src/linux/drivers/net/sk98lin/skge.c Solution #2, which I think is even better, is to use the New SysKonnect GigaEthernet support driver (aka the SKGE driver, currently EXPERIMENTAL in the -mm kernel). It's a smaller driver for Yukon which doesn't give the VPD errors at all, even in an unflashed system. Supposedly it's faster too. I would LOVE to see it included in the mainstream kernel. If you use gentoo, it's in the gentoo-sources kernel, starting around 2.6.11-r2 or so. -> Device Drivers -> Networking support -> Network device support (NETDEVICES [=y]) -> Ethernet (1000 Mbit) Finally, if you have Windows (or FreeDOS, which is what I'd have to use), the BIOS patch is supposedly found here: http://www.asus.com/support/download/selectftp.aspx?l1_id=1&l2_id=20&l3_id=1&m_id=3&f_name=vpd_patch.zip~zaqwedc W
Yukon Manufacturer's Tools attachment has been deleted at Shuttle's (kindly-written) request, due to copyright issues in their contract with Marvell.
This problem is not unique to the Shuttle motherboards. I am using a couple of D-Link DGE-530R PCI ethernet adapters. Both show the same problem. My guess is that the same buggy manufacturer's tool is being used to write the EEPROM with a bad checksum.In other words, is probably a manufacturing defect, not a problem in the driver. Maybe a gpl'd tool like Yukon's that corrects the checksum would be a better solution than patching the driver as a work around.
This bug will be closed. There are two possible solutions for people with affected systems. Either run the skge driver that doesn't use VPD; or get vendor to provide tools that provide correct checksum. Since skge is up and stable, putting a workaround in sk98lin is a bad idea. The D-Link board has a different problem. The board has no EEPROM, so the sk98lin driver will never work, therefore that PCI-id has been deleted from the table, and the skge driver will be picked up instead.