Bug 11196 - [sky2] System freeze upon PCI Express error
Summary: [sky2] System freeze upon PCI Express error
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Jeff Garzik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-07-31 11:53 UTC by Pascal BERNARD
Modified: 2009-10-04 18:12 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.26.1
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Picture of the console (236.82 KB, image/jpeg)
2008-08-02 14:50 UTC, Pascal BERNARD
Details
Kernel message before freeze (660.94 KB, image/jpeg)
2008-11-22 11:34 UTC, Pascal BERNARD
Details
before resetting BIOS parameters (54.78 KB, application/octet-stream)
2008-11-27 13:55 UTC, Pascal BERNARD
Details
after the reset (54.78 KB, application/octet-stream)
2008-11-27 13:55 UTC, Pascal BERNARD
Details

Description Pascal BERNARD 2008-07-31 11:53:14 UTC
Latest working kernel version: -
Earliest failing kernel version: 2.6.18
Distribution:Debian64, but I do not see any change about sky2 in the changelog
Hardware Environment: ASUS P5B / Duo 6600
Software Environment:
Problem Description:
System freeze, no log. I got once the following message on the console:
sky2 0000:02:00.0 PCI Express error (0x40000)

You cannot get the message from an X-Window session since system freezes before you can get anything.

There are two eth05:04.0 Ethernet controller [0200]: Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller [11ab:4320] (rev 14)
ernet controllers on that board:
02:00.0 Ethernet controller [0200]: Marvell Technology Group Ltd. Unknown device [11ab:4364] (rev 12)

Since I use the other connection (skge driver), I do not have freeze anymore.

The problem occures also on a Debian-sid and a Mint distribution in 32bit.

Steps to reproduce:

The problem happens randomly but regularly (you have a good chance to hit it before three hours)
Comment 1 Andrew Morton 2008-07-31 12:02:24 UTC
2.6.18 is dreadfully old for the kernel.org developers.  Perhaps
your distro is still supporting it.

Please try 2.6.26 if poss, thanks.
Comment 2 Pascal BERNARD 2008-07-31 12:19:57 UTC
I have a 2.6.22 under Mint and a 2.6.25 under Debian sid though !

I will work with Debian sid and confirm. I do not remember with what version of the kernel it occured with it. It was obviously not an old one in any case.
Comment 3 Pascal BERNARD 2008-08-02 14:50:39 UTC
Created attachment 17063 [details]
Picture of the console

Luckily enough, I could reproduce the problem while I was on the console.

I use a Debian kernel:
 2.6.25-2-686 #1 SMP Fri Jul 18 17:46:56 UTC
Comment 4 Pascal BERNARD 2008-08-03 01:19:59 UTC
To be more precise, I lost network connection while I was under X and I had time to switch to the console before the crash. Thus, you may consider corrupted memory. The fact that it happened with several different kernels pleas in favour of a corruption inside the sky2 driver itself.
Comment 5 Pascal BERNARD 2008-08-03 07:04:45 UTC
As directed by Teodor <mteodor@gmail.com> from Debian, I switched to 2.6.26-1 version of the kernel. I could work for a few hours without problem, but I do not see what can explain a better behaviour:

pascal@moraes:~/linux-2.6/linux-2.6.26.y/drivers/net$ git log 'v2.6.25'.. sky2.c
commit a3b4fcedee5cf1d1342b85f1318c0fe1ff1727a9
Author: Stephen Hemminger <shemminger@vyatta.com>
Date:   Sat Jun 14 10:32:15 2008 -0700

    sky2: 88E8040T pci device id
    
    Missed one pci id for 88E8040T.
    
    Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
    Signed-off-by: Jeff Garzik <jgarzik@redhat.com>

commit 68c2889834602f6efed195f44439ef5d526683a8
Author: Ben Hutchings <bhutchings@solarflare.com>
Date:   Sat May 31 16:52:52 2008 +0100

    sky2: Hold RTNL while calling dev_close()
    
    dev_close() must be called holding the RTNL.
    
    Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
    Signed-off-by: Jeff Garzik <jgarzik@redhat.com>

commit d494eacde8858f9b53f5c640692caf14eb3c8239
Author: Stephen Hemminger <shemminger@vyatta.com>
Date:   Wed May 14 17:04:13 2008 -0700

    sky2: restore vlan acceleration on reset
    
    If device has to be reset by sky2_restart, then need to restore
    the VLAN acceleration settings.
    
    Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
    Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
pascal@moraes:~/linux-2.6/linux-2.6.26.y/drivers/net$ git log 'v2.6.25'.. sky2.h
commit a300344ab9b77130310fc225fdc7677e129b1163
Author: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date:   Tue May 6 14:34:35 2008 -0700

    sky2: fix simple define thinko
    
    noticed while browsing code, apparent thinko.  compile tested only.
    
    Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
    CC: Stephen Hemminger <shemminger@linux-foundation.org>
    Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Comment 6 Pascal BERNARD 2008-08-03 15:08:18 UTC
I got a kernel freeze with  2.6.26-1-686. I did not have the chance to catch a message on the console this time.
Comment 7 Pascal BERNARD 2008-08-13 10:59:24 UTC
Just in case it helps:
- I got a kernel freeze with sky2 module loaded but no link through it
- I have upgraded the BIOS of my motherboard (1212 which is not marked beta)
Comment 8 Pascal BERNARD 2008-11-22 11:34:05 UTC
Created attachment 18970 [details]
Kernel message before freeze

Here is a new instance of the issue:
- BIOS upgraded to last version 1236
- Linux Mint (ie kernel vmlinuz-2.6.24-21-generic which probably does not patch 2.5.24)

See attachment "Kernel message before freeze"
Comment 9 Stephen Hemminger 2008-11-22 22:56:41 UTC
Do you have 4G or more of memory? I know of driver problems with 4G or more of memory, but some motherboards do not correctly wire the upper address bits, so the memory above 4G is not accessible and causes PCI error. You might be able to work around this with the kernel iommu=soft.

Unfortunately, since sky2 driver is/was a strictly volunteer effort, and your seems to be the only current outstanding report of driver failures, so until other see the problem, I only really have the resources to give you hints to solve the problem yourself..
Comment 10 Pascal BERNARD 2008-11-23 14:39:20 UTC
I "only" have 2G of memory, so I suppose it is not worth trying iommu=soft.

The problem probably appeared after upgrading the BIOS. This can explain why others do not have the problem.

I do not know what I could do to give more hints.
Comment 11 Stephen Hemminger 2008-11-25 20:35:08 UTC
On Sun, 23 Nov 2008 14:39:21 -0800 (PST)
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=11196
> 
> 
> 
> 
> 
> ------- Comment #10 from pascal.bernard1@free.fr  2008-11-23 14:39 -------
> I "only" have 2G of memory, so I suppose it is not worth trying iommu=soft.
> 
> The problem probably appeared after upgrading the BIOS. This can explain why
> others do not have the problem.
> 
> I do not know what I could do to give more hints.
> 
> 

You could capture register dump with ethtool -d
for both old and new bios. It could be that the BIOS set up some
part of the chip differently and could be fixed. The problem is that
the driver generally tries not to fiddle with bits it doesn't need to
because the BIOS often initializes stuff based on hardware and timing
parameters based on the system bus speed etc...
Comment 12 Pascal BERNARD 2008-11-27 13:55:06 UTC
Created attachment 19057 [details]
before resetting BIOS parameters
Comment 13 Pascal BERNARD 2008-11-27 13:55:47 UTC
Created attachment 19058 [details]
after the reset
Comment 14 Pascal BERNARD 2008-11-27 13:57:11 UTC
I have run ethtool with the latest BIOS (1236), result in eth0.1236

I tried to load the latest stable release 1101, but the BIOS did not let me load an older version of it !

I then ask the BIOS to revert to default settings. I was quite surprised to see that the output of ethtool changed ! Result in eth0.afterdefault.

The BIOS offers the possibility of overclocking. I tried with a +10% overclocking. Is it a possible cause of driver failure ?

I will let you know if I observe kernel freeze or panic under nominal conditions.

Thank you for your support.
Comment 15 Pascal BERNARD 2008-12-04 13:48:01 UTC
I does not happen so often, but the system got frozen twice. Even though I had not the chance to trap a message, it is likely to be a sky2. Unabling sky2 resulted in no problem. I will try to get more info. This was just to let you know not to close this ticket.
Comment 16 Pascal BERNARD 2009-01-02 17:07:18 UTC
Here is a more precise message before kernel panic:
Jan  2 23:00:59 moraes kernel: [12337.606996] sky2 0000:02:00.0: error interrupt status=0x80000000
Jan  2 23:00:59 moraes kernel: [12337.606996] sky2 0000:02:00.0: PCI Express error (0x40000)
Jan  2 23:00:59 moraes kernel: [12337.606996] sky2 0000:02:00.0: error interrupt status=0x80000000
Jan  2 23:00:59 moraes kernel: [12337.606996] sky2 0000:02:00.0: PCI Express error (0x40000)
Jan  2 23:00:59 moraes kernel: [12337.606996] sky2 0000:02:00.0: error interrupt status=0x80000000
Jan  2 23:00:59 moraes kernel: [12337.606996] sky2 0000:02:00.0: PCI Express error (0x40000)

This happened while:
- the driver was not used ! I even blacklisted the module in modprobe.d but it was loaded anyway (does anyone know why ?)
- the BIOS setting were as standard as possible (no overclocking)
Comment 17 Pascal BERNARD 2009-10-04 18:12:03 UTC
It looks like a hardware issue with power management on the motherboard. The freeze occures even if the driver is not loaded. this cannot be the driver. I change the status to resoved (maybe a status REJECTED should be more appropriate.

Sorry to have bothered you with this problem.

Note You need to log in before you can comment on or make changes to this bug.