Bug 6142
Summary: | Skge related Oops on P3 SMP box with IRQ migration enabled | ||
---|---|---|---|
Product: | Drivers | Reporter: | Krzysztof Oledzki (ole) |
Component: | Network | Assignee: | Stephen Hemminger (stephen) |
Status: | RESOLVED CODE_FIX | ||
Severity: | blocking | CC: | bunk, stephen |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.15.4 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
Full .config file
possible IRQ race fix |
Description
Krzysztof Oledzki
2006-02-28 15:41:36 UTC
I'm having a similar problem with 2.6.15.6 on a Athlon64 X2 3800+ running 64 bit gentoo. The motherboard is an ASUS A8N-SLI nForce4 based board with two integrated NICs, one Marvell 88E8001 and one nVidia. The nVidia NIC works fine, but using the Marvell NIC with the skge driver eventually causes the system to lock up hard. It takes a while, but usually ~10 minutes of heavy NFS traffic (>20 MB/s) will break the system. It's not a hardware issue, since the Marvell NIC works fine (albeit slower and less efficiently) with the in-kernel sk98lin driver. The problem only manifests when using a SMP kernel. Setting smp_affinity to 1 on the skge interrupt (82 on my system) seems to make the problem go away. Smells like the race condition problems haven't quite been fixed yet. One minor complication: I'm using the loop-aes 3.1c patch and have disk encryption on all of my drives. Perhaps this is the source of the problem. Here are a list of things that don't seem to have any effect on the problem: Over/Underclocking the system 2.6.16-rc5 Linux Vserver patches Preempt vs. Non-Preempt Monkeying around with the interrupt coalescing settings with ethtool Side note: The newer sk98lin driver from SysKonnect causes the system to crash spectacularly whenever any NFS traffic occurs unless a big chunk of SSH traffic (>10MB) occurs first. If the SSH transfer occurs first, the system will be rock solid -- hours of high-speed data transfer -- from then on out. On Wed, 15 Mar 2006, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=6142 > > > > > > ------- Additional Comments From robert@firehead.org 2006-03-15 15:18 ------- > I'm having a similar problem with 2.6.15.6 on a Athlon64 X2 3800+ running 64 bit > gentoo. The motherboard is an ASUS A8N-SLI nForce4 based board with two > integrated NICs, one Marvell 88E8001 and one nVidia. The nVidia NIC works fine, > but using the Marvell NIC with the skge driver eventually causes the system to > lock up hard. It takes a while, but usually ~10 minutes of heavy NFS traffic > (>20 MB/s) will break the system. It's not a hardware issue, since the Marvell > NIC works fine (albeit slower and less efficiently) with the in-kernel sk98lin > driver. The problem only manifests when using a SMP kernel. > > Setting smp_affinity to 1 on the skge interrupt (82 on my system) seems to make > the problem go away. Smells like the race condition problems haven't quite been > fixed yet. One minor complication: I'm using the loop-aes 3.1c patch and have > disk encryption on all of my drives. Perhaps this is the source of the problem. > > Here are a list of things that don't seem to have any effect on the problem: > > Over/Underclocking the system > 2.6.16-rc5 > Linux Vserver patches > Preempt vs. Non-Preempt > Monkeying around with the interrupt coalescing settings with ethtool > > > Side note: The newer sk98lin driver from SysKonnect causes the system to crash > spectacularly whenever any NFS traffic occurs unless a big chunk of SSH traffic > (>10MB) occurs first. If the SSH transfer occurs first, the system will be rock > solid -- hours of high-speed data transfer -- from then on out. > You may also try to disable rx and/or tx csum. With disabled rx&tx hardware csuming my system is stable even with smp_affinity set to 3. Now I only need to test what is the real problem: rx or tx... Best regards, Krzysztof Ol Please retest with new 1.4 version (post 2.6.16). You can find diff from 2.6.16 version at: http://developer.osdl.org/shemminger/prototypes/skge-1.4.diff Applied skge 1.4 patch to 2.6.16-vserver (presence or absence of vserver had no effect on crashes previously). This time the system locked up within a few minutes of heavy NFS traffic, so it seems the bug is still there. SMP affinity setting decreased the frequenct of crashing, but did not eliminate the problem entirely. Turning off tx and rx checksumming with ethtool -K seems to have made the bug go away for now. This caused a performance hit of about 20% which I was able to get rid of by messing with the interrupt coalescing settings on all the machines. Please send full .config of a non-working system. I can't reproduce this with an old P3 SMP box, and 2.6.16.6 so something different is going on. It may have something to do with bonding or vlan's. I saw the bonding config, are you using VLAN's as well? Please reopen this bug if: - it is still present in kernel 2.6.17 and - you can provide the requested information. Created attachment 8711 [details]
Full .config file
The bug still exists in 2.6.17. Anyway, it take some time before system crashes - sometimes even day or two and this server is quite busy (pop3/imap/smtp/amavis/apache/mysql/etc). For now I'm happy with the "/usr/sbin/ethtool -K eth1 tx off" workaround. Ah, I don't use vlans on this server - only bonding (active/backup) with eth0+eth1. Created attachment 8900 [details]
possible IRQ race fix
This changes order of lock and irq register read that could theoritically
cause problems.
The problems should now be fixed in 2.6.17.13 and 2.6.18 |