Bug 9173

Summary: BUG: soft lockup detected on CPU#0 - maybe related to TCP_MD5SIG
Product: Networking Reporter: Tore Anderson (tore)
Component: IPV4Assignee: Stephen Hemminger (stephen)
Status: RESOLVED CODE_FIX    
Severity: normal CC: tore, yoshfuji
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.20-9-server-lp2 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Call traces

Description Tore Anderson 2007-10-17 07:34:49 UTC
Most recent kernel where this bug did not occur: 2.6.12 (with TCP_MD5SIG-implementation from http://hasso.linux.ee/doku.php/english:network:rfc2385)
Distribution: Ubuntu 6.06.1 LTS
Hardware Environment: Sun X4100 (x86_64, SMP)
Software Environment: Ubuntu 2.6.20-9-server-lp2 (-lp2 because it's recompiled with TCP_MD5SIG enabled), 64-bits userspace
Problem Description:

The server is a border router running Quagga for BGP and OSPF, and usually forwards 4-500Mbps worth of traffic between around 80 VLAN interfaces.  Four network interfaces, bonded pairwise.  It has three BGP sessions with MD5 signatures enabled.

The server has an identical twin (for failover) which has also locked up like this, although it happens much more frequently on the active one (no matter which one is active, unfortunately).  We've got lots of these servers, but only the border routers have had these lockups.

Once in a while (say, once every four to six weeks) it will flood the console with BUG: soft lockup detected on CPU#0! and shortly after fail completely.This time I had increased the default prink level one notch and got the back traces too.  I'm not used to reading those, but the md5sig stuff seems to stand out...

I'll try to attach the trace somehow (got an error message about the bug being to large when attempting to include it here).

Steps to reproduce:  It happens completely out of the blue, so I don't know how.

Tore
Comment 1 Tore Anderson 2007-10-17 07:35:49 UTC
Created attachment 13187 [details]
Call traces

The traces printed to the console when the server locks up
Comment 2 Stephen Hemminger 2007-10-29 22:52:22 UTC
This bug was just fixed.

commit 2c4f6219aca5939b57596278ea8b014275d4917b
Author: David S. Miller <davem@sunset.davemloft.net>
Date:   Tue Feb 20 23:51:47 2007 -0800

    [TCP]: Fix MD5 signature pool locking.
    
    The locking calls assumed that these code paths were only
    invoked in software interrupt context, but that isn't true.
    
    Therefore we need to use spin_{lock,unlock}_bh() throughout.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>