Bug 10309 - BUG: soft lockup - CPU#0 stuck
Summary: BUG: soft lockup - CPU#0 stuck
Status: CLOSED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: SPARC64 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: platform_sparc64
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-03-22 06:53 UTC by Arno
Modified: 2010-01-19 17:02 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.24.3
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Arno 2008-03-22 06:53:39 UTC
Earliest failing kernel version: 2.6.24.3
Distribution: Debian 4.0/Etch Sparc64
Hardware Environment: Sparc64
Problem Description:
During boot of my Sparc64 with a vanille 2.6.24.3 kernel the kernel gets stuck. After while it continues allowing normal use of the machine. The message shown in the kernel logs looks like this:

BUG: soft lockup - CPU#0 stuck for 11s! [ifconfig:2408]
TSTATE: 0000004480009603 TPC: 0000000010013390 TNPC: 0000000010013394 Y: 00000000    Not tainted
TPC: <gem_interrupt+0x14/0xec [sungem]>
g0: 0000000000009000 g1: 0800000000000001 g2: 0000000000000100 g3: 0000000000000400
g4: fffff8003e61c060 g5: 0000000000000020 g6: fffff8003e7e8000 g7: 0000000000000000
o0: 0000000000000001 o1: fffff8003d090670 o2: 0000000000000001 o3: 000001fe0000f078
o4: 7fffffffffffffff o5: 0000000080000000 sp: fffff8003e7ea341 ret_pc: 00000000100133bc
RPC: <gem_interrupt+0x40/0xec [sungem]>
l0: fffff8003d090670 l1: 0000000000821400 l2: fffff8003e7eaca0 l3: 0000000000000400
l4: 0000000000000000 l5: 0000000000000005 l6: 0000000000000000 l7: 0000000000000008
i0: 0000000000000009 i1: fffff8003d090620 i2: 000000001c2245fa i3: 000000000000000c
i4: 7fffffffffffffff i5: 0000000000000000 i6: fffff8003e7ea401 i7: 000000000047e974
I7: <handle_IRQ_event+0x34/0x74>

It appartently has to do with the network (ifconfig) although I'm not 100%. I tried changing several settings in the kernel (to many to write down here) but all give the same result.

Steps to reproduce:
- Build a 2.6.24.3 for Sparc64
- Boot a SUN Sparc64 machine with this kernel
- System gets stuck during boot but after a while it continues its boot.
Comment 1 Anonymous Emailer 2008-03-22 09:37:41 UTC
Reply-To: akpm@linux-foundation.org

(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sat, 22 Mar 2008 06:53:40 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=10309
> 
>            Summary: BUG: soft lockup - CPU#0 stuck
>            Product: Platform Specific/Hardware
>            Version: 2.5
>      KernelVersion: 2.6.24.3
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: SPARC64
>         AssignedTo: platform_sparc64@kernel-bugs.osdl.org
>         ReportedBy: arnova@eld.physics.leidenuniv.nl
> 
> 
> Earliest failing kernel version: 2.6.24.3
> Distribution: Debian 4.0/Etch Sparc64
> Hardware Environment: Sparc64
> Problem Description:
> During boot of my Sparc64 with a vanille 2.6.24.3 kernel the kernel gets
> stuck.
> After while it continues allowing normal use of the machine. The message
> shown
> in the kernel logs looks like this:
> 
> BUG: soft lockup - CPU#0 stuck for 11s! [ifconfig:2408]
> TSTATE: 0000004480009603 TPC: 0000000010013390 TNPC: 0000000010013394 Y:
> 00000000    Not tainted
> TPC: <gem_interrupt+0x14/0xec [sungem]>
> g0: 0000000000009000 g1: 0800000000000001 g2: 0000000000000100 g3:
> 0000000000000400
> g4: fffff8003e61c060 g5: 0000000000000020 g6: fffff8003e7e8000 g7:
> 0000000000000000
> o0: 0000000000000001 o1: fffff8003d090670 o2: 0000000000000001 o3:
> 000001fe0000f078
> o4: 7fffffffffffffff o5: 0000000080000000 sp: fffff8003e7ea341 ret_pc:
> 00000000100133bc
> RPC: <gem_interrupt+0x40/0xec [sungem]>
> l0: fffff8003d090670 l1: 0000000000821400 l2: fffff8003e7eaca0 l3:
> 0000000000000400
> l4: 0000000000000000 l5: 0000000000000005 l6: 0000000000000000 l7:
> 0000000000000008
> i0: 0000000000000009 i1: fffff8003d090620 i2: 000000001c2245fa i3:
> 000000000000000c
> i4: 7fffffffffffffff i5: 0000000000000000 i6: fffff8003e7ea401 i7:
> 000000000047e974
> I7: <handle_IRQ_event+0x34/0x74>
> 
> It appartently has to do with the network (ifconfig) although I'm not 100%. I
> tried changing several settings in the kernel (to many to write down here)
> but
> all give the same result.
> 
> Steps to reproduce:
> - Build a 2.6.24.3 for Sparc64
> - Boot a SUN Sparc64 machine with this kernel
> - System gets stuck during boot but after a while it continues its boot.
> 

I expect it would be useful if you tell us the latest version of the kernel
on which this didn't happen.  ie: what kernel version were you running before
you "up"graded to 2.6.24?

Thanks.
Comment 2 Arno 2008-03-22 10:20:32 UTC
Sorry, I forgot to mention that. The last kernel that didn't have this problem was Debian's 2.6.18-6-sparc64 stock kernel.
Comment 3 Arno 2008-03-23 10:00:08 UTC
Just build and tried a 2.6.19.7 vanilla kernel, and this kernel also does NOT suffer from this issue. I will now try a 2.6.23.17 vanilla kernel, and see what this does....
Comment 4 Arno 2008-03-23 12:13:49 UTC
I can confirm that this issue also doesn't exist in 2.6.23.17. Just tested with a vanilla kernel on my test system and the problem does NOT occur. The obvious conclusion is that this problem got introduced in 2.6.24 (post 2.6.23)...
Comment 5 Aaron Sethman 2008-08-07 07:58:21 UTC
This issue still exists on 2.6.26.2. 
Comment 6 Evgeni Golov 2008-11-28 18:59:32 UTC
So it is in 2.6.26.7.
Comment 7 Evgeni Golov 2008-12-08 02:48:28 UTC
And in 2.6.28-rc7

The bad commit seems to be
commit bea3348eef27e6044b6161fd04c3152215f96411
Author: Stephen Hemminger <shemminger@linux-foundation.org>
Date:   Wed Oct 3 16:41:36 2007 -0700

    [NET]: Make NAPI polling independent of struct net_device objects.

Will try to debug further
Comment 8 Evgeni Golov 2008-12-20 12:22:41 UTC
It's still present in 2.6.28-rc9 and I was not able to debug more than the commit above - git did not want to revert it :(
Comment 9 Evgeni Golov 2009-02-05 04:43:51 UTC
Still present in 2.6.29-rc3 (or more precisely: linux-2.6.git at eda58a85ec3fc05855a26654d97a2b53f0e715b9).
Comment 10 Evgeni Golov 2009-02-12 23:42:53 UTC
Fine, it's fixed now.

Commit: 71822faa3bc0af5dbf5e333a2d085f1ed7cd809f
sungem: Soft lockup in sungem on Netra AC200 when switching interface up

Note You need to log in before you can comment on or make changes to this bug.