7066 – __free_pages+0xc/0x34

Bug 7066 - __free_pages+0xc/0x34

Summary: __free_pages+0xc/0x34

Status:	CLOSED PATCH_ALREADY_AVAILABLE

Alias:	None

Product:	Memory Management
Classification:	Unclassified
Component:	Page Allocator (show other bugs)
Hardware:	i386 Linux

Importance:	P2 normal
Assignee:	Andrew Morton

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-08-28 10:03 UTC by Don Harter
Modified:	2007-02-25 08:36 UTC (History)
CC List:	3 users (show)

See Also:
Kernel Version:	linux-2.6.17.11
Subsystem:
Regression:	---
Bisected commit-id:

Attachments
System map (672.66 KB, text/plain) 2006-08-28 10:50 UTC, Don Harter	Details
lspci -vvv (18.10 KB, text/plain) 2006-08-28 10:54 UTC, Don Harter	Details
my .config file (66.44 KB, text/plain) 2006-08-28 14:35 UTC, Don Harter	Details
2 new crashes in a log (5.60 KB, text/plain) 2006-09-09 22:34 UTC, Don Harter	Details
strace of mount command (3.07 KB, text/plain) 2006-09-15 17:39 UTC, Don Harter	Details
bootup and log of the crash (20.61 KB, text/plain) 2006-09-17 07:39 UTC, Don Harter	Details
system map for the smp kernel compile (692.30 KB, text/plain) 2006-09-17 07:40 UTC, Don Harter	Details
my smp .config file (65.81 KB, text/plain) 2006-09-17 07:44 UTC, Don Harter	Details
Remove superfluous lock from rtc.c (1.96 KB, patch) 2006-09-29 10:03 UTC, Takashi Iwai	Details \| Diff
Fix ABBA deadlock in rtc and sound timer (617 bytes, patch) 2006-09-29 11:03 UTC, Takashi Iwai	Details \| Diff
crash dump (10.42 KB, text/plain) 2006-09-29 15:35 UTC, Don Harter	Details
system map for the patched kernel (735.29 KB, text/plain) 2006-09-29 15:37 UTC, Don Harter	Details
another crash- syslog (3.97 KB, text/plain) 2006-10-02 15:17 UTC, Don Harter	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Description Don Harter 2006-08-28 10:03:58 UTC

Most recent kernel where this bug did not occur:
Distribution: suse 10.1
Hardware Environment: abit KN8 sli
Software Environment:
Problem Description: system crashes

Steps to reproduce: use the system

kernel BUG at include/linux/mm.h:296!
invalid opcode: 0000 [#1]
Modules linked in: nls_iso8859_1 nls_cp437 vfat fat ipt_LOG xt_limit xt_pkttype
af_packet edd snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ip6t_REJECT
xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat ip_nat iptable_filter
ip6table_mangle ip_conntrack nfnetlink ip_tables ip6table_filter ip6_tables
x_tables ipv6 loop dm_mod snd_intel8x0 ehci_hcd snd_ac97_codec snd_ac97_bus
i2c_nforce2 snd_pcm snd_timer snd ohci_hcd soundcore usbcore i2c_core
snd_page_alloc ide_cd cdrom ohci1394 ieee1394 forcedeth ext3 jbd processor sg
sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core
CPU:    0
EIP:    0060:[<c013c778>]    Not tainted VLI
EFLAGS: 00013246   (2.6.17.11-default #1) 
EIP is at __free_pages+0xc/0x34
eax: 00000000   ebx: 00000000   ecx: c16af540   edx: 00000000
esi: f6d27bdc   edi: f57aa000   ebp: f6d27b58   esp: f6d27b58
ds: 007b   es: 007b   ss: 0068
Process X (pid: 3552, threadinfo=f6d26000 task=dfafea70)
Stack: f6d27b60 c013c7cd f6d27b78 c01631e3 00000014 00000001 f65d3734 00000023 
       f6d27e2c c0163a9f f6d27e4c dfbc226c 00000001 f6d27fa0 f6d27f4c 00000000 
       00000023 f6d27eb4 f6d27ed4 f6d27ef4 f6d27e54 f6d27e74 f6d27e94 00000007 
Call Trace:
 <c0103ba3> show_stack_log_lvl+0x85/0x8f  <c0103d04> show_registers+0x11f/0x18b
 <c0103ec1> die+0x151/0x261  <c0287c52> do_trap+0x7c/0x96
 <c01045a9> do_invalid_op+0x89/0x93  <c0103693> error_code+0x4f/0x54
 <c013c7cd> free_pages+0x2d/0x2f  <c01631e3> poll_freewait+0x50/0x5c
 <c0163a9f> do_select+0x4a3/0x4c6  <c0163d8a> core_sys_select+0x2c8/0x2ed
 <c01642fd> sys_select+0x93/0x166  <c0102afb> sysenter_past_esp+0x54/0x75
Code: 18 89 4a 04 ff 75 e8 89 55 ec ba 01 00 00 00 e8 60 fb ff ff 53 9d 5e 8d 65
f4 5b 5e 5f 5d c3 55 89 c1 8b 40 04 89 e5 85 c0 75 08 <0f> 0b 28 01 41 6c 2a c0
ff 49 04 0f 94 c0 84 c0 74 14 85 d2 75 
EIP: [<c013c778>] __free_pages+0xc/0x34 SS:ESP 0068:f6d27b58
 <1>BUG: unable to handle kernel paging request at virtual address 0020020c
 printing eip:
c014f450
*pde = 76564067
Oops: 0000 [#2]
Modules linked in: nls_iso8859_1 nls_cp437 vfat fat ipt_LOG xt_limit xt_pkttype
af_packet edd snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ip6t_REJECT
xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat ip_nat iptable_filter
ip6table_mangle ip_conntrack nfnetlink ip_tables ip6table_filter ip6_tables
x_tables ipv6 loop dm_mod snd_intel8x0 ehci_hcd snd_ac97_codec snd_ac97_bus
i2c_nforce2 snd_pcm snd_timer snd ohci_hcd soundcore usbcore i2c_core
snd_page_alloc ide_cd cdrom ohci1394 ieee1394 forcedeth ext3 jbd processor sg
sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core
CPU:    0
EIP:    0060:[<c014f450>]    Not tainted VLI
EFLAGS: 00013246   (2.6.17.11-default #1) 
EIP is at cache_alloc_debugcheck_after+0xe6/0x127
eax: 80000000   ebx: f78e0d34   ecx: 00200200   edx: 00000000
esi: dffff980   edi: f78e0d34   ebp: f6d278fc   esp: f6d278e8
ds: 007b   es: 007b   ss: 0068
Process X (pid: 3552, threadinfo=f6d26000 task=dfafea70)
Stack: c0114957 00000020 dffff980 00000020 00003246 f6d27918 c015024e f88dbbe5 
       f88dbbe5 f7a4fd88 00000680 f7a4fd88 f6d2793c c0235195 00000000 00000020 
       00000724 dfff6500 c2194000 dfa262c0 0000003d f6d27958 f88dbbe5 dfa26000 
Call Trace:
 <c0103ba3> show_stack_log_lvl+0x85/0x8f  <c0103d04> show_registers+0x11f/0x18b
 <c0103ec1> die+0x151/0x261  <c0288a5d> do_page_fault+0x41b/0x57e
 <c0103693> error_code+0x4f/0x54  <c015024e> __kmalloc_track_caller+0xaa/0xb7
 <c0235195> __alloc_skb+0x52/0xfa  <f88dbbe5> nv_alloc_rx+0x48/0x147 [forcedeth]
 <f88dc134> nv_nic_irq+0x84/0x191 [forcedeth]  <c013770b> handle_IRQ_event+0x27/0x52
 <c01377b0> __do_IRQ+0x7a/0xce  <c0104b09> do_IRQ+0x49/0x5c
 <c01035a6> common_interrupt+0x1a/0x20  <c0103f1a> die+0x1aa/0x261
 <c0287c52> do_trap+0x7c/0x96  <c01045a9> do_invalid_op+0x89/0x93
 <c0103693> error_code+0x4f/0x54  <c013c7cd> free_pages+0x2d/0x2f
 <c01631e3> poll_freewait+0x50/0x5c  <c0163a9f> do_select+0x4a3/0x4c6
 <c0163d8a> core_sys_select+0x2c8/0x2ed  <c01642fd> sys_select+0x93/0x166
 <c0102afb> sysenter_past_esp+0x54/0x75 
Code: ff ff c7 00 a5 c2 0f 17 8d 97 00 00 00 40 c1 ea 0c c1 e2 05 03 15 50 65 3e
c0 8b 02 f6 c4 40 74 03 8b 52 0c 8b 4a 1c 89 fb 31 d2 <2b> 59 0c 89 d8 f7 76 10
c7 44 81 1c fd ff ff ff 8b 5e 3c 03 be 
EIP: [<c014f450>] cache_alloc_debugcheck_after+0xe6/0x127 SS:ESP 0068:f6d278e8
 <0>Kernel panic - not syncing: Fatal exception in interrupt

Comment 1 Don Harter 2006-08-28 10:45:53 UTC

root:~>/usr/src/linux-2.6.17.11/scripts/ver_linux
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

Linux c-69-137-115-121 2.6.17.11-default #1 Mon Aug 28 01:21:52 CDT 2006 i686
athlon i386 GNU/Linux

Gnu C                  4.1.0
Gnu make               3.80
binutils               2.16.91.0.5
util-linux             2.12r
mount                  2.12r
module-init-tools      3.2.2
e2fsprogs              1.38
jfsutils               1.1.10
reiserfsprogs          3.6.19
xfsprogs               2.7.11
quota-tools            3.13.
PPP                    2.4.3
isdn4k-utils           3.9
nfs-utils              1.0.7
Linux C Library        > libc.2.4
Dynamic linker (ldd)   2.4
Linux C++ Library      6.0.8
Procps                 3.2.6
Net-tools              1.60
Kbd                    1.12
Sh-utils               5.93
udev                   085
Modules Loaded         nls_iso8859_1 nls_cp437 vfat fat ipt_LOG xt_limit
xt_pkttype af_packet edd snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device
ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat ip_nat
iptable_filter ip6table_mangle ip_conntrack nfnetlink ip_tables ip6table_filter
ip6_tables x_tables ipv6 loop dm_mod snd_intel8x0 snd_ac97_codec snd_ac97_bus
snd_pcm snd_timer snd ehci_hcd soundcore snd_page_alloc ohci_hcd usbcore
i2c_nforce2 i2c_core ide_cd cdrom ohci1394 ieee1394 forcedeth ext3 jbd processor
sg sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core
root:~>

Comment 2 Don Harter 2006-08-28 10:47:40 UTC

root:~>cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 43
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
stepping        : 1
cpu MHz         : 2009.490
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext
3dnow up pni lahf_lm cmp_legacy ts fid vid ttp
bogomips        : 4022.72

Comment 3 Don Harter 2006-08-28 10:49:12 UTC

root:~>cat /proc/ioports
0000-001f : dma1
0020-0021 : pic1
0040-0043 : timer0
0050-0053 : timer1
0060-006f : keyboard
0070-0077 : rtc
0080-008f : dma page reg
00a0-00a1 : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : ide1
01f0-01f7 : ide0
0376-0376 : ide1
03c0-03df : vga+
03f6-03f6 : ide0
0970-0977 : 0000:00:07.0
  0970-0977 : sata_nv
09f0-09f7 : 0000:00:07.0
  09f0-09f7 : sata_nv
0b70-0b73 : 0000:00:07.0
  0b70-0b73 : sata_nv
0bf0-0bf3 : 0000:00:07.0
  0bf0-0bf3 : sata_nv
1c00-1c3f : 0000:00:01.1
  1c00-1c07 : nForce2_smbus
1c40-1c7f : 0000:00:01.1
  1c40-1c47 : nForce2_smbus
7000-7fff : PCI Bus #05
8000-8fff : PCI Bus #04
9000-9fff : PCI Bus #03
a000-afff : PCI Bus #02
b000-bfff : PCI Bus #01
  b400-b407 : 0000:01:0a.0
  b800-b807 : 0000:01:0a.0
  bc00-bc1f : 0000:01:0a.0
    bc00-bc07 : serial
    bc08-bc0f : serial
c800-c807 : 0000:00:0a.0
  c800-c807 : forcedeth
cc00-cc0f : 0000:00:07.0
  cc00-cc0f : sata_nv
e000-e00f : 0000:00:06.0
  e000-e007 : ide0
  e008-e00f : ide1
ec00-ecff : 0000:00:04.0
  ec00-ecff : NVidia CK804
f000-f0ff : 0000:00:04.0
  f000-f0ff : NVidia CK804
fc00-fc1f : 0000:00:01.1
root:~>cat /proc/iomem
00000000-0009efff : System RAM
  00000000-00000000 : Crash kernel
0009f000-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000cedff : Video ROM
000f0000-000fffff : System ROM
00100000-7fedffff : System RAM
  00100000-00289fbf : Kernel code
  00289fc0-0037cd6f : Kernel data
7fee0000-7fee2fff : ACPI Non-volatile Storage
7fee3000-7feeffff : ACPI Tables
7fef0000-7fefffff : reserved
d0000000-dfffffff : PCI Bus #05
  d0000000-dfffffff : 0000:05:00.0
e0000000-efffffff : reserved
fa000000-fcffffff : PCI Bus #05
  fa000000-faffffff : 0000:05:00.0
  fb000000-fbffffff : 0000:05:00.0
  fc000000-fc01ffff : 0000:05:00.0
fd800000-fd8fffff : PCI Bus #04
fd900000-fd9fffff : PCI Bus #04
fda00000-fdafffff : PCI Bus #03
fdb00000-fdbfffff : PCI Bus #03
fdc00000-fdcfffff : PCI Bus #02
fdd00000-fddfffff : PCI Bus #02
fde00000-fdefffff : PCI Bus #01
  fdef8000-fdefbfff : 0000:01:06.0
  fdeff000-fdeff7ff : 0000:01:06.0
    fdeff000-fdeff7ff : ohci1394
fdf00000-fdffffff : PCI Bus #01
fe02a000-fe02afff : 0000:00:0a.0
  fe02a000-fe02afff : forcedeth
fe02b000-fe02bfff : 0000:00:07.0
  fe02b000-fe02bfff : sata_nv
fe02d000-fe02dfff : 0000:00:04.0
  fe02d000-fe02dfff : NVidia CK804
fe02f000-fe02ffff : 0000:00:02.0
  fe02f000-fe02ffff : ohci_hcd
feb00000-feb000ff : 0000:00:02.1
  feb00000-feb000ff : ehci_hcd
fec00000-ffffffff : reserved
root:~>

Comment 4 Don Harter 2006-08-28 10:50:58 UTC

Created attachment 8887 [details]
System map

Comment 5 Don Harter 2006-08-28 10:54:09 UTC

Created attachment 8888 [details]
lspci -vvv

Comment 6 Don Harter 2006-08-28 10:55:13 UTC

root:~>cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: ST3160812AS      Rev: 3.AA
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: ST3160812AS      Rev: 3.AA
  Type:   Direct-Access                    ANSI SCSI revision: 05
root:~>

Comment 7 Andrew Morton 2006-08-28 13:08:21 UTC

Begin forwarded message:

Date: Mon, 28 Aug 2006 12:59:12 -0700 (PDT)
From: Linus Torvalds <torvalds@osdl.org>
To: Andrew Morton <akpm@osdl.org>
Subject: Re: Fw: [Bug 7066] New: __free_pages+0xc/0x34

On Mon, 28 Aug 2006, Andrew Morton wrote:
> 
> This is another in the ongoing once-a-month dribble of weird crashes coming
> out of the poll() code.  This time it looks like a free of a freed page.
> 
> There's enough noise coming out of that area to make me think we have some
> subtle bug in there.  Or maybe it's just that random driver
> lifetime-management bugs will sometimes manifest in a way which makes it
> look like the poll() code is to blame, dunno.

I really don't think the poll/select code is to blame - all the memory 
management there is really really simple, and it's all totally thread- 
private, so there aren't even any race conditions or locking issues.

The "memory management" is a poll_wqueues entry allocated on the thread 
kernel stack, and a simple linked list of pages. 

I suspect things show up in the poll() code not because the poll code 
itself is buggy, but because when you run X, poll() is one of the most 
common system calls, and of the common system calls it's the only one that 
does any real amount of page allocation (ie the other ones are 
gettimeofday() and things like sendmsg()/recvmsg(), and the former one 
obviously doesn't do any allocations at all, and the latter ones hide all 
their allocations through the slab layer).

In other words, if something corrupts the free-page list, poll() just ends 
up being the most likely thing to ever notice.

The part that may give more of a clue is the second oops, which gets a 
page fault on this part:

	#ifdef CONFIG_DEBUG_SLAB_LEAK
	        {
	                struct slab *slabp;
	                unsigned objnr;

	                slabp = page_get_slab(virt_to_page(objp));
	                objnr = (unsigned)(objp - slabp->s_mem) / cachep->buffer_size;
	                slab_bufctl(slabp)[objnr] = BUFCTL_ACTIVE;
	        }
	#endif

where if I read the oops right, it's the "objp - slabp->s_mem" that 
crashes because "slabp" is 0x00200200, ie the list poisoning thing.

But slabp was gotten through "page_get_slab()", which in turn is 
page->lru.prev.

And notice how we did NOT trigger the

	BUG_ON(!PageSlab(page));

in page_get_slab().

That looks strange and interesting. Somebody seems to have done a 
list_del(&page->lru.prev) on a page that is still marked as being a slab 
page.

And yeah, I bet it's related to the first oops somehow, but the thing is, 
that code-sequence should have _nothing_ to do with slab pages. And 
besides, if that sequence does a list_del(), I'd have expected it to do so 
_after_ it does the put_page_testzero() anyway (but I didn't go check).

		Linus

Comment 8 Linus Torvalds 2006-08-28 13:59:59 UTC

Don, can you confirm whether you can reproduce this bug at will or not? 
Has it happened only once, or can you basically depend on it happening 
under certain loads?

		Linus

Comment 9 Don Harter 2006-08-28 14:31:33 UTC

I had been having problems with earlier kernel versions and which were tainted.
 I decided to try at that time the latest kernel version linux-2.6.17.1 which
was tainted.  I got more crashes on tainted  linux-2.6.17.1 than I did on
tainted linux-2.6.16.21-0.13.
I have not been using untainted linux-2.6.17.11 very long.  I just compiled it
this morning.  It seems to be more stable than tainted linux-2.6.17.1 so far.  I
had problems with clock not keeping the correct time in smp tainted versions of
suse linux-2.6.16.21-0.13. So I switched to an default non smp config.  With
this untainted kernel that I have submitted my bug report on, my screensaver
does not change.
  Today I did a bios upgrade for my motherboard and that might fix some
problems.  I am jsut waiting for the next crash on this kernel.  I have a serial
cable to another linux box and it will hopefully capture it. If it is like
tainted linux-2.6.17.1, I shouldn't have to wait very long.

Comment 10 Don Harter 2006-08-28 14:35:33 UTC

Created attachment 8891 [details]
my .config file

Comment 11 Don Harter 2006-08-29 04:51:17 UTC

It has been over 12 hours and this hasn't crashed.  It seems to be more stable.
 I suspect that those smp changes made in earlier versions may have caused that.
 I am expecting lightning storms and need to shut down my computer.  I will see
if I can crash this system using my kmid suse bug like the other kernels.  That
may be an audio driver issue though.
If I compile a smp version and it crashes, would that apply to this bug report
or should I submit a separate one?

Comment 12 Don Harter 2006-08-29 05:08:34 UTC

Yes I can crash it with kmid.  I am not sure if it is related to this bug though.
Here is what I captured:
BUG: spinlock recursion on CPU#0, kmid/10249
 lock: c0342198, .magic: dead4ead, .owner: kmid/10249, .owner_cpu: 0
BUG: spinlock lockup on CPU#0, kmid/10249, c0342198
&#134;&#130;??!&#149;?

Comment 13 Don Harter 2006-09-09 09:43:57 UTC

I am back from vacation.  I have been running on this kernel for over 3 days now
and it hasn't crashed.  I will try to load it more or try a smp version on this
dual core processor.

Comment 14 Don Harter 2006-09-09 22:34:02 UTC

Created attachment 8979 [details]
2 new crashes in a log

I was just reading an email in thunderbird when it crashed.  I was interacting
with the mail window.

Comment 15 Linus Torvalds 2006-09-10 11:35:23 UTC

On Sat, 9 Sep 2006, bugme-daemon@bugzilla.kernel.org wrote:
> 
> I was just reading an email in thunderbird when it crashed.  I was interacting
> with the mail window.

Looks like the "sock->ops->sendmsg()" function pointer was corrupted (with 
a value of something like 0xa951937b, which doesn't look like any obvious 
string or other poisoning value, so there's no hint there).

The "sock->ops" values should normally be pointers to a compile-time 
static structure, so I suspect the "sock" structure itself got corrupted 
somehow. Which points to more slab corruption. Which makes this very hard 
to debug, since the corruption could have happened at any point.

Don, do you have slab debugging turned on in your tree? If you don't, 
turning it on might help us. It's "CONFIG_DEBUG_SLAB".

		Linus

Comment 16 Don Harter 2006-09-10 12:04:06 UTC

I have built an smp version.
I chose these options among others:
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set
CONFIG_OBSOLETE_INTERMODULE=m
If you would rather me use a non-smp version let me know.  I have had more
crashes though with smp kernels.
Others with my same motherboard have said that they do not have the problems
that I have had.  Could the problem be my add in cards or the drivers which
others may not have?

Comment 17 Don Harter 2006-09-11 14:47:55 UTC

I have built and am using a smp version.  It hasn't crashed yet.
However my clock on the kde panel is running fast.

Comment 18 Don Harter 2006-09-11 14:50:20 UTC

I had attached my kernel config file and I see that for the crash reports the
slab debug option was set.

Comment 19 Don Harter 2006-09-15 17:39:56 UTC

Created attachment 9028 [details]
strace of mount command

Not sure if this is related, but I cannot mount my other file system as a ext3
type.  When I issue the command it either locks up the terminal and does not
respond or segmentation faults.  I have attached an strace of where it just
locks up.  If I issue the same command as "ext2", it works.

Comment 20 Don Harter 2006-09-17 07:38:22 UTC

This crashed using smp kernel.  I was running the following command
"cp -R --preserve -f -x -t /mnt /"
I had a same size partition mounted on /mnt and was backing up my current
partition.  I will attach the boot up log and the information below.  I will
then attach my system.map for the snmp kernel.

CPU:    1
EIP:    0060:[<c02974ba>]    Not tainted VLI
EFLAGS: 00010086   (2.6.17.11-smp #1) 
EIP is at schedule+0x88a/0xa9e
eax: c241ce00   ebx: 00000001   ecx: 39393909   edx: c241ce00
esi: f68de9f0   edi: 00000000   ebp: d9777da0   esp: d9777d38
ds: 007b   es: 007b   ss: 0068
Process pdflush (pid: 3966, threadinfo=d9776000 task=dfe13560)
Stack: e5592810 dfb16420 9775360d 00000190 d9777d5c 0000000a dfe13668 dfe13560 
       df41da90 c241d760 97942cf2 00000190 001ef6e5 00000001 00000000 00000002 
       f68c0bc4 0000000c 00000001 00000000 dfab0d74 00000000 d9777d98 c241d760 
Call Trace:
 <c0104d2a> show_stack_log_lvl+0x85/0x8f  <c0104ea7> show_registers+0x13b/0x1af
 <c0105090> die+0x175/0x285  <c029a248> do_page_fault+0x428/0x58e
 <c010485b> error_code+0x4f/0x54  <c0297884> io_schedule+0x26/0x30
 <c0161d97> sync_buffer+0x33/0x37  <c0297fa0> __wait_on_bit+0x36/0x5d
 <c0298030> out_of_line_wait_on_bit+0x69/0x71  <c0161cef> __wait_on_buffer+0x1f/0x25
 <c0162df5> __bread+0x92/0xa8  <c019b931> ext2_get_inode+0x94/0xef
 <c019b9c5> ext2_update_inode+0x39/0x2b5  <c019bc49> ext2_write_inode+0x8/0xa
 <c017df89> __writeback_single_inode+0x1ba/0x310  <c017e367>
sync_sb_inodes+0x1a2/0x268
 <c017e8a6> writeback_inodes+0x8e/0xe0  <c0149111> background_writeout+0x5e/0x87
 <c0149763> pdflush+0xf2/0x189  <c0130520> kthread+0xa3/0xd0
 <c0102005> kernel_thread_helper+0x5/0xb 
Code: 8b 58 10 b8 80 28 3d c0 74 7f f0 0f b3 9e 78 01 00 00 89 c2 03 14 9d 80 ef
39 c0 c7 42 04 01 00 00 00 03 04 9d 80 ef 39 c0 89 08 <f0> 0f ab 99 78 01 00 00
8b 41 24 05 00 00 00 40 0f 22 d8 8b 81 
EIP: [<c02974ba>] schedule+0x88a/0xa9e SS:ESP 0068:d9777d38
 <0>BUG: spinlock recursion on CPU#1, pdflush/3966
BUG: spinlock lockup on CPU#0, swapper/0, c241d760
BUG: spinlock lockup on CPU#1, pdflush/3966, c0346ec0
BUG: spinlock lockup on CPU#0, swapper/0, c0346ec0
BUG: spinlock lockup on CPU#1, pdflush/3966, c0346ec0
BUG: spinlock lockup on CPU#0, swapper/0, c0346ec0
BUG: spinlock lockup on CPU#1, pdflush/3966, c0346ec0
BUG: spinlock lockup on CPU#0, swapper/0, c0346ec0
BUG: spinlock lockup on CPU#1, pdflush/3966, c0346ec0
BUG: spinlock lockup on CPU#0, swapper/0, c0346ec0
BUG: spinlock lockup on CPU#1, pdflush/3966, c0346ec0
BUG: spinlock lockup on CPU#0, swapper/0, c0346ec0
BUG: spinlock lockup on CPU#1, pdflush/3966, c0346ec0
BUG: spinlock lockup on CPU#0, swapper/0, c0346ec0
BUG: spinlock lockup on CPU#1, pdflush/3966, c0346ec0
BUG: spinlock lockup on CPU#0, swapper/0, c0346ec0
BUG: spinlock lockup on CPU#1, pdflush/3966, c0346ec0
BUG: spinlock lockup on CPU#0, swapper/0, c0346ec0
BUG: spinlock lockup on CPU#1, pdflush/3966, c0346ec0
BUG: spinlock lockup on CPU#0, swapper/0, c0346ec0
BUG: spinlock lockup on CPU#1, pdflush/3966, c0346ec0

Comment 21 Don Harter 2006-09-17 07:39:27 UTC

Created attachment 9033 [details]
bootup and log of the crash

Comment 22 Don Harter 2006-09-17 07:40:53 UTC

Created attachment 9034 [details]
system map for the smp kernel compile

Comment 23 Don Harter 2006-09-17 07:44:05 UTC

Created attachment 9035 [details]
my smp .config file

Comment 24 Linus Torvalds 2006-09-17 09:48:09 UTC


On Sun, 17 Sep 2006, bugme-daemon@bugzilla.kernel.org wrote:
>
> EIP:    0060:[<c02974ba>]    Not tainted VLI
> EIP is at schedule+0x88a/0xa9e
> eax: c241ce00   ebx: 00000001   ecx: 39393909   edx: c241ce00
> esi: f68de9f0   edi: 00000000   ebp: d9777da0   esp: d9777d38
> Code: 8b 58 10 b8 80 28 3d c0 74 7f f0 0f b3 9e 78 01 00 00 89 c2 03 14 9d 80 ef
> 39 c0 c7 42 04 01 00 00 00 03 04 9d 80 ef 39 c0 89 08 <f0> 0f ab 99 78 01 00 00
> 8b 41 24 05 00 00 00 40 0f 22 d8 8b 81 

That would be at the instruction sequence

	lock bts %ebx,%ds:0x178(%ecx)
	mov    0x24(%ecx),%eax
	add    $0x40000000,%eax
	mov    %eax,%cr3

which is the TLB invalidate, where "ecx" is 39393909, which looks like the 
string "\t999". It _should_ be this code (from switch_mm() in 
asm-i386/mmu_context.h):

                cpu_set(cpu, next->cpu_vm_mask);

                /* Re-load page tables */
                load_cr3(next->pgd);

so it's "next->cpu_vm_mask" that is corrupt.

		Linus

Comment 25 Don Harter 2006-09-28 16:48:13 UTC

I crashed this smp kernel again.  I just ran the command dd if=/dev/sda2
of=/dev/sdb2.
Here is what happened.
BUG: spinlock lockup on CPU#0, klogd/2530, c241d760

I was planning on trying the latest kernel but needed to run this command first.
 My main partition is almost full.

Comment 26 Don Harter 2006-09-29 02:30:48 UTC

I ran this same dd command with 2.6.18 and it did not crash.

Comment 27 Don Harter 2006-09-29 03:54:56 UTC

2.6.18 still has a bug with kmid. this may be a different bug and may be kmids fault


=======================================================
[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
kmid/4918 is trying to acquire lock:
 (&timer->lock){++..}, at: [<f89cee75>] snd_timer_interrupt+0x21/0x241 [snd_timer]

but task is already holding lock:
 (rtc_task_lock){+...}, at: [<c02275f6>] rtc_interrupt+0x96/0xe3

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (rtc_task_lock){+...}:
       [<c0136cde>] lock_acquire+0x60/0x80
       [<c02b07cf>] _spin_lock_irqsave+0x22/0x32
       [<c0227175>] rtc_control+0x2e/0x66
       [<f8b1c087>] rtctimer_start+0x45/0x5a [snd_rtctimer]
       [<f89ce0e7>] snd_timer_start1+0x67/0x78 [snd_timer]
       [<f89cec7f>] snd_timer_start+0x54/0x82 [snd_timer]
       [<f8ad0d2c>] snd_seq_timer_start+0x33/0x4a [snd_seq]
       [<f8acfcdb>] snd_seq_control_queue+0xb6/0x13e [snd_seq]
       [<f8ad10d2>] event_input_timer+0xe/0x10 [snd_seq]
       [<f8acce4e>] snd_seq_deliver_single_event+0xdd/0x1cc [snd_seq]
       [<f8acd0b3>] snd_seq_deliver_event+0x176/0x184 [snd_seq]
       [<f8acd4e9>] snd_seq_dispatch_event+0x10f/0x127 [snd_seq]
       [<f8acf6ef>] snd_seq_check_queue+0x9d/0xd6 [snd_seq]
       [<f8acf9b0>] snd_seq_enqueue_event+0xce/0xdf [snd_seq]
       [<f8acd173>] snd_seq_client_enqueue_event+0xb2/0xdc [snd_seq]
       [<f8ace80e>] snd_seq_write+0x129/0x16c [snd_seq]
       [<c01694d3>] vfs_write+0xab/0x157
       [<c0169b16>] sys_write+0x3b/0x60
       [<c0103db9>] sysenter_past_esp+0x56/0x8d

-> #0 (&timer->lock){++..}:
       [<c0136cde>] lock_acquire+0x60/0x80
       [<c02b07cf>] _spin_lock_irqsave+0x22/0x32
       [<f89cee75>] snd_timer_interrupt+0x21/0x241 [snd_timer]
       [<f8b1c0f0>] rtctimer_interrupt+0xd/0x11 [snd_rtctimer]
       [<c0227605>] rtc_interrupt+0xa5/0xe3
       [<c014b428>] handle_IRQ_event+0x20/0x4d
       [<c014b4e9>] __do_IRQ+0x94/0xef
       [<c010627d>] do_IRQ+0x71/0x84
       [<c01048d9>] common_interrupt+0x25/0x2c

other info that might help us debug this:

1 lock held by kmid/4918:
 #0:  (rtc_task_lock){+...}, at: [<c02275f6>] rtc_interrupt+0x96/0xe3

stack backtrace:
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c013604c>] print_circular_bug_tail+0x59/0x64
 [<c0136882>] __lock_acquire+0x82b/0x9b6
 [<c0136cde>] lock_acquire+0x60/0x80
 [<c02b07cf>] _spin_lock_irqsave+0x22/0x32
 [<f89cee75>] snd_timer_interrupt+0x21/0x241 [snd_timer]
 [<f8b1c0f0>] rtctimer_interrupt+0xd/0x11 [snd_rtctimer]
 [<c0227605>] rtc_interrupt+0xa5/0xe3
 [<c014b428>] handle_IRQ_event+0x20/0x4d
 [<c014b4e9>] __do_IRQ+0x94/0xef
 [<c010627d>] do_IRQ+0x71/0x84
 [<c01048d9>] common_interrupt+0x25/0x2c
DWARF2 unwinder stuck at common_interrupt+0x25/0x2c
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c013604c>] print_circular_bug_tail+0x59/0x64
 [<c0136882>] __lock_acquire+0x82b/0x9b6
 [<c0136cde>] lock_acquire+0x60/0x80
 [<c02b07cf>] _spin_lock_irqsave+0x22/0x32
 [<f89cee75>] snd_timer_interrupt+0x21/0x241 [snd_timer]
 [<f8b1c0f0>] rtctimer_interrupt+0xd/0x11 [snd_rtctimer]
 [<c0227605>] rtc_interrupt+0xa5/0xe3
 [<c014b428>] handle_IRQ_event+0x20/0x4d
 [<c014b4e9>] __do_IRQ+0x94/0xef
 [<c010627d>] do_IRQ+0x71/0x84
 [<c01048d9>] common_interrupt+0x25/0x2c
BUG: spinlock lockup on CPU#0, kmid/4947, c0321338
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c01d82d4>] _raw_spin_lock+0xc2/0xe8
 [<c02b07d6>] _spin_lock_irqsave+0x29/0x32
 [<c0227175>] rtc_control+0x2e/0x66
 [<f8b1c03e>] rtctimer_stop+0x3e/0x42 [snd_rtctimer]
 [<f89ceb71>] _snd_timer_stop+0xd0/0x137 [snd_timer]
 [<f89cebe7>] snd_timer_pause+0xf/0x11 [snd_timer]
 [<f8ad0a19>] snd_seq_timer_stop+0x20/0x26 [snd_seq]
 [<f8ad0d0f>] snd_seq_timer_start+0x16/0x4a [snd_seq]
 [<f8acfcdb>] snd_seq_control_queue+0xb6/0x13e [snd_seq]
 [<f8ad10d2>] event_input_timer+0xe/0x10 [snd_seq]
 [<f8acce4e>] snd_seq_deliver_single_event+0xdd/0x1cc [snd_seq]
 [<f8acd0b3>] snd_seq_deliver_event+0x176/0x184 [snd_seq]
 [<f8acd119>] snd_seq_client_enqueue_event+0x58/0xdc [snd_seq]
 [<f8ace80e>] snd_seq_write+0x129/0x16c [snd_seq]
 [<c01694d3>] vfs_write+0xab/0x157
 [<c0169b16>] sys_write+0x3b/0x60
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c01d82d4>] _raw_spin_lock+0xc2/0xe8
 [<c02b07d6>] _spin_lock_irqsave+0x29/0x32
 [<c0227175>] rtc_control+0x2e/0x66
 [<f8b1c03e>] rtctimer_stop+0x3e/0x42 [snd_rtctimer]
 [<f89ceb71>] _snd_timer_stop+0xd0/0x137 [snd_timer]
 [<f89cebe7>] snd_timer_pause+0xf/0x11 [snd_timer]
 [<f8ad0a19>] snd_seq_timer_stop+0x20/0x26 [snd_seq]
 [<f8ad0d0f>] snd_seq_timer_start+0x16/0x4a [snd_seq]
 [<f8acfcdb>] snd_seq_control_queue+0xb6/0x13e [snd_seq]
 [<f8ad10d2>] event_input_timer+0xe/0x10 [snd_seq]
 [<f8acce4e>] snd_seq_deliver_single_event+0xdd/0x1cc [snd_seq]
 [<f8acd0b3>] snd_seq_deliver_event+0x176/0x184 [snd_seq]
 [<f8acd119>] snd_seq_client_enqueue_event+0x58/0xdc [snd_seq]
 [<f8ace80e>] snd_seq_write+0x129/0x16c [snd_seq]
 [<c01694d3>] vfs_write+0xab/0x157
 [<c0169b16>] sys_write+0x3b/0x60
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: spinlock lockup on CPU#1, kmid/4918, dfbb263c
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c01d82d4>] _raw_spin_lock+0xc2/0xe8
 [<c02b07d6>] _spin_lock_irqsave+0x29/0x32
 [<f89cee75>] snd_timer_interrupt+0x21/0x241 [snd_timer]
 [<f8b1c0f0>] rtctimer_interrupt+0xd/0x11 [snd_rtctimer]
 [<c0227605>] rtc_interrupt+0xa5/0xe3
 [<c014b428>] handle_IRQ_event+0x20/0x4d
 [<c014b4e9>] __do_IRQ+0x94/0xef
 [<c010627d>] do_IRQ+0x71/0x84
 [<c01048d9>] common_interrupt+0x25/0x2c
DWARF2 unwinder stuck at common_interrupt+0x25/0x2c
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c01d82d4>] _raw_spin_lock+0xc2/0xe8
 [<c02b07d6>] _spin_lock_irqsave+0x29/0x32
 [<f89cee75>] snd_timer_interrupt+0x21/0x241 [snd_timer]
 [<f8b1c0f0>] rtctimer_interrupt+0xd/0x11 [snd_rtctimer]
 [<c0227605>] rtc_interrupt+0xa5/0xe3
 [<c014b428>] handle_IRQ_event+0x20/0x4d
 [<c014b4e9>] __do_IRQ+0x94/0xef
 [<c010627d>] do_IRQ+0x71/0x84
 [<c01048d9>] common_interrupt+0x25/0x2c

Comment 28 Takashi Iwai 2006-09-29 08:46:29 UTC

A simple workaround for lockdep of rtctimer is to disable CONFIG_SND_RTCTIMER,
or pass seq_default_timer_device=0 to snd-seq module.  This will use a system
timer instead of RTC.

But, of course, this AB<->BA lock must be fixed.  I'll check it now...

Comment 29 Takashi Iwai 2006-09-29 10:01:14 UTC

A simple solution is to remove the superfluous spinlock in rtc.c.
Since snd-rtctimer is the only user of this rtc in-kernel-control stuff and it
has own lock from the caller side (this triggered lockdep BUG), we can simply
remove rtc_task_lock in rtc.c.

I'll attach the patch.  Don, could you check whether kmid works with it?

Comment 30 Takashi Iwai 2006-09-29 10:03:24 UTC

Created attachment 9126 [details]
Remove superfluous lock from rtc.c

Comment 31 Linus Torvalds 2006-09-29 10:19:24 UTC

[ Peter Zijlistra added to Cc just because he's touched that area too 
  lately. Peter, you probably don't care, but it's a independent problem 
  to the one that has been getting tracked throught

	http://bugzilla.kernel.org/show_bug.cgi?id=7066

  and I thought I'd at least mention it to you ]

On Fri, 29 Sep 2006, Takashi Iwai wrote:
>
> A simple solution is to remove the superfluous spinlock in rtc.c.
> Since snd-rtctimer is the only user of this rtc in-kernel-control stuff and it
> has own lock from the caller side (this triggered lockdep BUG), we can simply
> remove rtc_task_lock in rtc.c.
> 
> I'll attach the patch.  Don, could you check whether kmid works with it?

I worry about this a bit - it sounds to me that the lock _should_ be in 
the rtc layer, and maybe it's the other lock that should be removed?

But you're the one who added that RTC lock in the first place, so ...

(Well, it's attributed to Jaroslav, but in the log-message he says it came 
from you - this was before sign-offs, and before git).

Who else really has worked on this thing? It seems a bit broken that 
there's this interface in the RTC code that is apparently only used by the 
sound subsystem, and seems to be badly designed anyway (and where you can 
apparently even disable the use of it).

This all makes me go "Hmm, that can't be very good.."

		Linus

Comment 32 Takashi Iwai 2006-09-29 11:02:28 UTC

We can unlock temporarily in timer->lock in sound/core/rtctimer.c, too.  But the
double-lock of rtc_task_lock might still happen due to a code path like
rtc_interrupt -> sound -> rtc_control().

Your concern is right.  There is one little non-safe case by the complete
removal of rtc_task_lock.  We have no protection for rtc_callback instance being
used in rtc_interrupt().

So, a safer option is to remove the lock in rtc_control().  This is just for
sanity-check and plays no real role.  This should suffice for avoiding the AB/BA
lock.  The new patch is attached again.

And, yes, the interface design is my bad, indeed.  It was written to be generic,
but so far the sound is the only user.  (RTC was the only usable major
fine-timer-source on PC at that time :)  It can get rid when we use more finer
timer-subsystem...

Comment 33 Takashi Iwai 2006-09-29 11:03:52 UTC

Created attachment 9127 [details]
Fix ABBA deadlock in rtc and sound timer

Comment 34 Don Harter 2006-09-29 15:35:41 UTC

Created attachment 9129 [details]
crash dump

This generates debug information with the patch but does not lock up the
computer.
I tried and could not lock it up by moving the left volume bar up and down.  I
don't hear any sound even after playing with a mixer and an alsa mixer.  I can
play mp3s with mpg321.

Comment 35 Don Harter 2006-09-29 15:37:28 UTC

Created attachment 9130 [details]
system map for the patched kernel

Comment 36 Don Harter 2006-10-02 05:09:13 UTC

Since I have tried his patch my KDE clock is keeping the correct time. Before it
would not keep the correct time even with ntp.
I can also set my keyboard to repeat keys now.  Before I had to turn that off.

Comment 37 Don Harter 2006-10-02 15:17:20 UTC

Created attachment 9147 [details]
another crash- syslog

Here is data from another crash.  I found out my console log on the other
computer was not turned on so this is from syslog on the crashed machine.

Comment 38 Don Harter 2006-10-11 04:13:54 UTC

Oct  9 19:19:56 c-69-137-114-21 kernel:
=============================================
Oct  9 19:19:56 c-69-137-114-21 kernel: [ INFO: possible recursive locking
detected ]
Oct  9 19:19:56 c-69-137-114-21 kernel:
---------------------------------------------
Oct  9 19:19:56 c-69-137-114-21 kernel: java_vm/11309 is trying to acquire lock:
Oct  9 19:19:56 c-69-137-114-21 kernel:  (slock-AF_INET6){-+..}, at:
[<c0257c3f>] sk_clone+0xb6/0x27e
Oct  9 19:19:56 c-69-137-114-21 kernel:
Oct  9 19:19:56 c-69-137-114-21 kernel: but task is already holding lock:
Oct  9 19:19:56 c-69-137-114-21 kernel:  (slock-AF_INET6){-+..}, at:
[<f8f00d7d>] tcp_v6_rcv+0x318/0x717 [ipv6]
Oct  9 19:19:56 c-69-137-114-21 kernel:
Oct  9 19:19:56 c-69-137-114-21 kernel: other info that might help us debug this:
Oct  9 19:19:56 c-69-137-114-21 kernel: 1 lock held by java_vm/11309:
Oct  9 19:19:56 c-69-137-114-21 kernel:  #0:  (slock-AF_INET6){-+..}, at:
[<f8f00d7d>] tcp_v6_rcv+0x318/0x717 [ipv6]
Oct  9 19:19:56 c-69-137-114-21 kernel:


ct  9 19:19:56 c-69-137-114-21 kernel:
Oct  9 19:19:56 c-69-137-114-21 kernel: stack backtrace:
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c0104eda>] show_trace_log_lvl+0x58/0x16a
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c01054d7>] show_trace+0xd/0x10
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c01055f6>] dump_stack+0x19/0x1b
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c01367ea>] __lock_acquire+0x793/0x9b6
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c0136cde>] lock_acquire+0x60/0x80
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c02b0473>] _spin_lock+0x19/0x28
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c0257c3f>] sk_clone+0xb6/0x27e
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c027ef35>] inet_csk_clone+0xd/0x5e
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c028f7fd>]
tcp_create_openreq_child+0x1b/0x382
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<f8eff9fa>]
tcp_v6_syn_recv_sock+0x248/0x575 [ipv6]
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c028fd34>] tcp_check_req+0x1d0/0x2e4
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<f8efed23>] tcp_v6_do_rcv+0x142/0x356
[ipv6]
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<f8f0112a>] tcp_v6_rcv+0x6c5/0x717 [ipv6]
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<f8ee7a03>] ip6_input+0x1c3/0x296 [ipv6]
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<f8ee7f73>] ipv6_rcv+0x1d2/0x21f [ipv6]
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c025dbd8>] netif_receive_skb+0x2c6/0x34a
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c025f572>] process_backlog+0x99/0x114
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c025f76f>] net_rx_action+0x9d/0x162
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c01245e0>] __do_softirq+0x78/0xf2
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c0124698>] do_softirq+0x3e/0x56
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c012494a>] local_bh_enable_ip+0xa3/0xc9
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c02b0411>] _spin_unlock_bh+0x25/0x28
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c02569fb>] release_sock+0xb0/0xb8
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c0297dd8>]
inet_stream_connect+0x129/0x21c
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c025593a>] sys_connect+0x67/0x84
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c0255fac>] sys_socketcall+0x8c/0x186
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c0103db9>] sysenter_past_esp+0x56/0x8d
Oct  9 19:19:56 c-69-137-114-21 kernel: DWARF2 unwinder stuck at
sysenter_past_esp+0x56/0x8d
Oct  9 19:19:56 c-69-137-114-21 kernel: Leftover inexact backtrace:
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c01054d7>] show_trace+0xd/0x10
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c01055f6>] dump_stack+0x19/0x1b
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c01367ea>] __lock_acquire+0x793/0x9b6
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c0136cde>] lock_acquire+0x60/0x80
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c02b0473>] _spin_lock+0x19/0x28
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c0257c3f>] sk_clone+0xb6/0x27e
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c027ef35>] inet_csk_clone+0xd/0x5e
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c028f7fd>]
tcp_create_openreq_child+0x1b/0x382
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<f8eff9fa>]
tcp_v6_syn_recv_sock+0x248/0x575 [ipv6]
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c028fd34>] tcp_check_req+0x1d0/0x2e4
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<f8efed23>] tcp_v6_do_rcv+0x142/0x356
[ipv6]
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<f8f0112a>] tcp_v6_rcv+0x6c5/0x717 [ipv6]
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<f8ee7a03>] ip6_input+0x1c3/0x296 [ipv6]

Oct  9 19:19:56 c-69-137-114-21 kernel:  [<f8ee7f73>] ipv6_rcv+0x1d2/0x21f [ipv6]
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c025dbd8>] netif_receive_skb+0x2c6/0x34a
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c025f572>] process_backlog+0x99/0x114
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c025f76f>] net_rx_action+0x9d/0x162
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c01245e0>] __do_softirq+0x78/0xf2
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c0124698>] do_softirq+0x3e/0x56
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c012494a>] local_bh_enable_ip+0xa3/0xc9
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c02b0411>] _spin_unlock_bh+0x25/0x28
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c02569fb>] release_sock+0xb0/0xb8
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c0297dd8>]
inet_stream_connect+0x129/0x21c
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c025593a>] sys_connect+0x67/0x84
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c0255fac>] sys_socketcall+0x8c/0x186
Oct  9 19:19:56 c-69-137-114-21 kernel:  [<c0103db9>] sysenter_past_esp+0x56/0x8d
Oct  9 19:19:56 c-69-137-114-21 kernel: kobject vcs10: registering. parent: vc,
set: class_obj
Oct  9 19:19:56 c-69-137-114-21 kernel: kobject_uevent
Oct  9 19:19:56 c-69-137-114-21 kernel: fill_kobj_path: path = '/class/vc/vcs10'
Oct  9 19:19:56 c-69-137-114-21 kernel: kobject vcsa10: registering. parent: vc,
set: class_obj
Oct  9 19:19:56 c-69-137-114-21 kernel: kobject_uevent
Oct  9 19:19:56 c-69-137-114-21 kernel: fill_kobj_path: path = '/class/vc/vcsa10'
Oct  9 19:31:00 c-69-137-114-21 syslog-ng[2601]: STATS: dropped 0
Oct  9 20:31:00 c-69-137-114-21 syslog-ng[2601]: STATS: dropped 0

Comment 39 Don Harter 2006-10-11 04:18:58 UTC

I am not sure when this error was made. Perhaps in an older kernel or maybe not.
fsck 1.38 (30-Jun-2005)
[/bin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/sda1 
/dev/sda1 has gone 60 days without being checked, check forced.
/dev/sda1: Duplicate or bad block in use!
/dev/sda1: Multiply-claimed block(s) in inode 9601216: 16777225
/dev/sda1: (There are 1 inodes containing multiply-claimed blocks.)

Comment 40 Don Harter 2006-10-27 03:32:29 UTC

I have been having problems compiling a src rpm.  My system has been up for
several days.  It fails do to a segmentation fault.  When I change to that
directory and rerun the command manualy, it compiles OK without errors.  I just
did a memtest overnight and it passed with 0 errors.  Then I noticed these
messages in the log.

=============================================
[ INFO: possible recursive locking detected ]
---------------------------------------------
java_vm/6897 is trying to acquire lock:
 (slock-AF_INET6){-+..}, at: [<c0257c3f>] sk_clone+0xb6/0x27e

but task is already holding lock:
 (slock-AF_INET6){-+..}, at: [<f8f1ad7d>] tcp_v6_rcv+0x318/0x717 [ipv6]

other info that might help us debug this:
1 lock held by java_vm/6897:
 #0:  (slock-AF_INET6){-+..}, at: [<f8f1ad7d>] tcp_v6_rcv+0x318/0x717 [ipv6]

stack backtrace:
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c01367ea>] __lock_acquire+0x793/0x9b6
 [<c0136cde>] lock_acquire+0x60/0x80
 [<c02b0473>] _spin_lock+0x19/0x28
 [<c0257c3f>] sk_clone+0xb6/0x27e
 [<c027ef35>] inet_csk_clone+0xd/0x5e
 [<c028f7fd>] tcp_create_openreq_child+0x1b/0x382
 [<f8f199fa>] tcp_v6_syn_recv_sock+0x248/0x575 [ipv6]
 [<c028fd34>] tcp_check_req+0x1d0/0x2e4
 [<f8f18d23>] tcp_v6_do_rcv+0x142/0x356 [ipv6]
 [<f8f1b12a>] tcp_v6_rcv+0x6c5/0x717 [ipv6]
 [<f8f01a03>] ip6_input+0x1c3/0x296 [ipv6]
 [<f8f01f73>] ipv6_rcv+0x1d2/0x21f [ipv6]
 [<c025dbd8>] netif_receive_skb+0x2c6/0x34a
 [<c025f572>] process_backlog+0x99/0x114
 [<c025f76f>] net_rx_action+0x9d/0x162
 [<c01245e0>] __do_softirq+0x78/0xf2
 [<c0124698>] do_softirq+0x3e/0x56
[<c012494a>] local_bh_enable_ip+0xa3/0xc9
 [<c02b0411>] _spin_unlock_bh+0x25/0x28
 [<c02569fb>] release_sock+0xb0/0xb8
 [<c0297dd8>] inet_stream_connect+0x129/0x21c
 [<c025593a>] sys_connect+0x67/0x84
 [<c0255fac>] sys_socketcall+0x8c/0x186
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c01367ea>] __lock_acquire+0x793/0x9b6
 [<c0136cde>] lock_acquire+0x60/0x80
 [<c02b0473>] _spin_lock+0x19/0x28
 [<c0257c3f>] sk_clone+0xb6/0x27e
 [<c027ef35>] inet_csk_clone+0xd/0x5e
 [<c028f7fd>] tcp_create_openreq_child+0x1b/0x382
 [<f8f199fa>] tcp_v6_syn_recv_sock+0x248/0x575 [ipv6]
 [<c028fd34>] tcp_check_req+0x1d0/0x2e4
 [<f8f18d23>] tcp_v6_do_rcv+0x142/0x356 [ipv6]
 [<f8f1b12a>] tcp_v6_rcv+0x6c5/0x717 [ipv6]
 [<f8f01a03>] ip6_input+0x1c3/0x296 [ipv6]
 [<f8f01f73>] ipv6_rcv+0x1d2/0x21f [ipv6]
 [<c025dbd8>] netif_receive_skb+0x2c6/0x34a
 [<c025f572>] process_backlog+0x99/0x114
 [<c025f76f>] net_rx_action+0x9d/0x162
 [<c01245e0>] __do_softirq+0x78/0xf2
 [<c0124698>] do_softirq+0x3e/0x56
 [<c012494a>] local_bh_enable_ip+0xa3/0xc9
 [<c02b0411>] _spin_unlock_bh+0x25/0x28
 [<c02569fb>] release_sock+0xb0/0xb8
 [<c0297dd8>] inet_stream_connect+0x129/0x21c
 [<c025593a>] sys_connect+0x67/0x84
 [<c0255fac>] sys_socketcall+0x8c/0x186
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
SFW2-INext-DROP-DEFLT IN=eth0 OUT= MAC=00:50:8d:db:39:3d:00:01:5c:22:76:02:08:00
SRC=60.11.125.41 DST=69.137.114.21 LEN=928 TOS=0x00 PREC=0x20 TTL=43 ID=0 DF
PROTO=UDP SPT=57695 DPT=1026 LEN=908 
SFW2-INext-DROP-DEFLT IN=eth0 OUT= MAC=00:50:8d:db:39:3d:00:01:5c:22:76:02:08:00
SRC=60.11.125.41 DST=69.137.114.21 LEN=928 TOS=0x00 PREC=0x20 TTL=43 ID=0 DF
PROTO=UDP SPT=57695 DPT=1027 LEN=908 
kobject_uevent
fill_kobj_path: path = '/class/vc/vcs7'
kobject vcs7: cleaning up
kobject_uevent
fill_kobj_path: path = '/class/vc/vcsa7'
kobject vcsa7: cleaning up
INIT: Switching to runINIT: kobject_uevent
Sending processefill_kobj_path: path = '/class/vc/vcs3'
s the TERM signal
kobject vcs3: cleaning up
kobject_uevent
fill_kobj_path: path = '/class/vc/vcsa3'
kobject vcsa3: cleaning up
kobject_uevent
fill_kobj_path: path = '/class/vc/vcs2'
kobject vcs2: cleaning up
kobject_uevent
fill_kobj_path: path = '/class/vc/vcsa2'
kobject vcsa2: cleaning up
kobject_uevent
fill_kobj_path: path = '/class/vc/vcs5'
kobject vcs5: cleaning up
kobject_uevent
fill_kobj_path: path = '/class/vc/vcsa5'
kobject vcsa5: cleaning up
kobject_uevent
fill_kobj_path: path = '/class/vc/vcs6'
kobject vcs6: cleaning up
kobject_uevent
fill_kobj_path: path = '/class/vc/vcsa6'
kobject vcsa6: cleaning up
kobject_uevent
fill_kobj_path: path = '/class/vc/vcs4'
kobject vcs4: cleaning up
kobject_uevent
fill_kobj_path: path = '/class/vc/vcsa4'
kobject vcsa4: cleaning up

Comment 41 Don Harter 2006-10-28 10:00:53 UTC

This latest crash was caused by just untaring a large source file:
Oops: 0002 [#1]
SMP 
Modules linked in: ipt_LOG xt_limit xt_pkttype af_packet cpufreq_ondemand
cpufreq_userspace cpufreq_powersave powernow_k8 freq_table edd snd_pcm_oss
snd_mixer_oss snd_seq snd_seq_device asus_acpi button battery ac ip6t_REJECT
xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat ip_nat iptable_filter
ip6table_mangle ip_conntrack nfnetlink ip_tables ip6table_filter ip6_tables
x_tables ipv6 loop dm_mod snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_nforce2 ehci_hcd ohci_hcd i2c_core
usbcore ohci1394 ieee1394 ide_cd cdrom forcedeth ext3 jbd fan thermal processor
sg sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core
CPU:    1
EIP:    0060:[<c017e0e0>]    Not tainted VLI
EFLAGS: 00010206   (2.6.18-smp #2) 
EIP is at prune_one_dentry+0x23/0x79
eax: 00207400   ebx: d7f4f07c   ecx: d7f4f0a4   edx: c2aa2fb4
esi: d7f4f07c   edi: d7f4f084   ebp: dfad7ee8   esp: dfad7ee0
ds: 007b   es: 007b   ss: 0068
Process kswapd0 (pid: 213, ti=dfad6000 task=dff0cae0 task.ti=dfad6000)
Stack: dfa769d0 d7f4f07c dfad7f04 c017e2ff 00000000 00000053 00027678 dfff2404 
       000000a8 dfad7f0c c017e34f dfad7f3c c0154b5a 01d8da00 00000000 01d8da00 
       0006d934 000000d0 00000180 00000000 00000060 c030ea80 c0312200 dfad7fc8 
Call Trace:
 [<c017e2ff>] prune_dcache+0xfc/0x133
 [<c017e34f>] shrink_dcache_memory+0x19/0x31
 [<c0154b5a>] shrink_slab+0xd0/0x137
 [<c0154f33>] kswapd+0x2d2/0x3a8
 [<c01310d7>] kthread+0xc3/0xf0
 [<c0102005>] kernel_thread_helper+0x5/0xb
DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb
Leftover inexact backtrace:
 [<c0105076>] show_stack_log_lvl+0x8a/0x95
 [<c01051ad>] show_registers+0x12c/0x199
 [<c010539b>] die+0x181/0x284
 [<c02b190b>] do_page_fault+0x3e5/0x4ad
 [<c0104a71>] error_code+0x39/0x40
 [<c017e2ff>] prune_dcache+0xfc/0x133
 [<c017e34f>] shrink_dcache_memory+0x19/0x31
 [<c0154b5a>] shrink_slab+0xd0/0x137
 [<c0154f33>] kswapd+0x2d2/0x3a8
 [<c01310d7>] kthread+0xc3/0xf0
 [<c0102005>] kernel_thread_helper+0x5/0xb
Code: fe ff ff 89 d8 5b 5d c3 55 89 e5 56 53 89 c3 8b 40 04 a8 10 75 1f 83 c8 10
89 43 04 8d 4b 28 8b 43 28 8b 51 04 85 c0 89 02 74 03 <89> 50 04 c7 41 04 00 02
20 00 8d 4b 48 8b 53 48 8b 41 04 89 42 
EIP: [<c017e0e0>] prune_one_dentry+0x23/0x79 SS:ESP 0068:dfad7ee0
 <4>SFW2-INext-DROP-DEFLT IN=eth0 OUT=
MAC=00:50:8d:db:39:3d:00:01:5c:22:76:02:08:00 SRC=221.208.208.94
DST=69.137.114.21 LEN=485 TOS=0x00 PREC=0x20 TTL=43 ID=0 DF PROTO=UDP SPT=53530
DPT=1027 LEN=465 
BUG: soft lockup detected on CPU#0!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c017dcd3>] d_instantiate+0x24/0x8a
 [<f88bb7fc>] ext3_add_nondir+0x2e/0x42 [ext3]
 [<f88bbd2d>] ext3_create+0xa8/0xdc [ext3]
 [<c0175603>] vfs_create+0xce/0x13e
 [<c01780cc>] open_namei+0x16b/0x630
 [<c0167b0c>] do_filp_open+0x1f/0x35
 [<c0167b62>] do_sys_open+0x40/0xb5
 [<c0167c03>] sys_open+0x16/0x18
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: soft lockup detected on CPU#1!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c01c9409>] _atomic_dec_and_lock+0x2d/0x4c
 [<c017df95>] dput+0x34/0x139
 [<c0174957>] path_release+0xd/0x23
 [<c0171765>] vfs_stat_fd+0x36/0x40
 [<c01717f1>] vfs_stat+0x11/0x13
 [<c0171807>] sys_stat64+0x14/0x28
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: soft lockup detected on CPU#0!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c01ccee0>] __delay+0x9/0xb
 [<c01d828c>] _raw_spin_lock+0x7a/0xe8
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c017dcd3>] d_instantiate+0x24/0x8a
 [<f88bb7fc>] ext3_add_nondir+0x2e/0x42 [ext3]
 [<f88bbd2d>] ext3_create+0xa8/0xdc [ext3]
 [<c0175603>] vfs_create+0xce/0x13e
 [<c01780cc>] open_namei+0x16b/0x630
 [<c0167b0c>] do_filp_open+0x1f/0x35
 [<c0167b62>] do_sys_open+0x40/0xb5
 [<c0167c03>] sys_open+0x16/0x18
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: soft lockup detected on CPU#1!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c01c9409>] _atomic_dec_and_lock+0x2d/0x4c
 [<c017df95>] dput+0x34/0x139
 [<c0174957>] path_release+0xd/0x23
 [<c0171765>] vfs_stat_fd+0x36/0x40
 [<c01717f1>] vfs_stat+0x11/0x13
 [<c0171807>] sys_stat64+0x14/0x28
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: soft lockup detected on CPU#0!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c017dcd3>] d_instantiate+0x24/0x8a
 [<f88bb7fc>] ext3_add_nondir+0x2e/0x42 [ext3]
 [<f88bbd2d>] ext3_create+0xa8/0xdc [ext3]
 [<c0175603>] vfs_create+0xce/0x13e
 [<c01780cc>] open_namei+0x16b/0x630
 [<c0167b0c>] do_filp_open+0x1f/0x35
 [<c0167b62>] do_sys_open+0x40/0xb5
 [<c0167c03>] sys_open+0x16/0x18
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: soft lockup detected on CPU#1!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c01c9409>] _atomic_dec_and_lock+0x2d/0x4c
 [<c017df95>] dput+0x34/0x139
 [<c0174957>] path_release+0xd/0x23
 [<c0171765>] vfs_stat_fd+0x36/0x40
 [<c01717f1>] vfs_stat+0x11/0x13
 [<c0171807>] sys_stat64+0x14/0x28
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: soft lockup detected on CPU#0!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c017dcd3>] d_instantiate+0x24/0x8a
 [<f88bb7fc>] ext3_add_nondir+0x2e/0x42 [ext3]
 [<f88bbd2d>] ext3_create+0xa8/0xdc [ext3]
 [<c0175603>] vfs_create+0xce/0x13e
 [<c01780cc>] open_namei+0x16b/0x630
 [<c0167b0c>] do_filp_open+0x1f/0x35
 [<c0167b62>] do_sys_open+0x40/0xb5
 [<c0167c03>] sys_open+0x16/0x18
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: soft lockup detected on CPU#1!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c01c9409>] _atomic_dec_and_lock+0x2d/0x4c
 [<c017df95>] dput+0x34/0x139
 [<c0174957>] path_release+0xd/0x23
 [<c0171765>] vfs_stat_fd+0x36/0x40
 [<c01717f1>] vfs_stat+0x11/0x13
 [<c0171807>] sys_stat64+0x14/0x28
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: soft lockup detected on CPU#0!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c01ccee0>] __delay+0x9/0xb
 [<c01d828c>] _raw_spin_lock+0x7a/0xe8
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c017dcd3>] d_instantiate+0x24/0x8a
 [<f88bb7fc>] ext3_add_nondir+0x2e/0x42 [ext3]
 [<f88bbd2d>] ext3_create+0xa8/0xdc [ext3]
 [<c0175603>] vfs_create+0xce/0x13e
 [<c01780cc>] open_namei+0x16b/0x630
 [<c0167b0c>] do_filp_open+0x1f/0x35
 [<c0167b62>] do_sys_open+0x40/0xb5
 [<c0167c03>] sys_open+0x16/0x18
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: soft lockup detected on CPU#1!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c01c9409>] _atomic_dec_and_lock+0x2d/0x4c
 [<c017df95>] dput+0x34/0x139
 [<c0174957>] path_release+0xd/0x23
 [<c0171765>] vfs_stat_fd+0x36/0x40
 [<c01717f1>] vfs_stat+0x11/0x13
 [<c0171807>] sys_stat64+0x14/0x28
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: soft lockup detected on CPU#0!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c017dcd3>] d_instantiate+0x24/0x8a
 [<f88bb7fc>] ext3_add_nondir+0x2e/0x42 [ext3]
 [<f88bbd2d>] ext3_create+0xa8/0xdc [ext3]
 [<c0175603>] vfs_create+0xce/0x13e
 [<c01780cc>] open_namei+0x16b/0x630
 [<c0167b0c>] do_filp_open+0x1f/0x35
 [<c0167b62>] do_sys_open+0x40/0xb5
 [<c0167c03>] sys_open+0x16/0x18
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: soft lockup detected on CPU#1!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c01c9409>] _atomic_dec_and_lock+0x2d/0x4c
 [<c017df95>] dput+0x34/0x139
 [<c0174957>] path_release+0xd/0x23
 [<c0171765>] vfs_stat_fd+0x36/0x40
 [<c01717f1>] vfs_stat+0x11/0x13
 [<c0171807>] sys_stat64+0x14/0x28
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: soft lockup detected on CPU#0!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c01d828c>] _raw_spin_lock+0x7a/0xe8
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c017dcd3>] d_instantiate+0x24/0x8a
 [<f88bb7fc>] ext3_add_nondir+0x2e/0x42 [ext3]
 [<f88bbd2d>] ext3_create+0xa8/0xdc [ext3]
 [<c0175603>] vfs_create+0xce/0x13e
 [<c01780cc>] open_namei+0x16b/0x630
 [<c0167b0c>] do_filp_open+0x1f/0x35
 [<c0167b62>] do_sys_open+0x40/0xb5
 [<c0167c03>] sys_open+0x16/0x18
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: soft lockup detected on CPU#1!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c01c9409>] _atomic_dec_and_lock+0x2d/0x4c
 [<c017df95>] dput+0x34/0x139
 [<c0174957>] path_release+0xd/0x23
 [<c0171765>] vfs_stat_fd+0x36/0x40
 [<c01717f1>] vfs_stat+0x11/0x13
 [<c0171807>] sys_stat64+0x14/0x28
 [<c0103db9>] sysenter_past_esp+0x56/0x8d
BUG: soft lockup detected on CPU#0!
 [<c0104eda>] show_trace_log_lvl+0x58/0x16a
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
 [<c01054d7>] show_trace+0xd/0x10
 [<c01055f6>] dump_stack+0x19/0x1b
 [<c014b1b7>] softlockup_tick+0xa5/0xb9
 [<c0128467>] run_local_timers+0x12/0x14
 [<c01287e6>] update_process_times+0x3c/0x61
 [<c0113a78>] smp_apic_timer_interrupt+0x6d/0x75
 [<c010499a>] apic_timer_interrupt+0x2a/0x30
 [<c02b047a>] _spin_lock+0x20/0x28
 [<c017dcd3>] d_instantiate+0x24/0x8a
 [<f88bb7fc>] ext3_add_nondir+0x2e/0x42 [ext3]
 [<f88bbd2d>] ext3_create+0xa8/0xdc [ext3]
 [<c0175603>] vfs_create+0xce/0x13e
 [<c01780cc>] open_namei+0x16b/0x630
 [<c0167b0c>] do_filp_open+0x1f/0x35
 [<c0167b62>] do_sys_open+0x40/0xb5
 [<c0167c03>] sys_open+0x16/0x18
 [<c0103db9>] sysenter_past_esp+0x56/0x8d

Comment 42 Linus Torvalds 2006-10-28 10:40:45 UTC

On Sat, 28 Oct 2006, bugme-daemon@bugzilla.kernel.org wrote:
>
> This latest crash was caused by just untaring a large source file:
>
> EIP:    0060:[<c017e0e0>]    Not tainted VLI
> EFLAGS: 00010206   (2.6.18-smp #2) 
> EIP is at prune_one_dentry+0x23/0x79
> eax: 00207400   ebx: d7f4f07c   ecx: d7f4f0a4   edx: c2aa2fb4
> esi: d7f4f07c   edi: d7f4f084   ebp: dfad7ee8   esp: dfad7ee0
> ds: 007b   es: 007b   ss: 0068
> Process kswapd0 (pid: 213, ti=dfad6000 task=dff0cae0 task.ti=dfad6000)
> Stack: dfa769d0 d7f4f07c dfad7f04 c017e2ff 00000000 00000053 00027678 dfff2404 
>        000000a8 dfad7f0c c017e34f dfad7f3c c0154b5a 01d8da00 00000000 01d8da00 
>        0006d934 000000d0 00000180 00000000 00000060 c030ea80 c0312200 dfad7fc8 
> Code: ... c0 89 02 74 03 <89> 50 04 c7 41 04 ...

That's this sequence:

		test   %eax,%eax
		mov    %eax,(%edx)
		je     over
***		mov    %edx,0x4(%eax)	***
	over:
		movl   $0x200200,0x4(%ecx)
		lea    0x48(%ebx),%ecx
		..

where the starred instruction oopses because %eax is 00207400.

That sequence is "hlist_del_rcu()" from __d_drop() at the very top of the 
function.

The value for %eax is somewhat interesting. It's 0x00200200 + 0x7200, and 
while I don't see how the 0x7200 got there, it's still close enough to 
0x00200200 that it's intriguing. And 0x00200200 is obviously the magic 
number to poison the list entries.

Anyway, the value of %eax at that point comes from "dentry->d_hash.next", 
so the dentry hash-list has clearly gotten corrupted. It _looks_ like the 
same dentry has gotten unhashed twice, but we actually have code to 
protect against that in __d_drop:

	if (!(dentry->d_flags & DCACHE_UNHASHED)) {
		dentry->d_flags |= DCACHE_UNHASHED;
		hlist_del_rcu(&dentry->d_hash);
	}

but it's still intriguing.

(This all should run under "dcache_lock" _and_ the dentry->d_lock, so 
there's no race on the DCACHE_UNHASHED thing either).

Very strange. 

		Linus

Comment 43 Don Harter 2006-10-30 04:48:20 UTC

drh:/boot>md5sum vmlinuz-2.6.18-smp
583308f23f63ecf72dc80d37c825857e  vmlinuz-2.6.18-smp
drh:/boot>md5sum initrd-2.6.18-smp
d1cda47c9e9d23c128c09f0bc48f3f20  initrd-2.6.18-smp

drh:/boot/grub>for name in `ls`; do md5sum $name; done
f3317bc6b36d7f44282ad8e020b3e2f9  default
089d43a54adbc29b7d2c5308e9cd8439  e2fs_stage1_5
a99945c26d8455ee9787bfc937491699  fat_stage1_5
12f49f1bdedaaa6f24eab8aa06078bbc  ffs_stage1_5
453f03aa3c0363e877abadd7f52b4fc3  iso9660_stage1_5
ffed452493ddc3466ab40a3ad7de4f91  jfs_stage1_5
4481893f2403b21d93ae213982d3bddb  minix_stage1_5
45c5163d47e8dcdea74ae5a8271d622b  reiserfs_stage1_5
addcae72a6ed5918e733d6f5a31862d5  stage1
9f66f479e64fe8eed2528be43295af00  stage2
bbb6e9daaa91fe2f178118454d56fc4c  ufs2_stage1_5
20feb267048c8e8fd426f79ccfa42ca5  vstafs_stage1_5
08c1dc8ce7aeb110aff9b85ee27ea2f9  xfs_stage1_5

Comment 44 Don Harter 2006-11-28 04:49:21 UTC

I am running linux-2.6.18.3 now.  Recently I have done some heavy processing and
it did not crash.
I noticed this message in the log which may be a clue as to what might have/be
causing the problem.
Nov 28 06:16:40 c-69-137-114-21 kernel: [165129.452000] eth0: too many
iterations (6) in nv_nic_irq.
I have seen this message in other areas of the log.

Comment 45 Don Harter 2007-01-01 14:50:57 UTC

This bug does not seem to appear in linux-2.6.18.3.
I have not had a crash yet with this kernel version.
I turned off my other computer with the serial console log, because of a lack of
crashes. 
I did notice an abnormality today, but it may not be a kernel problem.  When I
ran grub-install on /dev/hda I got an error message for dev/fd0 even though
there were no references to /dev/fd0.  /dev/fd0 was in the map though.
So if you want to close this you may.  If I find additional problems, I will
open another bug report. Thanks!

Note You need to log in before you can comment on or make changes to this bug.