Bug 5964 - slab: double free detected in cache 'vm_area_struct'
slab: double free detected in cache 'vm_area_struct'
Status: REJECTED UNREPRODUCIBLE
Product: Memory Management
Classification: Unclassified
Component: Slab Allocator
i386 Linux
: P2 high
Assigned To: Andrew Morton
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-01-26 04:29 UTC by Malte S. Stretz
Modified: 2007-11-06 02:10 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.16
Tree: Mainline
Regression: Yes


Attachments
config for 2.6.14 (36.19 KB, text/plain)
2006-01-26 04:43 UTC, Malte S. Stretz
Details
dmesg output from 2.6.16 (56.30 KB, text/plain)
2006-04-07 15:32 UTC, Malte S. Stretz
Details
config of 2.6.16 (39.07 KB, text/plain)
2006-04-07 15:34 UTC, Malte S. Stretz
Details
"screenshot" of the last crash (124.27 KB, image/jpeg)
2006-06-03 07:27 UTC, Malte S. Stretz
Details

Description Malte S. Stretz 2006-01-26 04:29:43 UTC
Most recent kernel where this bug did not occur:
2.6.10-gentoo-r4

Most recent kernel where this bug did occur:
2.6.14

Distribution:
Gentoo

Hardware Environment:
mss@otherland ~ $ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 4
model name      : AMD Athlon(tm) Processor
stepping        : 2
cpu MHz         : 1199.924
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips        : 2402.42

mss@otherland ~ $ lspci
00:00.0 Host bridge: VIA Technologies, Inc. VT8366/A/7 [Apollo KT266/A/333]
00:01.0 PCI bridge: VIA Technologies, Inc. VT8366/A/7 [Apollo KT266/A/333 AGP]
00:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
00:08.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
00:09.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 61)
00:09.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 61)
00:09.2 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 63)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8233 PCI to ISA Bridge
00:11.1 IDE interface: VIA Technologies, Inc.
VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00:11.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 1b)
00:11.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 1b)
00:11.4 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 1b)
00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237
AC97 Audio Controller (rev 10)
01:00.0 VGA compatible controller: ATI Technologies Inc RV280 [Radeon 9200 PRO]
(rev 01)
01:00.1 Display controller: ATI Technologies Inc RV280 [Radeon 9200 PRO]
(Secondary) (rev 01)
mss@otherland ~ $ lsusb
Bus 005 Device 001: ID 0000:0000
Bus 004 Device 002: ID 07cc:0301 Carry Computer Eng., Co., Ltd
Bus 004 Device 001: ID 0000:0000
Bus 003 Device 007: ID 04a6:0180 Nokia Display Products
Bus 003 Device 004: ID 045e:001e Microsoft Corp. IntelliMouse Explorer
Bus 003 Device 002: ID 0451:1446 Texas Instruments, Inc. TUSB2040/2070 Hub
Bus 003 Device 001: ID 0000:0000
Bus 002 Device 001: ID 0000:0000
Bus 001 Device 001: ID 0000:0000

Software Environment:
From an earlier kernel:

mss@otherland /usr/src/linux-2.6.16-rc1 $ sh scripts/ver_linux
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

Linux otherland 2.6.14-gentoo-r5 #3 PREEMPT Mon Jan 23 10:37:43 CET 2006 i686
AMD Athlon(tm) Processor AuthenticAMD GNU/Linux

Gnu C                  3.4.4
Gnu make               3.80
binutils               2.16.1
util-linux             2.12r
mount                  2.12r
module-init-tools      3.2.1
e2fsprogs              1.38
jfsutils               1.1.8
reiserfsprogs          3.6.19
reiser4progs           line
xfsprogs               2.6.25
Linux C Library        2.3.5
Dynamic linker (ldd)   2.3.5
Procps                 3.2.5
Net-tools              1.60
Kbd                    1.12
Sh-utils               5.2.1
udev                   079
Modules Loaded         snd_seq snd_pcm_oss snd_mixer_oss snd_via82xx
snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd_page_alloc snd_mpu401_uart
snd_rawmidi snd_seq_device snd soundcore sd_mod ext2 mbcache usbhid usb_storage
scsi_mod vfat fat ide_cd cdrom uhci_hcd usbcore via_rhine mii capability
commoncap button fan thermal processor non_fatal

Problem Description:
This was first reported in the Gentoo Bugzilla under
<http://bugs.gentoo.org/show_bug.cgi?id=117507>.  As it was (fairly)
reproducible with the vanilla 2.6.16-rc1, I was sent here.

Kernels since 2.6.14 (or some other version after 2.6.10 as I didn't try the
versions between) tend to crash my machine.  Going back to 2.6.10-gentoo-r5
fixes the problem.  The kernel crashes with the message "slab: double free
detected in cache 'vm_area_struct'" and "kernel BUG at mm/slab.c:2701! invalid
opcode: 0000 [#1]".

The whole dmesg output for 2.6.16-rc1 is available at
<http://bugs.gentoo.org/attachment.cgi?id=77737> (other versions are available
in that gentoo bug, too).  The cut version:
slab: double free detected in cache 'vm_area_struct', objp f72c73d8
------------[ cut here ]------------
kernel BUG at mm/slab.c:2701!
invalid opcode: 0000 [#1]
PREEMPT 
Modules linked in: snd_seq snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usb_lib
snd_hwdep snd_via82xx snd_ac97_codec snd_ac97_bus snd_pcm snd_timer
snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore sd_mod
ext2 mbcache usb_storage scsi_mod usbhid vfat fat ide_cd cdrom ehci_hcd uhci_hcd
usbcore ipv6 via_rhine mii capability commoncap button fan thermal processor
non_fatal
CPU:    0
EIP:    0060:[<c015e86d>]    Not tainted VLI
EFLAGS: 00210096   (2.6.16-rc1) 
EIP is at free_block+0xbd/0x180
eax: 00000047   ebx: f72c7000   ecx: c034ee6c   edx: c034ee6c
esi: c18ddd60   edi: 00000008   ebp: f72c701c   esp: d4747eb8
ds: 007b   es: 007b   ss: 0068
Process sh (pid: 8989, threadinfo=d4746000 task=ea1250b0)
Stack: <0>c031ac14 c0317bed f72c73d8 00200202 f72c73d8 00000002 c18de7cc c18dc820 
       f764b504 00000010 c015e9a3 c18dc820 c18da610 00000010 00000000 c18da610 
       c18ddd60 c18da600 c18dc820 f764b504 00200282 c015ecbb c18dc820 c18da600 
Call Trace:
 [<c015e9a3>] cache_flusharray+0x73/0x170
 [<c015ecbb>] kmem_cache_free+0x7b/0x80
 [<c014fbbc>] remove_vma+0x5c/0x80
 [<c014fbbc>] remove_vma+0x5c/0x80
 [<c0151f18>] exit_mmap+0xd8/0x110
 [<c011a8f7>] mmput+0x37/0xb0
 [<c011f4e3>] do_exit+0xf3/0x460
 [<c011f8c4>] do_group_exit+0x34/0xa0
 [<c010320b>] sysenter_past_esp+0x54/0x75
Code: 00 00 8d 6b 1c 83 7c bd 00 fe 74 27 8b 44 24 10 8b 54 24 2c 89 44 24 08 8b
42 48 c7 04 24 14 ac 31 c0 89 44 24 04 e8 c3 ea fb ff <0f> 0b 8d 0a 7d 9c 31 c0
8b 43 14 89 44 bd 00 89 7b 14 8b 4c 24 
 <1>Fixing recursive fault but reboot is needed!
scheduling while atomic: sh/0x00000001/8989
 [<c02fe198>] schedule+0x588/0x670
 [<c01040b8>] show_stack_log_lvl+0xa8/0xe0
 [<c015e882>] free_block+0xd2/0x180
 [<c011f68d>] do_exit+0x29d/0x460
 [<c011d347>] printk+0x17/0x20
 [<c0104545>] die+0x195/0x1a0
 [<c01048d0>] do_invalid_op+0x0/0xb0
 [<c0104972>] do_invalid_op+0xa2/0xb0
 [<c015e86d>] free_block+0xbd/0x180
 [<c011d74f>] release_console_sem+0xcf/0xf0
 [<c011d5e5>] vprintk+0x295/0x2b0
 [<c0144323>] buffered_rmqueue+0xc3/0x230
 [<c0103cfb>] error_code+0x4f/0x54
 [<c015e86d>] free_block+0xbd/0x180
 [<c015e9a3>] cache_flusharray+0x73/0x170
 [<c015ecbb>] kmem_cache_free+0x7b/0x80
 [<c014fbbc>] remove_vma+0x5c/0x80
 [<c014fbbc>] remove_vma+0x5c/0x80
 [<c0151f18>] exit_mmap+0xd8/0x110
 [<c011a8f7>] mmput+0x37/0xb0
 [<c011f4e3>] do_exit+0xf3/0x460
 [<c011f8c4>] do_group_exit+0x34/0xa0
 [<c010320b>] sysenter_past_esp+0x54/0x75
Unable to handle kernel paging request at virtual address 00100104
 printing eip:
c015e806
*pde = 00000000
Oops: 0002 [#2]
PREEMPT 
Modules linked in: snd_seq snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usb_lib
snd_hwdep snd_via82xx snd_ac97_codec snd_ac97_bus snd_pcm snd_timer
snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore sd_mod
ext2 mbcache usb_storage scsi_mod usbhid vfat fat ide_cd cdrom ehci_hcd uhci_hcd
usbcore ipv6 via_rhine mii capability commoncap button fan thermal processor
non_fatal
CPU:    0
EIP:    0060:[<c015e806>]    Not tainted VLI
EFLAGS: 00210082   (2.6.16-rc1) 
EIP is at free_block+0x56/0x180
eax: 00100100   ebx: f72c7000   ecx: 00000000   edx: 00200200
esi: c18ddd60   edi: 00000007   ebp: 00000000   esp: c1927e90
ds: 007b   es: 007b   ss: 0068
Process events/0 (pid: 4, threadinfo=c1926000 task=c1907a70)
Stack: <0>c18dc780 f78dc060 00000007 00000000 f72c7248 00000000 c18da610 c18da600 
       00000007 00000000 c015f210 c18dc820 c18da610 00000007 00000000 c18dc820 
       c18ddd24 c18dc820 c18ddd60 00000001 c015f2f4 c18dc820 c18da600 00000000 
Call Trace:
 [<c015f210>] drain_array_locked+0x80/0xd0
 [<c015f2f4>] cache_reap+0x94/0x200
 [<c012de42>] run_workqueue+0x92/0x130
 [<c015f260>] cache_reap+0x0/0x200
 [<c012e038>] worker_thread+0x158/0x180
 [<c0119270>] default_wake_function+0x0/0x20
 [<c0119270>] default_wake_function+0x0/0x20
 [<c012dee0>] worker_thread+0x0/0x180
 [<c0131896>] kthread+0xb6/0xf0
 [<c01317e0>] kthread+0x0/0xf0
 [<c0101395>] kernel_thread_helper+0x5/0x10
Code: 8b 4c 24 38 89 d0 89 54 24 10 05 00 00 00 40 c1 e8 0c c1 e0 05 03 05 10 7a
4d c0 8b 58 1c 8b 44 24 2c 8b 53 04 8b 74 88 14 8b 03 <89> 50 04 89 02 31 d2 c7
03 00 01 10 00 c7 43 04 00 02 20 00 8b 
 <6>note: events/0[4] exited with preempt_count 1

Steps to reproduce:
Not sure.  First, use some kernel after 2.6.10.  Current ideas who could be the
culprit:
1. High system load; that at least helps (like compiling via Gentoo's emerge -uD
world) but a relatively idle system seems to crash after a few days, too, just
not as often.
2. Some USB driver. I tend to switch off my screen which has an USB hub
included; it just seems that the system doesn't crash if I don't do so.
3. The serial driver. Seems like the system doesn't crash when I don't load it.
4. Something completely different.
I say "seems" in all cases because its very hard to reproduce. I don't think its
a hardware failure though because going back to 2.6.10 gives me a stable system.

Any ideas on how to debug this are welcome.
Comment 1 Malte S. Stretz 2006-01-26 04:43:40 UTC
Created attachment 7153 [details]
config for 2.6.14

I accidently deleted the 2.6.16-rc1 config (and the binary) -- I'm currently
recreating it, based on this config for the (also crashing) 2.6.14.
Comment 2 Malte S. Stretz 2006-04-07 15:32:57 UTC
Created attachment 7808 [details]
dmesg output from 2.6.16

This is the dmesg output from 2.6.16-gentoo-r1; it first seemed to be pretty
stable but finally crashed...
Comment 3 Malte S. Stretz 2006-04-07 15:34:27 UTC
Created attachment 7809 [details]
config of 2.6.16
Comment 4 Malte S. Stretz 2006-04-07 15:53:09 UTC
Guess I'll just disable DEBUG_SLAB.  If anybody wants to debug this.....
Comment 5 Andrew Morton 2006-04-07 16:28:06 UTC
How long does the crash take to happen?

Are there any different-looking crashes, or always this one?

If poss, can you run memtest86 on that machine for 24 hours?
Comment 6 Malte S. Stretz 2006-04-07 16:35:58 UTC
How long: Depends.  The box was running for a few dayswithout problems, then it
crashed twice in a row.  It was under high load both times (first compiling
KOffice, then the Kernel) but I compiled  a whole stuff without any crashes.

To me the crash looks always the same; just the order the processes are dyingis
different.

I will try a memtest at some point but as I said, 2.6.10 runs rock solid.
Comment 7 Malte S. Stretz 2006-04-18 09:07:38 UTC
The Easter weekend gave me a good chance for a memtest86+ (v1.65) session:  
100h running, the tests passed 70 times without any errors.
Comment 8 Hugh Dickins 2006-04-18 11:31:24 UTC
Thanks for doing such a thorough memtest86+: sounds convincing.

I had been thinking of your vm_area_struct double-free as just
one of several confounding slab corruptions seen in recent months.
But looking at it again now, suspect it's more specific.

Could you please rebuild with the patch below, run your testing on
that kernel, and report back how it goes when you've run for long
enough to judge?

Even if it seems to fix your immediate problem, I don't believe it's
the real fix: more something to try, and if it works, then we have a
better idea of what direction to look in next.

(That is, something I could easily cook up to keep you busy, while
I go away and think about something else - oops, how unprofessional,
forget I said that ;-)

Hugh

--- 2.6.16/mm/mmap.c	2006-03-20 05:53:29.000000000 +0000
+++ linux/mm/mmap.c	2006-04-18 18:59:39.000000000 +0100
@@ -1933,7 +1933,7 @@ EXPORT_SYMBOL(do_brk);
 void exit_mmap(struct mm_struct *mm)
 {
 	struct mmu_gather *tlb;
-	struct vm_area_struct *vma = mm->mmap;
+	struct vm_area_struct *vma = xchg(&mm->mmap, 0);
 	unsigned long nr_accounted = 0;
 	unsigned long end;
 

Comment 9 Malte S. Stretz 2006-04-18 23:14:14 UTC
That didn't really help.  Yesterday night the box crashed again, this time it 
even took down the xinetd process so I can't get a trace.
Comment 10 Hugh Dickins 2006-04-19 09:59:33 UTC
On Tue, 18 Apr 2006, bugme-daemon@bugzilla.kernel.org wrote:
> That didn't really help.  Yesterday night the box crashed again,
> this time it even took down the xinetd process so I can't get a trace.

Hmmm.  Very inconclusive.  It might be that the patch was irrelevant
and didn't help at all; or it might be that the patch helped to get
around the vm_area_struct freeing errors, and so let the system sail
on to hit the effects of the underlying bug.  I think I'd like to ask
you to run with the patch again, in the hope that "can't get a trace"
was a one-off, and more info emerges this time around.

Comment 11 Malte S. Stretz 2006-06-03 07:27:22 UTC
Created attachment 8248 [details]
"screenshot" of the last crash

After some time I tried again a kernel with the patch applied and after the
system was running for half a day it tends to go into crash frenzy again, this
time also on startup.  All I could gather after some tries is this "screenshot"
which says that it is crashing in slab.c:2392 now (but maybe the line numbers
have changed because this is the more recent kernel 2.6.16-gentoo-r8).
Comment 12 Malte S. Stretz 2006-06-05 10:44:29 UTC
Crashed again.  I *think* the trace looks different but who knows...

[17179569.184000] Linux version 2.6.16-gentoo-r8-bug5964-try1 (root@otherland) 
(gcc version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)) #1 PREEMPT Fri 
Jun 2 18:40:48 CEST 2006
[...]
[17351310.004000] slab: double free detected in cache 'vm_area_struct', objp 
ebb56a18
[17351310.004000] ------------[ cut here ]------------
[17351310.004000] kernel BUG at mm/slab.c:2392!
[17351310.004000] invalid opcode: 0000 [#1]
[17351310.004000] PREEMPT 
[17351310.004000] Modules linked in: w83627hf hwmon_vid hwmon eeprom i2c_isa 
i2c_viapro md5 ipv6 snd_seq snd_pcm_oss snd_mixer_oss snd_via82xx gameport 
snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd_page_alloc snd_mpu401_uart 
snd_rawmidi snd_seq_device snd soundcore sd_mod usb_storage scsi_mod usbhid 
dm_mod vfat fat ide_cd cdrom 8250 serial_core ehci_hcd uhci_hcd usbcore tun 
ne2k_pci 8390 3c59x via_rhine mii capability commoncap button fan thermal 
processor non_fatal rtc
[17351310.004000] CPU:    0
[17351310.004000] EIP:    0060:[<b0160241>]    Not tainted VLI
[17351310.004000] EFLAGS: 00010096   (2.6.16-gentoo-r8-bug5964-try1 #1) 
[17351310.004000] EIP is at slab_put_obj+0x51/0xa0
[17351310.004000] eax: 00000059   ebx: ebb56000   ecx: b0356f6c   edx: 00000001
[17351310.004000] esi: 00000018   edi: ebb5601c   ebp: bdc23e70   esp: bdc23e54
[17351310.004000] ds: 007b   es: 007b   ss: 0068
[17351310.004000] Process tcsh (pid: 6732, threadinfo=bdc22000 task=e9618590)
[17351310.004000] Stack: <0>b0321cd8 b031e951 ebb56a18 bdc23e68 ebb56a18 
ebb56000 effedd60 bdc23e98 
[17351310.004000]        b0160d68 effec820 ebb56000 ebb56a18 00000000 0000000d 
effee7cc effec820 
[17351310.004000]        b86fa7c0 bdc23ec8 b0160e5b effec820 effea610 00000010 
00000000 effea610 
[17351310.004000] Call Trace:
[17351310.004000]  [<b01040ca>] show_stack_log_lvl+0xaa/0xe0
[17351310.004000]  [<b01042e7>] show_registers+0x197/0x210
[17351310.004000]  [<b01044e7>] die+0xf7/0x1a0
[17351310.004000]  [<b0104617>] do_trap+0x87/0xd0
[17351310.004000]  [<b0104985>] do_invalid_op+0xb5/0xc0
[17351310.004000]  [<b0103ceb>] error_code+0x4f/0x54
[17351310.004000]  [<b0160d68>] free_block+0x88/0x100
[17351310.004000]  [<b0160e5b>] cache_flusharray+0x7b/0x180
[17351310.004000]  [<b0161172>] kmem_cache_free+0x72/0x80
[17351310.004000]  [<b0152588>] remove_vma+0x58/0x70
[17351310.004000]  [<b01547dd>] exit_mmap+0xdd/0x110
[17351310.004000]  [<b011a903>] mmput+0x33/0xb0
[17351310.004000]  [<b011f85d>] exit_mm+0x8d/0x110
[17351310.004000]  [<b01200e7>] do_exit+0xf7/0x4c0
[17351310.004000]  [<b012052b>] do_group_exit+0x3b/0xd0
[17351310.004000]  [<b01205d5>] sys_exit_group+0x15/0x20
[17351310.004000]  [<b01031fb>] sysenter_past_esp+0x54/0x75
[17351310.004000] Code: 3b 45 14 75 45 8d 7b 1c 83 3c b7 fe 74 25 8b 45 10 8b 
55 08 89 44 24 08 8b 42 44 c7 04 24 d8 1c 32 b0 89 44 24 04 e8 2f db fb ff <0f> 
0b 58 09 6e 0e 32 b0 8b 43 14 89 04 b7 ff 4b 10 89 73 14 8b 
[17351310.004000]  <1>Fixing recursive fault but reboot is needed!
[17351310.004000] scheduling while atomic: tcsh/0x00000001/6732
[17351310.004000]  [<b0104010>] show_trace+0x20/0x30
[17351310.004000]  [<b010414e>] dump_stack+0x1e/0x20
[17351310.004000]  [<b030588c>] schedule+0x5ac/0x690
[17351310.004000]  [<b01202ee>] do_exit+0x2fe/0x4c0
[17351310.004000]  [<b0104585>] die+0x195/0x1a0
[17351310.004000]  [<b0104617>] do_trap+0x87/0xd0
[17351310.004000]  [<b0104985>] do_invalid_op+0xb5/0xc0
[17351310.004000]  [<b0103ceb>] error_code+0x4f/0x54
[17351310.004000]  [<b0160d68>] free_block+0x88/0x100
[17351310.004000]  [<b0160e5b>] cache_flusharray+0x7b/0x180
[17351310.004000]  [<b0161172>] kmem_cache_free+0x72/0x80
[17351310.004000]  [<b0152588>] remove_vma+0x58/0x70
[17351310.004000]  [<b01547dd>] exit_mmap+0xdd/0x110
[17351310.004000]  [<b011a903>] mmput+0x33/0xb0
[17351310.004000]  [<b011f85d>] exit_mm+0x8d/0x110
[17351310.004000]  [<b01200e7>] do_exit+0xf7/0x4c0
[17351310.004000]  [<b012052b>] do_group_exit+0x3b/0xd0
[17351310.004000]  [<b01205d5>] sys_exit_group+0x15/0x20
[17351310.004000]  [<b01031fb>] sysenter_past_esp+0x54/0x75
Comment 13 Hugh Dickins 2006-06-11 13:09:30 UTC
I don't have high hopes that it will enlighten me, but please apply
patch below, rebuild your kernel (with or without CONFIG_DEBUG_SLAB
as you prefer), try running, and report the messages you get - along
with output from /proc/slabinfo (or at least its relevant lines e.g.
for "vm_area_struct").  I've cut out the BUG (with its not very
interesting backtrace), so you should be able to continue running
successfully, as you found you could before without DEBUG_SLAB.  After
gathering several groups of error messages, it's probably worth reboot
and trying again: to help build up a picture of what's common.  Thanks.

--- 2.6.16/mm/slab.c	2006-03-20 05:53:29.000000000 +0000
+++ linux/mm/slab.c	2006-06-11 20:53:19.000000000 +0100
@@ -2368,10 +2368,8 @@ static void *slab_get_obj(struct kmem_ca
 
 	slabp->inuse++;
 	next = slab_bufctl(slabp)[slabp->free];
-#if DEBUG
 	slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
 	WARN_ON(slabp->nodeid != nodeid);
-#endif
 	slabp->free = next;
 
 	return objp;
@@ -2382,16 +2380,16 @@ static void slab_put_obj(struct kmem_cac
 {
 	unsigned int objnr = (unsigned)(objp-slabp->s_mem) / cachep->buffer_size;
 
-#if DEBUG
 	/* Verify that the slab belongs to the intended node */
 	WARN_ON(slabp->nodeid != nodeid);
 
 	if (slab_bufctl(slabp)[objnr] != BUFCTL_FREE) {
+		kmem_bufctl_t *bufctl = slab_bufctl(slabp) + objnr;
 		printk(KERN_ERR "slab: double free detected in cache "
 		       "'%s', objp %p\n", cachep->name, objp);
-		BUG();
+		printk(KERN_ERR "      slab_bufctl(%p)[%x] = %x@%p\n",
+			slabp, objnr, *bufctl, bufctl);
 	}
-#endif
 	slab_bufctl(slabp)[objnr] = slabp->free;
 	slabp->free = objnr;
 	slabp->inuse--;

Comment 14 Malte S. Stretz 2006-06-15 09:50:28 UTC
Here's one:
[17179569.184000] Linux version 2.6.16-gentoo-r9-bug5964-try2 (root@otherland) 
(gcc version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)) #3 PREEMPT Tue 
Jun 13 00:55:11 CEST 2006
[...]
[17379204.504000] hub 3-2:1.0: USB hub found
[17379204.508000] hub 3-2:1.0: 4 ports detected
[17383280.004000] slab: double free detected in cache 'biovec-1', objp efd4ead0
[17383280.004000]       slab_bufctl(efd4e000)[78] = ff00fffe@efd4e1fc
[17386724.516000] usb 3-2: USB disconnect, address 14

Unfortunately some time ago, last dmesg entry dated [17411086.788000], 
slabinfo:
biovec-1             374    609     16  203    1 : tunables  120   60    0 : 
slabdata      3      3      0

I'll see if I find the time to create a cron job to log this stuff.
Comment 15 Malte S. Stretz 2006-06-19 04:57:39 UTC
[17479641.280000] slab: double free detected in cache 'vm_area_struct', objp 
db38a90c
[17479641.280000]       slab_bufctl(db38a000)[18] = fffe@db38a07c

current dmesg ts: 17720900.356000
Comment 16 Malte S. Stretz 2006-06-20 03:02:02 UTC
Hmmm... these pattern are really getting interesting :)

[17720900.356000] hub 3-2:1.0: 4 ports detected
[17775653.028000] slab: double free detected in cache 'biovec-1', objp eff848d0
[17775653.028000]       slab_bufctl(eff84000)[58] = ff00fffe@eff8417c
[17776279.448000] usb 3-2: USB disconnect, address 28
[17781983.028000] slab: double free detected in cache 'bio', objp efcac320
[17781983.028000]       slab_bufctl(efcac000)[8] = ff00fffe@efcac03c
[17803962.460000] slab: double free detected in cache 'vm_area_struct', objp 
ec0dc90c
[17803962.460000]       slab_bufctl(ec0dc000)[18] = ff00fffe@ec0dc07c
[17814556.724000] usb 3-2: new full speed USB device using uhci_hcd and address 
29

biovec-1             280    609     16  203    1 : tunables  120   60    0 : 
slabdata      3      3      0
bio                  280    413     64   59    1 : tunables  120   60    0 : 
slabdata      7      7      0
vm_area_struct      7830  10076     88   44    1 : tunables  120   60    0 : 
slabdata    229    229      0
Comment 17 Malte S. Stretz 2006-07-20 15:34:55 UTC
After long time running, finally another one.  And I hoped, the switch to the 
radeon driver had made them go:

[17253932.868000] slab: double free detected in cache 'anon_vma', objp d6eb1728
[17253932.868000]       slab_bufctl(d6eb1000)[38] = fffe@d6eb10fc

Same pattern, two bytes 00 instead of the expected ff.
Comment 18 Malte S. Stretz 2006-10-02 00:24:04 UTC
I switched to 2.6.17 and its getting nasty again.  Now the double free is 
detected and when the second error is supposed to be printed, the Kernel oops's  
with an "unable to handle kernel paging request".  I had to modify Hugh's patch 
for the 2.6.17 as the code has changed but that was pretty straight-forward.

The new patch and the oops:
--- mm/slab.c.orig      2006-09-19 20:13:04.000000000 +0200
+++ mm/slab.c   2006-09-27 12:31:36.000000000 +0200
@@ -2431,10 +2431,8 @@

        slabp->inuse++;
        next = slab_bufctl(slabp)[slabp->free];
-#if DEBUG
        slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
        WARN_ON(slabp->nodeid != nodeid);
-#endif
        slabp->free = next;

        return objp;
@@ -2445,16 +2443,16 @@
 {
        unsigned int objnr = obj_to_index(cachep, slabp, objp);

-#if DEBUG
        /* Verify that the slab belongs to the intended node */
        WARN_ON(slabp->nodeid != nodeid);

        if (slab_bufctl(slabp)[objnr] + 1 <= SLAB_LIMIT + 1) {
+               kmem_bufctl_t *bufctl = slab_bufctl(slabp)[objnr];
                printk(KERN_ERR "slab: double free detected in cache "
                                "'%s', objp %p\n", cachep->name, objp);
-               BUG();
+               printk(KERN_ERR "      slab_bufctl(%p)[%x] = %x@%p\n",
+                               slabp, objnr, *bufctl, bufctl);
        }
-#endif
        slab_bufctl(slabp)[objnr] = slabp->free;
        slabp->free = objnr;
        slabp->inuse--;


[17191396.540000] slab: double free detected in cache 'vm_area_struct', objp 
ea67990c
[17191396.540000] BUG: unable to handle kernel paging request at virtual 
address 0000fffe
[17191396.540000]  printing eip:
[17191396.540000] c015ca22
[17191396.540000] *pde = 00000000
[17191396.540000] Oops: 0000 [#1]
[17191396.540000] PREEMPT
[17191396.540000] Modules linked in: w83627hf hwmon_vid hwmon ipv6 eeprom 
i2c_isa i2c_viapro iptable_mangle iptable_filter ip_tables x_tables snd_seq 
snd_via
82xx gameport snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd_page_alloc 
snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore joydev sd_mod usbhid 
usb_
storage scsi_mod vfat fat ide_cd cdrom 8250 serial_core ehci_hcd uhci_hcd 
usbcore tun ne2k_pci 8390 3c59x via_rhine mii capability commoncap button fan 
therm
al processor non_fatal radeon
[17191396.540000] CPU:    0
[17191396.540000] EIP:    0060:[<c015ca22>]    Not tainted VLI
[17191396.540000] EFLAGS: 00010086   (2.6.17-gentoo-r8-b5964t3 #3)
[17191396.540000] EIP is at free_block+0x132/0x1b0
[17191396.540000] eax: 00000059   ebx: ea679000   ecx: 00000073   edx: 00000001
[17191396.540000] esi: 0000fffe   edi: dfffdd20   ebp: c7c6fe8c   esp: c7c6fe50
[17191396.540000] ds: 007b   es: 007b   ss: 0068
[17191396.540000] Process dcop (pid: 5306, threadinfo=c7c6e000 task=e3344570)
[17191396.540000] Stack: c031bda0 c0318b54 ea67990c c7c6fe64 0000fffe 00000018 
0000003c dfff9010
[17191396.540000]        dffffc60 00000030 ea67990c ea67907c dfffebe0 0000003c 
dffffc60 c7c6feb8
[17191396.540000]        c015c6d7 00000000 c7c6fec4 c0161c52 dfff9010 dfff9000 
00000000 dfff9000
[17191396.540000] Call Trace:
[17191396.540000]  <c01042ad> show_stack_log_lvl+0x9d/0xd0  <c01044f6> 
show_registers+0x1c6/0x250
[17191396.540000]  <c010469e> die+0x11e/0x2c0  <c0116846> 
do_page_fault+0x276/0x68c
[17191396.540000]  <c0103c6f> error_code+0x4f/0x54  <c015c6d7> 
cache_flusharray+0x47/0xf0
[17191396.540000]  <c015c848> kmem_cache_free+0x48/0x50  <c0150826> 
remove_vma+0x46/0x50
[17191396.540000]  <c015090f> exit_mmap+0xdf/0x110  <c0119f93> mmput+0x33/0xc0
[17191396.540000]  <c011ded3> exit_mm+0x93/0x120  <c011f709> do_exit+0xd9/0x9c0
[17191396.540000]  <c0120027> do_group_exit+0x37/0xa0  <c01200a5> 
sys_exit_group+0x15/0x20
[17191396.540000]  <c0103173> sysenter_past_esp+0x54/0x75
[17191396.540000] Code: ff ff 83 c4 30 5b 5e 5f c9 c3 8b 55 ec 8b 4d e4 89 54 
24 08 8b 41 44 c7 04 24 a0 bd 31 c0 89 44 24 04 e8 62 08 fc ff 89 74 24 10 <8b>
 06 89 5c 24 04 c7 04 24 d8 bd 31 c0 89 44 24 0c 8b 45 d8 89
[17191396.540000] EIP: [<c015ca22>] free_block+0x132/0x1b0 SS:ESP 0068:c7c6fe50
[17191396.540000]  <1>Fixing recursive fault but reboot is needed!
[17191396.540000] BUG: scheduling while atomic: dcop/0x00000001/5306
[17191396.540000]  <c0104323> show_trace+0x13/0x20  <c010496e> 
dump_stack+0x1e/0x20
[17191396.540000]  <c02fd21a> schedule+0x49a/0x670  <c011fc70> 
do_exit+0x640/0x9c0
[17191396.540000]  <c010483d> die+0x2bd/0x2c0  <c0116846> 
do_page_fault+0x276/0x68c
[17191396.540000]  <c0103c6f> error_code+0x4f/0x54  <c015c6d7> 
cache_flusharray+0x47/0xf0
[17191396.540000]  <c015c848> kmem_cache_free+0x48/0x50  <c0150826> 
remove_vma+0x46/0x50
[17191396.540000]  <c015090f> exit_mmap+0xdf/0x110  <c0119f93> mmput+0x33/0xc0
[17191396.540000]  <c011ded3> exit_mm+0x93/0x120  <c011f709> do_exit+0xd9/0x9c0
[17191396.540000]  <c0120027> do_group_exit+0x37/0xa0  <c01200a5> 
sys_exit_group+0x15/0x20
[17191396.540000]  <c0103173> sysenter_past_esp+0x54/0x75
[17191396.552000] slab: double free detected in cache 'vm_area_struct', objp 
e1f7ac7c
[17191396.552000] general protection fault: 0000 [#2]
[17191396.552000] PREEMPT
[17191396.552000] Modules linked in: w83627hf hwmon_vid hwmon ipv6 eeprom 
i2c_isa i2c_viapro iptable_mangle iptable_filter ip_tables x_tables snd_seq 
snd_via
82xx gameport snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd_page_alloc 
snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore joydev sd_mod usbhid 
usb_
storage scsi_mod vfat fat ide_cd cdrom 8250 serial_core ehci_hcd uhci_hcd 
usbcore tun ne2k_pci 8390 3c59x via_rhine mii capability commoncap button fan 
therm
al processor non_fatal radeon
[17191396.552000] CPU:    0
[17191396.552000] EIP:    0060:[<c015ca22>]    Not tainted VLI
[17191396.552000] EFLAGS: 00010086   (2.6.17-gentoo-r8-b5964t3 #3)
[17191396.552000] EIP is at free_block+0x132/0x1b0
[17191396.552000] eax: 00000059   ebx: e1f7a000   ecx: 00000073   edx: f5690000
[17191396.552000] esi: ffffffff   edi: dfffdd20   ebp: f5691e8c   esp: f5691e50
[17191396.552000] ds: 007b   es: 007b   ss: 0068
[17191396.552000] Process dcop (pid: 5308, threadinfo=f5690000 task=d8e225d0)
[17191396.552000] Stack: c031bda0 c0318b54 e1f7ac7c f5691e64 ffffffff 00000022 
0000003c dfff9010
[17191396.552000]        dffffc60 00000000 e1f7ac7c e1f7a0a4 dfffebe0 0000003c 
dffffc60 f5691eb8
[17191396.552000]        c015c6d7 00000000 f5691ec4 c0161c52 dfff9010 dfff9000 
00000000 dfff9000
[17191396.552000] Call Trace:
[17191396.552000]  <c01042ad> show_stack_log_lvl+0x9d/0xd0  <c01044f6> 
show_registers+0x1c6/0x250
[17191396.552000]  <c010469e> die+0x11e/0x2c0  <c0105781> 
do_general_protection+0x1d1/0x230
[17191396.552000]  <c0103c6f> error_code+0x4f/0x54  <c015c6d7> 
cache_flusharray+0x47/0xf0
[17191396.552000]  <c015c848> kmem_cache_free+0x48/0x50  <c0150826> 
remove_vma+0x46/0x50
[17191396.552000]  <c015090f> exit_mmap+0xdf/0x110  <c0119f93> mmput+0x33/0xc0
[17191396.552000]  <c011ded3> exit_mm+0x93/0x120  <c011f709> do_exit+0xd9/0x9c0
[17191396.552000]  <c0120027> do_group_exit+0x37/0xa0  <c01200a5> 
sys_exit_group+0x15/0x20
[17191396.552000]  <c0103173> sysenter_past_esp+0x54/0x75
[17191396.552000] Code: ff ff 83 c4 30 5b 5e 5f c9 c3 8b 55 ec 8b 4d e4 89 54 
24 08 8b 41 44 c7 04 24 a0 bd 31 c0 89 44 24 04 e8 62 08 fc ff 89 74 24 10 <8b>
 06 89 5c 24 04 c7 04 24 d8 bd 31 c0 89 44 24 0c 8b 45 d8 89
[17191396.552000] EIP: [<c015ca22>] free_block+0x132/0x1b0 SS:ESP 0068:f5691e50
[17191396.552000]  <1>Fixing recursive fault but reboot is needed!
[17191396.552000] BUG: scheduling while atomic: dcop/0x00000001/5308
[17191396.552000]  <c0104323> show_trace+0x13/0x20  <c010496e> 
dump_stack+0x1e/0x20
[17191396.552000]  <c02fd21a> schedule+0x49a/0x670  <c011fc70> 
do_exit+0x640/0x9c0
[17191396.552000]  <c010483d> die+0x2bd/0x2c0  <c0105781> 
do_general_protection+0x1d1/0x230
[17191396.552000]  <c0103c6f> error_code+0x4f/0x54  <c015c6d7> 
cache_flusharray+0x47/0xf0
[17191396.552000]  <c015c848> kmem_cache_free+0x48/0x50  <c0150826> 
remove_vma+0x46/0x50
[17191396.552000]  <c015090f> exit_mmap+0xdf/0x110  <c0119f93> mmput+0x33/0xc0
[17191396.552000]  <c011ded3> exit_mm+0x93/0x120  <c011f709> do_exit+0xd9/0x9c0
[17191396.552000]  <c0120027> do_group_exit+0x37/0xa0  <c01200a5> 
sys_exit_group+0x15/0x20
[17191396.552000]  <c0103173> sysenter_past_esp+0x54/0x75
...
Comment 19 Daniel Drake 2007-04-29 07:16:32 UTC
Malte, is this still an issue on the latest kernel release (2.6.21 or newer)?
Comment 20 Malte S. Stretz 2007-05-30 14:29:56 UTC
When I left Germany in January it was still an issue, haven't used my 
workstation at home since then. I can check again in September.
Comment 21 Natalie Protasevich 2007-11-05 23:30:36 UTC
Malte, did you have chance to test recently?
Thanks.
Comment 22 Malte S. Stretz 2007-11-06 02:10:59 UTC
Nope, sorry, don't have that system anymore.  And I heard there will be a new allocator anyway, so I'll just close this bug.

Note You need to log in before you can comment on or make changes to this bug.