Most recent kernel where this bug did not occur: 2.6.10-gentoo-r4 Most recent kernel where this bug did occur: 2.6.14 Distribution: Gentoo Hardware Environment: mss@otherland ~ $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 4 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1199.924 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow bogomips : 2402.42 mss@otherland ~ $ lspci 00:00.0 Host bridge: VIA Technologies, Inc. VT8366/A/7 [Apollo KT266/A/333] 00:01.0 PCI bridge: VIA Technologies, Inc. VT8366/A/7 [Apollo KT266/A/333 AGP] 00:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS) 00:08.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42) 00:09.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 61) 00:09.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 61) 00:09.2 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 63) 00:11.0 ISA bridge: VIA Technologies, Inc. VT8233 PCI to ISA Bridge 00:11.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06) 00:11.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 1b) 00:11.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 1b) 00:11.4 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 1b) 00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237 AC97 Audio Controller (rev 10) 01:00.0 VGA compatible controller: ATI Technologies Inc RV280 [Radeon 9200 PRO] (rev 01) 01:00.1 Display controller: ATI Technologies Inc RV280 [Radeon 9200 PRO] (Secondary) (rev 01) mss@otherland ~ $ lsusb Bus 005 Device 001: ID 0000:0000 Bus 004 Device 002: ID 07cc:0301 Carry Computer Eng., Co., Ltd Bus 004 Device 001: ID 0000:0000 Bus 003 Device 007: ID 04a6:0180 Nokia Display Products Bus 003 Device 004: ID 045e:001e Microsoft Corp. IntelliMouse Explorer Bus 003 Device 002: ID 0451:1446 Texas Instruments, Inc. TUSB2040/2070 Hub Bus 003 Device 001: ID 0000:0000 Bus 002 Device 001: ID 0000:0000 Bus 001 Device 001: ID 0000:0000 Software Environment: From an earlier kernel: mss@otherland /usr/src/linux-2.6.16-rc1 $ sh scripts/ver_linux If some fields are empty or look unusual you may have an old version. Compare to the current minimal requirements in Documentation/Changes. Linux otherland 2.6.14-gentoo-r5 #3 PREEMPT Mon Jan 23 10:37:43 CET 2006 i686 AMD Athlon(tm) Processor AuthenticAMD GNU/Linux Gnu C 3.4.4 Gnu make 3.80 binutils 2.16.1 util-linux 2.12r mount 2.12r module-init-tools 3.2.1 e2fsprogs 1.38 jfsutils 1.1.8 reiserfsprogs 3.6.19 reiser4progs line xfsprogs 2.6.25 Linux C Library 2.3.5 Dynamic linker (ldd) 2.3.5 Procps 3.2.5 Net-tools 1.60 Kbd 1.12 Sh-utils 5.2.1 udev 079 Modules Loaded snd_seq snd_pcm_oss snd_mixer_oss snd_via82xx snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore sd_mod ext2 mbcache usbhid usb_storage scsi_mod vfat fat ide_cd cdrom uhci_hcd usbcore via_rhine mii capability commoncap button fan thermal processor non_fatal Problem Description: This was first reported in the Gentoo Bugzilla under <http://bugs.gentoo.org/show_bug.cgi?id=117507>. As it was (fairly) reproducible with the vanilla 2.6.16-rc1, I was sent here. Kernels since 2.6.14 (or some other version after 2.6.10 as I didn't try the versions between) tend to crash my machine. Going back to 2.6.10-gentoo-r5 fixes the problem. The kernel crashes with the message "slab: double free detected in cache 'vm_area_struct'" and "kernel BUG at mm/slab.c:2701! invalid opcode: 0000 [#1]". The whole dmesg output for 2.6.16-rc1 is available at <http://bugs.gentoo.org/attachment.cgi?id=77737> (other versions are available in that gentoo bug, too). The cut version: slab: double free detected in cache 'vm_area_struct', objp f72c73d8 ------------[ cut here ]------------ kernel BUG at mm/slab.c:2701! invalid opcode: 0000 [#1] PREEMPT Modules linked in: snd_seq snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usb_lib snd_hwdep snd_via82xx snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore sd_mod ext2 mbcache usb_storage scsi_mod usbhid vfat fat ide_cd cdrom ehci_hcd uhci_hcd usbcore ipv6 via_rhine mii capability commoncap button fan thermal processor non_fatal CPU: 0 EIP: 0060:[<c015e86d>] Not tainted VLI EFLAGS: 00210096 (2.6.16-rc1) EIP is at free_block+0xbd/0x180 eax: 00000047 ebx: f72c7000 ecx: c034ee6c edx: c034ee6c esi: c18ddd60 edi: 00000008 ebp: f72c701c esp: d4747eb8 ds: 007b es: 007b ss: 0068 Process sh (pid: 8989, threadinfo=d4746000 task=ea1250b0) Stack: <0>c031ac14 c0317bed f72c73d8 00200202 f72c73d8 00000002 c18de7cc c18dc820 f764b504 00000010 c015e9a3 c18dc820 c18da610 00000010 00000000 c18da610 c18ddd60 c18da600 c18dc820 f764b504 00200282 c015ecbb c18dc820 c18da600 Call Trace: [<c015e9a3>] cache_flusharray+0x73/0x170 [<c015ecbb>] kmem_cache_free+0x7b/0x80 [<c014fbbc>] remove_vma+0x5c/0x80 [<c014fbbc>] remove_vma+0x5c/0x80 [<c0151f18>] exit_mmap+0xd8/0x110 [<c011a8f7>] mmput+0x37/0xb0 [<c011f4e3>] do_exit+0xf3/0x460 [<c011f8c4>] do_group_exit+0x34/0xa0 [<c010320b>] sysenter_past_esp+0x54/0x75 Code: 00 00 8d 6b 1c 83 7c bd 00 fe 74 27 8b 44 24 10 8b 54 24 2c 89 44 24 08 8b 42 48 c7 04 24 14 ac 31 c0 89 44 24 04 e8 c3 ea fb ff <0f> 0b 8d 0a 7d 9c 31 c0 8b 43 14 89 44 bd 00 89 7b 14 8b 4c 24 <1>Fixing recursive fault but reboot is needed! scheduling while atomic: sh/0x00000001/8989 [<c02fe198>] schedule+0x588/0x670 [<c01040b8>] show_stack_log_lvl+0xa8/0xe0 [<c015e882>] free_block+0xd2/0x180 [<c011f68d>] do_exit+0x29d/0x460 [<c011d347>] printk+0x17/0x20 [<c0104545>] die+0x195/0x1a0 [<c01048d0>] do_invalid_op+0x0/0xb0 [<c0104972>] do_invalid_op+0xa2/0xb0 [<c015e86d>] free_block+0xbd/0x180 [<c011d74f>] release_console_sem+0xcf/0xf0 [<c011d5e5>] vprintk+0x295/0x2b0 [<c0144323>] buffered_rmqueue+0xc3/0x230 [<c0103cfb>] error_code+0x4f/0x54 [<c015e86d>] free_block+0xbd/0x180 [<c015e9a3>] cache_flusharray+0x73/0x170 [<c015ecbb>] kmem_cache_free+0x7b/0x80 [<c014fbbc>] remove_vma+0x5c/0x80 [<c014fbbc>] remove_vma+0x5c/0x80 [<c0151f18>] exit_mmap+0xd8/0x110 [<c011a8f7>] mmput+0x37/0xb0 [<c011f4e3>] do_exit+0xf3/0x460 [<c011f8c4>] do_group_exit+0x34/0xa0 [<c010320b>] sysenter_past_esp+0x54/0x75 Unable to handle kernel paging request at virtual address 00100104 printing eip: c015e806 *pde = 00000000 Oops: 0002 [#2] PREEMPT Modules linked in: snd_seq snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usb_lib snd_hwdep snd_via82xx snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore sd_mod ext2 mbcache usb_storage scsi_mod usbhid vfat fat ide_cd cdrom ehci_hcd uhci_hcd usbcore ipv6 via_rhine mii capability commoncap button fan thermal processor non_fatal CPU: 0 EIP: 0060:[<c015e806>] Not tainted VLI EFLAGS: 00210082 (2.6.16-rc1) EIP is at free_block+0x56/0x180 eax: 00100100 ebx: f72c7000 ecx: 00000000 edx: 00200200 esi: c18ddd60 edi: 00000007 ebp: 00000000 esp: c1927e90 ds: 007b es: 007b ss: 0068 Process events/0 (pid: 4, threadinfo=c1926000 task=c1907a70) Stack: <0>c18dc780 f78dc060 00000007 00000000 f72c7248 00000000 c18da610 c18da600 00000007 00000000 c015f210 c18dc820 c18da610 00000007 00000000 c18dc820 c18ddd24 c18dc820 c18ddd60 00000001 c015f2f4 c18dc820 c18da600 00000000 Call Trace: [<c015f210>] drain_array_locked+0x80/0xd0 [<c015f2f4>] cache_reap+0x94/0x200 [<c012de42>] run_workqueue+0x92/0x130 [<c015f260>] cache_reap+0x0/0x200 [<c012e038>] worker_thread+0x158/0x180 [<c0119270>] default_wake_function+0x0/0x20 [<c0119270>] default_wake_function+0x0/0x20 [<c012dee0>] worker_thread+0x0/0x180 [<c0131896>] kthread+0xb6/0xf0 [<c01317e0>] kthread+0x0/0xf0 [<c0101395>] kernel_thread_helper+0x5/0x10 Code: 8b 4c 24 38 89 d0 89 54 24 10 05 00 00 00 40 c1 e8 0c c1 e0 05 03 05 10 7a 4d c0 8b 58 1c 8b 44 24 2c 8b 53 04 8b 74 88 14 8b 03 <89> 50 04 89 02 31 d2 c7 03 00 01 10 00 c7 43 04 00 02 20 00 8b <6>note: events/0[4] exited with preempt_count 1 Steps to reproduce: Not sure. First, use some kernel after 2.6.10. Current ideas who could be the culprit: 1. High system load; that at least helps (like compiling via Gentoo's emerge -uD world) but a relatively idle system seems to crash after a few days, too, just not as often. 2. Some USB driver. I tend to switch off my screen which has an USB hub included; it just seems that the system doesn't crash if I don't do so. 3. The serial driver. Seems like the system doesn't crash when I don't load it. 4. Something completely different. I say "seems" in all cases because its very hard to reproduce. I don't think its a hardware failure though because going back to 2.6.10 gives me a stable system. Any ideas on how to debug this are welcome.
Created attachment 7153 [details] config for 2.6.14 I accidently deleted the 2.6.16-rc1 config (and the binary) -- I'm currently recreating it, based on this config for the (also crashing) 2.6.14.
Created attachment 7808 [details] dmesg output from 2.6.16 This is the dmesg output from 2.6.16-gentoo-r1; it first seemed to be pretty stable but finally crashed...
Created attachment 7809 [details] config of 2.6.16
Guess I'll just disable DEBUG_SLAB. If anybody wants to debug this.....
How long does the crash take to happen? Are there any different-looking crashes, or always this one? If poss, can you run memtest86 on that machine for 24 hours?
How long: Depends. The box was running for a few dayswithout problems, then it crashed twice in a row. It was under high load both times (first compiling KOffice, then the Kernel) but I compiled a whole stuff without any crashes. To me the crash looks always the same; just the order the processes are dyingis different. I will try a memtest at some point but as I said, 2.6.10 runs rock solid.
The Easter weekend gave me a good chance for a memtest86+ (v1.65) session: 100h running, the tests passed 70 times without any errors.
Thanks for doing such a thorough memtest86+: sounds convincing. I had been thinking of your vm_area_struct double-free as just one of several confounding slab corruptions seen in recent months. But looking at it again now, suspect it's more specific. Could you please rebuild with the patch below, run your testing on that kernel, and report back how it goes when you've run for long enough to judge? Even if it seems to fix your immediate problem, I don't believe it's the real fix: more something to try, and if it works, then we have a better idea of what direction to look in next. (That is, something I could easily cook up to keep you busy, while I go away and think about something else - oops, how unprofessional, forget I said that ;-) Hugh --- 2.6.16/mm/mmap.c 2006-03-20 05:53:29.000000000 +0000 +++ linux/mm/mmap.c 2006-04-18 18:59:39.000000000 +0100 @@ -1933,7 +1933,7 @@ EXPORT_SYMBOL(do_brk); void exit_mmap(struct mm_struct *mm) { struct mmu_gather *tlb; - struct vm_area_struct *vma = mm->mmap; + struct vm_area_struct *vma = xchg(&mm->mmap, 0); unsigned long nr_accounted = 0; unsigned long end;
That didn't really help. Yesterday night the box crashed again, this time it even took down the xinetd process so I can't get a trace.
On Tue, 18 Apr 2006, bugme-daemon@bugzilla.kernel.org wrote: > That didn't really help. Yesterday night the box crashed again, > this time it even took down the xinetd process so I can't get a trace. Hmmm. Very inconclusive. It might be that the patch was irrelevant and didn't help at all; or it might be that the patch helped to get around the vm_area_struct freeing errors, and so let the system sail on to hit the effects of the underlying bug. I think I'd like to ask you to run with the patch again, in the hope that "can't get a trace" was a one-off, and more info emerges this time around.
Created attachment 8248 [details] "screenshot" of the last crash After some time I tried again a kernel with the patch applied and after the system was running for half a day it tends to go into crash frenzy again, this time also on startup. All I could gather after some tries is this "screenshot" which says that it is crashing in slab.c:2392 now (but maybe the line numbers have changed because this is the more recent kernel 2.6.16-gentoo-r8).
Crashed again. I *think* the trace looks different but who knows... [17179569.184000] Linux version 2.6.16-gentoo-r8-bug5964-try1 (root@otherland) (gcc version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)) #1 PREEMPT Fri Jun 2 18:40:48 CEST 2006 [...] [17351310.004000] slab: double free detected in cache 'vm_area_struct', objp ebb56a18 [17351310.004000] ------------[ cut here ]------------ [17351310.004000] kernel BUG at mm/slab.c:2392! [17351310.004000] invalid opcode: 0000 [#1] [17351310.004000] PREEMPT [17351310.004000] Modules linked in: w83627hf hwmon_vid hwmon eeprom i2c_isa i2c_viapro md5 ipv6 snd_seq snd_pcm_oss snd_mixer_oss snd_via82xx gameport snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore sd_mod usb_storage scsi_mod usbhid dm_mod vfat fat ide_cd cdrom 8250 serial_core ehci_hcd uhci_hcd usbcore tun ne2k_pci 8390 3c59x via_rhine mii capability commoncap button fan thermal processor non_fatal rtc [17351310.004000] CPU: 0 [17351310.004000] EIP: 0060:[<b0160241>] Not tainted VLI [17351310.004000] EFLAGS: 00010096 (2.6.16-gentoo-r8-bug5964-try1 #1) [17351310.004000] EIP is at slab_put_obj+0x51/0xa0 [17351310.004000] eax: 00000059 ebx: ebb56000 ecx: b0356f6c edx: 00000001 [17351310.004000] esi: 00000018 edi: ebb5601c ebp: bdc23e70 esp: bdc23e54 [17351310.004000] ds: 007b es: 007b ss: 0068 [17351310.004000] Process tcsh (pid: 6732, threadinfo=bdc22000 task=e9618590) [17351310.004000] Stack: <0>b0321cd8 b031e951 ebb56a18 bdc23e68 ebb56a18 ebb56000 effedd60 bdc23e98 [17351310.004000] b0160d68 effec820 ebb56000 ebb56a18 00000000 0000000d effee7cc effec820 [17351310.004000] b86fa7c0 bdc23ec8 b0160e5b effec820 effea610 00000010 00000000 effea610 [17351310.004000] Call Trace: [17351310.004000] [<b01040ca>] show_stack_log_lvl+0xaa/0xe0 [17351310.004000] [<b01042e7>] show_registers+0x197/0x210 [17351310.004000] [<b01044e7>] die+0xf7/0x1a0 [17351310.004000] [<b0104617>] do_trap+0x87/0xd0 [17351310.004000] [<b0104985>] do_invalid_op+0xb5/0xc0 [17351310.004000] [<b0103ceb>] error_code+0x4f/0x54 [17351310.004000] [<b0160d68>] free_block+0x88/0x100 [17351310.004000] [<b0160e5b>] cache_flusharray+0x7b/0x180 [17351310.004000] [<b0161172>] kmem_cache_free+0x72/0x80 [17351310.004000] [<b0152588>] remove_vma+0x58/0x70 [17351310.004000] [<b01547dd>] exit_mmap+0xdd/0x110 [17351310.004000] [<b011a903>] mmput+0x33/0xb0 [17351310.004000] [<b011f85d>] exit_mm+0x8d/0x110 [17351310.004000] [<b01200e7>] do_exit+0xf7/0x4c0 [17351310.004000] [<b012052b>] do_group_exit+0x3b/0xd0 [17351310.004000] [<b01205d5>] sys_exit_group+0x15/0x20 [17351310.004000] [<b01031fb>] sysenter_past_esp+0x54/0x75 [17351310.004000] Code: 3b 45 14 75 45 8d 7b 1c 83 3c b7 fe 74 25 8b 45 10 8b 55 08 89 44 24 08 8b 42 44 c7 04 24 d8 1c 32 b0 89 44 24 04 e8 2f db fb ff <0f> 0b 58 09 6e 0e 32 b0 8b 43 14 89 04 b7 ff 4b 10 89 73 14 8b [17351310.004000] <1>Fixing recursive fault but reboot is needed! [17351310.004000] scheduling while atomic: tcsh/0x00000001/6732 [17351310.004000] [<b0104010>] show_trace+0x20/0x30 [17351310.004000] [<b010414e>] dump_stack+0x1e/0x20 [17351310.004000] [<b030588c>] schedule+0x5ac/0x690 [17351310.004000] [<b01202ee>] do_exit+0x2fe/0x4c0 [17351310.004000] [<b0104585>] die+0x195/0x1a0 [17351310.004000] [<b0104617>] do_trap+0x87/0xd0 [17351310.004000] [<b0104985>] do_invalid_op+0xb5/0xc0 [17351310.004000] [<b0103ceb>] error_code+0x4f/0x54 [17351310.004000] [<b0160d68>] free_block+0x88/0x100 [17351310.004000] [<b0160e5b>] cache_flusharray+0x7b/0x180 [17351310.004000] [<b0161172>] kmem_cache_free+0x72/0x80 [17351310.004000] [<b0152588>] remove_vma+0x58/0x70 [17351310.004000] [<b01547dd>] exit_mmap+0xdd/0x110 [17351310.004000] [<b011a903>] mmput+0x33/0xb0 [17351310.004000] [<b011f85d>] exit_mm+0x8d/0x110 [17351310.004000] [<b01200e7>] do_exit+0xf7/0x4c0 [17351310.004000] [<b012052b>] do_group_exit+0x3b/0xd0 [17351310.004000] [<b01205d5>] sys_exit_group+0x15/0x20 [17351310.004000] [<b01031fb>] sysenter_past_esp+0x54/0x75
I don't have high hopes that it will enlighten me, but please apply patch below, rebuild your kernel (with or without CONFIG_DEBUG_SLAB as you prefer), try running, and report the messages you get - along with output from /proc/slabinfo (or at least its relevant lines e.g. for "vm_area_struct"). I've cut out the BUG (with its not very interesting backtrace), so you should be able to continue running successfully, as you found you could before without DEBUG_SLAB. After gathering several groups of error messages, it's probably worth reboot and trying again: to help build up a picture of what's common. Thanks. --- 2.6.16/mm/slab.c 2006-03-20 05:53:29.000000000 +0000 +++ linux/mm/slab.c 2006-06-11 20:53:19.000000000 +0100 @@ -2368,10 +2368,8 @@ static void *slab_get_obj(struct kmem_ca slabp->inuse++; next = slab_bufctl(slabp)[slabp->free]; -#if DEBUG slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE; WARN_ON(slabp->nodeid != nodeid); -#endif slabp->free = next; return objp; @@ -2382,16 +2380,16 @@ static void slab_put_obj(struct kmem_cac { unsigned int objnr = (unsigned)(objp-slabp->s_mem) / cachep->buffer_size; -#if DEBUG /* Verify that the slab belongs to the intended node */ WARN_ON(slabp->nodeid != nodeid); if (slab_bufctl(slabp)[objnr] != BUFCTL_FREE) { + kmem_bufctl_t *bufctl = slab_bufctl(slabp) + objnr; printk(KERN_ERR "slab: double free detected in cache " "'%s', objp %p\n", cachep->name, objp); - BUG(); + printk(KERN_ERR " slab_bufctl(%p)[%x] = %x@%p\n", + slabp, objnr, *bufctl, bufctl); } -#endif slab_bufctl(slabp)[objnr] = slabp->free; slabp->free = objnr; slabp->inuse--;
Here's one: [17179569.184000] Linux version 2.6.16-gentoo-r9-bug5964-try2 (root@otherland) (gcc version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)) #3 PREEMPT Tue Jun 13 00:55:11 CEST 2006 [...] [17379204.504000] hub 3-2:1.0: USB hub found [17379204.508000] hub 3-2:1.0: 4 ports detected [17383280.004000] slab: double free detected in cache 'biovec-1', objp efd4ead0 [17383280.004000] slab_bufctl(efd4e000)[78] = ff00fffe@efd4e1fc [17386724.516000] usb 3-2: USB disconnect, address 14 Unfortunately some time ago, last dmesg entry dated [17411086.788000], slabinfo: biovec-1 374 609 16 203 1 : tunables 120 60 0 : slabdata 3 3 0 I'll see if I find the time to create a cron job to log this stuff.
[17479641.280000] slab: double free detected in cache 'vm_area_struct', objp db38a90c [17479641.280000] slab_bufctl(db38a000)[18] = fffe@db38a07c current dmesg ts: 17720900.356000
Hmmm... these pattern are really getting interesting :) [17720900.356000] hub 3-2:1.0: 4 ports detected [17775653.028000] slab: double free detected in cache 'biovec-1', objp eff848d0 [17775653.028000] slab_bufctl(eff84000)[58] = ff00fffe@eff8417c [17776279.448000] usb 3-2: USB disconnect, address 28 [17781983.028000] slab: double free detected in cache 'bio', objp efcac320 [17781983.028000] slab_bufctl(efcac000)[8] = ff00fffe@efcac03c [17803962.460000] slab: double free detected in cache 'vm_area_struct', objp ec0dc90c [17803962.460000] slab_bufctl(ec0dc000)[18] = ff00fffe@ec0dc07c [17814556.724000] usb 3-2: new full speed USB device using uhci_hcd and address 29 biovec-1 280 609 16 203 1 : tunables 120 60 0 : slabdata 3 3 0 bio 280 413 64 59 1 : tunables 120 60 0 : slabdata 7 7 0 vm_area_struct 7830 10076 88 44 1 : tunables 120 60 0 : slabdata 229 229 0
After long time running, finally another one. And I hoped, the switch to the radeon driver had made them go: [17253932.868000] slab: double free detected in cache 'anon_vma', objp d6eb1728 [17253932.868000] slab_bufctl(d6eb1000)[38] = fffe@d6eb10fc Same pattern, two bytes 00 instead of the expected ff.
I switched to 2.6.17 and its getting nasty again. Now the double free is detected and when the second error is supposed to be printed, the Kernel oops's with an "unable to handle kernel paging request". I had to modify Hugh's patch for the 2.6.17 as the code has changed but that was pretty straight-forward. The new patch and the oops: --- mm/slab.c.orig 2006-09-19 20:13:04.000000000 +0200 +++ mm/slab.c 2006-09-27 12:31:36.000000000 +0200 @@ -2431,10 +2431,8 @@ slabp->inuse++; next = slab_bufctl(slabp)[slabp->free]; -#if DEBUG slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE; WARN_ON(slabp->nodeid != nodeid); -#endif slabp->free = next; return objp; @@ -2445,16 +2443,16 @@ { unsigned int objnr = obj_to_index(cachep, slabp, objp); -#if DEBUG /* Verify that the slab belongs to the intended node */ WARN_ON(slabp->nodeid != nodeid); if (slab_bufctl(slabp)[objnr] + 1 <= SLAB_LIMIT + 1) { + kmem_bufctl_t *bufctl = slab_bufctl(slabp)[objnr]; printk(KERN_ERR "slab: double free detected in cache " "'%s', objp %p\n", cachep->name, objp); - BUG(); + printk(KERN_ERR " slab_bufctl(%p)[%x] = %x@%p\n", + slabp, objnr, *bufctl, bufctl); } -#endif slab_bufctl(slabp)[objnr] = slabp->free; slabp->free = objnr; slabp->inuse--; [17191396.540000] slab: double free detected in cache 'vm_area_struct', objp ea67990c [17191396.540000] BUG: unable to handle kernel paging request at virtual address 0000fffe [17191396.540000] printing eip: [17191396.540000] c015ca22 [17191396.540000] *pde = 00000000 [17191396.540000] Oops: 0000 [#1] [17191396.540000] PREEMPT [17191396.540000] Modules linked in: w83627hf hwmon_vid hwmon ipv6 eeprom i2c_isa i2c_viapro iptable_mangle iptable_filter ip_tables x_tables snd_seq snd_via 82xx gameport snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore joydev sd_mod usbhid usb_ storage scsi_mod vfat fat ide_cd cdrom 8250 serial_core ehci_hcd uhci_hcd usbcore tun ne2k_pci 8390 3c59x via_rhine mii capability commoncap button fan therm al processor non_fatal radeon [17191396.540000] CPU: 0 [17191396.540000] EIP: 0060:[<c015ca22>] Not tainted VLI [17191396.540000] EFLAGS: 00010086 (2.6.17-gentoo-r8-b5964t3 #3) [17191396.540000] EIP is at free_block+0x132/0x1b0 [17191396.540000] eax: 00000059 ebx: ea679000 ecx: 00000073 edx: 00000001 [17191396.540000] esi: 0000fffe edi: dfffdd20 ebp: c7c6fe8c esp: c7c6fe50 [17191396.540000] ds: 007b es: 007b ss: 0068 [17191396.540000] Process dcop (pid: 5306, threadinfo=c7c6e000 task=e3344570) [17191396.540000] Stack: c031bda0 c0318b54 ea67990c c7c6fe64 0000fffe 00000018 0000003c dfff9010 [17191396.540000] dffffc60 00000030 ea67990c ea67907c dfffebe0 0000003c dffffc60 c7c6feb8 [17191396.540000] c015c6d7 00000000 c7c6fec4 c0161c52 dfff9010 dfff9000 00000000 dfff9000 [17191396.540000] Call Trace: [17191396.540000] <c01042ad> show_stack_log_lvl+0x9d/0xd0 <c01044f6> show_registers+0x1c6/0x250 [17191396.540000] <c010469e> die+0x11e/0x2c0 <c0116846> do_page_fault+0x276/0x68c [17191396.540000] <c0103c6f> error_code+0x4f/0x54 <c015c6d7> cache_flusharray+0x47/0xf0 [17191396.540000] <c015c848> kmem_cache_free+0x48/0x50 <c0150826> remove_vma+0x46/0x50 [17191396.540000] <c015090f> exit_mmap+0xdf/0x110 <c0119f93> mmput+0x33/0xc0 [17191396.540000] <c011ded3> exit_mm+0x93/0x120 <c011f709> do_exit+0xd9/0x9c0 [17191396.540000] <c0120027> do_group_exit+0x37/0xa0 <c01200a5> sys_exit_group+0x15/0x20 [17191396.540000] <c0103173> sysenter_past_esp+0x54/0x75 [17191396.540000] Code: ff ff 83 c4 30 5b 5e 5f c9 c3 8b 55 ec 8b 4d e4 89 54 24 08 8b 41 44 c7 04 24 a0 bd 31 c0 89 44 24 04 e8 62 08 fc ff 89 74 24 10 <8b> 06 89 5c 24 04 c7 04 24 d8 bd 31 c0 89 44 24 0c 8b 45 d8 89 [17191396.540000] EIP: [<c015ca22>] free_block+0x132/0x1b0 SS:ESP 0068:c7c6fe50 [17191396.540000] <1>Fixing recursive fault but reboot is needed! [17191396.540000] BUG: scheduling while atomic: dcop/0x00000001/5306 [17191396.540000] <c0104323> show_trace+0x13/0x20 <c010496e> dump_stack+0x1e/0x20 [17191396.540000] <c02fd21a> schedule+0x49a/0x670 <c011fc70> do_exit+0x640/0x9c0 [17191396.540000] <c010483d> die+0x2bd/0x2c0 <c0116846> do_page_fault+0x276/0x68c [17191396.540000] <c0103c6f> error_code+0x4f/0x54 <c015c6d7> cache_flusharray+0x47/0xf0 [17191396.540000] <c015c848> kmem_cache_free+0x48/0x50 <c0150826> remove_vma+0x46/0x50 [17191396.540000] <c015090f> exit_mmap+0xdf/0x110 <c0119f93> mmput+0x33/0xc0 [17191396.540000] <c011ded3> exit_mm+0x93/0x120 <c011f709> do_exit+0xd9/0x9c0 [17191396.540000] <c0120027> do_group_exit+0x37/0xa0 <c01200a5> sys_exit_group+0x15/0x20 [17191396.540000] <c0103173> sysenter_past_esp+0x54/0x75 [17191396.552000] slab: double free detected in cache 'vm_area_struct', objp e1f7ac7c [17191396.552000] general protection fault: 0000 [#2] [17191396.552000] PREEMPT [17191396.552000] Modules linked in: w83627hf hwmon_vid hwmon ipv6 eeprom i2c_isa i2c_viapro iptable_mangle iptable_filter ip_tables x_tables snd_seq snd_via 82xx gameport snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore joydev sd_mod usbhid usb_ storage scsi_mod vfat fat ide_cd cdrom 8250 serial_core ehci_hcd uhci_hcd usbcore tun ne2k_pci 8390 3c59x via_rhine mii capability commoncap button fan therm al processor non_fatal radeon [17191396.552000] CPU: 0 [17191396.552000] EIP: 0060:[<c015ca22>] Not tainted VLI [17191396.552000] EFLAGS: 00010086 (2.6.17-gentoo-r8-b5964t3 #3) [17191396.552000] EIP is at free_block+0x132/0x1b0 [17191396.552000] eax: 00000059 ebx: e1f7a000 ecx: 00000073 edx: f5690000 [17191396.552000] esi: ffffffff edi: dfffdd20 ebp: f5691e8c esp: f5691e50 [17191396.552000] ds: 007b es: 007b ss: 0068 [17191396.552000] Process dcop (pid: 5308, threadinfo=f5690000 task=d8e225d0) [17191396.552000] Stack: c031bda0 c0318b54 e1f7ac7c f5691e64 ffffffff 00000022 0000003c dfff9010 [17191396.552000] dffffc60 00000000 e1f7ac7c e1f7a0a4 dfffebe0 0000003c dffffc60 f5691eb8 [17191396.552000] c015c6d7 00000000 f5691ec4 c0161c52 dfff9010 dfff9000 00000000 dfff9000 [17191396.552000] Call Trace: [17191396.552000] <c01042ad> show_stack_log_lvl+0x9d/0xd0 <c01044f6> show_registers+0x1c6/0x250 [17191396.552000] <c010469e> die+0x11e/0x2c0 <c0105781> do_general_protection+0x1d1/0x230 [17191396.552000] <c0103c6f> error_code+0x4f/0x54 <c015c6d7> cache_flusharray+0x47/0xf0 [17191396.552000] <c015c848> kmem_cache_free+0x48/0x50 <c0150826> remove_vma+0x46/0x50 [17191396.552000] <c015090f> exit_mmap+0xdf/0x110 <c0119f93> mmput+0x33/0xc0 [17191396.552000] <c011ded3> exit_mm+0x93/0x120 <c011f709> do_exit+0xd9/0x9c0 [17191396.552000] <c0120027> do_group_exit+0x37/0xa0 <c01200a5> sys_exit_group+0x15/0x20 [17191396.552000] <c0103173> sysenter_past_esp+0x54/0x75 [17191396.552000] Code: ff ff 83 c4 30 5b 5e 5f c9 c3 8b 55 ec 8b 4d e4 89 54 24 08 8b 41 44 c7 04 24 a0 bd 31 c0 89 44 24 04 e8 62 08 fc ff 89 74 24 10 <8b> 06 89 5c 24 04 c7 04 24 d8 bd 31 c0 89 44 24 0c 8b 45 d8 89 [17191396.552000] EIP: [<c015ca22>] free_block+0x132/0x1b0 SS:ESP 0068:f5691e50 [17191396.552000] <1>Fixing recursive fault but reboot is needed! [17191396.552000] BUG: scheduling while atomic: dcop/0x00000001/5308 [17191396.552000] <c0104323> show_trace+0x13/0x20 <c010496e> dump_stack+0x1e/0x20 [17191396.552000] <c02fd21a> schedule+0x49a/0x670 <c011fc70> do_exit+0x640/0x9c0 [17191396.552000] <c010483d> die+0x2bd/0x2c0 <c0105781> do_general_protection+0x1d1/0x230 [17191396.552000] <c0103c6f> error_code+0x4f/0x54 <c015c6d7> cache_flusharray+0x47/0xf0 [17191396.552000] <c015c848> kmem_cache_free+0x48/0x50 <c0150826> remove_vma+0x46/0x50 [17191396.552000] <c015090f> exit_mmap+0xdf/0x110 <c0119f93> mmput+0x33/0xc0 [17191396.552000] <c011ded3> exit_mm+0x93/0x120 <c011f709> do_exit+0xd9/0x9c0 [17191396.552000] <c0120027> do_group_exit+0x37/0xa0 <c01200a5> sys_exit_group+0x15/0x20 [17191396.552000] <c0103173> sysenter_past_esp+0x54/0x75 ...
Malte, is this still an issue on the latest kernel release (2.6.21 or newer)?
When I left Germany in January it was still an issue, haven't used my workstation at home since then. I can check again in September.
Malte, did you have chance to test recently? Thanks.
Nope, sorry, don't have that system anymore. And I heard there will be a new allocator anyway, so I'll just close this bug.