Bug 7707
Summary: | "Eeek! page_mapcount(page) went negative! (-1)" | ||
---|---|---|---|
Product: | Memory Management | Reporter: | Chris Rankin (rankincj) |
Component: | Page Allocator | Assignee: | Nick Piggin (nickpiggin) |
Status: | REJECTED INVALID | ||
Severity: | high | CC: | bunk, hughd, pj, protasnb, todorovic.s, valkyrie, vincent.kessler |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.19.1 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: | rmap debug for 2.6.20 |
Description
Chris Rankin
2006-12-18 10:45:31 UTC
I think i got the same bug here: Most recent kernel where this bug did *NOT* occur: unknown Distribution: Custom Gentoo based Hardware Environment: Via EPIA based (12 hours memtest86 proofed) Software Environment: kernel 2.6.19 Eeek! page_mapcount(page) went negative! (-1) page->flags = 80000004 page->count = 0 page->mapping = 00000000 ------------[ cut here ]------------ kernel BUG at mm/rmap.c:578! invalid opcode: 0000 [#1] PREEMPT Modules linked in: fcpci(P) CPU: 0 EIP: 0060:[<c0143bd5>] Tainted: P VLI EFLAGS: 00010286 (2.6.19 #2) EIP is at page_remove_rmap+0x75/0xa6 eax: 0000001e ebx: c122b140 ecx: ffffffff edx: 00002a1d esi: b7f5c000 edi: de060d70 ebp: 00000020 esp: de0f5dd0 ds: 007b es: 007b ss: 0068 Process telnet (pid: 948, ti=de0f4000 task=dee7b030 task.ti=de0f4000) Stack: c030f271 00000000 c122b140 c013e434 c122b140 b7f5c000 1158a025 00006000 00000000 00000001 b7f86000 de0edb7c de03f220 c03d248c 00000000 fffffffe de0edb7c b7f86000 00000000 de0f5e4c de04dcf8 de03f220 00000002 c0140f76 Call Trace: [<c013e434>] unmap_vmas+0x267/0x470 [<c0140f76>] exit_mmap+0x75/0x101 [<c011453f>] mmput+0x25/0x85 [<c01178bf>] exit_mm+0xcd/0xd3 [<c0118d2d>] do_exit+0x19c/0x7a5 [<c01193ba>] sys_exit_group+0x0/0x11 [<c0121416>] get_signal_to_deliver+0x391/0x3b7 [<c0102529>] do_notify_resume+0x84/0x609 [<c015d7a3>] d_rehash+0x17/0x44 [<c027eade>] sock_attach_fd+0x7d/0xd9 [<c027e693>] sockfd_lookup_light+0x24/0x3e [<c027e811>] sys_setsockopt+0x7f/0x9e [<c027ff15>] sys_socketcall+0x9a/0x24d [<c0102ede>] work_notifysig+0x13/0x25 ======================= Code: 74 03 8b 53 0c 8b 42 04 c7 04 24 5a f2 30 c0 89 44 24 04 e8 e6 31 fd ff 8b 43 10 c7 04 24 71 f2 30 c0 89 44 24 04 e8 d3 31 fd ff <0f> 0b 42 02 06 f2 30 c0 8b 03 8b 53 10 c1 e8 1f 83 f2 01 83 e2 EIP: [<c0143bd5>] page_remove_rmap+0x75/0xa6 SS:ESP 0068:de0f5dd0 <1>Fixing recursive fault but reboot is needed! BUG: scheduling while atomic: telnet/0x00000002/948 [<c02f588f>] __sched_text_start+0x4f/0x54f [<c011aaad>] irq_exit+0x25/0x30 [<c0105a23>] do_IRQ+0x70/0x85 [<c0118c7a>] do_exit+0xe9/0x7a5 [<c01047a1>] die+0x286/0x28e [<c01050a8>] do_invalid_op+0x0/0xb4 [<c0105153>] do_invalid_op+0xab/0xb4 [<c0143bd5>] page_remove_rmap+0x75/0xa6 [<c0116957>] release_console_sem+0x1a1/0x1e0 [<c0116d8b>] vprintk+0x29c/0x2b9 [<c02b5f37>] tcp_v4_send_check+0x8a/0xcf [<c02f7639>] error_code+0x39/0x40 [<c0143bd5>] page_remove_rmap+0x75/0xa6 [<c013e434>] unmap_vmas+0x267/0x470 [<c0140f76>] exit_mmap+0x75/0x101 [<c011453f>] mmput+0x25/0x85 [<c01178bf>] exit_mm+0xcd/0xd3 [<c0118d2d>] do_exit+0x19c/0x7a5 [<c01193ba>] sys_exit_group+0x0/0x11 [<c0121416>] get_signal_to_deliver+0x391/0x3b7 [<c0102529>] do_notify_resume+0x84/0x609 [<c015d7a3>] d_rehash+0x17/0x44 [<c027eade>] sock_attach_fd+0x7d/0xd9 [<c027e693>] sockfd_lookup_light+0x24/0x3e [<c027e811>] sys_setsockopt+0x7f/0x9e [<c027ff15>] sys_socketcall+0x9a/0x24d [<c0102ede>] work_notifysig+0x13/0x25 ======================= Are these repeatable at all? If so, I have a patch that could help track it down if you are intersted to try? Thanks Nick, I haven't been able to reproduce this yet, no. (Which is annoying, because it happened very quickly the first time.) But could you post the patch anyway, please? Thanks, Chris Created attachment 10309 [details]
rmap debug for 2.6.20
This should keep track of exactly who incremented the mapcount for a given
page,
and will print traces if somebody decrements the mapcount if they are not
supposed
to.
Chris, Vincent, Were you able to incorporate the patch and run with it? Have the problem been reproduced with the patch in? Thanks, --Natalie Looks like what I have, too: Distribution: Linux version 2.6.21 (2.6.21-5) (root@skeptic) (gcc version 4.1.3 20070601 (prerelease) (Debian 4.1.2-12)) #1 SMP Wed Jul 11 16:37:10 EEST 2007 However, I think I had the same issue while I was still running Debian 4.0. Hardware Environment: Acer 7003WSMi Software Environment: Problem Description: Executing `sudo vim /etc/fstab' while attemping to get my laptop connected to my Nokia 7373 phone via a bluetooth usb device, an `A-Link Bluetooth USB 2.0 Adapter A2 - US'. I was attempting to use `fuse', or so. Dumps to terminals: Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ... localhost kernel: Eeek! page_mapcount(page) went negative! (-1) Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ... localhost kernel: page pfn = 0 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ... localhost kernel: page->flags = 400 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ... localhost kernel: page->count = 1 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ... localhost kernel: page->mapping = 0000000000000000 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ... localhost kernel: vma->vm_ops = 0xffffffff804b31e0 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ... localhost kernel: vma->vm_ops->nopage = filemap_nopage+0x0/0x2fe Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ... localhost kernel: vma->vm_file->f_op->mmap = generic_file_mmap+0x0/0x3f Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ... localhost kernel: ------------[ cut here ]------------ Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ... localhost kernel: invalid opcode: 0000 [1] SMP dmesg: Eeek! page_mapcount(page) went negative! (-1) page pfn = 0 page->flags = 400 page->count = 1 page->mapping = 0000000000000000 vma->vm_ops = 0xffffffff804b31e0 vma->vm_ops->nopage = filemap_nopage+0x0/0x2fe vma->vm_file->f_op->mmap = generic_file_mmap+0x0/0x3f ------------[ cut here ]------------ kernel BUG at mm/rmap.c:596! invalid opcode: 0000 [1] SMP CPU 0 Modules linked in: fuse hci_usb rfcomm l2cap bluetooth ppdev parport_pc lp parport button ac battery cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_ondeman d cpufreq_conservative ipv6 arc4 ecb ieee80211_crypt_wep powernow_k8 freq_table loop pcmcia snd_hda_intel snd_hda_codec bcm43xx snd_pcm_oss firmware_class snd_mixer_o ss ieee80211softmac snd_pcm snd_timer snd soundcore yenta_socket rsrc_nonstatic tifm_7xx1 ieee80211 ieee80211_crypt psmouse serio_raw k8temp pcspkr snd_page_alloc i2c _nforce2 pcmcia_core i2c_core evdev ext3 jbd mbcache sha256 aes cbc blkcipher dm_crypt ide_generic dm_mirror dm_snapshot dm_mod ide_cd cdrom ide_disk generic amd74xx ide_core sata_nv forcedeth ata_generic libata scsi_mod ehci_hcd ohci_hcd thermal processor fan Pid: 7253, comm: sh Not tainted 2.6.21 #1 RIP: 0010:[<ffffffff8020ab42>] [<ffffffff8020ab42>] page_remove_rmap+0xe4/0x100 RSP: 0000:ffff81001bc83cd8 EFLAGS: 00010296 RAX: 000000000000003b RBX: ffff81003b1f3000 RCX: 0000000000005c7f RDX: 00000000ffffffff RSI: ffff81001db1c268 RDI: ffffffff804aed3c RBP: ffff810026749768 R08: ffffffff80595b40 R09: 00039c0000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 000000000804d000 R13: ffff81001db1c268 R14: ffff81000100b800 R15: 00000000080e9000 FS: 00000000f7ef8000(0000) GS:ffffffff804e4000(0000) knlGS:00000000f7e1b6c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 0000000008049724 CR3: 000000001cc9d000 CR4: 00000000000006e0 Process sh (pid: 7253, threadinfo ffff81001bc82000, task ffff81001d9b6180) Stack: 0000000036c92020 ffff81003b1f3000 0000000000000000 ffffffff80207a0a 0000000000000000 ffff81001bc83dc8 ffffffffffffffff 0000000000000000 ffff810026749768 ffff81001bc83dd0 00000000003fafff 0000000000000000 Call Trace: [<ffffffff80207a0a>] unmap_vmas+0x3dd/0x6fe [<ffffffff80235c85>] exit_mmap+0x76/0xeb [<ffffffff80237bb1>] mmput+0x28/0x98 [<ffffffff8021370e>] do_exit+0x212/0x7ef [<ffffffff8020a9de>] do_page_fault+0x6f0/0x770 [<ffffffff8021c7b4>] __dentry_open+0x101/0x1aa [<ffffffff802253a4>] do_filp_open+0x2a/0x38 [<ffffffff8022cfcd>] sys_rt_sigprocmask+0x50/0xce [<ffffffff8025b93d>] error_exit+0x0/0x84 Code: 0f 0b eb fe 8b 77 18 41 58 5b 5d 83 e6 01 f7 de 83 c6 04 e9 RIP [<ffffffff8020ab42>] page_remove_rmap+0xe4/0x100 RSP <ffff81001bc83cd8> Fixing recursive fault but reboot is needed! P.S. The last one I just got was like this: Message from syslogd@localhost at Fri Jul 13 07:35:48 2007 ... localhost kernel: Oops: 0000 [1] SMP Message from syslogd@localhost at Fri Jul 13 07:35:48 2007 ... localhost kernel: CR2: 0000000000000040 Message from syslogd@localhost at Fri Jul 13 07:35:55 2007 ... localhost kernel: stack segment: 0000 [2] SMP ,I'll skip dmesg to not to clutter this too much. (In reply to comment #5) > Chris, Vincent, > Were you able to incorporate the patch and run with it? Have the problem been > reproduced with the patch in? > Thanks, > --Natalie I'm compiling the kernel (2.6.21) w/the patch. Will report back when I know more. Susanna. Subject: Re: "Eeek! page_mapcount(page) went negative! (-1)"
> Eeek! page_mapcount(page) went negative! (-1)
> page pfn = 0
> page->flags = 400
By all means try that patch as Natalie suggests, but I doubt it'll
tell you anything interesting: your page table contains an entry
marked present for page 0 (itself correctly flagged as Reserved).
That means your page table has been corrupted: I don't think the
stack display goes far enough to show the corrupt page table entry
(we can make a patch for that later if this remains a mystery).
Let's suppose that it might be 00000001, a single bit error.
Please run memtest86+ overnight to check your RAM.
Subject: Re: "Eeek! page_mapcount(page) went negative! (-1)" On Fri, 13 Jul 2007, bugme-daemon@bugzilla.kernel.org wrote: > > ------- Comment #8 from echughesiv@gmail.com 2007-07-13 06:22 ------- > Subject: Re: "Eeek! page_mapcount(page) went negative! (-1)" >... > Please run memtest86+ overnight to check your RAM. I would just like to add that I believe I am Hugh Dickins <hugh@veritas.com> despite bugzilla having decided that I am Edward Hughes IV <echughesiv@gmail.com> Help!! I applied the patch, and after quite a few tries I finally got the kernel compiled. The problem had nothing to do w/the source code. It seems that there's something wrong w/my system, as it starts behaving oddly [1], when I compile without nice +20. Anyway, I haven't seen another Eeek! as of yet, altho I did suffer a total freeze just a few minutes ago. I'm suspecting that those are caused by a) kcryptd/LUKS, b) powernowd c) bad hardware or d) an unrelated kernel bug. [1] Ordinary commands start failing (that won't after reboot), and a freeze sometimes follows soon after - altho more often I reboot the system myself as it has become so unusable (like tar and gcc crashing, etc.) And the answer is: c) bad hardware. Seems that the extra 512 Mb memory didn't last for more than a few months. memtest: (a gazillion errors) http://aycu13.webshots.com/image/21292/2002493603451648766_rs.jpg Thank you all. =) P.S. No new Eeek!s so far. Thanks, the bug can be closed now. Subject: Re: "Eeek! page_mapcount(page) went negative! (-1)" Sorry, just testing whether I'm Hugh Dickins <hugh@veritas.com> again! I have been getting this bug for a while now, and had the bug yesterday on 2.6.20-16 (kubuntu 7.04). I ran memtest86 overnight for 15 hours, with no errors. Other people are reporting this on bugs.launchpad.net I have noticed that some people (including myself) who reported this on bugs.launchpad.net load nvidia's driver and taint the kernel. I will try Nick's patch at some point and if someone can let me know how to possibly reproduce this, I can rmmod nvidia and try to trigger. We have no idea how it can be reproduced. If you can reproduce it without nvidia loaded and with my patch installed, it may help. |