Bug 7707

Summary: "Eeek! page_mapcount(page) went negative! (-1)"
Product: Memory Management Reporter: Chris Rankin (rankincj)
Component: Page AllocatorAssignee: Nick Piggin (nickpiggin)
Status: REJECTED INVALID    
Severity: high CC: bunk, hughd, pj, protasnb, todorovic.s, valkyrie, vincent.kessler
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.19.1 Subsystem:
Regression: --- Bisected commit-id:
Attachments: rmap debug for 2.6.20

Description Chris Rankin 2006-12-18 10:45:31 UTC
Most recent kernel where this bug did *NOT* occur:
2.6.18.5 (to my knowledge)

Distribution:
Userspace is FC5

Hardware Environment:
Dual P4 2.66 GHz Xeon (Northwood), HT enabled, 2 GB RAM, compiled with gcc-4.1.1

Software Environment:

Problem Description:
Compiling xine-lib triggered internal BUG report:

Eeek! page_mapcount(page) went negative! (-1)
  page->flags = 14
  page->count = 0
  page->mapping = 00000000
------------[ cut here ]------------
kernel BUG at /home/chris/LINUX/linux-2.6.19/mm/rmap.c:578!
invalid opcode: 0000 [#1]
PREEMPT SMP
Modules linked in: radeon drm pwc eeprom cpufreq_ondemand p4_clockmod
speedstep_lib nfsd exportfs ipv6 autofs4 nfs lockd sunrpc af_packet
firmware_class binfmt_misc video thermal processor fan button ac lp parport_pc
parport nvram video1394 raw1394 eth1394 compat_ioctl32 videodev v4l1_compat
v4l2_common snd_usb_audio snd_usb_lib snd_intel8x0 snd_emu10k1_synth
snd_emux_synth snd_seq_virmidi snd_seq_midi_emul snd_emu10k1 snd_rawmidi
snd_ac97_codec snd_ac97_bus snd_seq_dummy snd_seq_oss ohci1394
snd_seq_midi_event snd_seq ieee1394 snd_pcm_oss snd_mixer_oss snd_pcm ehci_hcd
e7xxx_edac serio_raw snd_seq_device uhci_hcd edac_mc e1000 psmouse snd_timer
snd_page_alloc snd_util_mem snd_hwdep ide_cd cdrom snd soundcore pcspkr
intel_agp i2c_i801 i2c_core agpgart usbcore ext3 jbd
CPU:    0
EIP:    0060:[<c0145fa0>]    Not tainted VLI
EFLAGS: 00010282   (2.6.19.1 #1)
EIP is at page_remove_rmap+0x70/0x8f
eax: 0000001e   ebx: c100f500   ecx: ebd0c000   edx: 00000002
esi: 00000020   edi: 08665000   ebp: eac11994   esp: ebd0cee0
ds: 007b   es: 007b   ss: 0068
Process cc1 (pid: 24220, ti=ebd0c000 task=ed5b3a90 task.ti=ebd0c000)
Stack: c0284bf0 00000000 c100f500 c0140c39 00000000 edb60518 ebd0cf54 00000000
       00000001 08696000 ec3f3084 f798f040 c200f0c0 fffffff9 ffffffff c155822c
       ec3f3084 08696000 00000000 00000000 ebd0cf54 ecfab5c0 f798f040 00000001
Call Trace:
 [<c0140c39>] unmap_vmas+0x24d/0x4df
 [<c01436ac>] exit_mmap+0x7e/0x10e
 [<c011717d>] mmput+0x1d/0x78
 [<c011bb94>] do_exit+0x1a9/0x77c
 [<c0111c2f>] do_page_fault+0x281/0x51a
 [<c0150689>] vfs_write+0xfc/0x13b
 [<c011c1dd>] sys_exit_group+0x0/0xd
 [<c0102b9d>] sysenter_past_esp+0x56/0x79
 =======================
Code: 74 03 8b 53 0c 8b 42 04 89 44 24 04 c7 04 24 d9 4b 28 c0 e8 52 3c fd ff 8b
43 10 89 44 24 04 c7 04 24 f0 4b 28 c0 e8 3f 3c fd ff <0f> 0b 42 02 66 4b 28 c0
8b 53 10 83 f2 01 83 e2 01 89 d8 5b 59
EIP: [<c0145fa0>] page_remove_rmap+0x70/0x8f SS:ESP 0068:ebd0cee0
 <1>Fixing recursive fault but reboot is needed!
BUG: scheduling while atomic: cc1/0x00000002/24220
 [<c026d6cf>] __sched_text_start+0x4f/0x900
 [<c019ed40>] cfq_free_io_context+0x57/0xbb
 [<c0140068>] sys_madvise+0x168/0x3a4
 [<c011bad9>] do_exit+0xee/0x77c
 [<c0140068>] sys_madvise+0x168/0x3a4
 [<c0140068>] sys_madvise+0x168/0x3a4
 [<c0103fc7>] die+0x2a5/0x2cc
 [<c010486c>] do_invalid_op+0x0/0xab
 [<c010490e>] do_invalid_op+0xa2/0xab
 [<c0145fa0>] page_remove_rmap+0x70/0x8f
 [<c0119b85>] vprintk+0x2b9/0x313
 [<c0119b8f>] vprintk+0x2c3/0x313
 [<c013a59c>] __pagevec_free+0x18/0x22
 [<c026ff79>] error_code+0x39/0x40
 [<c0145fa0>] page_remove_rmap+0x70/0x8f
 [<c0140c39>] unmap_vmas+0x24d/0x4df
 [<c01436ac>] exit_mmap+0x7e/0x10e
 [<c011717d>] mmput+0x1d/0x78
 [<c011bb94>] do_exit+0x1a9/0x77c
 [<c0111c2f>] do_page_fault+0x281/0x51a
 [<c0150689>] vfs_write+0xfc/0x13b
 [<c011c1dd>] sys_exit_group+0x0/0xd
 [<c0102b9d>] sysenter_past_esp+0x56/0x79
 =======================

Steps to reproduce:
Have not reproduced it yet.
Comment 1 Vincent Kessler 2007-01-16 12:43:15 UTC
I think i got the same bug here:

Most recent kernel where this bug did *NOT* occur:
unknown

Distribution:
Custom Gentoo based

Hardware Environment:
Via EPIA based
(12 hours memtest86 proofed)

Software Environment:
kernel 2.6.19


Eeek! page_mapcount(page) went negative! (-1)
  page->flags = 80000004
  page->count = 0
  page->mapping = 00000000
------------[ cut here ]------------
kernel BUG at mm/rmap.c:578!
invalid opcode: 0000 [#1]
PREEMPT
Modules linked in: fcpci(P)
CPU:    0
EIP:    0060:[<c0143bd5>]    Tainted: P      VLI
EFLAGS: 00010286   (2.6.19 #2)
EIP is at page_remove_rmap+0x75/0xa6
eax: 0000001e   ebx: c122b140   ecx: ffffffff   edx: 00002a1d
esi: b7f5c000   edi: de060d70   ebp: 00000020   esp: de0f5dd0
ds: 007b   es: 007b   ss: 0068
Process telnet (pid: 948, ti=de0f4000 task=dee7b030 task.ti=de0f4000)
Stack: c030f271 00000000 c122b140 c013e434 c122b140 b7f5c000 1158a025 00006000
       00000000 00000001 b7f86000 de0edb7c de03f220 c03d248c 00000000 fffffffe
       de0edb7c b7f86000 00000000 de0f5e4c de04dcf8 de03f220 00000002 c0140f76
Call Trace:
 [<c013e434>] unmap_vmas+0x267/0x470
 [<c0140f76>] exit_mmap+0x75/0x101
 [<c011453f>] mmput+0x25/0x85
 [<c01178bf>] exit_mm+0xcd/0xd3
 [<c0118d2d>] do_exit+0x19c/0x7a5
 [<c01193ba>] sys_exit_group+0x0/0x11
 [<c0121416>] get_signal_to_deliver+0x391/0x3b7
 [<c0102529>] do_notify_resume+0x84/0x609
 [<c015d7a3>] d_rehash+0x17/0x44
 [<c027eade>] sock_attach_fd+0x7d/0xd9
 [<c027e693>] sockfd_lookup_light+0x24/0x3e
 [<c027e811>] sys_setsockopt+0x7f/0x9e
 [<c027ff15>] sys_socketcall+0x9a/0x24d
 [<c0102ede>] work_notifysig+0x13/0x25
 =======================
Code: 74 03 8b 53 0c 8b 42 04 c7 04 24 5a f2 30 c0 89 44 24 04 e8 e6 31 fd ff 8b
43 10 c7 04 24 71 f2 30 c0 89 44 24 04 e8 d3 31 fd ff <0f> 0b 42 02 06 f2 30 c0
8b 03 8b 53 10 c1 e8 1f 83 f2 01 83 e2
EIP: [<c0143bd5>] page_remove_rmap+0x75/0xa6 SS:ESP 0068:de0f5dd0
 <1>Fixing recursive fault but reboot is needed!
BUG: scheduling while atomic: telnet/0x00000002/948
 [<c02f588f>] __sched_text_start+0x4f/0x54f
 [<c011aaad>] irq_exit+0x25/0x30
 [<c0105a23>] do_IRQ+0x70/0x85
 [<c0118c7a>] do_exit+0xe9/0x7a5
 [<c01047a1>] die+0x286/0x28e
 [<c01050a8>] do_invalid_op+0x0/0xb4
 [<c0105153>] do_invalid_op+0xab/0xb4
 [<c0143bd5>] page_remove_rmap+0x75/0xa6
 [<c0116957>] release_console_sem+0x1a1/0x1e0
 [<c0116d8b>] vprintk+0x29c/0x2b9
 [<c02b5f37>] tcp_v4_send_check+0x8a/0xcf
 [<c02f7639>] error_code+0x39/0x40
 [<c0143bd5>] page_remove_rmap+0x75/0xa6
 [<c013e434>] unmap_vmas+0x267/0x470
 [<c0140f76>] exit_mmap+0x75/0x101
 [<c011453f>] mmput+0x25/0x85
 [<c01178bf>] exit_mm+0xcd/0xd3
 [<c0118d2d>] do_exit+0x19c/0x7a5
 [<c01193ba>] sys_exit_group+0x0/0x11
 [<c0121416>] get_signal_to_deliver+0x391/0x3b7
 [<c0102529>] do_notify_resume+0x84/0x609
 [<c015d7a3>] d_rehash+0x17/0x44
 [<c027eade>] sock_attach_fd+0x7d/0xd9
 [<c027e693>] sockfd_lookup_light+0x24/0x3e
 [<c027e811>] sys_setsockopt+0x7f/0x9e
 [<c027ff15>] sys_socketcall+0x9a/0x24d
 [<c0102ede>] work_notifysig+0x13/0x25
 =======================
Comment 2 Nick Piggin 2007-02-01 01:22:50 UTC
Are these repeatable at all? If so, I have a patch that could help track it down
if you are intersted to try?

Thanks
Comment 3 Chris Rankin 2007-02-03 04:16:51 UTC
Nick,

I haven't been able to reproduce this yet, no. (Which is annoying, because it
happened very quickly the first time.) But could you post the patch anyway, please?

Thanks,
Chris
Comment 4 Nick Piggin 2007-02-05 20:03:36 UTC
Created attachment 10309 [details]
rmap debug for 2.6.20

This should keep track of exactly who incremented the mapcount for a given
page,
and will print traces if somebody decrements the mapcount if they are not
supposed
to.
Comment 5 Natalie Protasevich 2007-05-22 14:42:34 UTC
Chris, Vincent,
Were you able to incorporate the patch and run with it? Have the problem been
reproduced with the patch in?
Thanks,
--Natalie
Comment 6 Susanna Kaukinen 2007-07-12 23:50:56 UTC
Looks like what I have, too:

Distribution:
Linux version 2.6.21 (2.6.21-5) (root@skeptic) (gcc version 4.1.3 20070601 (prerelease) (Debian 4.1.2-12)) #1 SMP Wed Jul 11 16:37:10 EEST 2007

However, I think I had the same issue while I was still running Debian 4.0.

Hardware Environment: Acer 7003WSMi

Software Environment:

Problem Description:
Executing `sudo vim /etc/fstab' while attemping to get my laptop connected to my Nokia 7373 phone via a bluetooth usb device, an `A-Link Bluetooth USB 2.0 Adapter A2 - US'. I was attempting to use `fuse', or so.

Dumps to terminals:

 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ...
 localhost kernel: Eeek! page_mapcount(page) went negative! (-1)

 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ...
 localhost kernel:   page pfn = 0

 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ...
 localhost kernel:   page->flags = 400

 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ...
 localhost kernel:   page->count = 1

 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ...
 localhost kernel:   page->mapping = 0000000000000000

 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ...
 localhost kernel:   vma->vm_ops = 0xffffffff804b31e0

 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ...
 localhost kernel:   vma->vm_ops->nopage = filemap_nopage+0x0/0x2fe

 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ...
 localhost kernel:   vma->vm_file->f_op->mmap = generic_file_mmap+0x0/0x3f

 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ...
 localhost kernel: ------------[ cut here ]------------

 Message from syslogd@localhost at Thu Jul 12 22:11:53 2007 ...
 localhost kernel: invalid opcode: 0000 [1] SMP

dmesg:

Eeek! page_mapcount(page) went negative! (-1)
  page pfn = 0
  page->flags = 400
  page->count = 1
  page->mapping = 0000000000000000
  vma->vm_ops = 0xffffffff804b31e0
  vma->vm_ops->nopage = filemap_nopage+0x0/0x2fe
  vma->vm_file->f_op->mmap = generic_file_mmap+0x0/0x3f
------------[ cut here ]------------
kernel BUG at mm/rmap.c:596!
invalid opcode: 0000 [1] SMP
CPU 0
Modules linked in: fuse hci_usb rfcomm l2cap bluetooth ppdev parport_pc
lp parport button ac battery cpufreq_powersave cpufreq_stats
cpufreq_userspace cpufreq_ondeman
d cpufreq_conservative ipv6 arc4 ecb ieee80211_crypt_wep powernow_k8
freq_table loop pcmcia snd_hda_intel snd_hda_codec bcm43xx snd_pcm_oss
firmware_class snd_mixer_o
ss ieee80211softmac snd_pcm snd_timer snd soundcore yenta_socket
rsrc_nonstatic tifm_7xx1 ieee80211 ieee80211_crypt psmouse serio_raw
k8temp pcspkr snd_page_alloc i2c
_nforce2 pcmcia_core i2c_core evdev ext3 jbd mbcache sha256 aes cbc
blkcipher dm_crypt ide_generic dm_mirror dm_snapshot dm_mod ide_cd cdrom
ide_disk generic amd74xx
ide_core sata_nv forcedeth ata_generic libata scsi_mod ehci_hcd ohci_hcd
thermal processor fan
Pid: 7253, comm: sh Not tainted 2.6.21 #1
RIP: 0010:[<ffffffff8020ab42>]  [<ffffffff8020ab42>]
page_remove_rmap+0xe4/0x100
RSP: 0000:ffff81001bc83cd8  EFLAGS: 00010296
RAX: 000000000000003b RBX: ffff81003b1f3000 RCX: 0000000000005c7f
RDX: 00000000ffffffff RSI: ffff81001db1c268 RDI: ffffffff804aed3c
RBP: ffff810026749768 R08: ffffffff80595b40 R09: 00039c0000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 000000000804d000
R13: ffff81001db1c268 R14: ffff81000100b800 R15: 00000000080e9000
FS:  00000000f7ef8000(0000) GS:ffffffff804e4000(0000) knlGS:00000000f7e1b6c0
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000008049724 CR3: 000000001cc9d000 CR4: 00000000000006e0
Process sh (pid: 7253, threadinfo ffff81001bc82000, task ffff81001d9b6180)
Stack:  0000000036c92020 ffff81003b1f3000 0000000000000000 ffffffff80207a0a
 0000000000000000 ffff81001bc83dc8 ffffffffffffffff 0000000000000000
 ffff810026749768 ffff81001bc83dd0 00000000003fafff 0000000000000000
Call Trace:
 [<ffffffff80207a0a>] unmap_vmas+0x3dd/0x6fe
 [<ffffffff80235c85>] exit_mmap+0x76/0xeb
 [<ffffffff80237bb1>] mmput+0x28/0x98
 [<ffffffff8021370e>] do_exit+0x212/0x7ef
 [<ffffffff8020a9de>] do_page_fault+0x6f0/0x770
 [<ffffffff8021c7b4>] __dentry_open+0x101/0x1aa
 [<ffffffff802253a4>] do_filp_open+0x2a/0x38
 [<ffffffff8022cfcd>] sys_rt_sigprocmask+0x50/0xce
 [<ffffffff8025b93d>] error_exit+0x0/0x84


Code: 0f 0b eb fe 8b 77 18 41 58 5b 5d 83 e6 01 f7 de 83 c6 04 e9
RIP  [<ffffffff8020ab42>] page_remove_rmap+0xe4/0x100
 RSP <ffff81001bc83cd8>
Fixing recursive fault but reboot is needed!

P.S.
The last one I just got was like this:

Message from syslogd@localhost at Fri Jul 13 07:35:48 2007 ...
localhost kernel: Oops: 0000 [1] SMP 

Message from syslogd@localhost at Fri Jul 13 07:35:48 2007 ...
localhost kernel: CR2: 0000000000000040

Message from syslogd@localhost at Fri Jul 13 07:35:55 2007 ...
localhost kernel: stack segment: 0000 [2] SMP 

,I'll skip dmesg to not to clutter this too much.
Comment 7 Susanna Kaukinen 2007-07-13 00:17:07 UTC
(In reply to comment #5)
> Chris, Vincent,
> Were you able to incorporate the patch and run with it? Have the problem been
> reproduced with the patch in?
> Thanks,
> --Natalie

I'm compiling the kernel (2.6.21) w/the patch. Will report back when I know more.

Susanna.
Comment 8 Edward Hughes IV 2007-07-13 06:22:08 UTC
Subject: Re:  "Eeek! page_mapcount(page) went negative! (-1)"

> Eeek! page_mapcount(page) went negative! (-1)
>   page pfn = 0
>   page->flags = 400

By all means try that patch as Natalie suggests, but I doubt it'll
tell you anything interesting: your page table contains an entry
marked present for page 0 (itself correctly flagged as Reserved).

That means your page table has been corrupted: I don't think the
stack display goes far enough to show the corrupt page table entry
(we can make a patch for that later if this remains a mystery).
Let's suppose that it might be 00000001, a single bit error.

Please run memtest86+ overnight to check your RAM.
Comment 9 Edward Hughes IV 2007-07-13 06:38:47 UTC
Subject: Re:  "Eeek! page_mapcount(page) went negative! (-1)"

On Fri, 13 Jul 2007, bugme-daemon@bugzilla.kernel.org wrote:
> 
> ------- Comment #8 from echughesiv@gmail.com  2007-07-13 06:22 -------
> Subject: Re:  "Eeek! page_mapcount(page) went negative! (-1)"
>... 
> Please run memtest86+ overnight to check your RAM.

I would just like to add that I believe I am
Hugh Dickins <hugh@veritas.com>
despite bugzilla having decided that I am
Edward Hughes IV <echughesiv@gmail.com>

Help!!
Comment 10 Susanna Kaukinen 2007-07-13 08:56:48 UTC
I applied the patch, and after quite a few tries I finally got the kernel compiled. The problem had nothing to do w/the source code. It seems that there's something wrong w/my system, as it starts behaving oddly [1], when I compile without nice +20. 

Anyway, I haven't seen another Eeek! as of yet, altho I did suffer a total freeze just a few minutes ago. I'm suspecting that those are caused by a) kcryptd/LUKS, b) powernowd c) bad hardware or d) an unrelated kernel bug.

[1] Ordinary commands start failing (that won't after reboot), and a freeze sometimes follows soon after - altho more often I reboot the system myself as it has become so unusable (like tar and gcc crashing, etc.)
Comment 11 Susanna Kaukinen 2007-07-13 10:42:58 UTC
And the answer is: c) bad hardware.

Seems that the extra 512 Mb memory didn't last for more than a few months.

memtest: (a gazillion errors)
http://aycu13.webshots.com/image/21292/2002493603451648766_rs.jpg

Thank you all. =)

P.S. No new Eeek!s so far.
Comment 12 Natalie Protasevich 2007-07-13 11:37:44 UTC
Thanks, the bug can be closed now.
Comment 13 Hugh Dickins 2007-07-14 07:47:27 UTC
Subject: Re:  "Eeek! page_mapcount(page) went negative! (-1)"

Sorry, just testing whether I'm Hugh Dickins <hugh@veritas.com> again!
Comment 14 Srdjan Todorovic 2007-09-05 02:35:10 UTC
I have been getting this bug for a while now, and had the bug
yesterday on 2.6.20-16 (kubuntu 7.04). I ran memtest86 overnight
for 15 hours, with no errors.

Other people are reporting this on bugs.launchpad.net

I have noticed that some people (including myself) who reported
this on bugs.launchpad.net load nvidia's driver and taint the kernel.

I will try Nick's patch at some point and if someone can let me
know how to possibly reproduce this, I can rmmod nvidia and try to
trigger.
Comment 15 Nick Piggin 2007-09-05 05:11:12 UTC
We have no idea how it can be reproduced. If you can reproduce it without
nvidia loaded and with my patch installed, it may help.