Bug 10412

Summary: Memory corruption a_ops or 'mapping' pointer
Product: Memory Management Reporter: Plamen Petrov (plamen.sisi)
Component: OtherAssignee: Jan Kara (jack)
Status: CLOSED INVALID    
Severity: blocking CC: jack
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: linux 2.6.25-rc8-00166-g6fdf5e6 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 9832    
Attachments: gzipped config used
dmesg output
lspci -vv output

Description Plamen Petrov 2008-04-07 02:33:04 UTC
Latest working kernel version: not sure
Earliest failing kernel version: never happened before
Distribution: BlueWhite64 12 (unofficial Slackware x64 port)
Hardware Environment: Core 2 Duo E4300, 1GB RAM, Via based mobo
Software Environment: KDE desktop
Problem Description: BUG while closing KMail to shut down the machine

Steps to reproduce:

While closing KMail from the KDE desktop to
shutdown the machine, I got this:

[===snip===]

[145026.240307] general protection fault: 0000 [1] PREEMPT SMP
[145026.240313] CPU 0
[145026.240315] Modules linked in:
[145026.240318] Pid: 32546, comm: kmail Not tainted 2.6.25-rc8-00166-g6fdf5e6 #1
[145026.240320] RIP: 0010:[<ffffffff80275274>]  [<ffffffff80275274>] set_page_dirty+0x34/0x90
[145026.240327] RSP: 0018:ffff810004ba9d98  EFLAGS: 00010206
[145026.240328] RAX: 6b636f6c5f747369 RBX: ffff81000779cae0 RCX: ffff8100010046e0
[145026.240330] RDX: ffffffff802c08d0 RSI: 8000000033d60067 RDI: ffffe20000b56d00
[145026.240332] RBP: 000000000115c000 R08: 0000000000000006 R09: 0000000000000001
[145026.240334] R10: 0000000000000002 R11: 000000000000031b R12: ffffe20000b56d00
[145026.240335] R13: 0000000001200000 R14: 0000000033d60067 R15: 0000000000000000
[145026.240338] FS:  0000000000000000(0000) GS:ffffffff80993000(0000) knlGS:0000000000000000
[145026.240340] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[145026.240341] CR2: 00007f695e7aa3f0 CR3: 0000000000201000 CR4: 00000000000006e0
[145026.240343] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[145026.240345] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[145026.240347] Process kmail (pid: 32546, threadinfo ffff810004ba8000, task ffff81002137a000)
[145026.240348] Stack:  0000000001155000 ffff81000779cae0 000000000115c000 ffffffff8027e271
[145026.240353]  0000000000000000 0000000001574fff 0000000000000000 ffff810004ba9ea8
[145026.240357]  ffffffffffffffff 0000000000000000 ffff81003af86bd0 ffff810004ba9eb0
[145026.240360] Call Trace:
[145026.240364]  [<ffffffff8027e271>] ? unmap_vmas+0x611/0x800
[145026.240367]  [<ffffffff8028252a>] ? exit_mmap+0x8a/0x130
[145026.240370]  [<ffffffff80236957>] ? mmput+0x57/0xe0
[145026.240373]  [<ffffffff8023cc8a>] ? do_exit+0x19a/0x7e0
[145026.240377]  [<ffffffff807079fb>] ? lockdep_sys_exit_thunk+0x35/0x67
[145026.240381]  [<ffffffff803f6281>] ? __up_read+0x21/0xb0
[145026.240383]  [<ffffffff8023d303>] ? do_group_exit+0x33/0xa0
[145026.240386]  [<ffffffff8020b56b>] ? system_call_after_swapgs+0x7b/0x80
[145026.240387]
[145026.240388]
[145026.240389] Code: 48 8b 47 18 66 85 d2 78 70 a8 01 75 50 48 85 c0 74 4b 48 8b 80 b8 00 00 00 48 c7 c2 d0 08 2c 80 48 8b 40 20 48 85 c0 48 0f 44 c2 <ff> d0 85 c0 89 c5 74 21 65 48 8b 34 25 00 00 00 00 48 81 c6 d0
[145026.240416] RIP  [<ffffffff80275274>] set_page_dirty+0x34/0x90
[145026.240419]  RSP <ffff810004ba9d98>
[145026.240423] ---[ end trace 6b50f2d712edcd7a ]---
[145026.240424] Fixing recursive fault but reboot is needed!
[145026.240426] BUG: scheduling while atomic: kmail/32546/0x00000003
[145026.240428] INFO: lockdep is turned off.
[145026.240430] Pid: 32546, comm: kmail Tainted: G      D 2.6.25-rc8-00166-g6fdf5e6 #1
[145026.240431]
[145026.240432] Call Trace:
[145026.240435]  [<ffffffff80705a26>] thread_return+0x34c/0x576
[145026.240437]  [<ffffffff80239469>] release_console_sem+0x49/0x200
[145026.240439]  [<ffffffff80239e5e>] printk+0x4e/0x60
[145026.240441]  [<ffffffff8023d29e>] do_exit+0x7ae/0x7e0
[145026.240444]  [<ffffffff8022e223>] __wake_up+0x43/0x70
[145026.240446]  [<ffffffff8020c8e7>] oops_end+0x87/0x90
[145026.240448]  [<ffffffff8070884d>] error_exit+0x0/0x9a
[145026.240452]  [<ffffffff802c08d0>] __set_page_dirty_buffers+0x0/0xc0
[145026.240454]  [<ffffffff80275274>] set_page_dirty+0x34/0x90
[145026.240457]  [<ffffffff8027b6f8>] __dec_zone_state+0x18/0x80
[145026.240459]  [<ffffffff8027e271>] unmap_vmas+0x611/0x800
[145026.240461]  [<ffffffff8028252a>] exit_mmap+0x8a/0x130
[145026.240463]  [<ffffffff80236957>] mmput+0x57/0xe0
[145026.240466]  [<ffffffff8023cc8a>] do_exit+0x19a/0x7e0
[145026.240468]  [<ffffffff807079fb>] lockdep_sys_exit_thunk+0x35/0x67
[145026.240470]  [<ffffffff803f6281>] __up_read+0x21/0xb0
[145026.240473]  [<ffffffff8023d303>] do_group_exit+0x33/0xa0
[145026.240475]  [<ffffffff8020b56b>] system_call_after_swapgs+0x7b/0x80
[145026.240476]

[===end-snip===]
Comment 1 Plamen Petrov 2008-04-07 02:35:10 UTC
Created attachment 15644 [details]
gzipped config used

This is the kernel config used, copied directly from /proc.
Comment 2 Plamen Petrov 2008-04-07 02:35:53 UTC
Created attachment 15645 [details]
dmesg output

output of dmesg
Comment 3 Plamen Petrov 2008-04-07 02:36:16 UTC
Created attachment 15646 [details]
lspci -vv output

lspci -vv output
Comment 4 Plamen Petrov 2008-04-07 02:38:47 UTC
First report reference:

http://lkml.org/lkml/2008/4/7/36
Comment 5 Rafael J. Wysocki 2008-04-07 03:24:06 UTC
This entry is being used for tracking a regression from 2.6.24.  Please don't
close it until the problem is fixed in the mainline.
Comment 6 Rafael J. Wysocki 2008-04-07 03:25:39 UTC
Can you identify the piece of code corresponding to set_page_dirty+0x34, please?
Comment 7 Rafael J. Wysocki 2008-04-07 04:47:50 UTC
Please keep the bugme-daemon address on the CC list, so that the bug tracker
can pick up your replies automatically.

On Monday, 7 of April 2008, Plamen Petrov wrote:
> bugme-daemon@bugzilla.kernel.org wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=10412
> > 
> > ------- Comment #6 from rjw@sisk.pl  2008-04-07 03:25 -------
> > Can you identify the piece of code corresponding to set_page_dirty+0x34,
> > please?
> > 
> Sorry, I'm not sure what exactly you want me to do...
> And, frankly, I'm not sure if I am even capable of doing it...

You can use gdb for this purpose.  Please go to the directory where you have
compiled the kernel and rung "gdb vmlinux".  Then, under gdb, execute
"l *set_page_dirty+0x34" and it should show you which line of code this address
corresponds to.

This will work if your kernel has been compiled with CONFIG_DEBUG_INFO set.
Comment 8 Plamen Petrov 2008-04-07 04:57:06 UTC
Rafael J. Wysocki wrote:
> Please keep the bugme-daemon address on the CC list, so that the bug tracker
> can pick up your replies automatically.
> 
Will do, sorry.
> On Monday, 7 of April 2008, Plamen Petrov wrote:
>> bugme-daemon@bugzilla.kernel.org wrote:
>>> http://bugzilla.kernel.org/show_bug.cgi?id=10412
>>>
>>> ------- Comment #6 from rjw@sisk.pl  2008-04-07 03:25 -------
>>> Can you identify the piece of code corresponding to set_page_dirty+0x34,
>>> please?
>>>
>> Sorry, I'm not sure what exactly you want me to do...
>> And, frankly, I'm not sure if I am even capable of doing it...
> 
> You can use gdb for this purpose.  Please go to the directory where you have
> compiled the kernel and rung "gdb vmlinux".  Then, under gdb, execute
> "l *set_page_dirty+0x34" and it should show you which line of code this
> address
> corresponds to.
> 
> This will work if your kernel has been compiled with CONFIG_DEBUG_INFO set.
Unfortunately, I will be away from the machine for at least the next 6
hours or so - the debug will have to wait.

Just checked - excerpt from my kernel config:

...
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
...

If I re-compile the same kernel with CONFIG_DEBUG_INFO set, will it do?
Or should I go to latest git available, and enable CONFIG_DEBUG_INFO?

Thanks,
Comment 9 Linus Torvalds 2008-04-07 08:22:12 UTC
The piece of code that oopses is (use "linux/scripts/decodecode" to see it 
from the oops):

    12:   48 8b 80 b8 00 00 00    mov    0xb8(%rax),%rax
    19:   48 c7 c2 d0 08 2c 80    mov    $0xffffffff802c08d0,%rdx
    20:   48 8b 40 20             mov    0x20(%rax),%rax
    24:   48 85 c0                test   %rax,%rax
    27:   48 0f 44 c2             cmove  %rdx,%rax
**   0:   ff d0                   callq  *%rax            <--------- THIS
     2:   85 c0                   test   %eax,%eax
     4:   89 c5                   mov    %eax,%ebp
     6:   74 21                   je     0x29

and it gets a GP fault on the "callq" to never-never-land due to EAX being
corrupt (RAX: 6b636f6c5f747369).

That RAX value is a string ("ist_lock"), not a pointer. The code itself 
comes from

        int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
        if (!spd)
                spd = __set_page_dirty_buffers;
        return (*spd)(page);

(in __set_page_dirty() - inlined), so it looks like "mapping" or "a_ops" 
was corrupted.

The "Scheduling while atomic" part is uninteresting. It's just a result of 
this earlier oops (we killed the process while it was holding locks).
Comment 10 Jan Kara 2008-04-10 07:47:15 UTC
Plamen, could you try running memtest (to rule out possibility of a single-bit error in the pointer) and if it doesn't find anything, try enabling CONFIG_DEBUG_SLAB and to catch memory corruption.
Comment 11 Plamen Petrov 2008-04-10 22:44:22 UTC
bugme-daemon@bugzilla.kernel.org wrote:
> Plamen, could you try running memtest (to rule out possibility of a
> single-bit
> error in the pointer) and if it doesn't find anything, try enabling
> CONFIG_DEBUG_SLAB and to catch memory corruption.
> 
> 

Well, I must admit I only ran memtest when I assembled that machine six 
months or so ago...

Anyway, yesturday I changed my motherboard - the new one is a 
GA-P35-DS3R from Gigabyte.

What I can do is enable CONFIG_DEBUG_SLAB, but with this new mobo -
I have no problems at all.

I'll keep you informed if anything comes up...
Comment 12 Jan Kara 2008-04-14 08:49:44 UTC
OK, I'll close the bug as invalid for now, please reopen in case you see the same problem again. I'm also changing the subject of the bug to match more the real problem...