217154 – BUG: Bad page map in process : bad pte

Bug 217154 - BUG: Bad page map in process : bad pte

Summary: BUG: Bad page map in process : bad pte

Status:	NEW

Alias:	None

Product:	Memory Management
Classification:	Unclassified
Component:	Other (show other bugs)
Hardware:	ARM Linux

Importance:	P1 normal
Assignee:	Andrew Morton

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-03-07 08:58 UTC by Youngjun
Modified:	2023-03-08 01:19 UTC (History)
CC List:	1 user (show)

See Also:
Kernel Version:	5.4.210
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
.config (92.84 KB, text/plain) 2023-03-07 08:58 UTC, Youngjun	Details
dmesg (468.68 KB, text/plain) 2023-03-07 10:07 UTC, Youngjun	Details
Add an attachment (proposed patch, testcase, etc.)

Description Youngjun 2023-03-07 08:58:00 UTC

Created attachment 303892 [details]
.config

Hi All,

When I run an application that uses almost all of the memory on my custom board and then exit with Ctrl+C(SIGINT), I get the error below.

[   34.560094] BUG: Bad page map in process opengltest:disk$0  pte:86000707 pmd:87516835
[   34.568054] page:e0400000 refcount:0 mapcount:-129 mapping:00000000 index:0x1
[   34.575151] flags: 0x2(referenced)
[   34.578486] raw: 00000002 e0411284 e0400fc4 00000000 00000001 00000004 ffffff7e 00000000
[   34.586583] raw: 00000000
[   34.589136] page dumped because: bad pte
[   34.593040] addr:b27e7000 vm_flags:140440fb anon_vma:00000000 mapping:cb53e2f8 index:10282
[   34.601333] file:renderD129 fault:lima_gem_fault mmap:lima_gem_mmap readpage:0x0
[   34.608681] CPU: 1 PID: 178 Comm: opengltest:disk$0 Tainted: P           O      5.4.210-custom #1
[   34.616955] Hardware name: Custom Cortex-A7 board (Flattened Device Tree)
[   34.623282] Backtrace:
[   34.625720] [<c06cc9d8>] (dump_backtrace) from [<c06ccc60>] (show_stack+0x20/0x24)
[   34.633254]  r7:e0400000 r6:600f0013 r5:00000000 r4:c0b6e964
[   34.638889] [<c06ccc40>] (show_stack) from [<c06d5138>] (dump_stack+0x90/0xac)
[   34.646083] [<c06d50a8>] (dump_stack) from [<c02784f4>] (print_bad_pte+0x170/0x1a0)
[   34.653706]  r7:e0400000 r6:b27e7000 r5:ca8d7478 r4:00000000
[   34.659343] [<c0278384>] (print_bad_pte) from [<c0279980>] (unmap_page_range+0x35c/0x454)
[   34.667489]  r10:c99d2c98 r9:86000707 r8:e0400000 r7:ca8d7478 r6:b2800000 r5:b27e7000
[   34.675282]  r4:c997be20
[   34.677802] [<c0279624>] (unmap_page_range) from [<c0279b04>] (unmap_single_vma+0x8c/0x94)
[   34.686036]  r10:000000f0 r9:c0baf44c r8:00000000 r7:c997be20 r6:b27e7000 r5:ca8d7478
[   34.693829]  r4:b2a47000
[   34.696348] [<c0279a78>] (unmap_single_vma) from [<c0279ce4>] (unmap_vmas+0x60/0x68)
[   34.704060]  r7:00000000 r6:c997be20 r5:ffffffff r4:ca8d7478
[   34.709697] [<c0279c84>] (unmap_vmas) from [<c02803a0>] (exit_mmap+0xf0/0x150)
[   34.716888]  r8:00000000 r7:c99b024c r6:ffffe000 r5:00000000 r4:c9919948
[   34.723564] [<c02802b0>] (exit_mmap) from [<c011d000>] (__mmput+0x48/0xc0)
[   34.730404]  r5:c99b0200 r4:c99b0200
[   34.733962] [<c011cfb8>] (__mmput) from [<c011d0a8>] (mmput+0x30/0x34)
[   34.740457]  r5:c99b0200 r4:c99b0200
[   34.744016] [<c011d078>] (mmput) from [<c01249b8>] (do_exit+0x404/0xa38)
[   34.750684]  r5:c99b0200 r4:c9991680
[   34.754242] [<c01245b4>] (do_exit) from [<c01250a4>] (do_group_exit+0x6c/0xcc)
[   34.761429]  r7:c99a0540
[   34.763950] [<c0125038>] (do_group_exit) from [<c01319f4>] (get_signal+0x214/0x708)
[   34.771573]  r7:c99a0540 r6:c997bf44 r5:ffffe000 r4:00418004
[   34.777210] [<c01317e0>] (get_signal) from [<c010c238>] (do_work_pending+0xf0/0x448)
[   34.784922]  r10:000000f0 r9:b6c01af6 r8:b6c01af4 r7:fffffe00 r6:ffffe000 r5:00000001
[   34.792715]  r4:c997bfb0
[   34.795236] [<c010c148>] (do_work_pending) from [<c010106c>] (slow_work_pending+0xc/0x20)
[   34.803377] Exception stack(0xc997bfb0 to 0xc997bff8)
[   34.808406] bfa0:                                     05b4ae80 00000189 00000000 00000000
[   34.816555] bfc0: 00000000 ffffffff 00000000 000000f0 05b4ae80 b6c3c4c5 00000000 00000000
[   34.824700] bfe0: 000000f0 b576daa0 b6c3a7cd b6c01af4 80010030 05b4ae80
[   34.831289]  r10:000000f0 r9:c997a000 r8:c0101264 r7:000000f0 r6:00000000 r5:ffffffff
[   34.839082]  r4:00000000


I did a lot of debugging before reporting this bug, and here's my conclusion.

Occurs only if the scope of the alloc_node_mem_map() function allocates a memblock for a mem_map for the flat memory model consisting of a page[] structure contains the address 0xa0400000.

[    0.000000] memblock_alloc_try_nid: 5013504 bytes align=0x40 nid=0 from=0x00000000 max_addr=0x00000000 alloc_node_mem_map.constprop.0+0x7c/0x118
[    0.000000] memblock_reserve: [0xa0328000-0xa07effff] memblock_alloc_range_nid+0x100/0x13c

As you can see in the log above, 0xa0328000-0xa07effff contains 0xa04000000. Also, the problem always occurs when accessing a certain area from 0xa04000000.

When I googled other issues, I found that it might be a h/w issue, so I allocated the area 0xa0400000-0xa04fffff as reserved and tested it with memtester and there was no problem.

The physical memory on my board is 256MB * 4, which is 1GB. However, for testing purposes, I have modified the memory to 544 MB with kernel parameter.

# cat /proc/iomem
...
80000000-877fffff : System RAM
  80008000-809fffff : Kernel code
  80b00000-80bc80c7 : Kernel data
88000000-8b8fffff : System RAM
9e200000-9ebfffff : System RAM
9fd00000-a1ffffff : System RAM

Physical Linear
80000000-8FFFFFFF 256MB
90000000-9FFFFFFF 256MB
A0000000-A1FFFFFF  32MB
-----------------------
                  544MB


I'm stuck on how to debug here. Any advice would be appreciated.

Comment 1 Youngjun 2023-03-07 10:07:51 UTC

Created attachment 303893 [details]
dmesg

Comment 2 Andrew Morton 2023-03-07 22:06:15 UTC

Comment on attachment 303893 [details]
dmesg

It does sound like a hardware issue.

Perhaps you could modify alloc_node_mem_map(): if the memory it got from memmap_alloc() contains 0xa0400000 then simply leak it and allocate another chunk of memory?

Comment 3 Youngjun 2023-03-08 01:19:56 UTC

(In reply to Andrew Morton from comment #2)
> Comment on attachment 303893 [details]
> dmesg
> 
> It does sound like a hardware issue.
> 
> Perhaps you could modify alloc_node_mem_map(): if the memory it got from
> memmap_alloc() contains 0xa0400000 then simply leak it and allocate another
> chunk of memory?

It's a little different than what you're talking about, but as I mentioned above, when I prevented node_map_mmap() from allocating the 0xa0400000 area with the reserved no-map property, I didn't have the problem. Is there a difference in results between what I did and what you're talking about?

I also tested the 0xa0400000 region for 24 hours with the command 'memtester -p 0xa0400000 1M' and had no issues.

Note You need to log in before you can comment on or make changes to this bug.