Created attachment 303892 [details] .config Hi All, When I run an application that uses almost all of the memory on my custom board and then exit with Ctrl+C(SIGINT), I get the error below. [ 34.560094] BUG: Bad page map in process opengltest:disk$0 pte:86000707 pmd:87516835 [ 34.568054] page:e0400000 refcount:0 mapcount:-129 mapping:00000000 index:0x1 [ 34.575151] flags: 0x2(referenced) [ 34.578486] raw: 00000002 e0411284 e0400fc4 00000000 00000001 00000004 ffffff7e 00000000 [ 34.586583] raw: 00000000 [ 34.589136] page dumped because: bad pte [ 34.593040] addr:b27e7000 vm_flags:140440fb anon_vma:00000000 mapping:cb53e2f8 index:10282 [ 34.601333] file:renderD129 fault:lima_gem_fault mmap:lima_gem_mmap readpage:0x0 [ 34.608681] CPU: 1 PID: 178 Comm: opengltest:disk$0 Tainted: P O 5.4.210-custom #1 [ 34.616955] Hardware name: Custom Cortex-A7 board (Flattened Device Tree) [ 34.623282] Backtrace: [ 34.625720] [<c06cc9d8>] (dump_backtrace) from [<c06ccc60>] (show_stack+0x20/0x24) [ 34.633254] r7:e0400000 r6:600f0013 r5:00000000 r4:c0b6e964 [ 34.638889] [<c06ccc40>] (show_stack) from [<c06d5138>] (dump_stack+0x90/0xac) [ 34.646083] [<c06d50a8>] (dump_stack) from [<c02784f4>] (print_bad_pte+0x170/0x1a0) [ 34.653706] r7:e0400000 r6:b27e7000 r5:ca8d7478 r4:00000000 [ 34.659343] [<c0278384>] (print_bad_pte) from [<c0279980>] (unmap_page_range+0x35c/0x454) [ 34.667489] r10:c99d2c98 r9:86000707 r8:e0400000 r7:ca8d7478 r6:b2800000 r5:b27e7000 [ 34.675282] r4:c997be20 [ 34.677802] [<c0279624>] (unmap_page_range) from [<c0279b04>] (unmap_single_vma+0x8c/0x94) [ 34.686036] r10:000000f0 r9:c0baf44c r8:00000000 r7:c997be20 r6:b27e7000 r5:ca8d7478 [ 34.693829] r4:b2a47000 [ 34.696348] [<c0279a78>] (unmap_single_vma) from [<c0279ce4>] (unmap_vmas+0x60/0x68) [ 34.704060] r7:00000000 r6:c997be20 r5:ffffffff r4:ca8d7478 [ 34.709697] [<c0279c84>] (unmap_vmas) from [<c02803a0>] (exit_mmap+0xf0/0x150) [ 34.716888] r8:00000000 r7:c99b024c r6:ffffe000 r5:00000000 r4:c9919948 [ 34.723564] [<c02802b0>] (exit_mmap) from [<c011d000>] (__mmput+0x48/0xc0) [ 34.730404] r5:c99b0200 r4:c99b0200 [ 34.733962] [<c011cfb8>] (__mmput) from [<c011d0a8>] (mmput+0x30/0x34) [ 34.740457] r5:c99b0200 r4:c99b0200 [ 34.744016] [<c011d078>] (mmput) from [<c01249b8>] (do_exit+0x404/0xa38) [ 34.750684] r5:c99b0200 r4:c9991680 [ 34.754242] [<c01245b4>] (do_exit) from [<c01250a4>] (do_group_exit+0x6c/0xcc) [ 34.761429] r7:c99a0540 [ 34.763950] [<c0125038>] (do_group_exit) from [<c01319f4>] (get_signal+0x214/0x708) [ 34.771573] r7:c99a0540 r6:c997bf44 r5:ffffe000 r4:00418004 [ 34.777210] [<c01317e0>] (get_signal) from [<c010c238>] (do_work_pending+0xf0/0x448) [ 34.784922] r10:000000f0 r9:b6c01af6 r8:b6c01af4 r7:fffffe00 r6:ffffe000 r5:00000001 [ 34.792715] r4:c997bfb0 [ 34.795236] [<c010c148>] (do_work_pending) from [<c010106c>] (slow_work_pending+0xc/0x20) [ 34.803377] Exception stack(0xc997bfb0 to 0xc997bff8) [ 34.808406] bfa0: 05b4ae80 00000189 00000000 00000000 [ 34.816555] bfc0: 00000000 ffffffff 00000000 000000f0 05b4ae80 b6c3c4c5 00000000 00000000 [ 34.824700] bfe0: 000000f0 b576daa0 b6c3a7cd b6c01af4 80010030 05b4ae80 [ 34.831289] r10:000000f0 r9:c997a000 r8:c0101264 r7:000000f0 r6:00000000 r5:ffffffff [ 34.839082] r4:00000000 I did a lot of debugging before reporting this bug, and here's my conclusion. Occurs only if the scope of the alloc_node_mem_map() function allocates a memblock for a mem_map for the flat memory model consisting of a page[] structure contains the address 0xa0400000. [ 0.000000] memblock_alloc_try_nid: 5013504 bytes align=0x40 nid=0 from=0x00000000 max_addr=0x00000000 alloc_node_mem_map.constprop.0+0x7c/0x118 [ 0.000000] memblock_reserve: [0xa0328000-0xa07effff] memblock_alloc_range_nid+0x100/0x13c As you can see in the log above, 0xa0328000-0xa07effff contains 0xa04000000. Also, the problem always occurs when accessing a certain area from 0xa04000000. When I googled other issues, I found that it might be a h/w issue, so I allocated the area 0xa0400000-0xa04fffff as reserved and tested it with memtester and there was no problem. The physical memory on my board is 256MB * 4, which is 1GB. However, for testing purposes, I have modified the memory to 544 MB with kernel parameter. # cat /proc/iomem ... 80000000-877fffff : System RAM 80008000-809fffff : Kernel code 80b00000-80bc80c7 : Kernel data 88000000-8b8fffff : System RAM 9e200000-9ebfffff : System RAM 9fd00000-a1ffffff : System RAM Physical Linear 80000000-8FFFFFFF 256MB 90000000-9FFFFFFF 256MB A0000000-A1FFFFFF 32MB ----------------------- 544MB I'm stuck on how to debug here. Any advice would be appreciated.
Created attachment 303893 [details] dmesg
Comment on attachment 303893 [details] dmesg It does sound like a hardware issue. Perhaps you could modify alloc_node_mem_map(): if the memory it got from memmap_alloc() contains 0xa0400000 then simply leak it and allocate another chunk of memory?
(In reply to Andrew Morton from comment #2) > Comment on attachment 303893 [details] > dmesg > > It does sound like a hardware issue. > > Perhaps you could modify alloc_node_mem_map(): if the memory it got from > memmap_alloc() contains 0xa0400000 then simply leak it and allocate another > chunk of memory? It's a little different than what you're talking about, but as I mentioned above, when I prevented node_map_mmap() from allocating the 0xa0400000 area with the reserved no-map property, I didn't have the problem. Is there a difference in results between what I did and what you're talking about? I also tested the 0xa0400000 region for 24 hours with the command 'memtester -p 0xa0400000 1M' and had no issues.