On a Xen VM running as pvh [ 3.499843] Adding 131068k swap on /dev/xvda9. Priority:-2 extents:1 across:131068k SSFS [ 3.547312] EXT4-fs (xvda1): re-mounted. Opts: (null) [ 3.988606] EXT4-fs (xvda1): re-mounted. Opts: errors=remount-ro [ 24.647744] BUG: unable to handle kernel NULL pointer dereference at 00000008 [ 24.647801] IP: __radix_tree_lookup+0x14/0xa0 [ 24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000 [ 24.647828] Oops: 0000 [#1] SMP [ 24.647842] CPU: 5 PID: 3600 Comm: java Not tainted 4.14.13-rh10-20180115190010.xenU.i386 #1 [ 24.647855] task: e52518c0 task.stack: e4e7a000 [ 24.647866] EIP: __radix_tree_lookup+0x14/0xa0 [ 24.647876] EFLAGS: 00010286 CPU: 5 [ 24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000 [ 24.647895] ESI: 00000000 EDI: 00000000 EBP: e4e7bdb8 ESP: e4e7bda0 [ 24.647904] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 [ 24.647917] CR0: 80050033 CR2: 00000008 CR3: 25360000 CR4: 00002660 [ 24.647930] Call Trace: [ 24.647942] radix_tree_lookup_slot+0x13/0x30 [ 24.647955] find_get_entry+0x1d/0x120 [ 24.647963] pagecache_get_page+0x1f/0x230 [ 24.647975] lookup_swap_cache+0x42/0x140 [ 24.647983] swap_readahead_detect+0x66/0x2e0 [ 24.647993] do_swap_page+0x1fa/0x860 [ 24.648010] ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10 [ 24.648026] ? xen_pmd_val+0x10/0x20 [ 24.648035] handle_mm_fault+0x6f8/0x1020 [ 24.648046] __do_page_fault+0x18a/0x450 [ 24.648055] ? vmalloc_sync_all+0x250/0x250 [ 24.648063] do_page_fault+0x21/0x30 [ 24.648074] common_exception+0x45/0x4a [ 24.648082] EIP: 0xb76d873e [ 24.648088] EFLAGS: 00010206 CPU: 5 [ 24.648096] EAX: 76a10000 EBX: 76a1cd14 ECX: 00000006 EDX: 00000006 [ 24.648105] ESI: 00000040 EDI: b796c380 EBP: 77881008 ESP: 77880ff8 [ 24.648115] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b [ 24.648124] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f ff ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec <8b> 58 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6 [ 24.648195] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e4e7bda0 [ 24.648205] CR2: 0000000000000008 [ 24.648273] ---[ end trace ed356e59f215ce07 ]--- [ 28.890326] BUG: unable to handle kernel NULL pointer dereference at 00000008 [ 28.890372] IP: __radix_tree_lookup+0x14/0xa0 [ 28.890382] *pdpt = 0000000025488027 *pde = 0000000000000000 [ 28.890396] Oops: 0000 [#2] SMP [ 28.890408] CPU: 7 PID: 3542 Comm: java Tainted: G D 4.14.13-rh10-20180115190010.xenU.i386 #1 [ 28.890423] task: e8691080 task.stack: e52a6000 [ 28.890433] EIP: __radix_tree_lookup+0x14/0xa0 [ 28.890442] EFLAGS: 00010286 CPU: 7 [ 28.890449] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000 [ 28.890459] ESI: 00000000 EDI: 00000000 EBP: e52a7db8 ESP: e52a7da0 [ 28.890469] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 [ 28.890484] CR0: 80050033 CR2: 00000008 CR3: 25161000 CR4: 00002660 [ 28.890498] Call Trace: [ 28.890510] radix_tree_lookup_slot+0x13/0x30 [ 28.890522] find_get_entry+0x1d/0x120 [ 28.890531] pagecache_get_page+0x1f/0x230 [ 28.890541] lookup_swap_cache+0x42/0x140 [ 28.890550] swap_readahead_detect+0x66/0x2e0 [ 28.890559] do_swap_page+0x1fa/0x860 [ 28.890573] ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10 [ 28.890588] ? xen_pmd_val+0x10/0x20 [ 28.890597] handle_mm_fault+0x6f8/0x1020 [ 28.890607] __do_page_fault+0x18a/0x450 [ 28.890616] ? vmalloc_sync_all+0x250/0x250 [ 28.890681] do_page_fault+0x21/0x30 [ 28.890707] common_exception+0x45/0x4a [ 28.890715] EIP: 0xb779774f [ 28.890722] EFLAGS: 00010202 CPU: 7 [ 28.890730] EAX: 00000000 EBX: 66dd9d6c ECX: 02000000 EDX: 00000001 [ 28.890740] ESI: 02000000 EDI: 00000000 EBP: 674fe068 ESP: 674fe058 [ 28.890751] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b [ 28.890759] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f ff ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec <8b> 58 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6 [ 28.890830] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e52a7da0 [ 28.890841] CR2: 0000000000000008 [ 28.890886] ---[ end trace ed356e59f215ce08 ]--- # java -version java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) Server VM (build 24.51-b03, mixed mode) ~# free -m total used free shared buffers cached Mem: 5418 572 4846 0 25 245 -/+ buffers/cache: 301 5117 Swap: 127 0 127 # uptime 01:21:08 up 2:02, 3 users, load average: 13.47, 13.65, 13.63
FWIW seeing this on a couple of different servers. a busy java process seems to be a common trigger. seems from memory like it is just 32 bit kernels hitting the problem.
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Thu, 18 Jan 2018 01:21:56 +0000 bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=198497 > > Bug ID: 198497 > Summary: handle_mm_fault / xen_pmd_val / radix_tree_lookup_slot > Null pointer > Product: Memory Management > Version: 2.5 > Kernel Version: Linux app1.vpsgate.com > 4.14.13-rh10-20180115190010.xenU.i386 #1 SMP Mon Jan > 15 19:04:55 UTC 2018 i686 GNU/Linux > Hardware: Intel > OS: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: Page Allocator > Assignee: akpm@linux-foundation.org > Reporter: peter@rimuhosting.com > Regression: No Does this look familiar to anyone? > On a Xen VM running as pvh > > [ 3.499843] Adding 131068k swap on /dev/xvda9. Priority:-2 extents:1 > across:131068k SSFS > [ 3.547312] EXT4-fs (xvda1): re-mounted. Opts: (null) > [ 3.988606] EXT4-fs (xvda1): re-mounted. Opts: errors=remount-ro > [ 24.647744] BUG: unable to handle kernel NULL pointer dereference at > 00000008 > [ 24.647801] IP: __radix_tree_lookup+0x14/0xa0 > [ 24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000 > [ 24.647828] Oops: 0000 [#1] SMP > [ 24.647842] CPU: 5 PID: 3600 Comm: java Not tainted > 4.14.13-rh10-20180115190010.xenU.i386 #1 > [ 24.647855] task: e52518c0 task.stack: e4e7a000 > [ 24.647866] EIP: __radix_tree_lookup+0x14/0xa0 > [ 24.647876] EFLAGS: 00010286 CPU: 5 > [ 24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000 > [ 24.647895] ESI: 00000000 EDI: 00000000 EBP: e4e7bdb8 ESP: e4e7bda0 > [ 24.647904] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 > [ 24.647917] CR0: 80050033 CR2: 00000008 CR3: 25360000 CR4: 00002660 > [ 24.647930] Call Trace: > [ 24.647942] radix_tree_lookup_slot+0x13/0x30 > [ 24.647955] find_get_entry+0x1d/0x120 > [ 24.647963] pagecache_get_page+0x1f/0x230 > [ 24.647975] lookup_swap_cache+0x42/0x140 > [ 24.647983] swap_readahead_detect+0x66/0x2e0 > [ 24.647993] do_swap_page+0x1fa/0x860 > [ 24.648010] ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10 > [ 24.648026] ? xen_pmd_val+0x10/0x20 > [ 24.648035] handle_mm_fault+0x6f8/0x1020 > [ 24.648046] __do_page_fault+0x18a/0x450 > [ 24.648055] ? vmalloc_sync_all+0x250/0x250 > [ 24.648063] do_page_fault+0x21/0x30 > [ 24.648074] common_exception+0x45/0x4a > [ 24.648082] EIP: 0xb76d873e > [ 24.648088] EFLAGS: 00010206 CPU: 5 > [ 24.648096] EAX: 76a10000 EBX: 76a1cd14 ECX: 00000006 EDX: 00000006 > [ 24.648105] ESI: 00000040 EDI: b796c380 EBP: 77881008 ESP: 77880ff8 > [ 24.648115] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b > [ 24.648124] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f ff > ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec <8b> > 58 > 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6 > [ 24.648195] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e4e7bda0 > [ 24.648205] CR2: 0000000000000008 > [ 24.648273] ---[ end trace ed356e59f215ce07 ]--- > [ 28.890326] BUG: unable to handle kernel NULL pointer dereference at > 00000008 > [ 28.890372] IP: __radix_tree_lookup+0x14/0xa0 > [ 28.890382] *pdpt = 0000000025488027 *pde = 0000000000000000 > [ 28.890396] Oops: 0000 [#2] SMP > [ 28.890408] CPU: 7 PID: 3542 Comm: java Tainted: G D > 4.14.13-rh10-20180115190010.xenU.i386 #1 > [ 28.890423] task: e8691080 task.stack: e52a6000 > [ 28.890433] EIP: __radix_tree_lookup+0x14/0xa0 > [ 28.890442] EFLAGS: 00010286 CPU: 7 > [ 28.890449] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000 > [ 28.890459] ESI: 00000000 EDI: 00000000 EBP: e52a7db8 ESP: e52a7da0 > [ 28.890469] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 > [ 28.890484] CR0: 80050033 CR2: 00000008 CR3: 25161000 CR4: 00002660 > [ 28.890498] Call Trace: > [ 28.890510] radix_tree_lookup_slot+0x13/0x30 > [ 28.890522] find_get_entry+0x1d/0x120 > [ 28.890531] pagecache_get_page+0x1f/0x230 > [ 28.890541] lookup_swap_cache+0x42/0x140 > [ 28.890550] swap_readahead_detect+0x66/0x2e0 > [ 28.890559] do_swap_page+0x1fa/0x860 > [ 28.890573] ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10 > [ 28.890588] ? xen_pmd_val+0x10/0x20 > [ 28.890597] handle_mm_fault+0x6f8/0x1020 > [ 28.890607] __do_page_fault+0x18a/0x450 > [ 28.890616] ? vmalloc_sync_all+0x250/0x250 > [ 28.890681] do_page_fault+0x21/0x30 > [ 28.890707] common_exception+0x45/0x4a > [ 28.890715] EIP: 0xb779774f > [ 28.890722] EFLAGS: 00010202 CPU: 7 > [ 28.890730] EAX: 00000000 EBX: 66dd9d6c ECX: 02000000 EDX: 00000001 > [ 28.890740] ESI: 02000000 EDI: 00000000 EBP: 674fe068 ESP: 674fe058 > [ 28.890751] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b > [ 28.890759] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f ff > ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec <8b> > 58 > 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6 > [ 28.890830] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e52a7da0 > [ 28.890841] CR2: 0000000000000008 > [ 28.890886] ---[ end trace ed356e59f215ce08 ]--- > > # java -version > java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) Server VM (build 24.51-b03, mixed mode) > > ~# free -m > total used free shared buffers cached > Mem: 5418 572 4846 0 25 245 > -/+ buffers/cache: 301 5117 > Swap: 127 0 127 > > # uptime > 01:21:08 up 2:02, 3 users, load average: 13.47, 13.65, 13.63 > > -- > You are receiving this mail because: > You are the assignee for the bug.
On 01/18/2018 01:55 PM, Andrew Morton wrote: > > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > > On Thu, 18 Jan 2018 01:21:56 +0000 bugzilla-daemon@bugzilla.kernel.org wrote: > >> https://bugzilla.kernel.org/show_bug.cgi?id=198497 >> >> Bug ID: 198497 >> Summary: handle_mm_fault / xen_pmd_val / radix_tree_lookup_slot >> Null pointer >> Product: Memory Management >> Version: 2.5 >> Kernel Version: Linux app1.vpsgate.com >> 4.14.13-rh10-20180115190010.xenU.i386 #1 SMP Mon Jan >> 15 19:04:55 UTC 2018 i686 GNU/Linux >> Hardware: Intel >> OS: Linux >> Tree: Mainline >> Status: NEW >> Severity: high >> Priority: P1 >> Component: Page Allocator >> Assignee: akpm@linux-foundation.org >> Reporter: peter@rimuhosting.com >> Regression: No > > Does this look familiar to anyone? > Fedora has been seeing similar reports https://bugzilla.redhat.com/show_bug.cgi?id=1531779 Multiple reporters, one in XEN, another on actual hardware >> On a Xen VM running as pvh >> >> [ 3.499843] Adding 131068k swap on /dev/xvda9. Priority:-2 extents:1 >> across:131068k SSFS >> [ 3.547312] EXT4-fs (xvda1): re-mounted. Opts: (null) >> [ 3.988606] EXT4-fs (xvda1): re-mounted. Opts: errors=remount-ro >> [ 24.647744] BUG: unable to handle kernel NULL pointer dereference at >> 00000008 >> [ 24.647801] IP: __radix_tree_lookup+0x14/0xa0 >> [ 24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000 >> [ 24.647828] Oops: 0000 [#1] SMP >> [ 24.647842] CPU: 5 PID: 3600 Comm: java Not tainted >> 4.14.13-rh10-20180115190010.xenU.i386 #1 >> [ 24.647855] task: e52518c0 task.stack: e4e7a000 >> [ 24.647866] EIP: __radix_tree_lookup+0x14/0xa0 >> [ 24.647876] EFLAGS: 00010286 CPU: 5 >> [ 24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000 >> [ 24.647895] ESI: 00000000 EDI: 00000000 EBP: e4e7bdb8 ESP: e4e7bda0 >> [ 24.647904] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 >> [ 24.647917] CR0: 80050033 CR2: 00000008 CR3: 25360000 CR4: 00002660 >> [ 24.647930] Call Trace: >> [ 24.647942] radix_tree_lookup_slot+0x13/0x30 >> [ 24.647955] find_get_entry+0x1d/0x120 >> [ 24.647963] pagecache_get_page+0x1f/0x230 >> [ 24.647975] lookup_swap_cache+0x42/0x140 >> [ 24.647983] swap_readahead_detect+0x66/0x2e0 >> [ 24.647993] do_swap_page+0x1fa/0x860 >> [ 24.648010] ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10 >> [ 24.648026] ? xen_pmd_val+0x10/0x20 >> [ 24.648035] handle_mm_fault+0x6f8/0x1020 >> [ 24.648046] __do_page_fault+0x18a/0x450 >> [ 24.648055] ? vmalloc_sync_all+0x250/0x250 >> [ 24.648063] do_page_fault+0x21/0x30 >> [ 24.648074] common_exception+0x45/0x4a >> [ 24.648082] EIP: 0xb76d873e >> [ 24.648088] EFLAGS: 00010206 CPU: 5 >> [ 24.648096] EAX: 76a10000 EBX: 76a1cd14 ECX: 00000006 EDX: 00000006 >> [ 24.648105] ESI: 00000040 EDI: b796c380 EBP: 77881008 ESP: 77880ff8 >> [ 24.648115] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b >> [ 24.648124] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f >> ff >> ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec <8b> >> 58 >> 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6 >> [ 24.648195] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e4e7bda0 >> [ 24.648205] CR2: 0000000000000008 >> [ 24.648273] ---[ end trace ed356e59f215ce07 ]--- >> [ 28.890326] BUG: unable to handle kernel NULL pointer dereference at >> 00000008 >> [ 28.890372] IP: __radix_tree_lookup+0x14/0xa0 >> [ 28.890382] *pdpt = 0000000025488027 *pde = 0000000000000000 >> [ 28.890396] Oops: 0000 [#2] SMP >> [ 28.890408] CPU: 7 PID: 3542 Comm: java Tainted: G D >> 4.14.13-rh10-20180115190010.xenU.i386 #1 >> [ 28.890423] task: e8691080 task.stack: e52a6000 >> [ 28.890433] EIP: __radix_tree_lookup+0x14/0xa0 >> [ 28.890442] EFLAGS: 00010286 CPU: 7 >> [ 28.890449] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000 >> [ 28.890459] ESI: 00000000 EDI: 00000000 EBP: e52a7db8 ESP: e52a7da0 >> [ 28.890469] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 >> [ 28.890484] CR0: 80050033 CR2: 00000008 CR3: 25161000 CR4: 00002660 >> [ 28.890498] Call Trace: >> [ 28.890510] radix_tree_lookup_slot+0x13/0x30 >> [ 28.890522] find_get_entry+0x1d/0x120 >> [ 28.890531] pagecache_get_page+0x1f/0x230 >> [ 28.890541] lookup_swap_cache+0x42/0x140 >> [ 28.890550] swap_readahead_detect+0x66/0x2e0 >> [ 28.890559] do_swap_page+0x1fa/0x860 >> [ 28.890573] ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10 >> [ 28.890588] ? xen_pmd_val+0x10/0x20 >> [ 28.890597] handle_mm_fault+0x6f8/0x1020 >> [ 28.890607] __do_page_fault+0x18a/0x450 >> [ 28.890616] ? vmalloc_sync_all+0x250/0x250 >> [ 28.890681] do_page_fault+0x21/0x30 >> [ 28.890707] common_exception+0x45/0x4a >> [ 28.890715] EIP: 0xb779774f >> [ 28.890722] EFLAGS: 00010202 CPU: 7 >> [ 28.890730] EAX: 00000000 EBX: 66dd9d6c ECX: 02000000 EDX: 00000001 >> [ 28.890740] ESI: 02000000 EDI: 00000000 EBP: 674fe068 ESP: 674fe058 >> [ 28.890751] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b >> [ 28.890759] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f >> ff >> ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec <8b> >> 58 >> 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6 >> [ 28.890830] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e52a7da0 >> [ 28.890841] CR2: 0000000000000008 >> [ 28.890886] ---[ end trace ed356e59f215ce08 ]--- >> >> # java -version >> java version "1.7.0_51" >> Java(TM) SE Runtime Environment (build 1.7.0_51-b13) >> Java HotSpot(TM) Server VM (build 24.51-b03, mixed mode) >> >> ~# free -m >> total used free shared buffers cached >> Mem: 5418 572 4846 0 25 245 >> -/+ buffers/cache: 301 5117 >> Swap: 127 0 127 >> >> # uptime >> 01:21:08 up 2:02, 3 users, load average: 13.47, 13.65, 13.63 >> >> -- >> You are receiving this mail because: >> You are the assignee for the bug. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> >
On Thu, Jan 18, 2018 at 02:18:20PM -0800, Laura Abbott wrote: > On 01/18/2018 01:55 PM, Andrew Morton wrote: > > > [ 24.647744] BUG: unable to handle kernel NULL pointer dereference at > > > 00000008 > > > [ 24.647801] IP: __radix_tree_lookup+0x14/0xa0 > > > [ 24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000 > > > [ 24.647828] Oops: 0000 [#1] SMP > > > [ 24.647842] CPU: 5 PID: 3600 Comm: java Not tainted > > > 4.14.13-rh10-20180115190010.xenU.i386 #1 > > > [ 24.647855] task: e52518c0 task.stack: e4e7a000 > > > [ 24.647866] EIP: __radix_tree_lookup+0x14/0xa0 > > > [ 24.647876] EFLAGS: 00010286 CPU: 5 > > > [ 24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000 > > > [ 24.647895] ESI: 00000000 EDI: 00000000 EBP: e4e7bdb8 ESP: e4e7bda0 > > > [ 24.647904] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 > > > [ 24.647917] CR0: 80050033 CR2: 00000008 CR3: 25360000 CR4: 00002660 > > > [ 24.647930] Call Trace: > > > [ 24.647942] radix_tree_lookup_slot+0x13/0x30 > > > [ 24.647955] find_get_entry+0x1d/0x120 > > > [ 24.647963] pagecache_get_page+0x1f/0x230 > > > [ 24.647975] lookup_swap_cache+0x42/0x140 > > > [ 24.647983] swap_readahead_detect+0x66/0x2e0 > > > [ 24.647993] do_swap_page+0x1fa/0x860 > > > [ 24.648010] ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10 > > > [ 24.648026] ? xen_pmd_val+0x10/0x20 > > > [ 24.648035] handle_mm_fault+0x6f8/0x1020 > > > [ 24.648046] __do_page_fault+0x18a/0x450 > > > [ 24.648055] ? vmalloc_sync_all+0x250/0x250 > > > [ 24.648063] do_page_fault+0x21/0x30 > > > [ 24.648074] common_exception+0x45/0x4a > > > [ 24.648082] EIP: 0xb76d873e > > > [ 24.648088] EFLAGS: 00010206 CPU: 5 > > > [ 24.648096] EAX: 76a10000 EBX: 76a1cd14 ECX: 00000006 EDX: 00000006 > > > [ 24.648105] ESI: 00000040 EDI: b796c380 EBP: 77881008 ESP: 77880ff8 > > > [ 24.648115] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b > > > [ 24.648124] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 > 1f ff > > > ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec > <8b> 58 > > > 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6 > > > [ 24.648195] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e4e7bda0 > > > [ 24.648205] CR2: 0000000000000008 > > > [ 24.648273] ---[ end trace ed356e59f215ce07 ]--- Running that code through decodecode, I get: 0: 55 push %ebp 1: 89 e5 mov %esp,%ebp 3: 57 push %edi 4: 89 d7 mov %edx,%edi 6: 56 push %esi 7: 53 push %ebx 8: 83 ec 0c sub $0xc,%esp b: 89 45 ec mov %eax,-0x14(%ebp) e: 89 4d e8 mov %ecx,-0x18(%ebp) 11: 8b 45 ec mov -0x14(%ebp),%eax 14:* 8b 58 04 mov 0x4(%eax),%ebx <-- trapping instruction 17: 89 d8 mov %ebx,%eax 19: 83 e0 03 and $0x3,%eax Which I think means it's looking at offset 4 from whichever argument the x86 calling convention puts in register %eax. Which I think is argument 0? Which is the radix tree root. And that makes sense; we're loading the root node from the radix tree root at offset 4. The problem is that %eax has the value 4 in it. That would match with 'page_tree' being at offset 4 from the start of address_space. So find_get_page() got called with a NULL mapping, so pagecache_get_page() got called with a NULL mapping. Which means I've tracked it back to: page = find_get_page(swap_address_space(entry), swp_offset(entry)); and swap_address_space() is returning NULL. Has this machine run swapoff recently, perhaps?
On 19/01/18 4:04 PM, Matthew Wilcox wrote: > On Thu, Jan 18, 2018 at 02:18:20PM -0800, Laura Abbott wrote: >> On 01/18/2018 01:55 PM, Andrew Morton wrote: >>>> [ 24.647744] BUG: unable to handle kernel NULL pointer dereference at >>>> 00000008 >>>> [ 24.647801] IP: __radix_tree_lookup+0x14/0xa0 >>>> [ 24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000 >>>> [ 24.647828] Oops: 0000 [#1] SMP >>>> [ 24.647842] CPU: 5 PID: 3600 Comm: java Not tainted >>>> 4.14.13-rh10-20180115190010.xenU.i386 #1 >>>> [ 24.647855] task: e52518c0 task.stack: e4e7a000 >>>> [ 24.647866] EIP: __radix_tree_lookup+0x14/0xa0 >>>> [ 24.647876] EFLAGS: 00010286 CPU: 5 >>>> [ 24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000 >>>> [ 24.647895] ESI: 00000000 EDI: 00000000 EBP: e4e7bdb8 ESP: e4e7bda0 >>>> [ 24.647904] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 >>>> [ 24.647917] CR0: 80050033 CR2: 00000008 CR3: 25360000 CR4: 00002660 >>>> [ 24.647930] Call Trace: >>>> [ 24.647942] radix_tree_lookup_slot+0x13/0x30 >>>> [ 24.647955] find_get_entry+0x1d/0x120 >>>> [ 24.647963] pagecache_get_page+0x1f/0x230 >>>> [ 24.647975] lookup_swap_cache+0x42/0x140 >>>> [ 24.647983] swap_readahead_detect+0x66/0x2e0 >>>> [ 24.647993] do_swap_page+0x1fa/0x860 >>>> [ 24.648010] ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10 >>>> [ 24.648026] ? xen_pmd_val+0x10/0x20 >>>> [ 24.648035] handle_mm_fault+0x6f8/0x1020 >>>> [ 24.648046] __do_page_fault+0x18a/0x450 >>>> [ 24.648055] ? vmalloc_sync_all+0x250/0x250 >>>> [ 24.648063] do_page_fault+0x21/0x30 >>>> [ 24.648074] common_exception+0x45/0x4a >>>> [ 24.648082] EIP: 0xb76d873e >>>> [ 24.648088] EFLAGS: 00010206 CPU: 5 >>>> [ 24.648096] EAX: 76a10000 EBX: 76a1cd14 ECX: 00000006 EDX: 00000006 >>>> [ 24.648105] ESI: 00000040 EDI: b796c380 EBP: 77881008 ESP: 77880ff8 >>>> [ 24.648115] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b >>>> [ 24.648124] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f >>>> ff >>>> ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec >>>> <8b> 58 >>>> 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6 >>>> [ 24.648195] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e4e7bda0 >>>> [ 24.648205] CR2: 0000000000000008 >>>> [ 24.648273] ---[ end trace ed356e59f215ce07 ]--- > Running that code through decodecode, I get: > > 0: 55 push %ebp > 1: 89 e5 mov %esp,%ebp > 3: 57 push %edi > 4: 89 d7 mov %edx,%edi > 6: 56 push %esi > 7: 53 push %ebx > 8: 83 ec 0c sub $0xc,%esp > b: 89 45 ec mov %eax,-0x14(%ebp) > e: 89 4d e8 mov %ecx,-0x18(%ebp) > 11: 8b 45 ec mov -0x14(%ebp),%eax > 14:* 8b 58 04 mov 0x4(%eax),%ebx <-- > trapping instruction > 17: 89 d8 mov %ebx,%eax > 19: 83 e0 03 and $0x3,%eax > > Which I think means it's looking at offset 4 from whichever argument > the x86 calling convention puts in register %eax. Which I think is > argument 0? Which is the radix tree root. And that makes sense; we're > loading the root node from the radix tree root at offset 4. The problem > is that %eax has the value 4 in it. That would match with 'page_tree' > being at offset 4 from the start of address_space. So find_get_page() > got called with a NULL mapping, so pagecache_get_page() got called > with a NULL mapping. > > Which means I've tracked it back to: > > page = find_get_page(swap_address_space(entry), swp_offset(entry)); > > and swap_address_space() is returning NULL. Has this machine run swapoff > recently, perhaps? Swap was on. Swap is small (127MB). Swap had not been dipped into. total used free shared buffers cached Swap: 127 0 127 PS: cannot recall seeing this issue on x86_64, just 32 bit. PPS: reminder this is on a Xen VM which per https://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html#PVH-Guest-Specific-Options has "out of sync pagetables" if that is relevant (we do not set that option, I am unsure what default is used).
On Fri, Jan 19, 2018 at 04:14:42PM +1300, xen@randomwebstuff.com wrote: > > On 19/01/18 4:04 PM, Matthew Wilcox wrote: > > On Thu, Jan 18, 2018 at 02:18:20PM -0800, Laura Abbott wrote: > > > On 01/18/2018 01:55 PM, Andrew Morton wrote: > > > > > [ 24.647744] BUG: unable to handle kernel NULL pointer dereference > at > > > > > 00000008 > > > > > [ 24.647801] IP: __radix_tree_lookup+0x14/0xa0 > > > > > [ 24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000 > > > > > [ 24.647828] Oops: 0000 [#1] SMP > > > > > [ 24.647842] CPU: 5 PID: 3600 Comm: java Not tainted > > > > > 4.14.13-rh10-20180115190010.xenU.i386 #1 > > > > > [ 24.647855] task: e52518c0 task.stack: e4e7a000 > > > > > [ 24.647866] EIP: __radix_tree_lookup+0x14/0xa0 > > > > > [ 24.647876] EFLAGS: 00010286 CPU: 5 > > > > > [ 24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: > 00000000 If my understanding is right, EDX contains the index we're looking up. Which is zero. So the swp_entry we got is one bit away from being NULL. Hmm. Have you run memtest86 or some other memory tester on the system recently? > PS: cannot recall seeing this issue on x86_64, just 32 bit. Laura has 64-bit instances of this. PPS: reminder > this is on a Xen VM which per > https://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html#PVH-Guest-Specific-Options > has "out of sync pagetables" if that is relevant (we do not set that option, > I am unsure what default is used). Laura also has non-Xen instances of this. They may not all be the same bug, of course.
On Thu, Jan 18, 2018 at 02:18:20PM -0800, Laura Abbott wrote: > Fedora has been seeing similar reports > https://bugzilla.redhat.com/show_bug.cgi?id=1531779 > > Multiple reporters, one in XEN, another on actual hardware Can you chuck this patch into Fedora? Should make it easier to see if it's a "stuck bit" kind of a problem. --- From: Matthew Wilcox <mawilcox@microsoft.com> Subject: Detect bad swap entries in lookup If we have a stuck bit in a PTE, we can end up looking for an entry in a NULL mapping, which oopses fairly quickly. Print a warning to help us debug, and return NULL which will help the machine survive a little longer. Although if it has a permanently stuck bit in a PTE, there's only a 50% chance it'll surive the insertion of a real PTE into that entry. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> diff --git a/mm/swap_state.c b/mm/swap_state.c index 39ae7cfad90f..5a928e0191a1 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -334,8 +334,12 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma, struct page *page; unsigned long ra_info; int win, hits, readahead; + struct address_space *swapper_space = swap_address_space(entry); + + if (WARN(!swapper_space, "Bad swp_entry: %lx\n", entry.val)) + return NULL; - page = find_get_page(swap_address_space(entry), swp_offset(entry)); + page = find_get_page(swapper_space, swp_offset(entry)); INC_CACHE_INFO(find_total); if (page) {
On 01/19/2018 05:21 AM, Matthew Wilcox wrote: > On Fri, Jan 19, 2018 at 04:14:42PM +1300, xen@randomwebstuff.com wrote: >> >> On 19/01/18 4:04 PM, Matthew Wilcox wrote: >>> On Thu, Jan 18, 2018 at 02:18:20PM -0800, Laura Abbott wrote: >>>> On 01/18/2018 01:55 PM, Andrew Morton wrote: >>>>>> [ 24.647744] BUG: unable to handle kernel NULL pointer dereference at >>>>>> 00000008 >>>>>> [ 24.647801] IP: __radix_tree_lookup+0x14/0xa0 >>>>>> [ 24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000 >>>>>> [ 24.647828] Oops: 0000 [#1] SMP >>>>>> [ 24.647842] CPU: 5 PID: 3600 Comm: java Not tainted >>>>>> 4.14.13-rh10-20180115190010.xenU.i386 #1 >>>>>> [ 24.647855] task: e52518c0 task.stack: e4e7a000 >>>>>> [ 24.647866] EIP: __radix_tree_lookup+0x14/0xa0 >>>>>> [ 24.647876] EFLAGS: 00010286 CPU: 5 >>>>>> [ 24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000 > > If my understanding is right, EDX contains the index we're looking up. > Which is zero. So the swp_entry we got is one bit away from being NULL. > Hmm. Have you run memtest86 or some other memory tester on the system > recently? > >> PS: cannot recall seeing this issue on x86_64, just 32 bit. > > Laura has 64-bit instances of this. > The 64-bit backtraces reported in the bugzilla looked different, I would consider it a different issue. > PPS: reminder >> this is on a Xen VM which per >> https://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html#PVH-Guest-Specific-Options >> has "out of sync pagetables" if that is relevant (we do not set that option, >> I am unsure what default is used). > > Laura also has non-Xen instances of this. They may not all be the same > bug, of course. >
Created attachment 273867 [details] attachment-13826-0.html On 20/01/18 6:30 AM, Laura Abbott wrote: > On 01/19/2018 05:21 AM, Matthew Wilcox wrote: >> On Fri, Jan 19, 2018 at 04:14:42PM +1300, xen@randomwebstuff.com wrote: >>> >>> On 19/01/18 4:04 PM, Matthew Wilcox wrote: >>>> On Thu, Jan 18, 2018 at 02:18:20PM -0800, Laura Abbott wrote: >>>>> On 01/18/2018 01:55 PM, Andrew Morton wrote: >>>>>>> [ 24.647744] BUG: unable to handle kernel NULL pointer >>>>>>> dereference at >>>>>>> 00000008 >>>>>>> [ 24.647801] IP: __radix_tree_lookup+0x14/0xa0 >>>>>>> [ 24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000 >>>>>>> [ 24.647828] Oops: 0000 [#1] SMP >>>>>>> [ 24.647842] CPU: 5 PID: 3600 Comm: java Not tainted >>>>>>> 4.14.13-rh10-20180115190010.xenU.i386 #1 >>>>>>> [ 24.647855] task: e52518c0 task.stack: e4e7a000 >>>>>>> [ 24.647866] EIP: __radix_tree_lookup+0x14/0xa0 >>>>>>> [ 24.647876] EFLAGS: 00010286 CPU: 5 >>>>>>> [ 24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: >>>>>>> 00000000 >> >> If my understanding is right, EDX contains the index we're looking up. >> Which is zero. So the swp_entry we got is one bit away from being NULL. >> Hmm. Have you run memtest86 or some other memory tester on the system >> recently? >> >>> PS: cannot recall seeing this issue on x86_64, just 32 bit. >> >> Laura has 64-bit instances of this. >> > > The 64-bit backtraces reported in the bugzilla looked different, > I would consider it a different issue. > >> PPS: reminder >>> this is on a Xen VM which per >>> >>> https://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html#PVH-Guest-Specific-Options >>> has "out of sync pagetables" if that is relevant (we do not set that >>> option, >>> I am unsure what default is used). >> >> Laura also has non-Xen instances of this. They may not all be the same >> bug, of course. >> Re-tried with the current latest 4.14 (4.14.15). Received the following: [2018-01-24 19:26:57] Ubuntu 14.04.5 LTS dev hvc0 [2018-01-24 19:26:57] [2018-01-24 19:26:57] dev login: [44501.106868] BUG: unable to handle kernel NULL pointer dereference at 00000008 [2018-01-25 07:47:50] [44501.106897] IP: __radix_tree_lookup+0x14/0xa0 [2018-01-25 07:47:50] [44501.106905] *pdpt = 000000001fe82027 *pde = 0000000000000000 [2018-01-25 07:47:50] [44501.106916] Oops: 0000 [#1] SMP [2018-01-25 07:47:50] [44501.106924] CPU: 0 PID: 3344 Comm: PassengerAgent Not tainted 4.14.15-rh13-20180123235331.xenU.i386 #1 [2018-01-25 07:47:50] [44501.106935] task: dfee39c0 task.stack: dff12000 [2018-01-25 07:47:50] [44501.106943] EIP: __radix_tree_lookup+0x14/0xa0 [2018-01-25 07:47:50] [44501.106950] EFLAGS: 00210286 CPU: 0 [2018-01-25 07:47:50] [44501.106955] EAX: 00000004 EBX: 00000001 ECX: 00000000 EDX: 00000000 [2018-01-25 07:47:50] [44501.106963] ESI: 00000000 EDI: 00000000 EBP: dff13db8 ESP: dff13da0 [2018-01-25 07:47:50] [44501.106971] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 [2018-01-25 07:47:50] [44501.106979] CR0: 80050033 CR2: 00000008 CR3: 1fdb1000 CR4: 00002660 [2018-01-25 07:47:50] [44501.106989] Call Trace: [2018-01-25 07:47:50] [44501.106995] radix_tree_lookup_slot+0x13/0x30 [2018-01-25 07:47:50] [44501.107004] find_get_entry+0x1d/0x120 [2018-01-25 07:47:50] [44501.107011] pagecache_get_page+0x1f/0x230 [2018-01-25 07:47:50] [44501.107018] lookup_swap_cache+0x42/0x140 [2018-01-25 07:47:50] [44501.107024] swap_readahead_detect+0x66/0x2e0 [2018-01-25 07:47:50] [44501.107032] do_swap_page+0x1fa/0x860 [2018-01-25 07:47:50] [44501.107040] ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10 [2018-01-25 07:47:50] [44501.107050] ? xen_pmd_val+0x10/0x20 [2018-01-25 07:47:50] [44501.107057] handle_mm_fault+0x6f8/0x1020 [2018-01-25 07:47:50] [44501.107065] ? _raw_spin_unlock_irqrestore+0x13/0x20 [2018-01-25 07:47:50] [44501.107074] ? pvclock_clocksource_read+0xa6/0x1a0 [2018-01-25 07:47:50] [44501.107081] __do_page_fault+0x18a/0x450 [2018-01-25 07:47:50] [44501.107089] ? _copy_to_user+0x28/0x40 [2018-01-25 07:47:50] [44501.107096] ? vmalloc_sync_all+0x250/0x250 [2018-01-25 07:47:50] [44501.107102] do_page_fault+0x21/0x30 [2018-01-25 07:47:50] [44501.107109] common_exception+0x45/0x4a [2018-01-25 07:47:50] [44501.107115] EIP: 0x82c3358 [2018-01-25 07:47:50] [44501.107120] EFLAGS: 00210202 CPU: 0 [2018-01-25 07:47:50] [44501.107126] EAX: b702d0b8 EBX: 081557a9 ECX: 00000000 EDX: 0a4296bc [2018-01-25 07:47:50] [44501.107133] ESI: b467c2cc EDI: 00000000 EBP: b467c138 ESP: b467c110 [2018-01-25 07:47:50] [44501.107141] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b [2018-01-25 07:47:50] [44501.107147] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f ff ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec <8b> 58 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6 [2018-01-25 07:47:50] [44501.110296] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:dff13da0 [2018-01-25 07:47:50] [44501.110304] CR2: 0000000000000008 [2018-01-25 07:47:50] [44501.110356] ---[ end trace 89cdd2ba8e7323a8 ]---
On Fri, Jan 26, 2018 at 07:54:06PM +1300, xen@randomwebstuff.com wrote: > Re-tried with the current latest 4.14 (4.14.15). Received the following: > > [2018-01-24 19:26:57] dev login: [44501.106868] BUG: unable to handle kernel > NULL pointer dereference at 00000008 > [2018-01-25 07:47:50] [44501.106897] IP: __radix_tree_lookup+0x14/0xa0 Please try including this patch: https://bugzilla.kernel.org/show_bug.cgi?id=198497#c7 And have you had the chance to run memtest86 yet?
On 27/01/18 8:40 AM, Matthew Wilcox wrote: > On Fri, Jan 26, 2018 at 07:54:06PM +1300, xen@randomwebstuff.com wrote: >> Re-tried with the current latest 4.14 (4.14.15). Received the following: >> >> [2018-01-24 19:26:57] dev login: [44501.106868] BUG: unable to handle kernel >> NULL pointer dereference at 00000008 >> [2018-01-25 07:47:50] [44501.106897] IP: __radix_tree_lookup+0x14/0xa0 > Please try including this patch: > > https://bugzilla.kernel.org/show_bug.cgi?id=198497#c7 > > And have you had the chance to run memtest86 yet? Added the patch at https://bugzilla.kernel.org/show_bug.cgi?id=198497#c7 After, received this stack. Have not tried memtest86. These are production hosts. This has occurred on multiple hosts. I can only recall this occurring on 32 bit kernels. I cannot recall issues with other VMs not running that kernel on the same hosts. [ 125.329163] Bad swp_entry: e000000 [ 125.329202] ------------[ cut here ]------------ [ 125.329219] WARNING: CPU: 0 PID: 4175 at mm/swap_state.c:339 lookup_swap_cache+0x140/0x160 [ 125.329233] CPU: 0 PID: 4175 Comm: apt-show-versio Not tainted 4.14.15-rh14-20180126233810.xenU.i386-00001-g6ba70cb #1 [ 125.329245] task: ead9a940 task.stack: e7c8c000 [ 125.329253] EIP: lookup_swap_cache+0x140/0x160 [ 125.329260] EFLAGS: 00010282 CPU: 0 [ 125.329267] EAX: 00000016 EBX: 00000000 ECX: ec5289c4 EDX: 0100016d [ 125.329275] ESI: b6312000 EDI: e7d94ea0 EBP: e7c8de24 ESP: e7c8de0c [ 125.329284] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 [ 125.329295] CR0: 80050033 CR2: b63124b0 CR3: 2718c000 CR4: 00002660 [ 125.329308] Call Trace: [ 125.329323] ? percpu_counter_add_batch+0x91/0xb0 [ 125.329332] swap_readahead_detect+0x66/0x2e0 [ 125.329343] ? radix_tree_tag_set+0x7a/0xe0 [ 125.329352] do_swap_page+0x1fa/0x860 [ 125.329361] ? __set_page_dirty_buffers+0xb1/0xe0 [ 125.329372] ? ext4_set_page_dirty+0x22/0x60 [ 125.329383] ? fault_dirty_shared_page.isra.90+0x3e/0xa0 [ 125.329396] ? xen_pmd_val+0x10/0x20 [ 125.329403] handle_mm_fault+0x6f8/0x1020 [ 125.329414] ? handle_irq_event_percpu+0x3c/0x50 [ 125.329424] __do_page_fault+0x18a/0x450 [ 125.329432] ? vmalloc_sync_all+0x250/0x250 [ 125.329439] do_page_fault+0x21/0x30 [ 125.329449] common_exception+0x45/0x4a [ 125.329456] EIP: 0xb7ce397b [ 125.329462] EFLAGS: 00010202 CPU: 0 [ 125.329469] EAX: 0000052a EBX: b7d77ff4 ECX: 000004fa EDX: b6311000 [ 125.329477] ESI: bf90eae0 EDI: b6ed4b20 EBP: bf90ea60 ESP: bf90ea20 [ 125.329486] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b [ 125.329493] Code: 18 1f 14 c2 85 ff 0f 85 41 ff ff ff f0 ff 05 38 fb 02 c2 e9 35 ff ff ff 8d 76 00 89 44 24 04 c7 04 24 55 93 f3 c1 e8 8c e7 f5 ff <0f> ff 8b 5d f4 31 c0 8b 75 f8 8b 7d fc 89 ec 5d c3 64 ff 05 18 [ 125.329558] ---[ end trace dd2704ca649b44ba ]---
On Tue, Jan 30, 2018 at 11:26:42AM +1300, xen@randonwebstuff.com wrote: > After, received this stack. > > Have not tried memtest86. These are production hosts. This has occurred on > multiple hosts. I can only recall this occurring on 32 bit kernels. I > cannot recall issues with other VMs not running that kernel on the same > hosts. > > [ 125.329163] Bad swp_entry: e000000 Mixed news here then ... 'e' is 8 | 4 | 2, so it's not a single bitflip. So no point in running memtest86. I should have made the printk produce leading zeroes, because that's 0x0e00'0000. ptes use the top 5 bits to encode the swapfile, so this swap entry is decoded as swapfile 1, page number 0x0600'0000. That's clearly ludicrous because you don't have a swapfile 1, and if you did, it wouldn't be so large as a terabyte. I think the next step in debugging this is printing the PTE which gave us this swp_entry. If you can drop the patch I asked you to try, and apply this patch instead, we'll have more idea about what's going on. Thanks! diff --git a/mm/memory.c b/mm/memory.c index 403934297a3d..8caaddb07747 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2892,6 +2892,10 @@ int do_swap_page(struct vm_fault *vmf) if (!page) page = lookup_swap_cache(entry, vma_readahead ? vma : NULL, vmf->address); + if (IS_ERR(page)) { + pte_ERROR(vmf->orig_pte); + page = NULL; + } if (!page) { struct swap_info_struct *si = swp_swap_info(entry); diff --git a/mm/shmem.c b/mm/shmem.c index 7fbe67be86fa..905fa34e022a 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1651,6 +1651,10 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, if (swap.val) { /* Look it up and read it in.. */ page = lookup_swap_cache(swap, NULL, 0); + if (IS_ERR(page)) { + pte_ERROR(vmf->orig_pte); + page = NULL; + } if (!page) { /* Or update major stats only when swapin succeeds?? */ if (fault_type) { diff --git a/mm/swap_state.c b/mm/swap_state.c index 39ae7cfad90f..7ee594c8eadd 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -334,8 +334,14 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma, struct page *page; unsigned long ra_info; int win, hits, readahead; + struct address_space *swapper_space = swap_address_space(entry); + + if (!swapper_space) { + pr_err("Bad swp_entry: %lx\n", entry.val); + return ERR_PTR(-EFAULT); + } - page = find_get_page(swap_address_space(entry), swp_offset(entry)); + page = find_get_page(swapper_space, swp_offset(entry)); INC_CACHE_INFO(find_total); if (page) { @@ -676,6 +682,10 @@ struct page *swap_readahead_detect(struct vm_fault *vmf, if ((unlikely(non_swap_entry(entry)))) return NULL; page = lookup_swap_cache(entry, vma, faddr); + if (IS_ERR(page)) { + pte_ERROR(vmf->orig_pte); + page = NULL; + } if (page) return page;
https://bugzilla.redhat.com/show_bug.cgi?id=1531779 It might be something related that "x86/mm: Found insecure W+X mapping at address" message is printed at boot. Are you seeing "x86/mm: Found insecure W+X mapping at address" before hitting "BUG: unable to handle kernel NULL pointer dereference" ?
I checked two different servers that have printed this BUG. I am not seeing 'Found insecure' in dmesg on either of them.
On Thu, Feb 01, 2018 at 08:02:43AM +0900, Tetsuo Handa wrote: > https://bugzilla.redhat.com/show_bug.cgi?id=1531779 > > It might be something related that > "x86/mm: Found insecure W+X mapping at address" message is printed at boot. > > Are you seeing "x86/mm: Found insecure W+X mapping at address" before > hitting "BUG: unable to handle kernel NULL pointer dereference" ? There are about eight different bugs in that thread; the only commonality I see between them is that there's a null pointer dereference somewhere in the kernel.
ping? On Wed, Jan 31, 2018 at 02:54:56AM -0800, Matthew Wilcox wrote: > On Tue, Jan 30, 2018 at 11:26:42AM +1300, xen@randonwebstuff.com wrote: > > After, received this stack. > > > > Have not tried memtest86. These are production hosts. This has occurred > on > > multiple hosts. I can only recall this occurring on 32 bit kernels. I > > cannot recall issues with other VMs not running that kernel on the same > > hosts. > > > > [ 125.329163] Bad swp_entry: e000000 > > Mixed news here then ... 'e' is 8 | 4 | 2, so it's not a single bitflip. > So no point in running memtest86. > > I should have made the printk produce leading zeroes, because that's > 0x0e00'0000. ptes use the top 5 bits to encode the swapfile, so > this swap entry is decoded as swapfile 1, page number 0x0600'0000. > That's clearly ludicrous because you don't have a swapfile 1, and if > you did, it wouldn't be so large as a terabyte. > > I think the next step in debugging this is printing the PTE which gave > us this swp_entry. If you can drop the patch I asked you to try, and > apply this patch instead, we'll have more idea about what's going on. > > Thanks! > > diff --git a/mm/memory.c b/mm/memory.c > index 403934297a3d..8caaddb07747 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2892,6 +2892,10 @@ int do_swap_page(struct vm_fault *vmf) > if (!page) > page = lookup_swap_cache(entry, vma_readahead ? vma : NULL, > vmf->address); > + if (IS_ERR(page)) { > + pte_ERROR(vmf->orig_pte); > + page = NULL; > + } > if (!page) { > struct swap_info_struct *si = swp_swap_info(entry); > > diff --git a/mm/shmem.c b/mm/shmem.c > index 7fbe67be86fa..905fa34e022a 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1651,6 +1651,10 @@ static int shmem_getpage_gfp(struct inode *inode, > pgoff_t index, > if (swap.val) { > /* Look it up and read it in.. */ > page = lookup_swap_cache(swap, NULL, 0); > + if (IS_ERR(page)) { > + pte_ERROR(vmf->orig_pte); > + page = NULL; > + } > if (!page) { > /* Or update major stats only when swapin succeeds?? */ > if (fault_type) { > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 39ae7cfad90f..7ee594c8eadd 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -334,8 +334,14 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct > vm_area_struct *vma, > struct page *page; > unsigned long ra_info; > int win, hits, readahead; > + struct address_space *swapper_space = swap_address_space(entry); > + > + if (!swapper_space) { > + pr_err("Bad swp_entry: %lx\n", entry.val); > + return ERR_PTR(-EFAULT); > + } > > - page = find_get_page(swap_address_space(entry), swp_offset(entry)); > + page = find_get_page(swapper_space, swp_offset(entry)); > > INC_CACHE_INFO(find_total); > if (page) { > @@ -676,6 +682,10 @@ struct page *swap_readahead_detect(struct vm_fault *vmf, > if ((unlikely(non_swap_entry(entry)))) > return NULL; > page = lookup_swap_cache(entry, vma, faddr); > + if (IS_ERR(page)) { > + pte_ERROR(vmf->orig_pte); > + page = NULL; > + } > if (page) > return page; > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
4.14.15 i686 with the patch above included: Feb 9 14:31:27 cs01 kernel: Bad swp_entry: 2000000 Feb 9 14:31:27 cs01 kernel: mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000) Feb 9 15:35:19 cs01 kernel: Bad swp_entry: 2000000 Feb 9 15:35:19 cs01 kernel: mm/swap_state.c:683: bad pte eee17f38(8000000100000000) Here is a non patched trace for reference: Jan 30 18:19:14 cs01 kernel: Oops: 0000 [#1] SMP Jan 30 18:19:14 cs01 kernel: Modules linked in: iptable_mangle ip_tables bonding Jan 30 18:19:14 cs01 kernel: CPU: 5 PID: 14205 Comm: nbdserver Not tainted 4.14.15 #4 Jan 30 18:19:14 cs01 kernel: Hardware name: To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M., BIOS 4.6.3 07/26/2010 Jan 30 18:19:14 cs01 kernel: task: ed97bb80 task.stack: ef11e000 Jan 30 18:19:14 cs01 kernel: EIP: __radix_tree_lookup+0xe/0xa0 Jan 30 18:19:14 cs01 kernel: EFLAGS: 00010292 CPU: 5 Jan 30 18:19:14 cs01 kernel: EAX: 00000004 EBX: 08627000 ECX: 00000000 EDX: 00000000 Jan 30 18:19:14 cs01 kernel: ESI: 00000000 EDI: 00000004 EBP: ef11fdfc ESP: ef11fdec Jan 30 18:19:14 cs01 kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Jan 30 18:19:14 cs01 kernel: CR0: 80050033 CR2: 00000008 CR3: 2ee46980 CR4: 000006f0 Jan 30 18:19:14 cs01 kernel: Call Trace: Jan 30 18:19:14 cs01 kernel: radix_tree_lookup_slot+0x11/0x30 Jan 30 18:19:14 cs01 kernel: ? smp_call_function_single+0xa3/0xf0 Jan 30 18:19:14 cs01 kernel: find_get_entry+0x1d/0x110 Jan 30 18:19:14 cs01 kernel: pagecache_get_page+0x1f/0x240 Jan 30 18:19:14 cs01 kernel: lookup_swap_cache+0x34/0xe0 Jan 30 18:19:14 cs01 kernel: swap_readahead_detect+0x56/0x2c0 Jan 30 18:19:14 cs01 kernel: ? flush_tlb_mm_range+0x98/0xe0 Jan 30 18:19:14 cs01 kernel: do_swap_page+0x102/0x5c0 Jan 30 18:19:14 cs01 kernel: ? page_add_new_anon_rmap+0x4b/0x70 Jan 30 18:19:14 cs01 kernel: ? _raw_spin_unlock+0x8/0x10 Jan 30 18:19:14 cs01 kernel: ? wp_page_copy+0x259/0x460 Jan 30 18:19:14 cs01 kernel: ? kmap_atomic_prot+0x28/0xc0 Jan 30 18:19:14 cs01 kernel: handle_mm_fault+0x428/0x9b0 Jan 30 18:19:14 cs01 kernel: __do_page_fault+0x173/0x420 Jan 30 18:19:14 cs01 kernel: ? vmalloc_sync_all+0x10/0x10 Jan 30 18:19:14 cs01 kernel: do_page_fault+0xb/0x10 Jan 30 18:19:14 cs01 kernel: common_exception+0x38/0x3e Jan 30 18:19:14 cs01 kernel: EIP: 0x805d5fa Jan 30 18:19:14 cs01 kernel: EFLAGS: 00010286 CPU: 5 Jan 30 18:19:14 cs01 kernel: EAX: 00000000 EBX: 08109ff4 ECX: 00000000 EDX: 00000000 Jan 30 18:19:14 cs01 kernel: ESI: 08627550 EDI: 08622f18 EBP: b6f72368 ESP: b6f72300 Jan 30 18:19:14 cs01 kernel: DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b Jan 30 18:19:14 cs01 kernel: ? vmalloc_sync_all+0x10/0x10 Jan 30 18:19:14 cs01 kernel: Code: 90 8d 74 26 00 80 40 03 01 e9 7c ff ff ff 8d b4 26 00 00 00 00 0f 0b 8d b6 00 00 00 00 55 89 e5 57 56 53 89 c7 83 ec 04 89 4d f0 <8b> 5f 04 89 d8 83 e0 03 83 f8 01 75 67 89 d8 83 e0 fe 0f b6 08 Jan 30 18:19:14 cs01 kernel: EIP: __radix_tree_lookup+0xe/0xa0 SS:ESP: 0068:ef11fdec Jan 30 18:19:14 cs01 kernel: CR2: 0000000000000008 Jan 30 18:19:14 cs01 kernel: ---[ end trace e2125a60a6ea383d ]---
Note there is no swap space in my environment. If this is helpful I can apply and run any further patches.
I just joined the linux-mm mailing list, if someone can please reply to this topic on the mailing list I will post my findings via email. Thanks.
The oops happens here on a xen-x86-32bit-domU running kernel 4.14.19-gentoo when I try to emerge net-dns/libidn-1.33-r2. The oops leaves a javac hanging with full cpu-usage. I applied the patch in Comment 12. This lets me emerge net-dns/libidn-1.33-r2 without javac hanging. The bug can be easily reproduced by emerging libidn again. Feb 17 21:14:56 colin kernel: Bad swp_entry: 4000000 Feb 17 21:14:56 colin kernel: mm/swap_state.c:683: bad pte e2187f30(8000000200000000) Feb 17 21:14:56 colin kernel: Bad swp_entry: 4000000 Feb 17 21:14:56 colin kernel: mm/swap_state.c:683: bad pte e584bf30(8000000200000000) Feb 17 21:16:26 colin kernel: Bad swp_entry: 4000000 Feb 17 21:16:26 colin kernel: mm/swap_state.c:683: bad pte e210df30(8000000200000000) Feb 17 21:23:21 colin kernel: Bad swp_entry: 4000000 Feb 17 21:23:21 colin kernel: mm/swap_state.c:683: bad pte dd5b1f30(8000000200000000) oops (unpatched kernel): Feb 17 21:08:57 colin kernel: BUG: unable to handle kernel NULL pointer dereference at 00000008 Feb 17 21:08:57 colin kernel: IP: __radix_tree_lookup+0xe/0xb0 Feb 17 21:08:57 colin kernel: *pdpt = 0000000021a03027 *pde = 0000000000000000 Feb 17 21:08:57 colin kernel: Oops: 0000 [#1] SMP Feb 17 21:08:57 colin kernel: Modules linked in: Feb 17 21:08:57 colin kernel: CPU: 5 PID: 10914 Comm: javac Not tainted 4.14.19-gentoo #4 Feb 17 21:08:57 colin kernel: task: e9fec0c0 task.stack: e1a9a000 Feb 17 21:08:57 colin kernel: EIP: __radix_tree_lookup+0xe/0xb0 Feb 17 21:08:57 colin kernel: EFLAGS: 00010296 CPU: 5 Feb 17 21:08:57 colin kernel: EAX: 00000004 EBX: 6612d000 ECX: 00000000 EDX: 00000000 Feb 17 21:08:57 colin kernel: ESI: 00000000 EDI: 00000004 EBP: e1a9bddc ESP: e1a9bdcc Feb 17 21:08:57 colin kernel: DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 Feb 17 21:08:57 colin kernel: CR0: 80050033 CR2: 00000008 CR3: 267e2000 CR4: 00042660 Feb 17 21:08:57 colin kernel: Call Trace: Feb 17 21:08:57 colin kernel: radix_tree_lookup_slot+0x11/0x30 Feb 17 21:08:57 colin kernel: find_get_entry+0x1d/0xe0 Feb 17 21:08:57 colin kernel: pagecache_get_page+0x1f/0x230 Feb 17 21:08:57 colin kernel: lookup_swap_cache+0x35/0xf0 Feb 17 21:08:57 colin kernel: swap_readahead_detect+0x4c/0x350 Feb 17 21:08:57 colin kernel: ? flush_tlb_mm_range+0x91/0xe0 Feb 17 21:08:57 colin kernel: do_swap_page+0x1ca/0x6d0 Feb 17 21:08:57 colin kernel: ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10 Feb 17 21:08:57 colin kernel: ? wp_page_copy+0x2af/0x520 Feb 17 21:08:57 colin kernel: ? xen_pmd_val+0x11/0x20 Feb 17 21:08:57 colin kernel: handle_mm_fault+0x3b8/0x940 Feb 17 21:08:57 colin kernel: __do_page_fault+0x178/0x400 Feb 17 21:08:57 colin kernel: ? vmalloc_sync_all+0x250/0x250 Feb 17 21:08:57 colin kernel: do_page_fault+0x1a/0x20 Feb 17 21:08:57 colin kernel: common_exception+0x84/0x8a Feb 17 21:08:57 colin kernel: EIP: 0xb744edf4 Feb 17 21:08:57 colin kernel: EFLAGS: 00010202 CPU: 5 Feb 17 21:08:57 colin kernel: EAX: 00000004 EBX: b77f6f28 ECX: 000284a2 EDX: 00001425 Feb 17 21:08:57 colin kernel: ESI: 6612d094 EDI: b7817148 EBP: 674ff0f8 ESP: 674ff0dc Feb 17 21:08:57 colin kernel: DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b Feb 17 21:08:57 colin kernel: Code: 90 8d 74 26 00 80 41 03 01 eb a7 66 90 0f 0b 8d b6 00 00 00 00 0f 0b 8d b6 00 00 00 00 55 89 e5 57 89 c7 56 53 83 ec 04 89 4d f0 <8b> 77 04 89 f0 83 e0 03 83 f8 01 75 75 89 f0 83 e0 fe 0f b6 08 Feb 17 21:08:57 colin kernel: EIP: __radix_tree_lookup+0xe/0xb0 SS:ESP: 0069:e1a9bdcc Feb 17 21:08:57 colin kernel: CR2: 0000000000000008 Feb 17 21:08:57 colin kernel: ---[ end trace f157259f300d3491 ]---
additional info: colin ~ # free -m gesamt benutzt frei gemns. Puffer/Cache verfügbar Speicher: 4050 839 2430 0 780 3058 Swap: 959 0 959 colin ~ # swapon -s Dateiname Typ Größe Benutzt Priorität /dev/xvda3 partition 983036 0 -2 colin ~ #
I see this issue with an Ocaml binary on a Xen 32bit PAE Dom0. I use the patch from https://bugzilla.kernel.org/show_bug.cgi?id=198497#c12 as a bandaid to let the system keep running. The bad ptes occur inconsistently but often. The systems have a single swap partition, but I don't think swap has been used. Kaby Lake machine: Bad swp_entry: 80000000 mm/swap_state.c:683: bad pte cdf9df1c(8000000400000000) Different Boot: Bad swp_entry: 80000000 mm/swap_state.c:683: bad pte ccf27f1c(8000000400000000) Skylake: Bad swp_entry: 80000000 mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000) Different Boot: Bad swp_entry: 80000000 mm/swap_state.c:683: bad pte ccb7df1c(8000000400000000) Ivy Bridge: Bad swp_entry: 40000000 mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000) Bad swp_entry: 40000000 mm/swap_state.c:683: bad pte c7833f1c(8000000200000000) Different boot: Bad swp_entry: 40000000 mm/swap_state.c:683: bad pte c223df1c(8000000200000000)
Hi, Any update on this please? We are also facing the same problem on 32 bit kernel with Intel ATOM on our production systems (with v4.14.2). -- Regards Sudip
On Fri, 9 Feb 2018 06:47:26 -0800 Matthew Wilcox <willy@infradead.org> wrote: > > ping? > There have been a bunch of updates to this issue in bugzilla (https://bugzilla.kernel.org/show_bug.cgi?id=198497). Sigh, I don't know what to do about this - maybe there's some way of getting bugzilla to echo everything to linux-mm or something. Anyway, please take a look - we appear to have a bug here. Perhaps this bug is sufficiently gnarly for you to prepare a debugging patch which we can add to the mainline kernel so we get (much) more debugging info when people hit it?
On Thu, Apr 12, 2018 at 10:12:09AM -0700, Andrew Morton wrote: > On Fri, 9 Feb 2018 06:47:26 -0800 Matthew Wilcox <willy@infradead.org> wrote: > > > > > ping? > > > > There have been a bunch of updates to this issue in bugzilla > (https://bugzilla.kernel.org/show_bug.cgi?id=198497). Sigh, I don't > know what to do about this - maybe there's some way of getting bugzilla > to echo everything to linux-mm or something. > > Anyway, please take a look - we appear to have a bug here. Perhaps > this bug is sufficiently gnarly for you to prepare a debugging patch > which we can add to the mainline kernel so we get (much) more debugging > info when people hit it? I have a few thoughts ... - The debugging patch I prepared appears to be doing its job well. People get the message and their machine stays working. - The commonality appears to be Xen running 32-bit kernels. Maybe we can kick the problem over to them to solve? - If we are seeing corruption purely in the lower bits, *we'll never know*. The radix tree lookup will simply not find anything, and all will be well. That said, the bad PTE values reported in that bug have the NX bit and one other bit set; generally bit 32, 33 or 34. I have an idea for adding a parity bit, but haven't had time to implement it. Anyone have an intern who wants an interesting kernel project to work on? Given that this is happening on Xen, I wonder if Xen is using some of the bits in the page table for its own purposes.
(In reply to willy from comment #25) > On Thu, Apr 12, 2018 at 10:12:09AM -0700, Andrew Morton wrote: > > On Fri, 9 Feb 2018 06:47:26 -0800 Matthew Wilcox <willy@infradead.org> > wrote: > > > > > > > > ping? > > > > > <snip> > > Given that this is happening on Xen, I wonder if Xen is using some of the > bits in the page table for its own purposes. My system is not Xen. We do not have any type of virtualisation.
On Thu, Apr 12, 2018 at 1:28 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=198497 > > --- Comment #25 from willy@infradead.org --- > On Thu, Apr 12, 2018 at 10:12:09AM -0700, Andrew Morton wrote: >> On Fri, 9 Feb 2018 06:47:26 -0800 Matthew Wilcox <willy@infradead.org> >> wrote: >> >> > >> > ping? >> > >> >> There have been a bunch of updates to this issue in bugzilla >> (https://bugzilla.kernel.org/show_bug.cgi?id=198497). Sigh, I don't >> know what to do about this - maybe there's some way of getting bugzilla >> to echo everything to linux-mm or something. >> >> Anyway, please take a look - we appear to have a bug here. Perhaps >> this bug is sufficiently gnarly for you to prepare a debugging patch >> which we can add to the mainline kernel so we get (much) more debugging >> info when people hit it? > > I have a few thoughts ... > > - The debugging patch I prepared appears to be doing its job well. > People get the message and their machine stays working. > - The commonality appears to be Xen running 32-bit kernels. Maybe we > can kick the problem over to them to solve? > - If we are seeing corruption purely in the lower bits, *we'll never > know*. The radix tree lookup will simply not find anything, and all > will be well. That said, the bad PTE values reported in that bug have > the NX bit and one other bit set; generally bit 32, 33 or 34. I have > an idea for adding a parity bit, but haven't had time to implement it. > Anyone have an intern who wants an interesting kernel project to work on? > > Given that this is happening on Xen, I wonder if Xen is using some of the > bits in the page table for its own purposes. The backtraces include do_swap_page(). While I have a swap partition configured, I don't think it's being used. Are we somehow misidentifying the page as a swap page? I'm not familiar with the code, but is there an easy way to query global swap usage? That way we can see if the check for a swap page is bogus. My system works with the band-aid patch. When that patch sets page = NULL, does that mean userspace is just going to get a zero-ed page? Userspace still works AFAICT, which makes me think it is a mis-identified page to start with. Regards, Jason
On Fri, Apr 20, 2018 at 09:10:11AM -0400, Jason Andryuk wrote: > > Given that this is happening on Xen, I wonder if Xen is using some of the > > bits in the page table for its own purposes. > > The backtraces include do_swap_page(). While I have a swap partition > configured, I don't think it's being used. Are we somehow > misidentifying the page as a swap page? I'm not familiar with the > code, but is there an easy way to query global swap usage? That way > we can see if the check for a swap page is bogus. > > My system works with the band-aid patch. When that patch sets page = > NULL, does that mean userspace is just going to get a zero-ed page? > Userspace still works AFAICT, which makes me think it is a > mis-identified page to start with. Here's how this code works. When we swap out an anonymous page (a page which is not backed by a file; could be from a MAP_PRIVATE mapping, could be brk()), we write it to the swap cache. In order to be able to find it again, we store a cookie (called a swp_entry_t) in the process' page table (marked with the 'present' bit clear, so the CPU will fault on it). When we get a fault, we look up the cookie in a radix tree and bring that page back in from swap. If there's no page found in the radix tree, we put a freshly zeroed page into the process's address space. That's because we won't find a page in the swap cache's radix tree for the first time we fault. It's not an indication of a bug if there's no page to be found. What we're seeing for this bug is page table entries of the format 0x8000'0004'0000'0000. That would be a zeroed entry, except for the fact that something's stepped on the upper bits. What is worrying is that potentially Xen might be stepping on the upper bits of either a present entry (leading to the process loading a page that belongs to someone else) or an entry which has been swapped out, leading to the process getting a zeroed page when it should be getting its page back from swap. Defending against this kind of corruption would take adding a parity bit to the page tables. That's not a project I have time for right now.
Adding xen-devel and the Linux Xen maintainers. Summary: Some Xen users (and maybe others) are hitting a BUG in __radix_tree_lookup() under do_swap_page() - example backtrace is provided at the end. Matthew Wilcox provided a band-aid patch that prints errors like the following instead of triggering the bug. Skylake 32bit PAE Dom0: Bad swp_entry: 80000000 mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000) Ivy Bridge 32bit PAE Dom0: Bad swp_entry: 40000000 mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000) Other 32bit DomU: Bad swp_entry: 4000000 mm/swap_state.c:683: bad pte e2187f30(8000000200000000) Other 32bit: Bad swp_entry: 2000000 mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000) The Linux bugzilla has more info https://bugzilla.kernel.org/show_bug.cgi?id=198497 This may not be exclusive to Xen Linux, but most of the reports are on Xen. Matthew wonders if Xen might be stepping on the upper bits of a pte. On Fri, Apr 20, 2018 at 9:39 AM, Matthew Wilcox <willy@infradead.org> wrote: > On Fri, Apr 20, 2018 at 09:10:11AM -0400, Jason Andryuk wrote: >> > Given that this is happening on Xen, I wonder if Xen is using some of the >> > bits in the page table for its own purposes. >> >> The backtraces include do_swap_page(). While I have a swap partition >> configured, I don't think it's being used. Are we somehow >> misidentifying the page as a swap page? I'm not familiar with the >> code, but is there an easy way to query global swap usage? That way >> we can see if the check for a swap page is bogus. >> >> My system works with the band-aid patch. When that patch sets page = >> NULL, does that mean userspace is just going to get a zero-ed page? >> Userspace still works AFAICT, which makes me think it is a >> mis-identified page to start with. > > Here's how this code works. Thanks for the description. > When we swap out an anonymous page (a page which is not backed by a > file; could be from a MAP_PRIVATE mapping, could be brk()), we write it > to the swap cache. In order to be able to find it again, we store a > cookie (called a swp_entry_t) in the process' page table (marked with > the 'present' bit clear, so the CPU will fault on it). When we get a > fault, we look up the cookie in a radix tree and bring that page back > in from swap. > > If there's no page found in the radix tree, we put a freshly zeroed > page into the process's address space. That's because we won't find > a page in the swap cache's radix tree for the first time we fault. > It's not an indication of a bug if there's no page to be found. Is "no page found" the case for a lazy, un-allocated MAP_ANONYMOUS page? > What we're seeing for this bug is page table entries of the format > 0x8000'0004'0000'0000. That would be a zeroed entry, except for the > fact that something's stepped on the upper bits. Does a totally zero-ed entry correspond to an un-allocated MAP_ANONYMOUS page? > What is worrying is that potentially Xen might be stepping on the upper > bits of either a present entry (leading to the process loading a page > that belongs to someone else) or an entry which has been swapped out, > leading to the process getting a zeroed page when it should be getting > its page back from swap. There was at least one report of non-Xen 32bit being affected. There was no backtrace, so it could be something else. One report doesn't have any swap configured. > Defending against this kind of corruption would take adding a parity > bit to the page tables. That's not a project I have time for right now. Understood. Thanks for the response. Regards, Jason [ 2234.939079] BUG: unable to handle kernel NULL pointer dereference at 00000008 [ 2234.942154] IP: __radix_tree_lookup+0xe/0xa0 [ 2234.945176] *pdpt = 0000000008cd5027 *pde = 0000000000000000 [ 2234.948382] Oops: 0000 [#1] SMP [ 2234.951410] Modules linked in: hp_wmi sparse_keymap rfkill wmi_bmof pcspkr i915 wmi hp_accel lis3lv02d input_polldev drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm hp_wireless i2c_algo_bit hid_multitouch sha256_generic xen_netfront v4v(O) psmouse ecb xts hid_generic xhci_pci xhci_hcd ohci_pci ohci_hcd uhci_hcd ehci_pci ehci_hcd usbhid hid tpm_tis tpm_tis_core tpm [ 2234.960816] CPU: 1 PID: 2338 Comm: xenvm Tainted: G O 4.14.18 #1 [ 2234.963991] Hardware name: Hewlett-Packard HP EliteBook Folio 9470m/18DF, BIOS 68IBD Ver. F.40 02/01/2013 [ 2234.967186] task: d4370980 task.stack: cf8e8000 [ 2234.970351] EIP: __radix_tree_lookup+0xe/0xa0 [ 2234.973520] EFLAGS: 00010286 CPU: 1 [ 2234.976699] EAX: 00000004 EBX: b5900000 ECX: 00000000 EDX: 00000000 [ 2234.979887] ESI: 00000000 EDI: 00000004 EBP: cf8e9dd0 ESP: cf8e9dc0 [ 2234.983081] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 [ 2234.986233] CR0: 80050033 CR2: 00000008 CR3: 08f12000 CR4: 00042660 [ 2234.989340] Call Trace: [ 2234.992354] radix_tree_lookup_slot+0x1d/0x50 [ 2234.995341] ? xen_irq_disable_direct+0xc/0xc [ 2234.998288] find_get_entry+0x1d/0x110 [ 2235.001140] pagecache_get_page+0x1f/0x240 [ 2235.003948] ? xen_flush_tlb_others+0x17b/0x260 [ 2235.006784] lookup_swap_cache+0x32/0xe0 [ 2235.009632] swap_readahead_detect+0x67/0x2c0 [ 2235.012447] do_swap_page+0x10a/0x750 [ 2235.015270] ? wp_page_copy+0x2c4/0x590 [ 2235.018043] ? xen_pmd_val+0x11/0x20 [ 2235.020729] handle_mm_fault+0x3f8/0x970 [ 2235.023352] ? xen_smp_send_reschedule+0xa/0x10 [ 2235.025927] ? resched_curr+0x68/0xc0 [ 2235.028444] __do_page_fault+0x1a7/0x480 [ 2235.030883] do_page_fault+0x33/0x110 [ 2235.033250] ? do_fast_syscall_32+0xb3/0x200 [ 2235.035567] ? vmalloc_sync_all+0x290/0x290 [ 2235.037828] common_exception+0x84/0x8a [ 2235.040011] EIP: 0xb7c8ddea [ 2235.042111] EFLAGS: 00010202 CPU: 1 [ 2235.044153] EAX: b7dd38d0 EBX: b7dd2780 ECX: b7dd2000 EDX: b5900010 [ 2235.046176] ESI: 00000000 EDI: b7dd38f0 EBP: b56ff124 ESP: b56ff070 [ 2235.048152] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b [ 2235.050053] Code: 42 14 29 c6 89 f0 c1 f8 02 e9 71 ff ff ff e8 aa 81 aa ff 8d 76 00 8d bc 27 00 00 00 00 55 89 e5 57 89 c7 56 53 83 ec 04 89 4d f0 <8b> 5f 04 89 d8 83 e0 03 83 f8 01 75 67 89 d8 83 e0 fe 0f b6 08 [ 2235.053998] EIP: __radix_tree_lookup+0xe/0xa0 SS:ESP: 0069:cf8e9dc0 [ 2235.055895] CR2: 0000000000000008
On 20/04/18 16:20, Jason Andryuk wrote: > Adding xen-devel and the Linux Xen maintainers. > > Summary: Some Xen users (and maybe others) are hitting a BUG in > __radix_tree_lookup() under do_swap_page() - example backtrace is > provided at the end. Matthew Wilcox provided a band-aid patch that > prints errors like the following instead of triggering the bug. > > Skylake 32bit PAE Dom0: > Bad swp_entry: 80000000 > mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000) > > Ivy Bridge 32bit PAE Dom0: > Bad swp_entry: 40000000 > mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000) > > Other 32bit DomU: > Bad swp_entry: 4000000 > mm/swap_state.c:683: bad pte e2187f30(8000000200000000) > > Other 32bit: > Bad swp_entry: 2000000 > mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000) > > The Linux bugzilla has more info > https://bugzilla.kernel.org/show_bug.cgi?id=198497 > > This may not be exclusive to Xen Linux, but most of the reports are on > Xen. Matthew wonders if Xen might be stepping on the upper bits of a > pte. Yes - Xen does use the upper bits of a PTE, but only 1 in release builds, and a second in debug builds. I don't understand where you're getting the 3rd bit in there. The use of these bits are dubious, and not adequately described in the ABI, and attempts to improve the state of play has come to nothing in the past. ~Andrew
On 20/04/18 16:25, Andrew Cooper wrote: > On 20/04/18 16:20, Jason Andryuk wrote: >> Adding xen-devel and the Linux Xen maintainers. >> >> Summary: Some Xen users (and maybe others) are hitting a BUG in >> __radix_tree_lookup() under do_swap_page() - example backtrace is >> provided at the end. Matthew Wilcox provided a band-aid patch that >> prints errors like the following instead of triggering the bug. >> >> Skylake 32bit PAE Dom0: >> Bad swp_entry: 80000000 >> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000) >> >> Ivy Bridge 32bit PAE Dom0: >> Bad swp_entry: 40000000 >> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000) >> >> Other 32bit DomU: >> Bad swp_entry: 4000000 >> mm/swap_state.c:683: bad pte e2187f30(8000000200000000) >> >> Other 32bit: >> Bad swp_entry: 2000000 >> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000) >> >> The Linux bugzilla has more info >> https://bugzilla.kernel.org/show_bug.cgi?id=198497 >> >> This may not be exclusive to Xen Linux, but most of the reports are on >> Xen. Matthew wonders if Xen might be stepping on the upper bits of a >> pte. > Yes - Xen does use the upper bits of a PTE, but only 1 in release > builds, and a second in debug builds. I don't understand where you're > getting the 3rd bit in there. > > The use of these bits are dubious, and not adequately described in the > ABI, and attempts to improve the state of play has come to nothing in > the past. Sorry - hit send too early. To be rather more helpful: For 64bit guests only, we use one bit to distinguish between guest kernel and guest user pages. This is because both guest user and kernel run in ring3, and have to have _PAGE_USER set on them. We use bit 52 to tag guest kernel mappings, which is seeded from the guest kernels choice of _PAGE_USER. In debug builds of the hypervisor only, we use bit 62 to tag grant mappings. This is to help spot API errors in the guest, and results in an instant crash if we spot misuse. ~Andrew
On Fri, Apr 20, 2018 at 11:42 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>> On 20.04.18 at 17:25, <andrew.cooper3@citrix.com> wrote: >> On 20/04/18 16:20, Jason Andryuk wrote: >>> Adding xen-devel and the Linux Xen maintainers. >>> >>> Summary: Some Xen users (and maybe others) are hitting a BUG in >>> __radix_tree_lookup() under do_swap_page() - example backtrace is >>> provided at the end. Matthew Wilcox provided a band-aid patch that >>> prints errors like the following instead of triggering the bug. >>> >>> Skylake 32bit PAE Dom0: >>> Bad swp_entry: 80000000 >>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000) >>> >>> Ivy Bridge 32bit PAE Dom0: >>> Bad swp_entry: 40000000 >>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000) >>> >>> Other 32bit DomU: >>> Bad swp_entry: 4000000 >>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000) >>> >>> Other 32bit: >>> Bad swp_entry: 2000000 >>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000) >>> >>> The Linux bugzilla has more info >>> https://bugzilla.kernel.org/show_bug.cgi?id=198497 >>> >>> This may not be exclusive to Xen Linux, but most of the reports are on >>> Xen. Matthew wonders if Xen might be stepping on the upper bits of a >>> pte. >> >> Yes - Xen does use the upper bits of a PTE, but only 1 in release >> builds, and a second in debug builds. I don't understand where you're >> getting the 3rd bit in there. > > The former supposedly is _PAGE_GUEST_KERNEL, which we use for 64-bit > guests only. Above talk is of 32-bit guests only. > > In addition both this and _PAGE_GNTTAB are used on present PTEs only, > while above talk is about swap entries. This hits a BUG going through do_swap_page, but it seems like users don't think they are actually using swap at the time. One reporter didn't have any swap configured. Some of this information was further down in my original message. I'm wondering if somehow we have a PTE that should be empty and should be lazily filled. For some reason, the entry has some bits set and is causing the trouble. Would Xen mess with the PTEs in that case? Thanks, Jason
On 20/04/18 16:52, Jason Andryuk wrote: > On Fri, Apr 20, 2018 at 11:42 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>> On 20.04.18 at 17:25, <andrew.cooper3@citrix.com> wrote: >>> On 20/04/18 16:20, Jason Andryuk wrote: >>>> Adding xen-devel and the Linux Xen maintainers. >>>> >>>> Summary: Some Xen users (and maybe others) are hitting a BUG in >>>> __radix_tree_lookup() under do_swap_page() - example backtrace is >>>> provided at the end. Matthew Wilcox provided a band-aid patch that >>>> prints errors like the following instead of triggering the bug. >>>> >>>> Skylake 32bit PAE Dom0: >>>> Bad swp_entry: 80000000 >>>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000) >>>> >>>> Ivy Bridge 32bit PAE Dom0: >>>> Bad swp_entry: 40000000 >>>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000) >>>> >>>> Other 32bit DomU: >>>> Bad swp_entry: 4000000 >>>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000) >>>> >>>> Other 32bit: >>>> Bad swp_entry: 2000000 >>>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000) >>>> >>>> The Linux bugzilla has more info >>>> https://bugzilla.kernel.org/show_bug.cgi?id=198497 >>>> >>>> This may not be exclusive to Xen Linux, but most of the reports are on >>>> Xen. Matthew wonders if Xen might be stepping on the upper bits of a >>>> pte. >>> Yes - Xen does use the upper bits of a PTE, but only 1 in release >>> builds, and a second in debug builds. I don't understand where you're >>> getting the 3rd bit in there. >> The former supposedly is _PAGE_GUEST_KERNEL, which we use for 64-bit >> guests only. Above talk is of 32-bit guests only. >> >> In addition both this and _PAGE_GNTTAB are used on present PTEs only, >> while above talk is about swap entries. > This hits a BUG going through do_swap_page, but it seems like users > don't think they are actually using swap at the time. One reporter > didn't have any swap configured. Some of this information was further > down in my original message. > > I'm wondering if somehow we have a PTE that should be empty and should > be lazily filled. For some reason, the entry has some bits set and is > causing the trouble. Would Xen mess with the PTEs in that case? Any PTE with the present bit clear will be accepted and used unmodified. That said, I believe there is some batching of updates for efficiency reasons in the PVops layer of the kernel, which might end up causing a disconnect between what the swap system things, and what the actual PTEs show when read. ~Andrew
My trace is up in comment 17. My kernel is 32 bit PAE, Not Xen, no swap enabled. # uname -a Linux is1 4.14.32 #4 SMP Fri Apr 6 16:35:18 PDT 2018 i686 i686 i386 GNU/Linux # cat /proc/swaps Filename Type Size Used Priority Apr 17 14:43:57 is1 kernel: Bad swp_entry: 2000000 Apr 17 14:43:57 is1 kernel: mm/swap_state.c:683: bad pte ed3e3f38(8000000100000000) Also Sudip on comment #26 says his is not a xen system.
>>> On 20.04.18 at 17:52, <jandryuk@gmail.com> wrote: > On Fri, Apr 20, 2018 at 11:42 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>> On 20.04.18 at 17:25, <andrew.cooper3@citrix.com> wrote: >>> On 20/04/18 16:20, Jason Andryuk wrote: >>>> Adding xen-devel and the Linux Xen maintainers. >>>> >>>> Summary: Some Xen users (and maybe others) are hitting a BUG in >>>> __radix_tree_lookup() under do_swap_page() - example backtrace is >>>> provided at the end. Matthew Wilcox provided a band-aid patch that >>>> prints errors like the following instead of triggering the bug. >>>> >>>> Skylake 32bit PAE Dom0: >>>> Bad swp_entry: 80000000 >>>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000) >>>> >>>> Ivy Bridge 32bit PAE Dom0: >>>> Bad swp_entry: 40000000 >>>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000) >>>> >>>> Other 32bit DomU: >>>> Bad swp_entry: 4000000 >>>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000) >>>> >>>> Other 32bit: >>>> Bad swp_entry: 2000000 >>>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000) >>>> >>>> The Linux bugzilla has more info >>>> https://bugzilla.kernel.org/show_bug.cgi?id=198497 >>>> >>>> This may not be exclusive to Xen Linux, but most of the reports are on >>>> Xen. Matthew wonders if Xen might be stepping on the upper bits of a >>>> pte. >>> >>> Yes - Xen does use the upper bits of a PTE, but only 1 in release >>> builds, and a second in debug builds. I don't understand where you're >>> getting the 3rd bit in there. >> >> The former supposedly is _PAGE_GUEST_KERNEL, which we use for 64-bit >> guests only. Above talk is of 32-bit guests only. >> >> In addition both this and _PAGE_GNTTAB are used on present PTEs only, >> while above talk is about swap entries. > > This hits a BUG going through do_swap_page, but it seems like users > don't think they are actually using swap at the time. One reporter > didn't have any swap configured. Some of this information was further > down in my original message. > > I'm wondering if somehow we have a PTE that should be empty and should > be lazily filled. For some reason, the entry has some bits set and is > causing the trouble. Would Xen mess with the PTEs in that case? As said in my previous reply - both of the bits Andrew has mentioned can only ever be set when the present bit is also set (which doesn't appear to be the case here). The set bits above are actually in the range of bits designated to the address, which Xen wouldn't ever play with. Jan
>>> On 20.04.18 at 17:25, <andrew.cooper3@citrix.com> wrote: > On 20/04/18 16:20, Jason Andryuk wrote: >> Adding xen-devel and the Linux Xen maintainers. >> >> Summary: Some Xen users (and maybe others) are hitting a BUG in >> __radix_tree_lookup() under do_swap_page() - example backtrace is >> provided at the end. Matthew Wilcox provided a band-aid patch that >> prints errors like the following instead of triggering the bug. >> >> Skylake 32bit PAE Dom0: >> Bad swp_entry: 80000000 >> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000) >> >> Ivy Bridge 32bit PAE Dom0: >> Bad swp_entry: 40000000 >> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000) >> >> Other 32bit DomU: >> Bad swp_entry: 4000000 >> mm/swap_state.c:683: bad pte e2187f30(8000000200000000) >> >> Other 32bit: >> Bad swp_entry: 2000000 >> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000) >> >> The Linux bugzilla has more info >> https://bugzilla.kernel.org/show_bug.cgi?id=198497 >> >> This may not be exclusive to Xen Linux, but most of the reports are on >> Xen. Matthew wonders if Xen might be stepping on the upper bits of a >> pte. > > Yes - Xen does use the upper bits of a PTE, but only 1 in release > builds, and a second in debug builds. I don't understand where you're > getting the 3rd bit in there. The former supposedly is _PAGE_GUEST_KERNEL, which we use for 64-bit guests only. Above talk is of 32-bit guests only. In addition both this and _PAGE_GNTTAB are used on present PTEs only, while above talk is about swap entries. Jan
On 04/20/2018 12:02 PM, Jan Beulich wrote: >>>> On 20.04.18 at 17:52, <jandryuk@gmail.com> wrote: >> On Fri, Apr 20, 2018 at 11:42 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>>> On 20.04.18 at 17:25, <andrew.cooper3@citrix.com> wrote: >>>> On 20/04/18 16:20, Jason Andryuk wrote: >>>>> Adding xen-devel and the Linux Xen maintainers. >>>>> >>>>> Summary: Some Xen users (and maybe others) are hitting a BUG in >>>>> __radix_tree_lookup() under do_swap_page() - example backtrace is >>>>> provided at the end. Matthew Wilcox provided a band-aid patch that >>>>> prints errors like the following instead of triggering the bug. >>>>> >>>>> Skylake 32bit PAE Dom0: >>>>> Bad swp_entry: 80000000 >>>>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000) >>>>> >>>>> Ivy Bridge 32bit PAE Dom0: >>>>> Bad swp_entry: 40000000 >>>>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000) >>>>> >>>>> Other 32bit DomU: >>>>> Bad swp_entry: 4000000 >>>>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000) >>>>> >>>>> Other 32bit: >>>>> Bad swp_entry: 2000000 >>>>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000) >>>>> >>>>> The Linux bugzilla has more info >>>>> https://bugzilla.kernel.org/show_bug.cgi?id=198497 >>>>> >>>>> This may not be exclusive to Xen Linux, but most of the reports are on >>>>> Xen. Matthew wonders if Xen might be stepping on the upper bits of a >>>>> pte. >>>> Yes - Xen does use the upper bits of a PTE, but only 1 in release >>>> builds, and a second in debug builds. I don't understand where you're >>>> getting the 3rd bit in there. >>> The former supposedly is _PAGE_GUEST_KERNEL, which we use for 64-bit >>> guests only. Above talk is of 32-bit guests only. >>> >>> In addition both this and _PAGE_GNTTAB are used on present PTEs only, >>> while above talk is about swap entries. >> This hits a BUG going through do_swap_page, but it seems like users >> don't think they are actually using swap at the time. One reporter >> didn't have any swap configured. Some of this information was further >> down in my original message. >> >> I'm wondering if somehow we have a PTE that should be empty and should >> be lazily filled. For some reason, the entry has some bits set and is >> causing the trouble. Would Xen mess with the PTEs in that case? > As said in my previous reply - both of the bits Andrew has mentioned can > only ever be set when the present bit is also set (which doesn't appear to > be the case here). The set bits above are actually in the range of bits > designated to the address, which Xen wouldn't ever play with. The bug description starts with: "On a Xen VM running as pvh" So is this a PV or a PVH guest? -boris
(In reply to David Strand from comment #34) > My trace is up in comment 17. My kernel is 32 bit PAE, Not Xen, no swap > enabled. > > # uname -a > Linux is1 4.14.32 #4 SMP Fri Apr 6 16:35:18 PDT 2018 i686 i686 i386 GNU/Linux > > # cat /proc/swaps > Filename Type Size Used > Priority > > Apr 17 14:43:57 is1 kernel: Bad swp_entry: 2000000 > Apr 17 14:43:57 is1 kernel: mm/swap_state.c:683: bad pte > ed3e3f38(8000000100000000) > > Also Sudip on comment #26 says his is not a xen system. Should we then open a new bug report with the title non-xen systems and attach the same traces? :)
On 20/04/18 21:20, Boris Ostrovsky wrote: > On 04/20/2018 12:02 PM, Jan Beulich wrote: >>>>> On 20.04.18 at 17:52, <jandryuk@gmail.com> wrote: >>> On Fri, Apr 20, 2018 at 11:42 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>>>> On 20.04.18 at 17:25, <andrew.cooper3@citrix.com> wrote: >>>>> On 20/04/18 16:20, Jason Andryuk wrote: >>>>>> Adding xen-devel and the Linux Xen maintainers. >>>>>> >>>>>> Summary: Some Xen users (and maybe others) are hitting a BUG in >>>>>> __radix_tree_lookup() under do_swap_page() - example backtrace is >>>>>> provided at the end. Matthew Wilcox provided a band-aid patch that >>>>>> prints errors like the following instead of triggering the bug. >>>>>> >>>>>> Skylake 32bit PAE Dom0: >>>>>> Bad swp_entry: 80000000 >>>>>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000) >>>>>> >>>>>> Ivy Bridge 32bit PAE Dom0: >>>>>> Bad swp_entry: 40000000 >>>>>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000) >>>>>> >>>>>> Other 32bit DomU: >>>>>> Bad swp_entry: 4000000 >>>>>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000) >>>>>> >>>>>> Other 32bit: >>>>>> Bad swp_entry: 2000000 >>>>>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000) >>>>>> >>>>>> The Linux bugzilla has more info >>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=198497 >>>>>> >>>>>> This may not be exclusive to Xen Linux, but most of the reports are on >>>>>> Xen. Matthew wonders if Xen might be stepping on the upper bits of a >>>>>> pte. >>>>> Yes - Xen does use the upper bits of a PTE, but only 1 in release >>>>> builds, and a second in debug builds. I don't understand where you're >>>>> getting the 3rd bit in there. >>>> The former supposedly is _PAGE_GUEST_KERNEL, which we use for 64-bit >>>> guests only. Above talk is of 32-bit guests only. >>>> >>>> In addition both this and _PAGE_GNTTAB are used on present PTEs only, >>>> while above talk is about swap entries. >>> This hits a BUG going through do_swap_page, but it seems like users >>> don't think they are actually using swap at the time. One reporter >>> didn't have any swap configured. Some of this information was further >>> down in my original message. >>> >>> I'm wondering if somehow we have a PTE that should be empty and should >>> be lazily filled. For some reason, the entry has some bits set and is >>> causing the trouble. Would Xen mess with the PTEs in that case? >> As said in my previous reply - both of the bits Andrew has mentioned can >> only ever be set when the present bit is also set (which doesn't appear to >> be the case here). The set bits above are actually in the range of bits >> designated to the address, which Xen wouldn't ever play with. > > > The bug description starts with: "On a Xen VM running as pvh" > > So is this a PV or a PVH guest? The stack backtrace suggests PV. Juergen
I am really sorry, I mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=198497#c26 that I have the same trace on non-Xen hardware. But I missed posting my trace since it looked same. [ 154.782681] BUG: unable to handle kernel NULL pointer dereference at 00000008 [ 155.249221] task: ec144a40 task.stack: ec178000 [ 155.269253] EIP: radix_tree_load_root+0x4/0x38 [ 155.274214] EFLAGS: 00010296 CPU: 1 [ 155.278107] EAX: 00000004 EBX: 00000008 ECX: ec179ddc EDX: ec179dd8 [ 155.285108] ESI: 00000000 EDI: 00000000 EBP: ec179dcc ESP: ec179dc8 [ 155.292110] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 155.298140] CR0: 80050033 CR2: 00000008 CR3: 3458bee0 CR4: 001006f0 [ 155.305133] Call Trace: [ 155.307866] __radix_tree_lookup+0x21/0x78 [ 155.312442] radix_tree_lookup_slot+0xf/0x1d [ 155.317213] find_get_entry+0x20/0xb6 [ 155.321303] pagecache_get_page+0x1f/0x1a7 [ 155.325880] lookup_swap_cache+0x30/0xc0 [ 155.330260] swap_readahead_detect+0x66/0x20e [ 155.335127] do_swap_page+0x2c/0x614 [ 155.339120] ? page_add_file_rmap+0x3e/0x67 [ 155.343816] ? alloc_set_pte+0x206/0x269 [ 155.348196] ? filemap_map_pages+0x55/0x260 [ 155.352869] handle_mm_fault+0x874/0x9b0 [ 155.357253] __do_page_fault+0x1ce/0x376 [ 155.361634] ? vmalloc_sync_all+0x5/0x5 [ 155.365946] do_page_fault+0xb/0xd [ 155.369744] common_exception+0x62/0x6a [ 155.374027] EIP: 0xb78c527f [ 155.377142] EFLAGS: 00010286 CPU: 1 [ 155.381044] EAX: b3546114 EBX: b78dc644 ECX: 00000001 EDX: b34730d4 [ 155.388038] ESI: 09ced960 EDI: 00006e67 EBP: b22ffd08 ESP: b22ffc90 [ 155.395041] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b [ 155.401074] ? vmalloc_sync_all+0x5/0x5 [ 155.405355] Code: 42 08 00 00 00 00 89 e5 40 89 02 89 42 04 31 c0 5d c3 55 8d 4a 1a 89 e5 53 bb 01 00 00 00 d3 e3 23 18 89 d8 5b 5d c3 55 89 e5 53 <8b> 40 04 89 cb 89 02 89 c2 83 e2 03 4a 75 1a 83 e0 fe 0f b6 08 [ 155.426416] EIP: radix_tree_load_root+0x4/0x38 SS:ESP: 0068:ec179dc8 [ 155.433515] CR2: 0000000000000008 [ 155.437208] ---[ end trace 1a562e195c40a78c ]---
On Fri, Apr 20, 2018 at 10:02:29AM -0600, Jan Beulich wrote: > >>>> Skylake 32bit PAE Dom0: > >>>> Bad swp_entry: 80000000 > >>>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000) > >>>> > >>>> Ivy Bridge 32bit PAE Dom0: > >>>> Bad swp_entry: 40000000 > >>>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000) > >>>> > >>>> Other 32bit DomU: > >>>> Bad swp_entry: 4000000 > >>>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000) > >>>> > >>>> Other 32bit: > >>>> Bad swp_entry: 2000000 > >>>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000) > As said in my previous reply - both of the bits Andrew has mentioned can > only ever be set when the present bit is also set (which doesn't appear to > be the case here). The set bits above are actually in the range of bits > designated to the address, which Xen wouldn't ever play with. Is it relevant that all the crashes we've seen are with PAE in the guest? Is it possible that Xen thinks the guest is not using PAE?
On 21/04/18 16:35, Matthew Wilcox wrote: > On Fri, Apr 20, 2018 at 10:02:29AM -0600, Jan Beulich wrote: >>>>>> Skylake 32bit PAE Dom0: >>>>>> Bad swp_entry: 80000000 >>>>>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000) >>>>>> >>>>>> Ivy Bridge 32bit PAE Dom0: >>>>>> Bad swp_entry: 40000000 >>>>>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000) >>>>>> >>>>>> Other 32bit DomU: >>>>>> Bad swp_entry: 4000000 >>>>>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000) >>>>>> >>>>>> Other 32bit: >>>>>> Bad swp_entry: 2000000 >>>>>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000) > >> As said in my previous reply - both of the bits Andrew has mentioned can >> only ever be set when the present bit is also set (which doesn't appear to >> be the case here). The set bits above are actually in the range of bits >> designated to the address, which Xen wouldn't ever play with. > > Is it relevant that all the crashes we've seen are with PAE in the guest? > Is it possible that Xen thinks the guest is not using PAE? > All Xen 32-bit PV guests are using PAE. Its part of the PV ABI. Juergen
On 20/04/18 17:20, Jason Andryuk wrote: > Adding xen-devel and the Linux Xen maintainers. > > Summary: Some Xen users (and maybe others) are hitting a BUG in > __radix_tree_lookup() under do_swap_page() - example backtrace is > provided at the end. Matthew Wilcox provided a band-aid patch that > prints errors like the following instead of triggering the bug. > > Skylake 32bit PAE Dom0: > Bad swp_entry: 80000000 > mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000) > > Ivy Bridge 32bit PAE Dom0: > Bad swp_entry: 40000000 > mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000) > > Other 32bit DomU: > Bad swp_entry: 4000000 > mm/swap_state.c:683: bad pte e2187f30(8000000200000000) > > Other 32bit: > Bad swp_entry: 2000000 > mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000) > > The Linux bugzilla has more info > https://bugzilla.kernel.org/show_bug.cgi?id=198497 > > This may not be exclusive to Xen Linux, but most of the reports are on > Xen. Matthew wonders if Xen might be stepping on the upper bits of a > pte. > > On Fri, Apr 20, 2018 at 9:39 AM, Matthew Wilcox <willy@infradead.org> wrote: >> On Fri, Apr 20, 2018 at 09:10:11AM -0400, Jason Andryuk wrote: >>>> Given that this is happening on Xen, I wonder if Xen is using some of the >>>> bits in the page table for its own purposes. >>> >>> The backtraces include do_swap_page(). While I have a swap partition >>> configured, I don't think it's being used. Are we somehow >>> misidentifying the page as a swap page? I'm not familiar with the >>> code, but is there an easy way to query global swap usage? That way >>> we can see if the check for a swap page is bogus. >>> >>> My system works with the band-aid patch. When that patch sets page = >>> NULL, does that mean userspace is just going to get a zero-ed page? >>> Userspace still works AFAICT, which makes me think it is a >>> mis-identified page to start with. >> >> Here's how this code works. > > Thanks for the description. > >> When we swap out an anonymous page (a page which is not backed by a >> file; could be from a MAP_PRIVATE mapping, could be brk()), we write it >> to the swap cache. In order to be able to find it again, we store a >> cookie (called a swp_entry_t) in the process' page table (marked with >> the 'present' bit clear, so the CPU will fault on it). When we get a >> fault, we look up the cookie in a radix tree and bring that page back >> in from swap. >> >> If there's no page found in the radix tree, we put a freshly zeroed >> page into the process's address space. That's because we won't find >> a page in the swap cache's radix tree for the first time we fault. >> It's not an indication of a bug if there's no page to be found. > > Is "no page found" the case for a lazy, un-allocated MAP_ANONYMOUS page? > >> What we're seeing for this bug is page table entries of the format >> 0x8000'0004'0000'0000. That would be a zeroed entry, except for the >> fact that something's stepped on the upper bits. > > Does a totally zero-ed entry correspond to an un-allocated MAP_ANONYMOUS > page? > >> What is worrying is that potentially Xen might be stepping on the upper >> bits of either a present entry (leading to the process loading a page >> that belongs to someone else) or an entry which has been swapped out, >> leading to the process getting a zeroed page when it should be getting >> its page back from swap. > > There was at least one report of non-Xen 32bit being affected. There > was no backtrace, so it could be something else. One report doesn't > have any swap configured. > >> Defending against this kind of corruption would take adding a parity >> bit to the page tables. That's not a project I have time for right now. > > Understood. Thanks for the response. > > Regards, > Jason > > > [ 2234.939079] BUG: unable to handle kernel NULL pointer dereference at > 00000008 > [ 2234.942154] IP: __radix_tree_lookup+0xe/0xa0 > [ 2234.945176] *pdpt = 0000000008cd5027 *pde = 0000000000000000 > [ 2234.948382] Oops: 0000 [#1] SMP > [ 2234.951410] Modules linked in: hp_wmi sparse_keymap rfkill wmi_bmof > pcspkr i915 wmi hp_accel lis3lv02d input_polldev drm_kms_helper > syscopyarea sysfillrect sysimgblt fb_sys_fops drm hp_wireless > i2c_algo_bit hid_multitouch sha256_generic xen_netfront v4v(O) psmouse > ecb xts hid_generic xhci_pci xhci_hcd ohci_pci ohci_hcd uhci_hcd > ehci_pci ehci_hcd usbhid hid tpm_tis tpm_tis_core tpm > [ 2234.960816] CPU: 1 PID: 2338 Comm: xenvm Tainted: G O 4.14.18 > #1 > [ 2234.963991] Hardware name: Hewlett-Packard HP EliteBook Folio > 9470m/18DF, BIOS 68IBD Ver. F.40 02/01/2013 > [ 2234.967186] task: d4370980 task.stack: cf8e8000 > [ 2234.970351] EIP: __radix_tree_lookup+0xe/0xa0 > [ 2234.973520] EFLAGS: 00010286 CPU: 1 > [ 2234.976699] EAX: 00000004 EBX: b5900000 ECX: 00000000 EDX: 00000000 > [ 2234.979887] ESI: 00000000 EDI: 00000004 EBP: cf8e9dd0 ESP: cf8e9dc0 > [ 2234.983081] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 > [ 2234.986233] CR0: 80050033 CR2: 00000008 CR3: 08f12000 CR4: 00042660 > [ 2234.989340] Call Trace: > [ 2234.992354] radix_tree_lookup_slot+0x1d/0x50 > [ 2234.995341] ? xen_irq_disable_direct+0xc/0xc > [ 2234.998288] find_get_entry+0x1d/0x110 > [ 2235.001140] pagecache_get_page+0x1f/0x240 > [ 2235.003948] ? xen_flush_tlb_others+0x17b/0x260 > [ 2235.006784] lookup_swap_cache+0x32/0xe0 > [ 2235.009632] swap_readahead_detect+0x67/0x2c0 > [ 2235.012447] do_swap_page+0x10a/0x750 > [ 2235.015270] ? wp_page_copy+0x2c4/0x590 > [ 2235.018043] ? xen_pmd_val+0x11/0x20 > [ 2235.020729] handle_mm_fault+0x3f8/0x970 > [ 2235.023352] ? xen_smp_send_reschedule+0xa/0x10 > [ 2235.025927] ? resched_curr+0x68/0xc0 > [ 2235.028444] __do_page_fault+0x1a7/0x480 > [ 2235.030883] do_page_fault+0x33/0x110 > [ 2235.033250] ? do_fast_syscall_32+0xb3/0x200 > [ 2235.035567] ? vmalloc_sync_all+0x290/0x290 > [ 2235.037828] common_exception+0x84/0x8a > [ 2235.040011] EIP: 0xb7c8ddea > [ 2235.042111] EFLAGS: 00010202 CPU: 1 > [ 2235.044153] EAX: b7dd38d0 EBX: b7dd2780 ECX: b7dd2000 EDX: b5900010 > [ 2235.046176] ESI: 00000000 EDI: b7dd38f0 EBP: b56ff124 ESP: b56ff070 > [ 2235.048152] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b > [ 2235.050053] Code: 42 14 29 c6 89 f0 c1 f8 02 e9 71 ff ff ff e8 aa > 81 aa ff 8d 76 00 8d bc 27 00 00 00 00 55 89 e5 57 89 c7 56 53 83 ec > 04 89 4d f0 <8b> 5f 04 89 d8 83 e0 03 83 f8 01 75 67 89 d8 83 e0 fe 0f > b6 08 > [ 2235.053998] EIP: __radix_tree_lookup+0xe/0xa0 SS:ESP: 0069:cf8e9dc0 > [ 2235.055895] CR2: 0000000000000008 > Could it be we just have a race regarding pte_clear()? This will set the low part of the pte to zero first and then the hight part. In case pte_clear() is used in interrupt mode especially Xen will be rather slow as it emulates the two writes to the page table resulting in a larger window where the race might happen. Juergen
On Mon, Apr 23, 2018 at 4:17 AM Juergen Gross <jgross@suse.com> wrote: > On 20/04/18 17:20, Jason Andryuk wrote: > > Adding xen-devel and the Linux Xen maintainers. > > > > Summary: Some Xen users (and maybe others) are hitting a BUG in > > __radix_tree_lookup() under do_swap_page() - example backtrace is > > provided at the end. Matthew Wilcox provided a band-aid patch that > > prints errors like the following instead of triggering the bug. > > > > Skylake 32bit PAE Dom0: > > Bad swp_entry: 80000000 > > mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000) > > > > Ivy Bridge 32bit PAE Dom0: > > Bad swp_entry: 40000000 > > mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000) > > > > Other 32bit DomU: > > Bad swp_entry: 4000000 > > mm/swap_state.c:683: bad pte e2187f30(8000000200000000) > > > > Other 32bit: > > Bad swp_entry: 2000000 > > mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000) > > > > The Linux bugzilla has more info > > https://bugzilla.kernel.org/show_bug.cgi?id=198497 > > > > This may not be exclusive to Xen Linux, but most of the reports are on > > Xen. Matthew wonders if Xen might be stepping on the upper bits of a > > pte. > > <snip> > > Could it be we just have a race regarding pte_clear()? This will set > the low part of the pte to zero first and then the hight part. > > In case pte_clear() is used in interrupt mode especially Xen will be > rather slow as it emulates the two writes to the page table resulting > in a larger window where the race might happen. It looks like Juergen was correct. With the L1TF vulnerability, the Xen hypervisor needs to detect vulnerable PTEs. For 32bit PAE, Xen would trap on PTEs like 0x8000'0002'0000'0000 - the same format as seen in this bug. He wrote two patches for Linux, now upstream, to write PTEs with 64bit operations or hypercalls and avoid the invalid PTEs: f7c90c2aa400 "x86/xen: don't write ptes directly in 32-bit PV guests" b2d7a075a1cc "x86/pae: use 64 bit atomic xchg function in native_ptep_get_and_clear" With those patches, I have not seen a "Bad swp_entry", so this seems fixed for me on Xen. There was also a report of a non-Xen kernel being affected. Is there an underlying problem that native PAE code updates PTEs in two writes, but there is no locking to prevent the intermediate PTE from being used elsewhere in the kernel? Regards, Jason