Bug 198497

Summary: handle_mm_fault / xen_pmd_val / radix_tree_lookup_slot Null pointer
Product: Memory Management Reporter: peter
Component: Page AllocatorAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: high CC: christian, dpstrand, jandryuk, octavsly, sudipm.mukherjee
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: Linux app1.com 4.14.13-rh10-20180115190010.xenU.i386 #1 SMP Mon Jan 15 19:04:55 UTC 2018 i686 GNU/Linux Subsystem:
Regression: No Bisected commit-id:
Attachments: attachment-13826-0.html

Description peter 2018-01-18 01:21:56 UTC
On a Xen VM running as pvh

[    3.499843] Adding 131068k swap on /dev/xvda9.  Priority:-2 extents:1 across:131068k SSFS
[    3.547312] EXT4-fs (xvda1): re-mounted. Opts: (null)
[    3.988606] EXT4-fs (xvda1): re-mounted. Opts: errors=remount-ro
[   24.647744] BUG: unable to handle kernel NULL pointer dereference at 00000008
[   24.647801] IP: __radix_tree_lookup+0x14/0xa0
[   24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000 
[   24.647828] Oops: 0000 [#1] SMP
[   24.647842] CPU: 5 PID: 3600 Comm: java Not tainted 4.14.13-rh10-20180115190010.xenU.i386 #1
[   24.647855] task: e52518c0 task.stack: e4e7a000
[   24.647866] EIP: __radix_tree_lookup+0x14/0xa0
[   24.647876] EFLAGS: 00010286 CPU: 5
[   24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000
[   24.647895] ESI: 00000000 EDI: 00000000 EBP: e4e7bdb8 ESP: e4e7bda0
[   24.647904]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
[   24.647917] CR0: 80050033 CR2: 00000008 CR3: 25360000 CR4: 00002660
[   24.647930] Call Trace:
[   24.647942]  radix_tree_lookup_slot+0x13/0x30
[   24.647955]  find_get_entry+0x1d/0x120
[   24.647963]  pagecache_get_page+0x1f/0x230
[   24.647975]  lookup_swap_cache+0x42/0x140
[   24.647983]  swap_readahead_detect+0x66/0x2e0
[   24.647993]  do_swap_page+0x1fa/0x860
[   24.648010]  ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10
[   24.648026]  ? xen_pmd_val+0x10/0x20
[   24.648035]  handle_mm_fault+0x6f8/0x1020
[   24.648046]  __do_page_fault+0x18a/0x450
[   24.648055]  ? vmalloc_sync_all+0x250/0x250
[   24.648063]  do_page_fault+0x21/0x30
[   24.648074]  common_exception+0x45/0x4a
[   24.648082] EIP: 0xb76d873e
[   24.648088] EFLAGS: 00010206 CPU: 5
[   24.648096] EAX: 76a10000 EBX: 76a1cd14 ECX: 00000006 EDX: 00000006
[   24.648105] ESI: 00000040 EDI: b796c380 EBP: 77881008 ESP: 77880ff8
[   24.648115]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[   24.648124] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f ff ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec <8b> 58 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6
[   24.648195] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e4e7bda0
[   24.648205] CR2: 0000000000000008
[   24.648273] ---[ end trace ed356e59f215ce07 ]---
[   28.890326] BUG: unable to handle kernel NULL pointer dereference at 00000008
[   28.890372] IP: __radix_tree_lookup+0x14/0xa0
[   28.890382] *pdpt = 0000000025488027 *pde = 0000000000000000 
[   28.890396] Oops: 0000 [#2] SMP
[   28.890408] CPU: 7 PID: 3542 Comm: java Tainted: G      D         4.14.13-rh10-20180115190010.xenU.i386 #1
[   28.890423] task: e8691080 task.stack: e52a6000
[   28.890433] EIP: __radix_tree_lookup+0x14/0xa0
[   28.890442] EFLAGS: 00010286 CPU: 7
[   28.890449] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000
[   28.890459] ESI: 00000000 EDI: 00000000 EBP: e52a7db8 ESP: e52a7da0
[   28.890469]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
[   28.890484] CR0: 80050033 CR2: 00000008 CR3: 25161000 CR4: 00002660
[   28.890498] Call Trace:
[   28.890510]  radix_tree_lookup_slot+0x13/0x30
[   28.890522]  find_get_entry+0x1d/0x120
[   28.890531]  pagecache_get_page+0x1f/0x230
[   28.890541]  lookup_swap_cache+0x42/0x140
[   28.890550]  swap_readahead_detect+0x66/0x2e0
[   28.890559]  do_swap_page+0x1fa/0x860
[   28.890573]  ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10
[   28.890588]  ? xen_pmd_val+0x10/0x20
[   28.890597]  handle_mm_fault+0x6f8/0x1020
[   28.890607]  __do_page_fault+0x18a/0x450
[   28.890616]  ? vmalloc_sync_all+0x250/0x250
[   28.890681]  do_page_fault+0x21/0x30
[   28.890707]  common_exception+0x45/0x4a
[   28.890715] EIP: 0xb779774f
[   28.890722] EFLAGS: 00010202 CPU: 7
[   28.890730] EAX: 00000000 EBX: 66dd9d6c ECX: 02000000 EDX: 00000001
[   28.890740] ESI: 02000000 EDI: 00000000 EBP: 674fe068 ESP: 674fe058
[   28.890751]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[   28.890759] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f ff ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec <8b> 58 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6
[   28.890830] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e52a7da0
[   28.890841] CR2: 0000000000000008
[   28.890886] ---[ end trace ed356e59f215ce08 ]---

# java -version
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) Server VM (build 24.51-b03, mixed mode)

~# free -m
             total       used       free     shared    buffers     cached
Mem:          5418        572       4846          0         25        245
-/+ buffers/cache:        301       5117
Swap:          127          0        127

# uptime
 01:21:08 up  2:02,  3 users,  load average: 13.47, 13.65, 13.63
Comment 1 peter 2018-01-18 02:00:46 UTC
FWIW seeing this on a couple of different servers.  a busy java process seems to be a common trigger.  seems from memory like it is just 32 bit kernels hitting the problem.
Comment 2 Andrew Morton 2018-01-18 21:55:23 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).


On Thu, 18 Jan 2018 01:21:56 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=198497
> 
>             Bug ID: 198497
>            Summary: handle_mm_fault / xen_pmd_val / radix_tree_lookup_slot
>                     Null pointer
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: Linux app1.vpsgate.com
>                     4.14.13-rh10-20180115190010.xenU.i386 #1 SMP Mon Jan
>                     15 19:04:55 UTC 2018 i686 GNU/Linux
>           Hardware: Intel
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Page Allocator
>           Assignee: akpm@linux-foundation.org
>           Reporter: peter@rimuhosting.com
>         Regression: No

Does this look familiar to anyone?

> On a Xen VM running as pvh
> 
> [    3.499843] Adding 131068k swap on /dev/xvda9.  Priority:-2 extents:1
> across:131068k SSFS
> [    3.547312] EXT4-fs (xvda1): re-mounted. Opts: (null)
> [    3.988606] EXT4-fs (xvda1): re-mounted. Opts: errors=remount-ro
> [   24.647744] BUG: unable to handle kernel NULL pointer dereference at
> 00000008
> [   24.647801] IP: __radix_tree_lookup+0x14/0xa0
> [   24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000 
> [   24.647828] Oops: 0000 [#1] SMP
> [   24.647842] CPU: 5 PID: 3600 Comm: java Not tainted
> 4.14.13-rh10-20180115190010.xenU.i386 #1
> [   24.647855] task: e52518c0 task.stack: e4e7a000
> [   24.647866] EIP: __radix_tree_lookup+0x14/0xa0
> [   24.647876] EFLAGS: 00010286 CPU: 5
> [   24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000
> [   24.647895] ESI: 00000000 EDI: 00000000 EBP: e4e7bdb8 ESP: e4e7bda0
> [   24.647904]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
> [   24.647917] CR0: 80050033 CR2: 00000008 CR3: 25360000 CR4: 00002660
> [   24.647930] Call Trace:
> [   24.647942]  radix_tree_lookup_slot+0x13/0x30
> [   24.647955]  find_get_entry+0x1d/0x120
> [   24.647963]  pagecache_get_page+0x1f/0x230
> [   24.647975]  lookup_swap_cache+0x42/0x140
> [   24.647983]  swap_readahead_detect+0x66/0x2e0
> [   24.647993]  do_swap_page+0x1fa/0x860
> [   24.648010]  ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10
> [   24.648026]  ? xen_pmd_val+0x10/0x20
> [   24.648035]  handle_mm_fault+0x6f8/0x1020
> [   24.648046]  __do_page_fault+0x18a/0x450
> [   24.648055]  ? vmalloc_sync_all+0x250/0x250
> [   24.648063]  do_page_fault+0x21/0x30
> [   24.648074]  common_exception+0x45/0x4a
> [   24.648082] EIP: 0xb76d873e
> [   24.648088] EFLAGS: 00010206 CPU: 5
> [   24.648096] EAX: 76a10000 EBX: 76a1cd14 ECX: 00000006 EDX: 00000006
> [   24.648105] ESI: 00000040 EDI: b796c380 EBP: 77881008 ESP: 77880ff8
> [   24.648115]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
> [   24.648124] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f ff
> ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec <8b>
> 58
> 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6
> [   24.648195] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e4e7bda0
> [   24.648205] CR2: 0000000000000008
> [   24.648273] ---[ end trace ed356e59f215ce07 ]---
> [   28.890326] BUG: unable to handle kernel NULL pointer dereference at
> 00000008
> [   28.890372] IP: __radix_tree_lookup+0x14/0xa0
> [   28.890382] *pdpt = 0000000025488027 *pde = 0000000000000000 
> [   28.890396] Oops: 0000 [#2] SMP
> [   28.890408] CPU: 7 PID: 3542 Comm: java Tainted: G      D        
> 4.14.13-rh10-20180115190010.xenU.i386 #1
> [   28.890423] task: e8691080 task.stack: e52a6000
> [   28.890433] EIP: __radix_tree_lookup+0x14/0xa0
> [   28.890442] EFLAGS: 00010286 CPU: 7
> [   28.890449] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000
> [   28.890459] ESI: 00000000 EDI: 00000000 EBP: e52a7db8 ESP: e52a7da0
> [   28.890469]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
> [   28.890484] CR0: 80050033 CR2: 00000008 CR3: 25161000 CR4: 00002660
> [   28.890498] Call Trace:
> [   28.890510]  radix_tree_lookup_slot+0x13/0x30
> [   28.890522]  find_get_entry+0x1d/0x120
> [   28.890531]  pagecache_get_page+0x1f/0x230
> [   28.890541]  lookup_swap_cache+0x42/0x140
> [   28.890550]  swap_readahead_detect+0x66/0x2e0
> [   28.890559]  do_swap_page+0x1fa/0x860
> [   28.890573]  ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10
> [   28.890588]  ? xen_pmd_val+0x10/0x20
> [   28.890597]  handle_mm_fault+0x6f8/0x1020
> [   28.890607]  __do_page_fault+0x18a/0x450
> [   28.890616]  ? vmalloc_sync_all+0x250/0x250
> [   28.890681]  do_page_fault+0x21/0x30
> [   28.890707]  common_exception+0x45/0x4a
> [   28.890715] EIP: 0xb779774f
> [   28.890722] EFLAGS: 00010202 CPU: 7
> [   28.890730] EAX: 00000000 EBX: 66dd9d6c ECX: 02000000 EDX: 00000001
> [   28.890740] ESI: 02000000 EDI: 00000000 EBP: 674fe068 ESP: 674fe058
> [   28.890751]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
> [   28.890759] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f ff
> ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec <8b>
> 58
> 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6
> [   28.890830] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e52a7da0
> [   28.890841] CR2: 0000000000000008
> [   28.890886] ---[ end trace ed356e59f215ce08 ]---
> 
> # java -version
> java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) Server VM (build 24.51-b03, mixed mode)
> 
> ~# free -m
>              total       used       free     shared    buffers     cached
> Mem:          5418        572       4846          0         25        245
> -/+ buffers/cache:        301       5117
> Swap:          127          0        127
> 
> # uptime
>  01:21:08 up  2:02,  3 users,  load average: 13.47, 13.65, 13.63
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.
Comment 3 Laura Abbott 2018-01-18 22:18:28 UTC
On 01/18/2018 01:55 PM, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> 
> On Thu, 18 Jan 2018 01:21:56 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:
> 
>> https://bugzilla.kernel.org/show_bug.cgi?id=198497
>>
>>              Bug ID: 198497
>>             Summary: handle_mm_fault / xen_pmd_val / radix_tree_lookup_slot
>>                      Null pointer
>>             Product: Memory Management
>>             Version: 2.5
>>      Kernel Version: Linux app1.vpsgate.com
>>                      4.14.13-rh10-20180115190010.xenU.i386 #1 SMP Mon Jan
>>                      15 19:04:55 UTC 2018 i686 GNU/Linux
>>            Hardware: Intel
>>                  OS: Linux
>>                Tree: Mainline
>>              Status: NEW
>>            Severity: high
>>            Priority: P1
>>           Component: Page Allocator
>>            Assignee: akpm@linux-foundation.org
>>            Reporter: peter@rimuhosting.com
>>          Regression: No
> 
> Does this look familiar to anyone?
> 

Fedora has been seeing similar reports
https://bugzilla.redhat.com/show_bug.cgi?id=1531779

Multiple reporters, one in XEN, another on actual hardware

>> On a Xen VM running as pvh
>>
>> [    3.499843] Adding 131068k swap on /dev/xvda9.  Priority:-2 extents:1
>> across:131068k SSFS
>> [    3.547312] EXT4-fs (xvda1): re-mounted. Opts: (null)
>> [    3.988606] EXT4-fs (xvda1): re-mounted. Opts: errors=remount-ro
>> [   24.647744] BUG: unable to handle kernel NULL pointer dereference at
>> 00000008
>> [   24.647801] IP: __radix_tree_lookup+0x14/0xa0
>> [   24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000
>> [   24.647828] Oops: 0000 [#1] SMP
>> [   24.647842] CPU: 5 PID: 3600 Comm: java Not tainted
>> 4.14.13-rh10-20180115190010.xenU.i386 #1
>> [   24.647855] task: e52518c0 task.stack: e4e7a000
>> [   24.647866] EIP: __radix_tree_lookup+0x14/0xa0
>> [   24.647876] EFLAGS: 00010286 CPU: 5
>> [   24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000
>> [   24.647895] ESI: 00000000 EDI: 00000000 EBP: e4e7bdb8 ESP: e4e7bda0
>> [   24.647904]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
>> [   24.647917] CR0: 80050033 CR2: 00000008 CR3: 25360000 CR4: 00002660
>> [   24.647930] Call Trace:
>> [   24.647942]  radix_tree_lookup_slot+0x13/0x30
>> [   24.647955]  find_get_entry+0x1d/0x120
>> [   24.647963]  pagecache_get_page+0x1f/0x230
>> [   24.647975]  lookup_swap_cache+0x42/0x140
>> [   24.647983]  swap_readahead_detect+0x66/0x2e0
>> [   24.647993]  do_swap_page+0x1fa/0x860
>> [   24.648010]  ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10
>> [   24.648026]  ? xen_pmd_val+0x10/0x20
>> [   24.648035]  handle_mm_fault+0x6f8/0x1020
>> [   24.648046]  __do_page_fault+0x18a/0x450
>> [   24.648055]  ? vmalloc_sync_all+0x250/0x250
>> [   24.648063]  do_page_fault+0x21/0x30
>> [   24.648074]  common_exception+0x45/0x4a
>> [   24.648082] EIP: 0xb76d873e
>> [   24.648088] EFLAGS: 00010206 CPU: 5
>> [   24.648096] EAX: 76a10000 EBX: 76a1cd14 ECX: 00000006 EDX: 00000006
>> [   24.648105] ESI: 00000040 EDI: b796c380 EBP: 77881008 ESP: 77880ff8
>> [   24.648115]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
>> [   24.648124] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f
>> ff
>> ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec <8b>
>> 58
>> 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6
>> [   24.648195] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e4e7bda0
>> [   24.648205] CR2: 0000000000000008
>> [   24.648273] ---[ end trace ed356e59f215ce07 ]---
>> [   28.890326] BUG: unable to handle kernel NULL pointer dereference at
>> 00000008
>> [   28.890372] IP: __radix_tree_lookup+0x14/0xa0
>> [   28.890382] *pdpt = 0000000025488027 *pde = 0000000000000000
>> [   28.890396] Oops: 0000 [#2] SMP
>> [   28.890408] CPU: 7 PID: 3542 Comm: java Tainted: G      D
>> 4.14.13-rh10-20180115190010.xenU.i386 #1
>> [   28.890423] task: e8691080 task.stack: e52a6000
>> [   28.890433] EIP: __radix_tree_lookup+0x14/0xa0
>> [   28.890442] EFLAGS: 00010286 CPU: 7
>> [   28.890449] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000
>> [   28.890459] ESI: 00000000 EDI: 00000000 EBP: e52a7db8 ESP: e52a7da0
>> [   28.890469]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
>> [   28.890484] CR0: 80050033 CR2: 00000008 CR3: 25161000 CR4: 00002660
>> [   28.890498] Call Trace:
>> [   28.890510]  radix_tree_lookup_slot+0x13/0x30
>> [   28.890522]  find_get_entry+0x1d/0x120
>> [   28.890531]  pagecache_get_page+0x1f/0x230
>> [   28.890541]  lookup_swap_cache+0x42/0x140
>> [   28.890550]  swap_readahead_detect+0x66/0x2e0
>> [   28.890559]  do_swap_page+0x1fa/0x860
>> [   28.890573]  ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10
>> [   28.890588]  ? xen_pmd_val+0x10/0x20
>> [   28.890597]  handle_mm_fault+0x6f8/0x1020
>> [   28.890607]  __do_page_fault+0x18a/0x450
>> [   28.890616]  ? vmalloc_sync_all+0x250/0x250
>> [   28.890681]  do_page_fault+0x21/0x30
>> [   28.890707]  common_exception+0x45/0x4a
>> [   28.890715] EIP: 0xb779774f
>> [   28.890722] EFLAGS: 00010202 CPU: 7
>> [   28.890730] EAX: 00000000 EBX: 66dd9d6c ECX: 02000000 EDX: 00000001
>> [   28.890740] ESI: 02000000 EDI: 00000000 EBP: 674fe068 ESP: 674fe058
>> [   28.890751]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
>> [   28.890759] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f
>> ff
>> ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec <8b>
>> 58
>> 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6
>> [   28.890830] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e52a7da0
>> [   28.890841] CR2: 0000000000000008
>> [   28.890886] ---[ end trace ed356e59f215ce08 ]---
>>
>> # java -version
>> java version "1.7.0_51"
>> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
>> Java HotSpot(TM) Server VM (build 24.51-b03, mixed mode)
>>
>> ~# free -m
>>               total       used       free     shared    buffers     cached
>> Mem:          5418        572       4846          0         25        245
>> -/+ buffers/cache:        301       5117
>> Swap:          127          0        127
>>
>> # uptime
>>   01:21:08 up  2:02,  3 users,  load average: 13.47, 13.65, 13.63
>>
>> -- 
>> You are receiving this mail because:
>> You are the assignee for the bug.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
Comment 4 willy 2018-01-19 03:04:53 UTC
On Thu, Jan 18, 2018 at 02:18:20PM -0800, Laura Abbott wrote:
> On 01/18/2018 01:55 PM, Andrew Morton wrote:
> > > [   24.647744] BUG: unable to handle kernel NULL pointer dereference at
> > > 00000008
> > > [   24.647801] IP: __radix_tree_lookup+0x14/0xa0
> > > [   24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000
> > > [   24.647828] Oops: 0000 [#1] SMP
> > > [   24.647842] CPU: 5 PID: 3600 Comm: java Not tainted
> > > 4.14.13-rh10-20180115190010.xenU.i386 #1
> > > [   24.647855] task: e52518c0 task.stack: e4e7a000
> > > [   24.647866] EIP: __radix_tree_lookup+0x14/0xa0
> > > [   24.647876] EFLAGS: 00010286 CPU: 5
> > > [   24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000
> > > [   24.647895] ESI: 00000000 EDI: 00000000 EBP: e4e7bdb8 ESP: e4e7bda0
> > > [   24.647904]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
> > > [   24.647917] CR0: 80050033 CR2: 00000008 CR3: 25360000 CR4: 00002660
> > > [   24.647930] Call Trace:
> > > [   24.647942]  radix_tree_lookup_slot+0x13/0x30
> > > [   24.647955]  find_get_entry+0x1d/0x120
> > > [   24.647963]  pagecache_get_page+0x1f/0x230
> > > [   24.647975]  lookup_swap_cache+0x42/0x140
> > > [   24.647983]  swap_readahead_detect+0x66/0x2e0
> > > [   24.647993]  do_swap_page+0x1fa/0x860
> > > [   24.648010]  ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10
> > > [   24.648026]  ? xen_pmd_val+0x10/0x20
> > > [   24.648035]  handle_mm_fault+0x6f8/0x1020
> > > [   24.648046]  __do_page_fault+0x18a/0x450
> > > [   24.648055]  ? vmalloc_sync_all+0x250/0x250
> > > [   24.648063]  do_page_fault+0x21/0x30
> > > [   24.648074]  common_exception+0x45/0x4a
> > > [   24.648082] EIP: 0xb76d873e
> > > [   24.648088] EFLAGS: 00010206 CPU: 5
> > > [   24.648096] EAX: 76a10000 EBX: 76a1cd14 ECX: 00000006 EDX: 00000006
> > > [   24.648105] ESI: 00000040 EDI: b796c380 EBP: 77881008 ESP: 77880ff8
> > > [   24.648115]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
> > > [   24.648124] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9
> 1f ff
> > > ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec
> <8b> 58
> > > 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6
> > > [   24.648195] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e4e7bda0
> > > [   24.648205] CR2: 0000000000000008
> > > [   24.648273] ---[ end trace ed356e59f215ce07 ]---

Running that code through decodecode, I get:

   0:	55                   	push   %ebp
   1:	89 e5                	mov    %esp,%ebp
   3:	57                   	push   %edi
   4:	89 d7                	mov    %edx,%edi
   6:	56                   	push   %esi
   7:	53                   	push   %ebx
   8:	83 ec 0c             	sub    $0xc,%esp
   b:	89 45 ec             	mov    %eax,-0x14(%ebp)
   e:	89 4d e8             	mov    %ecx,-0x18(%ebp)
  11:	8b 45 ec             	mov    -0x14(%ebp),%eax
  14:*	8b 58 04             	mov    0x4(%eax),%ebx		<-- trapping instruction
  17:	89 d8                	mov    %ebx,%eax
  19:	83 e0 03             	and    $0x3,%eax

Which I think means it's looking at offset 4 from whichever argument
the x86 calling convention puts in register %eax.  Which I think is
argument 0?  Which is the radix tree root.  And that makes sense; we're
loading the root node from the radix tree root at offset 4.  The problem
is that %eax has the value 4 in it.  That would match with 'page_tree'
being at offset 4 from the start of address_space.  So find_get_page()
got called with a NULL mapping, so pagecache_get_page() got called
with a NULL mapping.

Which means I've tracked it back to:

        page = find_get_page(swap_address_space(entry), swp_offset(entry));

and swap_address_space() is returning NULL.  Has this machine run swapoff
recently, perhaps?
Comment 5 xen 2018-01-19 03:25:09 UTC
On 19/01/18 4:04 PM, Matthew Wilcox wrote:
> On Thu, Jan 18, 2018 at 02:18:20PM -0800, Laura Abbott wrote:
>> On 01/18/2018 01:55 PM, Andrew Morton wrote:
>>>> [   24.647744] BUG: unable to handle kernel NULL pointer dereference at
>>>> 00000008
>>>> [   24.647801] IP: __radix_tree_lookup+0x14/0xa0
>>>> [   24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000
>>>> [   24.647828] Oops: 0000 [#1] SMP
>>>> [   24.647842] CPU: 5 PID: 3600 Comm: java Not tainted
>>>> 4.14.13-rh10-20180115190010.xenU.i386 #1
>>>> [   24.647855] task: e52518c0 task.stack: e4e7a000
>>>> [   24.647866] EIP: __radix_tree_lookup+0x14/0xa0
>>>> [   24.647876] EFLAGS: 00010286 CPU: 5
>>>> [   24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000
>>>> [   24.647895] ESI: 00000000 EDI: 00000000 EBP: e4e7bdb8 ESP: e4e7bda0
>>>> [   24.647904]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
>>>> [   24.647917] CR0: 80050033 CR2: 00000008 CR3: 25360000 CR4: 00002660
>>>> [   24.647930] Call Trace:
>>>> [   24.647942]  radix_tree_lookup_slot+0x13/0x30
>>>> [   24.647955]  find_get_entry+0x1d/0x120
>>>> [   24.647963]  pagecache_get_page+0x1f/0x230
>>>> [   24.647975]  lookup_swap_cache+0x42/0x140
>>>> [   24.647983]  swap_readahead_detect+0x66/0x2e0
>>>> [   24.647993]  do_swap_page+0x1fa/0x860
>>>> [   24.648010]  ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10
>>>> [   24.648026]  ? xen_pmd_val+0x10/0x20
>>>> [   24.648035]  handle_mm_fault+0x6f8/0x1020
>>>> [   24.648046]  __do_page_fault+0x18a/0x450
>>>> [   24.648055]  ? vmalloc_sync_all+0x250/0x250
>>>> [   24.648063]  do_page_fault+0x21/0x30
>>>> [   24.648074]  common_exception+0x45/0x4a
>>>> [   24.648082] EIP: 0xb76d873e
>>>> [   24.648088] EFLAGS: 00010206 CPU: 5
>>>> [   24.648096] EAX: 76a10000 EBX: 76a1cd14 ECX: 00000006 EDX: 00000006
>>>> [   24.648105] ESI: 00000040 EDI: b796c380 EBP: 77881008 ESP: 77880ff8
>>>> [   24.648115]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
>>>> [   24.648124] Code: ff ff ff 00 47 03 e9 69 ff ff ff 8b 45 08 89 06 e9 1f
>>>> ff
>>>> ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 89 45 ec 89 4d e8 8b 45 ec
>>>> <8b> 58
>>>> 04 89 d8 83 e0 03 48 89 5d f0 75 64 89 d8 83 e0 fe 0f b6
>>>> [   24.648195] EIP: __radix_tree_lookup+0x14/0xa0 SS:ESP: 0069:e4e7bda0
>>>> [   24.648205] CR2: 0000000000000008
>>>> [   24.648273] ---[ end trace ed356e59f215ce07 ]---
> Running that code through decodecode, I get:
>
>     0:        55                      push   %ebp
>     1:        89 e5                   mov    %esp,%ebp
>     3:        57                      push   %edi
>     4:        89 d7                   mov    %edx,%edi
>     6:        56                      push   %esi
>     7:        53                      push   %ebx
>     8:        83 ec 0c                sub    $0xc,%esp
>     b:        89 45 ec                mov    %eax,-0x14(%ebp)
>     e:        89 4d e8                mov    %ecx,-0x18(%ebp)
>    11:        8b 45 ec                mov    -0x14(%ebp),%eax
>    14:*       8b 58 04                mov    0x4(%eax),%ebx           <--
>    trapping instruction
>    17:        89 d8                   mov    %ebx,%eax
>    19:        83 e0 03                and    $0x3,%eax
>
> Which I think means it's looking at offset 4 from whichever argument
> the x86 calling convention puts in register %eax.  Which I think is
> argument 0?  Which is the radix tree root.  And that makes sense; we're
> loading the root node from the radix tree root at offset 4.  The problem
> is that %eax has the value 4 in it.  That would match with 'page_tree'
> being at offset 4 from the start of address_space.  So find_get_page()
> got called with a NULL mapping, so pagecache_get_page() got called
> with a NULL mapping.
>
> Which means I've tracked it back to:
>
>          page = find_get_page(swap_address_space(entry), swp_offset(entry));
>
> and swap_address_space() is returning NULL.  Has this machine run swapoff
> recently, perhaps?
Swap was on.  Swap is small (127MB).  Swap had not been dipped into.
              total       used       free     shared    buffers cached
Swap:          127          0        127

PS: cannot recall seeing this issue on x86_64, just 32 bit.  PPS: 
reminder this is on a Xen VM which per 
https://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html#PVH-Guest-Specific-Options 
has "out of sync pagetables" if that is relevant (we do not set that 
option, I am unsure what default is used).
Comment 6 willy 2018-01-19 13:21:52 UTC
On Fri, Jan 19, 2018 at 04:14:42PM +1300, xen@randomwebstuff.com wrote:
> 
> On 19/01/18 4:04 PM, Matthew Wilcox wrote:
> > On Thu, Jan 18, 2018 at 02:18:20PM -0800, Laura Abbott wrote:
> > > On 01/18/2018 01:55 PM, Andrew Morton wrote:
> > > > > [   24.647744] BUG: unable to handle kernel NULL pointer dereference
> at
> > > > > 00000008
> > > > > [   24.647801] IP: __radix_tree_lookup+0x14/0xa0
> > > > > [   24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000
> > > > > [   24.647828] Oops: 0000 [#1] SMP
> > > > > [   24.647842] CPU: 5 PID: 3600 Comm: java Not tainted
> > > > > 4.14.13-rh10-20180115190010.xenU.i386 #1
> > > > > [   24.647855] task: e52518c0 task.stack: e4e7a000
> > > > > [   24.647866] EIP: __radix_tree_lookup+0x14/0xa0
> > > > > [   24.647876] EFLAGS: 00010286 CPU: 5
> > > > > [   24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX:
> 00000000

If my understanding is right, EDX contains the index we're looking up.
Which is zero.  So the swp_entry we got is one bit away from being NULL.
Hmm.  Have you run memtest86 or some other memory tester on the system
recently?

> PS: cannot recall seeing this issue on x86_64, just 32 bit.

Laura has 64-bit instances of this.

PPS: reminder
> this is on a Xen VM which per
> https://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html#PVH-Guest-Specific-Options
> has "out of sync pagetables" if that is relevant (we do not set that option,
> I am unsure what default is used).

Laura also has non-Xen instances of this.  They may not all be the same
bug, of course.
Comment 7 willy 2018-01-19 13:33:56 UTC
On Thu, Jan 18, 2018 at 02:18:20PM -0800, Laura Abbott wrote:
> Fedora has been seeing similar reports
> https://bugzilla.redhat.com/show_bug.cgi?id=1531779
> 
> Multiple reporters, one in XEN, another on actual hardware

Can you chuck this patch into Fedora?  Should make it easier to see if it's
a "stuck bit" kind of a problem.

---

From: Matthew Wilcox <mawilcox@microsoft.com>
Subject: Detect bad swap entries in lookup

If we have a stuck bit in a PTE, we can end up looking for an entry in
a NULL mapping, which oopses fairly quickly.  Print a warning to help
us debug, and return NULL which will help the machine survive a little
longer.  Although if it has a permanently stuck bit in a PTE, there's only
a 50% chance it'll surive the insertion of a real PTE into that entry.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 39ae7cfad90f..5a928e0191a1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -334,8 +334,12 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long ra_info;
 	int win, hits, readahead;
+	struct address_space *swapper_space = swap_address_space(entry);
+
+	if (WARN(!swapper_space, "Bad swp_entry: %lx\n", entry.val))
+		return NULL;
 
-	page = find_get_page(swap_address_space(entry), swp_offset(entry));
+	page = find_get_page(swapper_space, swp_offset(entry));
 
 	INC_CACHE_INFO(find_total);
 	if (page) {
Comment 8 Laura Abbott 2018-01-19 17:30:31 UTC
On 01/19/2018 05:21 AM, Matthew Wilcox wrote:
> On Fri, Jan 19, 2018 at 04:14:42PM +1300, xen@randomwebstuff.com wrote:
>>
>> On 19/01/18 4:04 PM, Matthew Wilcox wrote:
>>> On Thu, Jan 18, 2018 at 02:18:20PM -0800, Laura Abbott wrote:
>>>> On 01/18/2018 01:55 PM, Andrew Morton wrote:
>>>>>> [   24.647744] BUG: unable to handle kernel NULL pointer dereference at
>>>>>> 00000008
>>>>>> [   24.647801] IP: __radix_tree_lookup+0x14/0xa0
>>>>>> [   24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000
>>>>>> [   24.647828] Oops: 0000 [#1] SMP
>>>>>> [   24.647842] CPU: 5 PID: 3600 Comm: java Not tainted
>>>>>> 4.14.13-rh10-20180115190010.xenU.i386 #1
>>>>>> [   24.647855] task: e52518c0 task.stack: e4e7a000
>>>>>> [   24.647866] EIP: __radix_tree_lookup+0x14/0xa0
>>>>>> [   24.647876] EFLAGS: 00010286 CPU: 5
>>>>>> [   24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 00000000
> 
> If my understanding is right, EDX contains the index we're looking up.
> Which is zero.  So the swp_entry we got is one bit away from being NULL.
> Hmm.  Have you run memtest86 or some other memory tester on the system
> recently?
> 
>> PS: cannot recall seeing this issue on x86_64, just 32 bit.
> 
> Laura has 64-bit instances of this.
> 

The 64-bit backtraces reported in the bugzilla looked different,
I would consider it a different issue.

> PPS: reminder
>> this is on a Xen VM which per
>> https://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html#PVH-Guest-Specific-Options
>> has "out of sync pagetables" if that is relevant (we do not set that option,
>> I am unsure what default is used).
> 
> Laura also has non-Xen instances of this.  They may not all be the same
> bug, of course.
>
Comment 9 xen 2018-01-26 06:54:15 UTC
Created attachment 273867 [details]
attachment-13826-0.html

On 20/01/18 6:30 AM, Laura Abbott wrote:
> On 01/19/2018 05:21 AM, Matthew Wilcox wrote:
>> On Fri, Jan 19, 2018 at 04:14:42PM +1300, xen@randomwebstuff.com wrote:
>>>
>>> On 19/01/18 4:04 PM, Matthew Wilcox wrote:
>>>> On Thu, Jan 18, 2018 at 02:18:20PM -0800, Laura Abbott wrote:
>>>>> On 01/18/2018 01:55 PM, Andrew Morton wrote:
>>>>>>> [   24.647744] BUG: unable to handle kernel NULL pointer 
>>>>>>> dereference at
>>>>>>> 00000008
>>>>>>> [   24.647801] IP: __radix_tree_lookup+0x14/0xa0
>>>>>>> [   24.647811] *pdpt = 00000000253d6027 *pde = 0000000000000000
>>>>>>> [   24.647828] Oops: 0000 [#1] SMP
>>>>>>> [   24.647842] CPU: 5 PID: 3600 Comm: java Not tainted
>>>>>>> 4.14.13-rh10-20180115190010.xenU.i386 #1
>>>>>>> [   24.647855] task: e52518c0 task.stack: e4e7a000
>>>>>>> [   24.647866] EIP: __radix_tree_lookup+0x14/0xa0
>>>>>>> [   24.647876] EFLAGS: 00010286 CPU: 5
>>>>>>> [   24.647884] EAX: 00000004 EBX: 00000007 ECX: 00000000 EDX: 
>>>>>>> 00000000
>>
>> If my understanding is right, EDX contains the index we're looking up.
>> Which is zero.  So the swp_entry we got is one bit away from being NULL.
>> Hmm.  Have you run memtest86 or some other memory tester on the system
>> recently?
>>
>>> PS: cannot recall seeing this issue on x86_64, just 32 bit.
>>
>> Laura has 64-bit instances of this.
>>
>
> The 64-bit backtraces reported in the bugzilla looked different,
> I would consider it a different issue.
>
>> PPS: reminder
>>> this is on a Xen VM which per 
>>>
>>> https://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html#PVH-Guest-Specific-Options
>>> has "out of sync pagetables" if that is relevant (we do not set that 
>>> option,
>>> I am unsure what default is used).
>>
>> Laura also has non-Xen instances of this.  They may not all be the same
>> bug, of course.
>>
Re-tried with the current latest 4.14 (4.14.15).  Received the following:

[2018-01-24 19:26:57] Ubuntu 14.04.5 LTS dev hvc0
[2018-01-24 19:26:57]
[2018-01-24 19:26:57] dev login: [44501.106868] BUG: unable to handle 
kernel NULL pointer dereference at 00000008
[2018-01-25 07:47:50] [44501.106897] IP: __radix_tree_lookup+0x14/0xa0
[2018-01-25 07:47:50] [44501.106905] *pdpt = 000000001fe82027 *pde = 
0000000000000000
[2018-01-25 07:47:50] [44501.106916] Oops: 0000 [#1] SMP
[2018-01-25 07:47:50] [44501.106924] CPU: 0 PID: 3344 Comm: 
PassengerAgent Not tainted 4.14.15-rh13-20180123235331.xenU.i386 #1
[2018-01-25 07:47:50] [44501.106935] task: dfee39c0 task.stack: dff12000
[2018-01-25 07:47:50] [44501.106943] EIP: __radix_tree_lookup+0x14/0xa0
[2018-01-25 07:47:50] [44501.106950] EFLAGS: 00210286 CPU: 0
[2018-01-25 07:47:50] [44501.106955] EAX: 00000004 EBX: 00000001 ECX: 
00000000 EDX: 00000000
[2018-01-25 07:47:50] [44501.106963] ESI: 00000000 EDI: 00000000 EBP: 
dff13db8 ESP: dff13da0
[2018-01-25 07:47:50] [44501.106971]  DS: 007b ES: 007b FS: 00d8 GS: 
00e0 SS: 0069
[2018-01-25 07:47:50] [44501.106979] CR0: 80050033 CR2: 00000008 CR3: 
1fdb1000 CR4: 00002660
[2018-01-25 07:47:50] [44501.106989] Call Trace:
[2018-01-25 07:47:50] [44501.106995]  radix_tree_lookup_slot+0x13/0x30
[2018-01-25 07:47:50] [44501.107004]  find_get_entry+0x1d/0x120
[2018-01-25 07:47:50] [44501.107011]  pagecache_get_page+0x1f/0x230
[2018-01-25 07:47:50] [44501.107018]  lookup_swap_cache+0x42/0x140
[2018-01-25 07:47:50] [44501.107024]  swap_readahead_detect+0x66/0x2e0
[2018-01-25 07:47:50] [44501.107032]  do_swap_page+0x1fa/0x860
[2018-01-25 07:47:50] [44501.107040]  ? 
__raw_callee_save___pv_queued_spin_unlock+0x9/0x10
[2018-01-25 07:47:50] [44501.107050]  ? xen_pmd_val+0x10/0x20
[2018-01-25 07:47:50] [44501.107057]  handle_mm_fault+0x6f8/0x1020
[2018-01-25 07:47:50] [44501.107065]  ? 
_raw_spin_unlock_irqrestore+0x13/0x20
[2018-01-25 07:47:50] [44501.107074]  ? pvclock_clocksource_read+0xa6/0x1a0
[2018-01-25 07:47:50] [44501.107081]  __do_page_fault+0x18a/0x450
[2018-01-25 07:47:50] [44501.107089]  ? _copy_to_user+0x28/0x40
[2018-01-25 07:47:50] [44501.107096]  ? vmalloc_sync_all+0x250/0x250
[2018-01-25 07:47:50] [44501.107102]  do_page_fault+0x21/0x30
[2018-01-25 07:47:50] [44501.107109]  common_exception+0x45/0x4a
[2018-01-25 07:47:50] [44501.107115] EIP: 0x82c3358
[2018-01-25 07:47:50] [44501.107120] EFLAGS: 00210202 CPU: 0
[2018-01-25 07:47:50] [44501.107126] EAX: b702d0b8 EBX: 081557a9 ECX: 
00000000 EDX: 0a4296bc
[2018-01-25 07:47:50] [44501.107133] ESI: b467c2cc EDI: 00000000 EBP: 
b467c138 ESP: b467c110
[2018-01-25 07:47:50] [44501.107141]  DS: 007b ES: 007b FS: 0000 GS: 
0033 SS: 007b
[2018-01-25 07:47:50] [44501.107147] Code: ff ff ff 00 47 03 e9 69 ff ff 
ff 8b 45 08 89 06 e9 1f ff ff ff 66 90 55 89 e5 57 89 d7 56 53 83 ec 0c 
89 45 ec 89 4d e8 8b 45 ec <8b> 58 04 89 d8 83 e0 03 48 89 5d f0 75 64 
89 d8 83 e0 fe 0f b6
[2018-01-25 07:47:50] [44501.110296] EIP: __radix_tree_lookup+0x14/0xa0 
SS:ESP: 0069:dff13da0
[2018-01-25 07:47:50] [44501.110304] CR2: 0000000000000008
[2018-01-25 07:47:50] [44501.110356] ---[ end trace 89cdd2ba8e7323a8 ]---
Comment 10 willy 2018-01-26 19:41:06 UTC
On Fri, Jan 26, 2018 at 07:54:06PM +1300, xen@randomwebstuff.com wrote:
> Re-tried with the current latest 4.14 (4.14.15).  Received the following:
> 
> [2018-01-24 19:26:57] dev login: [44501.106868] BUG: unable to handle kernel
> NULL pointer dereference at 00000008
> [2018-01-25 07:47:50] [44501.106897] IP: __radix_tree_lookup+0x14/0xa0

Please try including this patch:

https://bugzilla.kernel.org/show_bug.cgi?id=198497#c7

And have you had the chance to run memtest86 yet?
Comment 11 xen 2018-01-29 22:26:55 UTC
On 27/01/18 8:40 AM, Matthew Wilcox wrote:

> On Fri, Jan 26, 2018 at 07:54:06PM +1300, xen@randomwebstuff.com wrote:
>> Re-tried with the current latest 4.14 (4.14.15).  Received the following:
>>
>> [2018-01-24 19:26:57] dev login: [44501.106868] BUG: unable to handle kernel
>> NULL pointer dereference at 00000008
>> [2018-01-25 07:47:50] [44501.106897] IP: __radix_tree_lookup+0x14/0xa0
> Please try including this patch:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=198497#c7
>
> And have you had the chance to run memtest86 yet?
Added the patch at https://bugzilla.kernel.org/show_bug.cgi?id=198497#c7

After, received this stack.

Have not tried memtest86.  These are production hosts.  This has 
occurred on multiple hosts.  I can only recall this occurring on 32 bit 
kernels.  I cannot recall issues with other VMs not running that kernel 
on the same hosts.

[  125.329163] Bad swp_entry: e000000
[  125.329202] ------------[ cut here ]------------
[  125.329219] WARNING: CPU: 0 PID: 4175 at mm/swap_state.c:339 
lookup_swap_cache+0x140/0x160
[  125.329233] CPU: 0 PID: 4175 Comm: apt-show-versio Not tainted 
4.14.15-rh14-20180126233810.xenU.i386-00001-g6ba70cb #1
[  125.329245] task: ead9a940 task.stack: e7c8c000
[  125.329253] EIP: lookup_swap_cache+0x140/0x160
[  125.329260] EFLAGS: 00010282 CPU: 0
[  125.329267] EAX: 00000016 EBX: 00000000 ECX: ec5289c4 EDX: 0100016d
[  125.329275] ESI: b6312000 EDI: e7d94ea0 EBP: e7c8de24 ESP: e7c8de0c
[  125.329284]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
[  125.329295] CR0: 80050033 CR2: b63124b0 CR3: 2718c000 CR4: 00002660
[  125.329308] Call Trace:
[  125.329323]  ? percpu_counter_add_batch+0x91/0xb0
[  125.329332]  swap_readahead_detect+0x66/0x2e0
[  125.329343]  ? radix_tree_tag_set+0x7a/0xe0
[  125.329352]  do_swap_page+0x1fa/0x860
[  125.329361]  ? __set_page_dirty_buffers+0xb1/0xe0
[  125.329372]  ? ext4_set_page_dirty+0x22/0x60
[  125.329383]  ? fault_dirty_shared_page.isra.90+0x3e/0xa0
[  125.329396]  ? xen_pmd_val+0x10/0x20
[  125.329403]  handle_mm_fault+0x6f8/0x1020
[  125.329414]  ? handle_irq_event_percpu+0x3c/0x50
[  125.329424]  __do_page_fault+0x18a/0x450
[  125.329432]  ? vmalloc_sync_all+0x250/0x250
[  125.329439]  do_page_fault+0x21/0x30
[  125.329449]  common_exception+0x45/0x4a
[  125.329456] EIP: 0xb7ce397b
[  125.329462] EFLAGS: 00010202 CPU: 0
[  125.329469] EAX: 0000052a EBX: b7d77ff4 ECX: 000004fa EDX: b6311000
[  125.329477] ESI: bf90eae0 EDI: b6ed4b20 EBP: bf90ea60 ESP: bf90ea20
[  125.329486]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[  125.329493] Code: 18 1f 14 c2 85 ff 0f 85 41 ff ff ff f0 ff 05 38 fb 
02 c2 e9 35 ff ff ff 8d 76 00 89 44 24 04 c7 04 24 55 93 f3 c1 e8 8c e7 
f5 ff <0f> ff 8b 5d f4 31 c0 8b 75 f8 8b 7d fc 89 ec 5d c3 64 ff 05 18
[  125.329558] ---[ end trace dd2704ca649b44ba ]---
Comment 12 willy 2018-01-31 10:54:59 UTC
On Tue, Jan 30, 2018 at 11:26:42AM +1300, xen@randonwebstuff.com wrote:
> After, received this stack.
> 
> Have not tried memtest86.  These are production hosts.  This has occurred on
> multiple hosts.  I can only recall this occurring on 32 bit kernels.  I
> cannot recall issues with other VMs not running that kernel on the same
> hosts.
> 
> [  125.329163] Bad swp_entry: e000000

Mixed news here then ... 'e' is 8 | 4 | 2, so it's not a single bitflip.
So no point in running memtest86.

I should have made the printk produce leading zeroes, because that's
0x0e00'0000.  ptes use the top 5 bits to encode the swapfile, so
this swap entry is decoded as swapfile 1, page number 0x0600'0000.
That's clearly ludicrous because you don't have a swapfile 1, and if
you did, it wouldn't be so large as a terabyte.

I think the next step in debugging this is printing the PTE which gave
us this swp_entry.  If you can drop the patch I asked you to try, and
apply this patch instead, we'll have more idea about what's going on.

Thanks!

diff --git a/mm/memory.c b/mm/memory.c
index 403934297a3d..8caaddb07747 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2892,6 +2892,10 @@ int do_swap_page(struct vm_fault *vmf)
 	if (!page)
 		page = lookup_swap_cache(entry, vma_readahead ? vma : NULL,
 					 vmf->address);
+	if (IS_ERR(page)) {
+		pte_ERROR(vmf->orig_pte);
+		page = NULL;
+	}
 	if (!page) {
 		struct swap_info_struct *si = swp_swap_info(entry);
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 7fbe67be86fa..905fa34e022a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1651,6 +1651,10 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	if (swap.val) {
 		/* Look it up and read it in.. */
 		page = lookup_swap_cache(swap, NULL, 0);
+		if (IS_ERR(page)) {
+			pte_ERROR(vmf->orig_pte);
+			page = NULL;
+		}
 		if (!page) {
 			/* Or update major stats only when swapin succeeds?? */
 			if (fault_type) {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 39ae7cfad90f..7ee594c8eadd 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -334,8 +334,14 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long ra_info;
 	int win, hits, readahead;
+	struct address_space *swapper_space = swap_address_space(entry);
+
+	if (!swapper_space) {
+		pr_err("Bad swp_entry: %lx\n", entry.val);
+		return ERR_PTR(-EFAULT);
+	}
 
-	page = find_get_page(swap_address_space(entry), swp_offset(entry));
+	page = find_get_page(swapper_space, swp_offset(entry));
 
 	INC_CACHE_INFO(find_total);
 	if (page) {
@@ -676,6 +682,10 @@ struct page *swap_readahead_detect(struct vm_fault *vmf,
 	if ((unlikely(non_swap_entry(entry))))
 		return NULL;
 	page = lookup_swap_cache(entry, vma, faddr);
+	if (IS_ERR(page)) {
+		pte_ERROR(vmf->orig_pte);
+		page = NULL;
+	}
 	if (page)
 		return page;
Comment 13 Tetsuo Handa 2018-01-31 23:41:06 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1531779

It might be something related that
"x86/mm: Found insecure W+X mapping at address" message is printed at boot.

Are you seeing "x86/mm: Found insecure W+X mapping at address" before
hitting "BUG: unable to handle kernel NULL pointer dereference" ?
Comment 14 peter 2018-02-01 00:16:00 UTC
I checked two different servers that have printed this BUG.  I am not seeing 'Found insecure' in dmesg on either of them.
Comment 15 willy 2018-02-01 09:48:06 UTC
On Thu, Feb 01, 2018 at 08:02:43AM +0900, Tetsuo Handa wrote:
> https://bugzilla.redhat.com/show_bug.cgi?id=1531779
> 
> It might be something related that
> "x86/mm: Found insecure W+X mapping at address" message is printed at boot.
> 
> Are you seeing "x86/mm: Found insecure W+X mapping at address" before
> hitting "BUG: unable to handle kernel NULL pointer dereference" ?

There are about eight different bugs in that thread; the only commonality
I see between them is that there's a null pointer dereference somewhere
in the kernel.
Comment 16 willy 2018-02-09 14:47:30 UTC
ping?

On Wed, Jan 31, 2018 at 02:54:56AM -0800, Matthew Wilcox wrote:
> On Tue, Jan 30, 2018 at 11:26:42AM +1300, xen@randonwebstuff.com wrote:
> > After, received this stack.
> > 
> > Have not tried memtest86.  These are production hosts.  This has occurred
> on
> > multiple hosts.  I can only recall this occurring on 32 bit kernels.  I
> > cannot recall issues with other VMs not running that kernel on the same
> > hosts.
> > 
> > [  125.329163] Bad swp_entry: e000000
> 
> Mixed news here then ... 'e' is 8 | 4 | 2, so it's not a single bitflip.
> So no point in running memtest86.
> 
> I should have made the printk produce leading zeroes, because that's
> 0x0e00'0000.  ptes use the top 5 bits to encode the swapfile, so
> this swap entry is decoded as swapfile 1, page number 0x0600'0000.
> That's clearly ludicrous because you don't have a swapfile 1, and if
> you did, it wouldn't be so large as a terabyte.
> 
> I think the next step in debugging this is printing the PTE which gave
> us this swp_entry.  If you can drop the patch I asked you to try, and
> apply this patch instead, we'll have more idea about what's going on.
> 
> Thanks!
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 403934297a3d..8caaddb07747 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2892,6 +2892,10 @@ int do_swap_page(struct vm_fault *vmf)
>       if (!page)
>               page = lookup_swap_cache(entry, vma_readahead ? vma : NULL,
>                                        vmf->address);
> +     if (IS_ERR(page)) {
> +             pte_ERROR(vmf->orig_pte);
> +             page = NULL;
> +     }
>       if (!page) {
>               struct swap_info_struct *si = swp_swap_info(entry);
>  
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 7fbe67be86fa..905fa34e022a 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1651,6 +1651,10 @@ static int shmem_getpage_gfp(struct inode *inode,
> pgoff_t index,
>       if (swap.val) {
>               /* Look it up and read it in.. */
>               page = lookup_swap_cache(swap, NULL, 0);
> +             if (IS_ERR(page)) {
> +                     pte_ERROR(vmf->orig_pte);
> +                     page = NULL;
> +             }
>               if (!page) {
>                       /* Or update major stats only when swapin succeeds?? */
>                       if (fault_type) {
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 39ae7cfad90f..7ee594c8eadd 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -334,8 +334,14 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct
> vm_area_struct *vma,
>       struct page *page;
>       unsigned long ra_info;
>       int win, hits, readahead;
> +     struct address_space *swapper_space = swap_address_space(entry);
> +
> +     if (!swapper_space) {
> +             pr_err("Bad swp_entry: %lx\n", entry.val);
> +             return ERR_PTR(-EFAULT);
> +     }
>  
> -     page = find_get_page(swap_address_space(entry), swp_offset(entry));
> +     page = find_get_page(swapper_space, swp_offset(entry));
>  
>       INC_CACHE_INFO(find_total);
>       if (page) {
> @@ -676,6 +682,10 @@ struct page *swap_readahead_detect(struct vm_fault *vmf,
>       if ((unlikely(non_swap_entry(entry))))
>               return NULL;
>       page = lookup_swap_cache(entry, vma, faddr);
> +     if (IS_ERR(page)) {
> +             pte_ERROR(vmf->orig_pte);
> +             page = NULL;
> +     }
>       if (page)
>               return page;
>  
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Comment 17 David Strand 2018-02-12 22:01:14 UTC
4.14.15 i686 with the patch above included:

Feb  9 14:31:27 cs01 kernel: Bad swp_entry: 2000000
Feb  9 14:31:27 cs01 kernel: mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000)

Feb  9 15:35:19 cs01 kernel: Bad swp_entry: 2000000
Feb  9 15:35:19 cs01 kernel: mm/swap_state.c:683: bad pte eee17f38(8000000100000000)


Here is a non patched trace for reference:
Jan 30 18:19:14 cs01 kernel: Oops: 0000 [#1] SMP
Jan 30 18:19:14 cs01 kernel: Modules linked in: iptable_mangle ip_tables bonding
Jan 30 18:19:14 cs01 kernel: CPU: 5 PID: 14205 Comm: nbdserver Not tainted 4.14.15 #4
Jan 30 18:19:14 cs01 kernel: Hardware name: To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M., BIOS 4.6.3 07/26/2010
Jan 30 18:19:14 cs01 kernel: task: ed97bb80 task.stack: ef11e000
Jan 30 18:19:14 cs01 kernel: EIP: __radix_tree_lookup+0xe/0xa0
Jan 30 18:19:14 cs01 kernel: EFLAGS: 00010292 CPU: 5
Jan 30 18:19:14 cs01 kernel: EAX: 00000004 EBX: 08627000 ECX: 00000000 EDX: 00000000
Jan 30 18:19:14 cs01 kernel: ESI: 00000000 EDI: 00000004 EBP: ef11fdfc ESP: ef11fdec
Jan 30 18:19:14 cs01 kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Jan 30 18:19:14 cs01 kernel: CR0: 80050033 CR2: 00000008 CR3: 2ee46980 CR4: 000006f0
Jan 30 18:19:14 cs01 kernel: Call Trace:
Jan 30 18:19:14 cs01 kernel: radix_tree_lookup_slot+0x11/0x30
Jan 30 18:19:14 cs01 kernel: ? smp_call_function_single+0xa3/0xf0
Jan 30 18:19:14 cs01 kernel: find_get_entry+0x1d/0x110
Jan 30 18:19:14 cs01 kernel: pagecache_get_page+0x1f/0x240
Jan 30 18:19:14 cs01 kernel: lookup_swap_cache+0x34/0xe0
Jan 30 18:19:14 cs01 kernel: swap_readahead_detect+0x56/0x2c0
Jan 30 18:19:14 cs01 kernel: ? flush_tlb_mm_range+0x98/0xe0
Jan 30 18:19:14 cs01 kernel: do_swap_page+0x102/0x5c0
Jan 30 18:19:14 cs01 kernel: ? page_add_new_anon_rmap+0x4b/0x70
Jan 30 18:19:14 cs01 kernel: ? _raw_spin_unlock+0x8/0x10
Jan 30 18:19:14 cs01 kernel: ? wp_page_copy+0x259/0x460
Jan 30 18:19:14 cs01 kernel: ? kmap_atomic_prot+0x28/0xc0
Jan 30 18:19:14 cs01 kernel: handle_mm_fault+0x428/0x9b0
Jan 30 18:19:14 cs01 kernel: __do_page_fault+0x173/0x420
Jan 30 18:19:14 cs01 kernel: ? vmalloc_sync_all+0x10/0x10
Jan 30 18:19:14 cs01 kernel: do_page_fault+0xb/0x10
Jan 30 18:19:14 cs01 kernel: common_exception+0x38/0x3e
Jan 30 18:19:14 cs01 kernel: EIP: 0x805d5fa
Jan 30 18:19:14 cs01 kernel: EFLAGS: 00010286 CPU: 5
Jan 30 18:19:14 cs01 kernel: EAX: 00000000 EBX: 08109ff4 ECX: 00000000 EDX: 00000000
Jan 30 18:19:14 cs01 kernel: ESI: 08627550 EDI: 08622f18 EBP: b6f72368 ESP: b6f72300
Jan 30 18:19:14 cs01 kernel: DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
Jan 30 18:19:14 cs01 kernel: ? vmalloc_sync_all+0x10/0x10
Jan 30 18:19:14 cs01 kernel: Code: 90 8d 74 26 00 80 40 03 01 e9 7c ff ff ff 8d b4 26 00 00 00 00 0f 0b 8d b6 00 00 00 00 55 89 e5 57 56 53 89 c7 83 ec 04 89 4d f0 <8b> 5f 04 89 d8 83 e0 03 83 f8 01 75 67 89 d8 83 e0 fe 0f b6 08
Jan 30 18:19:14 cs01 kernel: EIP: __radix_tree_lookup+0xe/0xa0 SS:ESP: 0068:ef11fdec
Jan 30 18:19:14 cs01 kernel: CR2: 0000000000000008
Jan 30 18:19:14 cs01 kernel: ---[ end trace e2125a60a6ea383d ]---
Comment 18 David Strand 2018-02-12 22:10:30 UTC
Note there is no swap space in my environment. If this is helpful I can apply and run any further patches.
Comment 19 David Strand 2018-02-13 19:29:29 UTC
I just joined the linux-mm mailing list, if someone can please reply to this topic on the mailing list I will post my findings via email. Thanks.
Comment 20 Christian Holpert 2018-02-17 20:29:43 UTC
The oops happens here on a xen-x86-32bit-domU running kernel 4.14.19-gentoo when I try to emerge net-dns/libidn-1.33-r2. The oops leaves a javac hanging with full cpu-usage.

I applied the patch in Comment 12. This lets me emerge net-dns/libidn-1.33-r2 without javac hanging.

The bug can be easily reproduced by emerging libidn again.



Feb 17 21:14:56 colin kernel: Bad swp_entry: 4000000
Feb 17 21:14:56 colin kernel: mm/swap_state.c:683: bad pte e2187f30(8000000200000000)
Feb 17 21:14:56 colin kernel: Bad swp_entry: 4000000
Feb 17 21:14:56 colin kernel: mm/swap_state.c:683: bad pte e584bf30(8000000200000000)
Feb 17 21:16:26 colin kernel: Bad swp_entry: 4000000
Feb 17 21:16:26 colin kernel: mm/swap_state.c:683: bad pte e210df30(8000000200000000)
Feb 17 21:23:21 colin kernel: Bad swp_entry: 4000000
Feb 17 21:23:21 colin kernel: mm/swap_state.c:683: bad pte dd5b1f30(8000000200000000)



oops (unpatched kernel):
Feb 17 21:08:57 colin kernel: BUG: unable to handle kernel NULL pointer dereference at 00000008
Feb 17 21:08:57 colin kernel: IP: __radix_tree_lookup+0xe/0xb0
Feb 17 21:08:57 colin kernel: *pdpt = 0000000021a03027 *pde = 0000000000000000
Feb 17 21:08:57 colin kernel: Oops: 0000 [#1] SMP
Feb 17 21:08:57 colin kernel: Modules linked in:
Feb 17 21:08:57 colin kernel: CPU: 5 PID: 10914 Comm: javac Not tainted 4.14.19-gentoo #4
Feb 17 21:08:57 colin kernel: task: e9fec0c0 task.stack: e1a9a000
Feb 17 21:08:57 colin kernel: EIP: __radix_tree_lookup+0xe/0xb0
Feb 17 21:08:57 colin kernel: EFLAGS: 00010296 CPU: 5
Feb 17 21:08:57 colin kernel: EAX: 00000004 EBX: 6612d000 ECX: 00000000 EDX: 00000000
Feb 17 21:08:57 colin kernel: ESI: 00000000 EDI: 00000004 EBP: e1a9bddc ESP: e1a9bdcc
Feb 17 21:08:57 colin kernel:  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Feb 17 21:08:57 colin kernel: CR0: 80050033 CR2: 00000008 CR3: 267e2000 CR4: 00042660
Feb 17 21:08:57 colin kernel: Call Trace:
Feb 17 21:08:57 colin kernel:  radix_tree_lookup_slot+0x11/0x30
Feb 17 21:08:57 colin kernel:  find_get_entry+0x1d/0xe0
Feb 17 21:08:57 colin kernel:  pagecache_get_page+0x1f/0x230
Feb 17 21:08:57 colin kernel:  lookup_swap_cache+0x35/0xf0
Feb 17 21:08:57 colin kernel:  swap_readahead_detect+0x4c/0x350
Feb 17 21:08:57 colin kernel:  ? flush_tlb_mm_range+0x91/0xe0
Feb 17 21:08:57 colin kernel:  do_swap_page+0x1ca/0x6d0
Feb 17 21:08:57 colin kernel:  ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10
Feb 17 21:08:57 colin kernel:  ? wp_page_copy+0x2af/0x520
Feb 17 21:08:57 colin kernel:  ? xen_pmd_val+0x11/0x20
Feb 17 21:08:57 colin kernel:  handle_mm_fault+0x3b8/0x940
Feb 17 21:08:57 colin kernel:  __do_page_fault+0x178/0x400
Feb 17 21:08:57 colin kernel:  ? vmalloc_sync_all+0x250/0x250
Feb 17 21:08:57 colin kernel:  do_page_fault+0x1a/0x20
Feb 17 21:08:57 colin kernel:  common_exception+0x84/0x8a
Feb 17 21:08:57 colin kernel: EIP: 0xb744edf4
Feb 17 21:08:57 colin kernel: EFLAGS: 00010202 CPU: 5
Feb 17 21:08:57 colin kernel: EAX: 00000004 EBX: b77f6f28 ECX: 000284a2 EDX: 00001425
Feb 17 21:08:57 colin kernel: ESI: 6612d094 EDI: b7817148 EBP: 674ff0f8 ESP: 674ff0dc
Feb 17 21:08:57 colin kernel:  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
Feb 17 21:08:57 colin kernel: Code: 90 8d 74 26 00 80 41 03 01 eb a7 66 90 0f 0b 8d b6 00 00 00 00 0f 0b 8d b6 00 00 00 00 55 89 e5 57 89 c7 56 53 83 ec 04 89 4d f0 <8b> 77 04 89 f0 83 e0 03 83 f8 01 75 75 89 f0 83 e0 fe 0f b6 08
Feb 17 21:08:57 colin kernel: EIP: __radix_tree_lookup+0xe/0xb0 SS:ESP: 0069:e1a9bdcc
Feb 17 21:08:57 colin kernel: CR2: 0000000000000008
Feb 17 21:08:57 colin kernel: ---[ end trace f157259f300d3491 ]---
Comment 21 Christian Holpert 2018-02-17 20:33:51 UTC
additional info:

colin ~ # free -m
              gesamt       benutzt     frei      gemns.  Puffer/Cache verfügbar
Speicher:        4050         839        2430           0         780        3058
Swap:           959           0         959
colin ~ # swapon -s
Dateiname                               Typ             Größe   Benutzt Priorität
/dev/xvda3                              partition       983036  0       -2
colin ~ #
Comment 22 jandryuk 2018-03-28 12:28:14 UTC
I see this issue with an Ocaml binary on a Xen 32bit PAE Dom0.  I use the patch from https://bugzilla.kernel.org/show_bug.cgi?id=198497#c12 as a bandaid to let the system keep running.  The bad ptes occur inconsistently but often.  The systems have a single swap partition, but I don't think swap has been used.

Kaby Lake machine:
Bad swp_entry: 80000000
mm/swap_state.c:683: bad pte cdf9df1c(8000000400000000)
Different Boot:
Bad swp_entry: 80000000
mm/swap_state.c:683: bad pte ccf27f1c(8000000400000000)

Skylake:
Bad swp_entry: 80000000
mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000)
Different Boot:
Bad swp_entry: 80000000
mm/swap_state.c:683: bad pte ccb7df1c(8000000400000000)

Ivy Bridge:
Bad swp_entry: 40000000
mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000)
Bad swp_entry: 40000000
mm/swap_state.c:683: bad pte c7833f1c(8000000200000000)
Different boot:
Bad swp_entry: 40000000
mm/swap_state.c:683: bad pte c223df1c(8000000200000000)
Comment 23 Sudip 2018-04-12 16:27:21 UTC
Hi,
Any update on this please?

We are also facing the same problem on 32 bit kernel with Intel ATOM on our production systems (with v4.14.2).


--
Regards
Sudip
Comment 24 Andrew Morton 2018-04-12 17:12:12 UTC
On Fri, 9 Feb 2018 06:47:26 -0800 Matthew Wilcox <willy@infradead.org> wrote:

> 
> ping?
> 

There have been a bunch of updates to this issue in bugzilla
(https://bugzilla.kernel.org/show_bug.cgi?id=198497).  Sigh, I don't
know what to do about this - maybe there's some way of getting bugzilla
to echo everything to linux-mm or something.

Anyway, please take a look - we appear to have a bug here.  Perhaps
this bug is sufficiently gnarly for you to prepare a debugging patch
which we can add to the mainline kernel so we get (much) more debugging
info when people hit it?
Comment 25 willy 2018-04-12 17:28:35 UTC
On Thu, Apr 12, 2018 at 10:12:09AM -0700, Andrew Morton wrote:
> On Fri, 9 Feb 2018 06:47:26 -0800 Matthew Wilcox <willy@infradead.org> wrote:
> 
> > 
> > ping?
> > 
> 
> There have been a bunch of updates to this issue in bugzilla
> (https://bugzilla.kernel.org/show_bug.cgi?id=198497).  Sigh, I don't
> know what to do about this - maybe there's some way of getting bugzilla
> to echo everything to linux-mm or something.
> 
> Anyway, please take a look - we appear to have a bug here.  Perhaps
> this bug is sufficiently gnarly for you to prepare a debugging patch
> which we can add to the mainline kernel so we get (much) more debugging
> info when people hit it?

I have a few thoughts ...

 - The debugging patch I prepared appears to be doing its job well.
   People get the message and their machine stays working.
 - The commonality appears to be Xen running 32-bit kernels.  Maybe we
   can kick the problem over to them to solve?
 - If we are seeing corruption purely in the lower bits, *we'll never
   know*.  The radix tree lookup will simply not find anything, and all
   will be well.  That said, the bad PTE values reported in that bug have
   the NX bit and one other bit set; generally bit 32, 33 or 34.  I have
   an idea for adding a parity bit, but haven't had time to implement it.
   Anyone have an intern who wants an interesting kernel project to work on?

Given that this is happening on Xen, I wonder if Xen is using some of the
bits in the page table for its own purposes.
Comment 26 Sudip 2018-04-12 17:44:57 UTC
(In reply to willy from comment #25)
> On Thu, Apr 12, 2018 at 10:12:09AM -0700, Andrew Morton wrote:
> > On Fri, 9 Feb 2018 06:47:26 -0800 Matthew Wilcox <willy@infradead.org>
> wrote:
> > 
> > > 
> > > ping?
> > > 
> > 
<snip>
> 
> Given that this is happening on Xen, I wonder if Xen is using some of the
> bits in the page table for its own purposes.

My system is not Xen. We do not have any type of virtualisation.
Comment 27 jandryuk 2018-04-20 13:10:18 UTC
On Thu, Apr 12, 2018 at 1:28 PM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=198497
>
> --- Comment #25 from willy@infradead.org ---
> On Thu, Apr 12, 2018 at 10:12:09AM -0700, Andrew Morton wrote:
>> On Fri, 9 Feb 2018 06:47:26 -0800 Matthew Wilcox <willy@infradead.org>
>> wrote:
>>
>> >
>> > ping?
>> >
>>
>> There have been a bunch of updates to this issue in bugzilla
>> (https://bugzilla.kernel.org/show_bug.cgi?id=198497).  Sigh, I don't
>> know what to do about this - maybe there's some way of getting bugzilla
>> to echo everything to linux-mm or something.
>>
>> Anyway, please take a look - we appear to have a bug here.  Perhaps
>> this bug is sufficiently gnarly for you to prepare a debugging patch
>> which we can add to the mainline kernel so we get (much) more debugging
>> info when people hit it?
>
> I have a few thoughts ...
>
>  - The debugging patch I prepared appears to be doing its job well.
>    People get the message and their machine stays working.
>  - The commonality appears to be Xen running 32-bit kernels.  Maybe we
>    can kick the problem over to them to solve?
>  - If we are seeing corruption purely in the lower bits, *we'll never
>    know*.  The radix tree lookup will simply not find anything, and all
>    will be well.  That said, the bad PTE values reported in that bug have
>    the NX bit and one other bit set; generally bit 32, 33 or 34.  I have
>    an idea for adding a parity bit, but haven't had time to implement it.
>    Anyone have an intern who wants an interesting kernel project to work on?
>
> Given that this is happening on Xen, I wonder if Xen is using some of the
> bits in the page table for its own purposes.

The backtraces include do_swap_page().  While I have a swap partition
configured, I don't think it's being used.  Are we somehow
misidentifying the page as a swap page?  I'm not familiar with the
code, but is there an easy way to query global swap usage?  That way
we can see if the check for a swap page is bogus.

My system works with the band-aid patch.  When that patch sets page =
NULL, does that mean userspace is just going to get a zero-ed page?
Userspace still works AFAICT, which makes me think it is a
mis-identified page to start with.

Regards,
Jason
Comment 28 willy 2018-04-20 13:39:53 UTC
On Fri, Apr 20, 2018 at 09:10:11AM -0400, Jason Andryuk wrote:
> > Given that this is happening on Xen, I wonder if Xen is using some of the
> > bits in the page table for its own purposes.
> 
> The backtraces include do_swap_page().  While I have a swap partition
> configured, I don't think it's being used.  Are we somehow
> misidentifying the page as a swap page?  I'm not familiar with the
> code, but is there an easy way to query global swap usage?  That way
> we can see if the check for a swap page is bogus.
> 
> My system works with the band-aid patch.  When that patch sets page =
> NULL, does that mean userspace is just going to get a zero-ed page?
> Userspace still works AFAICT, which makes me think it is a
> mis-identified page to start with.

Here's how this code works.

When we swap out an anonymous page (a page which is not backed by a
file; could be from a MAP_PRIVATE mapping, could be brk()), we write it
to the swap cache.  In order to be able to find it again, we store a
cookie (called a swp_entry_t) in the process' page table (marked with
the 'present' bit clear, so the CPU will fault on it).  When we get a
fault, we look up the cookie in a radix tree and bring that page back
in from swap.

If there's no page found in the radix tree, we put a freshly zeroed
page into the process's address space.  That's because we won't find
a page in the swap cache's radix tree for the first time we fault.
It's not an indication of a bug if there's no page to be found.

What we're seeing for this bug is page table entries of the format
0x8000'0004'0000'0000.  That would be a zeroed entry, except for the
fact that something's stepped on the upper bits.

What is worrying is that potentially Xen might be stepping on the upper
bits of either a present entry (leading to the process loading a page
that belongs to someone else) or an entry which has been swapped out,
leading to the process getting a zeroed page when it should be getting
its page back from swap.

Defending against this kind of corruption would take adding a parity
bit to the page tables.  That's not a project I have time for right now.
Comment 29 jandryuk 2018-04-20 15:20:29 UTC
Adding xen-devel and the Linux Xen maintainers.

Summary: Some Xen users (and maybe others) are hitting a BUG in
__radix_tree_lookup() under do_swap_page() - example backtrace is
provided at the end.  Matthew Wilcox provided a band-aid patch that
prints errors like the following instead of triggering the bug.

Skylake 32bit PAE Dom0:
Bad swp_entry: 80000000
mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000)

Ivy Bridge 32bit PAE Dom0:
Bad swp_entry: 40000000
mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000)

Other 32bit DomU:
Bad swp_entry: 4000000
mm/swap_state.c:683: bad pte e2187f30(8000000200000000)

Other 32bit:
Bad swp_entry: 2000000
mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000)

The Linux bugzilla has more info
https://bugzilla.kernel.org/show_bug.cgi?id=198497

This may not be exclusive to Xen Linux, but most of the reports are on
Xen.  Matthew wonders if Xen might be stepping on the upper bits of a
pte.

On Fri, Apr 20, 2018 at 9:39 AM, Matthew Wilcox <willy@infradead.org> wrote:
> On Fri, Apr 20, 2018 at 09:10:11AM -0400, Jason Andryuk wrote:
>> > Given that this is happening on Xen, I wonder if Xen is using some of the
>> > bits in the page table for its own purposes.
>>
>> The backtraces include do_swap_page().  While I have a swap partition
>> configured, I don't think it's being used.  Are we somehow
>> misidentifying the page as a swap page?  I'm not familiar with the
>> code, but is there an easy way to query global swap usage?  That way
>> we can see if the check for a swap page is bogus.
>>
>> My system works with the band-aid patch.  When that patch sets page =
>> NULL, does that mean userspace is just going to get a zero-ed page?
>> Userspace still works AFAICT, which makes me think it is a
>> mis-identified page to start with.
>
> Here's how this code works.

Thanks for the description.

> When we swap out an anonymous page (a page which is not backed by a
> file; could be from a MAP_PRIVATE mapping, could be brk()), we write it
> to the swap cache.  In order to be able to find it again, we store a
> cookie (called a swp_entry_t) in the process' page table (marked with
> the 'present' bit clear, so the CPU will fault on it).  When we get a
> fault, we look up the cookie in a radix tree and bring that page back
> in from swap.
>
> If there's no page found in the radix tree, we put a freshly zeroed
> page into the process's address space.  That's because we won't find
> a page in the swap cache's radix tree for the first time we fault.
> It's not an indication of a bug if there's no page to be found.

Is "no page found" the case for a lazy, un-allocated MAP_ANONYMOUS page?

> What we're seeing for this bug is page table entries of the format
> 0x8000'0004'0000'0000.  That would be a zeroed entry, except for the
> fact that something's stepped on the upper bits.

Does a totally zero-ed entry correspond to an un-allocated MAP_ANONYMOUS page?

> What is worrying is that potentially Xen might be stepping on the upper
> bits of either a present entry (leading to the process loading a page
> that belongs to someone else) or an entry which has been swapped out,
> leading to the process getting a zeroed page when it should be getting
> its page back from swap.

There was at least one report of non-Xen 32bit being affected.  There
was no backtrace, so it could be something else.  One report doesn't
have any swap configured.

> Defending against this kind of corruption would take adding a parity
> bit to the page tables.  That's not a project I have time for right now.

Understood.  Thanks for the response.

Regards,
Jason


[ 2234.939079] BUG: unable to handle kernel NULL pointer dereference at 00000008
[ 2234.942154] IP: __radix_tree_lookup+0xe/0xa0
[ 2234.945176] *pdpt = 0000000008cd5027 *pde = 0000000000000000
[ 2234.948382] Oops: 0000 [#1] SMP
[ 2234.951410] Modules linked in: hp_wmi sparse_keymap rfkill wmi_bmof
pcspkr i915 wmi hp_accel lis3lv02d input_polldev drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops drm hp_wireless
i2c_algo_bit hid_multitouch sha256_generic xen_netfront v4v(O) psmouse
ecb xts hid_generic xhci_pci xhci_hcd ohci_pci ohci_hcd uhci_hcd
ehci_pci ehci_hcd usbhid hid tpm_tis tpm_tis_core tpm
[ 2234.960816] CPU: 1 PID: 2338 Comm: xenvm Tainted: G           O    4.14.18 #1
[ 2234.963991] Hardware name: Hewlett-Packard HP EliteBook Folio
9470m/18DF, BIOS 68IBD Ver. F.40 02/01/2013
[ 2234.967186] task: d4370980 task.stack: cf8e8000
[ 2234.970351] EIP: __radix_tree_lookup+0xe/0xa0
[ 2234.973520] EFLAGS: 00010286 CPU: 1
[ 2234.976699] EAX: 00000004 EBX: b5900000 ECX: 00000000 EDX: 00000000
[ 2234.979887] ESI: 00000000 EDI: 00000004 EBP: cf8e9dd0 ESP: cf8e9dc0
[ 2234.983081]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
[ 2234.986233] CR0: 80050033 CR2: 00000008 CR3: 08f12000 CR4: 00042660
[ 2234.989340] Call Trace:
[ 2234.992354]  radix_tree_lookup_slot+0x1d/0x50
[ 2234.995341]  ? xen_irq_disable_direct+0xc/0xc
[ 2234.998288]  find_get_entry+0x1d/0x110
[ 2235.001140]  pagecache_get_page+0x1f/0x240
[ 2235.003948]  ? xen_flush_tlb_others+0x17b/0x260
[ 2235.006784]  lookup_swap_cache+0x32/0xe0
[ 2235.009632]  swap_readahead_detect+0x67/0x2c0
[ 2235.012447]  do_swap_page+0x10a/0x750
[ 2235.015270]  ? wp_page_copy+0x2c4/0x590
[ 2235.018043]  ? xen_pmd_val+0x11/0x20
[ 2235.020729]  handle_mm_fault+0x3f8/0x970
[ 2235.023352]  ? xen_smp_send_reschedule+0xa/0x10
[ 2235.025927]  ? resched_curr+0x68/0xc0
[ 2235.028444]  __do_page_fault+0x1a7/0x480
[ 2235.030883]  do_page_fault+0x33/0x110
[ 2235.033250]  ? do_fast_syscall_32+0xb3/0x200
[ 2235.035567]  ? vmalloc_sync_all+0x290/0x290
[ 2235.037828]  common_exception+0x84/0x8a
[ 2235.040011] EIP: 0xb7c8ddea
[ 2235.042111] EFLAGS: 00010202 CPU: 1
[ 2235.044153] EAX: b7dd38d0 EBX: b7dd2780 ECX: b7dd2000 EDX: b5900010
[ 2235.046176] ESI: 00000000 EDI: b7dd38f0 EBP: b56ff124 ESP: b56ff070
[ 2235.048152]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[ 2235.050053] Code: 42 14 29 c6 89 f0 c1 f8 02 e9 71 ff ff ff e8 aa
81 aa ff 8d 76 00 8d bc 27 00 00 00 00 55 89 e5 57 89 c7 56 53 83 ec
04 89 4d f0 <8b> 5f 04 89 d8 83 e0 03 83 f8 01 75 67 89 d8 83 e0 fe 0f
b6 08
[ 2235.053998] EIP: __radix_tree_lookup+0xe/0xa0 SS:ESP: 0069:cf8e9dc0
[ 2235.055895] CR2: 0000000000000008
Comment 30 Andrew Cooper 2018-04-20 15:35:46 UTC
On 20/04/18 16:20, Jason Andryuk wrote:
> Adding xen-devel and the Linux Xen maintainers.
>
> Summary: Some Xen users (and maybe others) are hitting a BUG in
> __radix_tree_lookup() under do_swap_page() - example backtrace is
> provided at the end.  Matthew Wilcox provided a band-aid patch that
> prints errors like the following instead of triggering the bug.
>
> Skylake 32bit PAE Dom0:
> Bad swp_entry: 80000000
> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000)
>
> Ivy Bridge 32bit PAE Dom0:
> Bad swp_entry: 40000000
> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000)
>
> Other 32bit DomU:
> Bad swp_entry: 4000000
> mm/swap_state.c:683: bad pte e2187f30(8000000200000000)
>
> Other 32bit:
> Bad swp_entry: 2000000
> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000)
>
> The Linux bugzilla has more info
> https://bugzilla.kernel.org/show_bug.cgi?id=198497
>
> This may not be exclusive to Xen Linux, but most of the reports are on
> Xen.  Matthew wonders if Xen might be stepping on the upper bits of a
> pte.

Yes - Xen does use the upper bits of a PTE, but only 1 in release
builds, and a second in debug builds.  I don't understand where you're
getting the 3rd bit in there.

The use of these bits are dubious, and not adequately described in the
ABI, and attempts to improve the state of play has come to nothing in
the past.

~Andrew
Comment 31 Andrew Cooper 2018-04-20 15:40:45 UTC
On 20/04/18 16:25, Andrew Cooper wrote:
> On 20/04/18 16:20, Jason Andryuk wrote:
>> Adding xen-devel and the Linux Xen maintainers.
>>
>> Summary: Some Xen users (and maybe others) are hitting a BUG in
>> __radix_tree_lookup() under do_swap_page() - example backtrace is
>> provided at the end.  Matthew Wilcox provided a band-aid patch that
>> prints errors like the following instead of triggering the bug.
>>
>> Skylake 32bit PAE Dom0:
>> Bad swp_entry: 80000000
>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000)
>>
>> Ivy Bridge 32bit PAE Dom0:
>> Bad swp_entry: 40000000
>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000)
>>
>> Other 32bit DomU:
>> Bad swp_entry: 4000000
>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000)
>>
>> Other 32bit:
>> Bad swp_entry: 2000000
>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000)
>>
>> The Linux bugzilla has more info
>> https://bugzilla.kernel.org/show_bug.cgi?id=198497
>>
>> This may not be exclusive to Xen Linux, but most of the reports are on
>> Xen.  Matthew wonders if Xen might be stepping on the upper bits of a
>> pte.
> Yes - Xen does use the upper bits of a PTE, but only 1 in release
> builds, and a second in debug builds.  I don't understand where you're
> getting the 3rd bit in there.
>
> The use of these bits are dubious, and not adequately described in the
> ABI, and attempts to improve the state of play has come to nothing in
> the past.

Sorry - hit send too early.  To be rather more helpful:

For 64bit guests only, we use one bit to distinguish between guest
kernel and guest user pages.  This is because both guest user and kernel
run in ring3, and have to have _PAGE_USER set on them.  We use bit 52 to
tag guest kernel mappings, which is seeded from the guest kernels choice
of _PAGE_USER.

In debug builds of the hypervisor only, we use bit 62 to tag grant
mappings.  This is to help spot API errors in the guest, and results in
an instant crash if we spot misuse.

~Andrew
Comment 32 jandryuk 2018-04-20 15:52:14 UTC
On Fri, Apr 20, 2018 at 11:42 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 20.04.18 at 17:25, <andrew.cooper3@citrix.com> wrote:
>> On 20/04/18 16:20, Jason Andryuk wrote:
>>> Adding xen-devel and the Linux Xen maintainers.
>>>
>>> Summary: Some Xen users (and maybe others) are hitting a BUG in
>>> __radix_tree_lookup() under do_swap_page() - example backtrace is
>>> provided at the end.  Matthew Wilcox provided a band-aid patch that
>>> prints errors like the following instead of triggering the bug.
>>>
>>> Skylake 32bit PAE Dom0:
>>> Bad swp_entry: 80000000
>>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000)
>>>
>>> Ivy Bridge 32bit PAE Dom0:
>>> Bad swp_entry: 40000000
>>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000)
>>>
>>> Other 32bit DomU:
>>> Bad swp_entry: 4000000
>>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000)
>>>
>>> Other 32bit:
>>> Bad swp_entry: 2000000
>>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000)
>>>
>>> The Linux bugzilla has more info
>>> https://bugzilla.kernel.org/show_bug.cgi?id=198497
>>>
>>> This may not be exclusive to Xen Linux, but most of the reports are on
>>> Xen.  Matthew wonders if Xen might be stepping on the upper bits of a
>>> pte.
>>
>> Yes - Xen does use the upper bits of a PTE, but only 1 in release
>> builds, and a second in debug builds.  I don't understand where you're
>> getting the 3rd bit in there.
>
> The former supposedly is _PAGE_GUEST_KERNEL, which we use for 64-bit
> guests only. Above talk is of 32-bit guests only.
>
> In addition both this and _PAGE_GNTTAB are used on present PTEs only,
> while above talk is about swap entries.

This hits a BUG going through do_swap_page, but it seems like users
don't think they are actually using swap at the time.  One reporter
didn't have any swap configured.  Some of this information was further
down in my original message.

I'm wondering if somehow we have a PTE that should be empty and should
be lazily filled.  For some reason, the entry has some bits set and is
causing the trouble.  Would Xen mess with the PTEs in that case?

Thanks,
Jason
Comment 33 Andrew Cooper 2018-04-20 16:00:54 UTC
On 20/04/18 16:52, Jason Andryuk wrote:
> On Fri, Apr 20, 2018 at 11:42 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>> On 20.04.18 at 17:25, <andrew.cooper3@citrix.com> wrote:
>>> On 20/04/18 16:20, Jason Andryuk wrote:
>>>> Adding xen-devel and the Linux Xen maintainers.
>>>>
>>>> Summary: Some Xen users (and maybe others) are hitting a BUG in
>>>> __radix_tree_lookup() under do_swap_page() - example backtrace is
>>>> provided at the end.  Matthew Wilcox provided a band-aid patch that
>>>> prints errors like the following instead of triggering the bug.
>>>>
>>>> Skylake 32bit PAE Dom0:
>>>> Bad swp_entry: 80000000
>>>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000)
>>>>
>>>> Ivy Bridge 32bit PAE Dom0:
>>>> Bad swp_entry: 40000000
>>>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000)
>>>>
>>>> Other 32bit DomU:
>>>> Bad swp_entry: 4000000
>>>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000)
>>>>
>>>> Other 32bit:
>>>> Bad swp_entry: 2000000
>>>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000)
>>>>
>>>> The Linux bugzilla has more info
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=198497
>>>>
>>>> This may not be exclusive to Xen Linux, but most of the reports are on
>>>> Xen.  Matthew wonders if Xen might be stepping on the upper bits of a
>>>> pte.
>>> Yes - Xen does use the upper bits of a PTE, but only 1 in release
>>> builds, and a second in debug builds.  I don't understand where you're
>>> getting the 3rd bit in there.
>> The former supposedly is _PAGE_GUEST_KERNEL, which we use for 64-bit
>> guests only. Above talk is of 32-bit guests only.
>>
>> In addition both this and _PAGE_GNTTAB are used on present PTEs only,
>> while above talk is about swap entries.
> This hits a BUG going through do_swap_page, but it seems like users
> don't think they are actually using swap at the time.  One reporter
> didn't have any swap configured.  Some of this information was further
> down in my original message.
>
> I'm wondering if somehow we have a PTE that should be empty and should
> be lazily filled.  For some reason, the entry has some bits set and is
> causing the trouble.  Would Xen mess with the PTEs in that case?

Any PTE with the present bit clear will be accepted and used
unmodified.  That said, I believe there is some batching of updates for
efficiency reasons in the PVops layer of the kernel, which might end up
causing a disconnect between what the swap system things, and what the
actual PTEs show when read.

~Andrew
Comment 34 David Strand 2018-04-20 16:02:04 UTC
My trace is up in comment 17. My kernel is 32 bit PAE, Not Xen, no swap enabled.

# uname -a
Linux is1 4.14.32 #4 SMP Fri Apr 6 16:35:18 PDT 2018 i686 i686 i386 GNU/Linux

# cat /proc/swaps
Filename                                Type            Size    Used    Priority

Apr 17 14:43:57 is1 kernel: Bad swp_entry: 2000000
Apr 17 14:43:57 is1 kernel: mm/swap_state.c:683: bad pte ed3e3f38(8000000100000000)

Also Sudip on comment #26 says his is not a xen system.
Comment 35 JBeulich 2018-04-20 16:02:27 UTC
>>> On 20.04.18 at 17:52, <jandryuk@gmail.com> wrote:
> On Fri, Apr 20, 2018 at 11:42 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>> On 20.04.18 at 17:25, <andrew.cooper3@citrix.com> wrote:
>>> On 20/04/18 16:20, Jason Andryuk wrote:
>>>> Adding xen-devel and the Linux Xen maintainers.
>>>>
>>>> Summary: Some Xen users (and maybe others) are hitting a BUG in
>>>> __radix_tree_lookup() under do_swap_page() - example backtrace is
>>>> provided at the end.  Matthew Wilcox provided a band-aid patch that
>>>> prints errors like the following instead of triggering the bug.
>>>>
>>>> Skylake 32bit PAE Dom0:
>>>> Bad swp_entry: 80000000
>>>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000)
>>>>
>>>> Ivy Bridge 32bit PAE Dom0:
>>>> Bad swp_entry: 40000000
>>>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000)
>>>>
>>>> Other 32bit DomU:
>>>> Bad swp_entry: 4000000
>>>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000)
>>>>
>>>> Other 32bit:
>>>> Bad swp_entry: 2000000
>>>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000)
>>>>
>>>> The Linux bugzilla has more info
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=198497 
>>>>
>>>> This may not be exclusive to Xen Linux, but most of the reports are on
>>>> Xen.  Matthew wonders if Xen might be stepping on the upper bits of a
>>>> pte.
>>>
>>> Yes - Xen does use the upper bits of a PTE, but only 1 in release
>>> builds, and a second in debug builds.  I don't understand where you're
>>> getting the 3rd bit in there.
>>
>> The former supposedly is _PAGE_GUEST_KERNEL, which we use for 64-bit
>> guests only. Above talk is of 32-bit guests only.
>>
>> In addition both this and _PAGE_GNTTAB are used on present PTEs only,
>> while above talk is about swap entries.
> 
> This hits a BUG going through do_swap_page, but it seems like users
> don't think they are actually using swap at the time.  One reporter
> didn't have any swap configured.  Some of this information was further
> down in my original message.
> 
> I'm wondering if somehow we have a PTE that should be empty and should
> be lazily filled.  For some reason, the entry has some bits set and is
> causing the trouble.  Would Xen mess with the PTEs in that case?

As said in my previous reply - both of the bits Andrew has mentioned can
only ever be set when the present bit is also set (which doesn't appear to
be the case here). The set bits above are actually in the range of bits
designated to the address, which Xen wouldn't ever play with.

Jan
Comment 36 JBeulich 2018-04-20 16:02:46 UTC
>>> On 20.04.18 at 17:25, <andrew.cooper3@citrix.com> wrote:
> On 20/04/18 16:20, Jason Andryuk wrote:
>> Adding xen-devel and the Linux Xen maintainers.
>>
>> Summary: Some Xen users (and maybe others) are hitting a BUG in
>> __radix_tree_lookup() under do_swap_page() - example backtrace is
>> provided at the end.  Matthew Wilcox provided a band-aid patch that
>> prints errors like the following instead of triggering the bug.
>>
>> Skylake 32bit PAE Dom0:
>> Bad swp_entry: 80000000
>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000)
>>
>> Ivy Bridge 32bit PAE Dom0:
>> Bad swp_entry: 40000000
>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000)
>>
>> Other 32bit DomU:
>> Bad swp_entry: 4000000
>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000)
>>
>> Other 32bit:
>> Bad swp_entry: 2000000
>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000)
>>
>> The Linux bugzilla has more info
>> https://bugzilla.kernel.org/show_bug.cgi?id=198497 
>>
>> This may not be exclusive to Xen Linux, but most of the reports are on
>> Xen.  Matthew wonders if Xen might be stepping on the upper bits of a
>> pte.
> 
> Yes - Xen does use the upper bits of a PTE, but only 1 in release
> builds, and a second in debug builds.  I don't understand where you're
> getting the 3rd bit in there.

The former supposedly is _PAGE_GUEST_KERNEL, which we use for 64-bit
guests only. Above talk is of 32-bit guests only.

In addition both this and _PAGE_GNTTAB are used on present PTEs only,
while above talk is about swap entries.

Jan
Comment 37 boris.ostrovsky 2018-04-20 20:49:52 UTC
On 04/20/2018 12:02 PM, Jan Beulich wrote:
>>>> On 20.04.18 at 17:52, <jandryuk@gmail.com> wrote:
>> On Fri, Apr 20, 2018 at 11:42 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>>> On 20.04.18 at 17:25, <andrew.cooper3@citrix.com> wrote:
>>>> On 20/04/18 16:20, Jason Andryuk wrote:
>>>>> Adding xen-devel and the Linux Xen maintainers.
>>>>>
>>>>> Summary: Some Xen users (and maybe others) are hitting a BUG in
>>>>> __radix_tree_lookup() under do_swap_page() - example backtrace is
>>>>> provided at the end.  Matthew Wilcox provided a band-aid patch that
>>>>> prints errors like the following instead of triggering the bug.
>>>>>
>>>>> Skylake 32bit PAE Dom0:
>>>>> Bad swp_entry: 80000000
>>>>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000)
>>>>>
>>>>> Ivy Bridge 32bit PAE Dom0:
>>>>> Bad swp_entry: 40000000
>>>>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000)
>>>>>
>>>>> Other 32bit DomU:
>>>>> Bad swp_entry: 4000000
>>>>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000)
>>>>>
>>>>> Other 32bit:
>>>>> Bad swp_entry: 2000000
>>>>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000)
>>>>>
>>>>> The Linux bugzilla has more info
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=198497 
>>>>>
>>>>> This may not be exclusive to Xen Linux, but most of the reports are on
>>>>> Xen.  Matthew wonders if Xen might be stepping on the upper bits of a
>>>>> pte.
>>>> Yes - Xen does use the upper bits of a PTE, but only 1 in release
>>>> builds, and a second in debug builds.  I don't understand where you're
>>>> getting the 3rd bit in there.
>>> The former supposedly is _PAGE_GUEST_KERNEL, which we use for 64-bit
>>> guests only. Above talk is of 32-bit guests only.
>>>
>>> In addition both this and _PAGE_GNTTAB are used on present PTEs only,
>>> while above talk is about swap entries.
>> This hits a BUG going through do_swap_page, but it seems like users
>> don't think they are actually using swap at the time.  One reporter
>> didn't have any swap configured.  Some of this information was further
>> down in my original message.
>>
>> I'm wondering if somehow we have a PTE that should be empty and should
>> be lazily filled.  For some reason, the entry has some bits set and is
>> causing the trouble.  Would Xen mess with the PTEs in that case?
> As said in my previous reply - both of the bits Andrew has mentioned can
> only ever be set when the present bit is also set (which doesn't appear to
> be the case here). The set bits above are actually in the range of bits
> designated to the address, which Xen wouldn't ever play with.


The bug description starts with: "On a Xen VM running as pvh"

So is this a PV or a PVH guest?


-boris
Comment 38 Sudip 2018-04-20 21:09:42 UTC
(In reply to David Strand from comment #34)
> My trace is up in comment 17. My kernel is 32 bit PAE, Not Xen, no swap
> enabled.
> 
> # uname -a
> Linux is1 4.14.32 #4 SMP Fri Apr 6 16:35:18 PDT 2018 i686 i686 i386 GNU/Linux
> 
> # cat /proc/swaps
> Filename                                Type            Size    Used   
> Priority
> 
> Apr 17 14:43:57 is1 kernel: Bad swp_entry: 2000000
> Apr 17 14:43:57 is1 kernel: mm/swap_state.c:683: bad pte
> ed3e3f38(8000000100000000)
> 
> Also Sudip on comment #26 says his is not a xen system.

Should we then open a new bug report with the title non-xen systems and attach the same traces? :)
Comment 39 jgross 2018-04-21 06:33:46 UTC
On 20/04/18 21:20, Boris Ostrovsky wrote:
> On 04/20/2018 12:02 PM, Jan Beulich wrote:
>>>>> On 20.04.18 at 17:52, <jandryuk@gmail.com> wrote:
>>> On Fri, Apr 20, 2018 at 11:42 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>>>> On 20.04.18 at 17:25, <andrew.cooper3@citrix.com> wrote:
>>>>> On 20/04/18 16:20, Jason Andryuk wrote:
>>>>>> Adding xen-devel and the Linux Xen maintainers.
>>>>>>
>>>>>> Summary: Some Xen users (and maybe others) are hitting a BUG in
>>>>>> __radix_tree_lookup() under do_swap_page() - example backtrace is
>>>>>> provided at the end.  Matthew Wilcox provided a band-aid patch that
>>>>>> prints errors like the following instead of triggering the bug.
>>>>>>
>>>>>> Skylake 32bit PAE Dom0:
>>>>>> Bad swp_entry: 80000000
>>>>>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000)
>>>>>>
>>>>>> Ivy Bridge 32bit PAE Dom0:
>>>>>> Bad swp_entry: 40000000
>>>>>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000)
>>>>>>
>>>>>> Other 32bit DomU:
>>>>>> Bad swp_entry: 4000000
>>>>>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000)
>>>>>>
>>>>>> Other 32bit:
>>>>>> Bad swp_entry: 2000000
>>>>>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000)
>>>>>>
>>>>>> The Linux bugzilla has more info
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=198497 
>>>>>>
>>>>>> This may not be exclusive to Xen Linux, but most of the reports are on
>>>>>> Xen.  Matthew wonders if Xen might be stepping on the upper bits of a
>>>>>> pte.
>>>>> Yes - Xen does use the upper bits of a PTE, but only 1 in release
>>>>> builds, and a second in debug builds.  I don't understand where you're
>>>>> getting the 3rd bit in there.
>>>> The former supposedly is _PAGE_GUEST_KERNEL, which we use for 64-bit
>>>> guests only. Above talk is of 32-bit guests only.
>>>>
>>>> In addition both this and _PAGE_GNTTAB are used on present PTEs only,
>>>> while above talk is about swap entries.
>>> This hits a BUG going through do_swap_page, but it seems like users
>>> don't think they are actually using swap at the time.  One reporter
>>> didn't have any swap configured.  Some of this information was further
>>> down in my original message.
>>>
>>> I'm wondering if somehow we have a PTE that should be empty and should
>>> be lazily filled.  For some reason, the entry has some bits set and is
>>> causing the trouble.  Would Xen mess with the PTEs in that case?
>> As said in my previous reply - both of the bits Andrew has mentioned can
>> only ever be set when the present bit is also set (which doesn't appear to
>> be the case here). The set bits above are actually in the range of bits
>> designated to the address, which Xen wouldn't ever play with.
> 
> 
> The bug description starts with: "On a Xen VM running as pvh"
> 
> So is this a PV or a PVH guest?

The stack backtrace suggests PV.


Juergen
Comment 40 Sudip 2018-04-21 09:02:01 UTC
I am really sorry, I mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=198497#c26 that I have the same trace on non-Xen hardware. But I missed posting my trace since it looked same.


[  154.782681] BUG: unable to handle kernel NULL pointer dereference at 00000008
[  155.249221] task: ec144a40 task.stack: ec178000
[  155.269253] EIP: radix_tree_load_root+0x4/0x38
[  155.274214] EFLAGS: 00010296 CPU: 1
[  155.278107] EAX: 00000004 EBX: 00000008 ECX: ec179ddc EDX: ec179dd8
[  155.285108] ESI: 00000000 EDI: 00000000 EBP: ec179dcc ESP: ec179dc8
[  155.292110]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  155.298140] CR0: 80050033 CR2: 00000008 CR3: 3458bee0 CR4: 001006f0
[  155.305133] Call Trace:
[  155.307866]  __radix_tree_lookup+0x21/0x78
[  155.312442]  radix_tree_lookup_slot+0xf/0x1d
[  155.317213]  find_get_entry+0x20/0xb6
[  155.321303]  pagecache_get_page+0x1f/0x1a7
[  155.325880]  lookup_swap_cache+0x30/0xc0
[  155.330260]  swap_readahead_detect+0x66/0x20e
[  155.335127]  do_swap_page+0x2c/0x614
[  155.339120]  ? page_add_file_rmap+0x3e/0x67
[  155.343816]  ? alloc_set_pte+0x206/0x269
[  155.348196]  ? filemap_map_pages+0x55/0x260
[  155.352869]  handle_mm_fault+0x874/0x9b0
[  155.357253]  __do_page_fault+0x1ce/0x376
[  155.361634]  ? vmalloc_sync_all+0x5/0x5
[  155.365946]  do_page_fault+0xb/0xd
[  155.369744]  common_exception+0x62/0x6a
[  155.374027] EIP: 0xb78c527f
[  155.377142] EFLAGS: 00010286 CPU: 1
[  155.381044] EAX: b3546114 EBX: b78dc644 ECX: 00000001 EDX: b34730d4
[  155.388038] ESI: 09ced960 EDI: 00006e67 EBP: b22ffd08 ESP: b22ffc90
[  155.395041]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[  155.401074]  ? vmalloc_sync_all+0x5/0x5
[  155.405355] Code: 42 08 00 00 00 00 89 e5 40 89 02 89 42 04 31 c0 5d c3 55 8d 4a 1a 89 e5 53 bb 01 00 00 00 d3 e3 23 18 89 d8 5b 5d c3 55 89 e5 53 <8b> 40 04 89 cb 89 02 89 c2 83 e2 03 4a 75 1a 83 e0 fe 0f b6 08
[  155.426416] EIP: radix_tree_load_root+0x4/0x38 SS:ESP: 0068:ec179dc8
[  155.433515] CR2: 0000000000000008
[  155.437208] ---[ end trace 1a562e195c40a78c ]---
Comment 41 willy 2018-04-21 14:35:16 UTC
On Fri, Apr 20, 2018 at 10:02:29AM -0600, Jan Beulich wrote:
> >>>> Skylake 32bit PAE Dom0:
> >>>> Bad swp_entry: 80000000
> >>>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000)
> >>>>
> >>>> Ivy Bridge 32bit PAE Dom0:
> >>>> Bad swp_entry: 40000000
> >>>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000)
> >>>>
> >>>> Other 32bit DomU:
> >>>> Bad swp_entry: 4000000
> >>>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000)
> >>>>
> >>>> Other 32bit:
> >>>> Bad swp_entry: 2000000
> >>>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000)

> As said in my previous reply - both of the bits Andrew has mentioned can
> only ever be set when the present bit is also set (which doesn't appear to
> be the case here). The set bits above are actually in the range of bits
> designated to the address, which Xen wouldn't ever play with.

Is it relevant that all the crashes we've seen are with PAE in the guest?
Is it possible that Xen thinks the guest is not using PAE?
Comment 42 jgross 2018-04-22 05:50:27 UTC
On 21/04/18 16:35, Matthew Wilcox wrote:
> On Fri, Apr 20, 2018 at 10:02:29AM -0600, Jan Beulich wrote:
>>>>>> Skylake 32bit PAE Dom0:
>>>>>> Bad swp_entry: 80000000
>>>>>> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000)
>>>>>>
>>>>>> Ivy Bridge 32bit PAE Dom0:
>>>>>> Bad swp_entry: 40000000
>>>>>> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000)
>>>>>>
>>>>>> Other 32bit DomU:
>>>>>> Bad swp_entry: 4000000
>>>>>> mm/swap_state.c:683: bad pte e2187f30(8000000200000000)
>>>>>>
>>>>>> Other 32bit:
>>>>>> Bad swp_entry: 2000000
>>>>>> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000)
> 
>> As said in my previous reply - both of the bits Andrew has mentioned can
>> only ever be set when the present bit is also set (which doesn't appear to
>> be the case here). The set bits above are actually in the range of bits
>> designated to the address, which Xen wouldn't ever play with.
> 
> Is it relevant that all the crashes we've seen are with PAE in the guest?
> Is it possible that Xen thinks the guest is not using PAE?
> 

All Xen 32-bit PV guests are using PAE. Its part of the PV ABI.


Juergen
Comment 43 jgross 2018-04-23 08:17:19 UTC
On 20/04/18 17:20, Jason Andryuk wrote:
> Adding xen-devel and the Linux Xen maintainers.
> 
> Summary: Some Xen users (and maybe others) are hitting a BUG in
> __radix_tree_lookup() under do_swap_page() - example backtrace is
> provided at the end.  Matthew Wilcox provided a band-aid patch that
> prints errors like the following instead of triggering the bug.
> 
> Skylake 32bit PAE Dom0:
> Bad swp_entry: 80000000
> mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000)
> 
> Ivy Bridge 32bit PAE Dom0:
> Bad swp_entry: 40000000
> mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000)
> 
> Other 32bit DomU:
> Bad swp_entry: 4000000
> mm/swap_state.c:683: bad pte e2187f30(8000000200000000)
> 
> Other 32bit:
> Bad swp_entry: 2000000
> mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000)
> 
> The Linux bugzilla has more info
> https://bugzilla.kernel.org/show_bug.cgi?id=198497
> 
> This may not be exclusive to Xen Linux, but most of the reports are on
> Xen.  Matthew wonders if Xen might be stepping on the upper bits of a
> pte.
> 
> On Fri, Apr 20, 2018 at 9:39 AM, Matthew Wilcox <willy@infradead.org> wrote:
>> On Fri, Apr 20, 2018 at 09:10:11AM -0400, Jason Andryuk wrote:
>>>> Given that this is happening on Xen, I wonder if Xen is using some of the
>>>> bits in the page table for its own purposes.
>>>
>>> The backtraces include do_swap_page().  While I have a swap partition
>>> configured, I don't think it's being used.  Are we somehow
>>> misidentifying the page as a swap page?  I'm not familiar with the
>>> code, but is there an easy way to query global swap usage?  That way
>>> we can see if the check for a swap page is bogus.
>>>
>>> My system works with the band-aid patch.  When that patch sets page =
>>> NULL, does that mean userspace is just going to get a zero-ed page?
>>> Userspace still works AFAICT, which makes me think it is a
>>> mis-identified page to start with.
>>
>> Here's how this code works.
> 
> Thanks for the description.
> 
>> When we swap out an anonymous page (a page which is not backed by a
>> file; could be from a MAP_PRIVATE mapping, could be brk()), we write it
>> to the swap cache.  In order to be able to find it again, we store a
>> cookie (called a swp_entry_t) in the process' page table (marked with
>> the 'present' bit clear, so the CPU will fault on it).  When we get a
>> fault, we look up the cookie in a radix tree and bring that page back
>> in from swap.
>>
>> If there's no page found in the radix tree, we put a freshly zeroed
>> page into the process's address space.  That's because we won't find
>> a page in the swap cache's radix tree for the first time we fault.
>> It's not an indication of a bug if there's no page to be found.
> 
> Is "no page found" the case for a lazy, un-allocated MAP_ANONYMOUS page?
> 
>> What we're seeing for this bug is page table entries of the format
>> 0x8000'0004'0000'0000.  That would be a zeroed entry, except for the
>> fact that something's stepped on the upper bits.
> 
> Does a totally zero-ed entry correspond to an un-allocated MAP_ANONYMOUS
> page?
> 
>> What is worrying is that potentially Xen might be stepping on the upper
>> bits of either a present entry (leading to the process loading a page
>> that belongs to someone else) or an entry which has been swapped out,
>> leading to the process getting a zeroed page when it should be getting
>> its page back from swap.
> 
> There was at least one report of non-Xen 32bit being affected.  There
> was no backtrace, so it could be something else.  One report doesn't
> have any swap configured.
> 
>> Defending against this kind of corruption would take adding a parity
>> bit to the page tables.  That's not a project I have time for right now.
> 
> Understood.  Thanks for the response.
> 
> Regards,
> Jason
> 
> 
> [ 2234.939079] BUG: unable to handle kernel NULL pointer dereference at
> 00000008
> [ 2234.942154] IP: __radix_tree_lookup+0xe/0xa0
> [ 2234.945176] *pdpt = 0000000008cd5027 *pde = 0000000000000000
> [ 2234.948382] Oops: 0000 [#1] SMP
> [ 2234.951410] Modules linked in: hp_wmi sparse_keymap rfkill wmi_bmof
> pcspkr i915 wmi hp_accel lis3lv02d input_polldev drm_kms_helper
> syscopyarea sysfillrect sysimgblt fb_sys_fops drm hp_wireless
> i2c_algo_bit hid_multitouch sha256_generic xen_netfront v4v(O) psmouse
> ecb xts hid_generic xhci_pci xhci_hcd ohci_pci ohci_hcd uhci_hcd
> ehci_pci ehci_hcd usbhid hid tpm_tis tpm_tis_core tpm
> [ 2234.960816] CPU: 1 PID: 2338 Comm: xenvm Tainted: G           O    4.14.18
> #1
> [ 2234.963991] Hardware name: Hewlett-Packard HP EliteBook Folio
> 9470m/18DF, BIOS 68IBD Ver. F.40 02/01/2013
> [ 2234.967186] task: d4370980 task.stack: cf8e8000
> [ 2234.970351] EIP: __radix_tree_lookup+0xe/0xa0
> [ 2234.973520] EFLAGS: 00010286 CPU: 1
> [ 2234.976699] EAX: 00000004 EBX: b5900000 ECX: 00000000 EDX: 00000000
> [ 2234.979887] ESI: 00000000 EDI: 00000004 EBP: cf8e9dd0 ESP: cf8e9dc0
> [ 2234.983081]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
> [ 2234.986233] CR0: 80050033 CR2: 00000008 CR3: 08f12000 CR4: 00042660
> [ 2234.989340] Call Trace:
> [ 2234.992354]  radix_tree_lookup_slot+0x1d/0x50
> [ 2234.995341]  ? xen_irq_disable_direct+0xc/0xc
> [ 2234.998288]  find_get_entry+0x1d/0x110
> [ 2235.001140]  pagecache_get_page+0x1f/0x240
> [ 2235.003948]  ? xen_flush_tlb_others+0x17b/0x260
> [ 2235.006784]  lookup_swap_cache+0x32/0xe0
> [ 2235.009632]  swap_readahead_detect+0x67/0x2c0
> [ 2235.012447]  do_swap_page+0x10a/0x750
> [ 2235.015270]  ? wp_page_copy+0x2c4/0x590
> [ 2235.018043]  ? xen_pmd_val+0x11/0x20
> [ 2235.020729]  handle_mm_fault+0x3f8/0x970
> [ 2235.023352]  ? xen_smp_send_reschedule+0xa/0x10
> [ 2235.025927]  ? resched_curr+0x68/0xc0
> [ 2235.028444]  __do_page_fault+0x1a7/0x480
> [ 2235.030883]  do_page_fault+0x33/0x110
> [ 2235.033250]  ? do_fast_syscall_32+0xb3/0x200
> [ 2235.035567]  ? vmalloc_sync_all+0x290/0x290
> [ 2235.037828]  common_exception+0x84/0x8a
> [ 2235.040011] EIP: 0xb7c8ddea
> [ 2235.042111] EFLAGS: 00010202 CPU: 1
> [ 2235.044153] EAX: b7dd38d0 EBX: b7dd2780 ECX: b7dd2000 EDX: b5900010
> [ 2235.046176] ESI: 00000000 EDI: b7dd38f0 EBP: b56ff124 ESP: b56ff070
> [ 2235.048152]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
> [ 2235.050053] Code: 42 14 29 c6 89 f0 c1 f8 02 e9 71 ff ff ff e8 aa
> 81 aa ff 8d 76 00 8d bc 27 00 00 00 00 55 89 e5 57 89 c7 56 53 83 ec
> 04 89 4d f0 <8b> 5f 04 89 d8 83 e0 03 83 f8 01 75 67 89 d8 83 e0 fe 0f
> b6 08
> [ 2235.053998] EIP: __radix_tree_lookup+0xe/0xa0 SS:ESP: 0069:cf8e9dc0
> [ 2235.055895] CR2: 0000000000000008
> 

Could it be we just have a race regarding pte_clear()? This will set
the low part of the pte to zero first and then the hight part.

In case pte_clear() is used in interrupt mode especially Xen will be
rather slow as it emulates the two writes to the page table resulting
in a larger window where the race might happen.


Juergen
Comment 44 jandryuk 2018-09-04 12:54:47 UTC
On Mon, Apr 23, 2018 at 4:17 AM Juergen Gross <jgross@suse.com> wrote:
> On 20/04/18 17:20, Jason Andryuk wrote:
> > Adding xen-devel and the Linux Xen maintainers.
> >
> > Summary: Some Xen users (and maybe others) are hitting a BUG in
> > __radix_tree_lookup() under do_swap_page() - example backtrace is
> > provided at the end.  Matthew Wilcox provided a band-aid patch that
> > prints errors like the following instead of triggering the bug.
> >
> > Skylake 32bit PAE Dom0:
> > Bad swp_entry: 80000000
> > mm/swap_state.c:683: bad pte d3a39f1c(8000000400000000)
> >
> > Ivy Bridge 32bit PAE Dom0:
> > Bad swp_entry: 40000000
> > mm/swap_state.c:683: bad pte d3a05f1c(8000000200000000)
> >
> > Other 32bit DomU:
> > Bad swp_entry: 4000000
> > mm/swap_state.c:683: bad pte e2187f30(8000000200000000)
> >
> > Other 32bit:
> > Bad swp_entry: 2000000
> > mm/swap_state.c:683: bad pte ef3a3f38(8000000100000000)
> >
> > The Linux bugzilla has more info
> > https://bugzilla.kernel.org/show_bug.cgi?id=198497
> >
> > This may not be exclusive to Xen Linux, but most of the reports are on
> > Xen.  Matthew wonders if Xen might be stepping on the upper bits of a
> > pte.
> >
<snip>
>
> Could it be we just have a race regarding pte_clear()? This will set
> the low part of the pte to zero first and then the hight part.
>
> In case pte_clear() is used in interrupt mode especially Xen will be
> rather slow as it emulates the two writes to the page table resulting
> in a larger window where the race might happen.

It looks like Juergen was correct.  With the L1TF vulnerability, the
Xen hypervisor needs to detect vulnerable PTEs.  For 32bit PAE, Xen
would trap on PTEs like 0x8000'0002'0000'0000  - the same format as
seen in this bug.  He wrote two patches for Linux, now upstream, to
write PTEs with 64bit operations or hypercalls and avoid the invalid
PTEs:
f7c90c2aa400 "x86/xen: don't write ptes directly in 32-bit PV guests"
b2d7a075a1cc "x86/pae: use 64 bit atomic xchg function in
native_ptep_get_and_clear"

With those patches, I have not seen a "Bad swp_entry", so this seems
fixed for me on Xen.

There was also a report of a non-Xen kernel being affected.  Is there
an underlying problem that native PAE code updates PTEs in two writes,
but there is no locking to prevent the intermediate PTE from being
used elsewhere in the kernel?

Regards,
Jason