Hi all, With Kernel 5.19.x we noticed system freezes. This happens in virtual environments as well as on real hardware. On a real hardware machine we were able to catch the moment of freeze with continuous profiling. Specification of the machine where we captured the freeze: Thinkpad T14 CPU: AMD Ryzen 7 PRO 4750U Kernel: 5.19.8-200.fc36.x86_64 Stacktrace of kworker/12:3 that is using all resources and causing the freeze: # Source Location Function Name Function Line 0 arch/x86/include/asm/vdso/processor.h:13 rep_nop 11 1 arch/x86/include/asm/vdso/processor.h:18 cpu_relax 16 2 kernel/locking/qspinlock.c:514 native_queued_spin_lock_slowpath 316 3 kernel/locking/qspinlock.c:316 native_queued_spin_lock_slowpath N/A 4 arch/x86/include/asm/paravirt.h:591 pv_queued_spin_lock_slowpath 588 5 arch/x86/include/asm/qspinlock.h:51 queued_spin_lock_slowpath 49 6 include/asm-generic/qspinlock.h:114 queued_spin_lock 107 7 include/linux/spinlock.h:185 do_raw_spin_lock 182 8 include/linux/spinlock_api_smp.h:134 __raw_spin_lock 130 9 kernel/locking/spinlock.c:154 _raw_spin_lock 152 10 include/linux/spinlock.h:349 spin_lock 347 11 mm/vmalloc.c:1805 find_vmap_area 1801 12 mm/vmalloc.c:2525 find_vm_area 2521 13 mm/vmalloc.c:2639 __vunmap 2628 14 mm/vmalloc.c:97 free_work 91 15 kernel/workqueue.c:2289 process_one_work 2181 16 kernel/workqueue.c:2436 worker_thread 2378 17 kernel/kthread.c:376 kthread 330 18 N/A ret_from_fork N/A The functions in the above shown stacktrace hardly change. There is only one commit 993d0b287e2ef7bee2e8b13b0ce4d2b5066f278e which introduces changes to find_vmap_area() for 5.19. With this change in mind we looked for stacktraces which make also use of this new commit. And in a different kernel thread we do notice the use of check_heap_object(): # Source Location Function Name Function Line 0 arch/x86/include/asm/paravirt.h:704 arch_local_irq_enable 702 1 arch/x86/include/asm/irqflags.h:138 arch_local_irq_restore 135 2 kernel/sched/sched.h:1330 raw_spin_rq_unlock_irqrestore 1327 3 kernel/sched/sched.h:1327 raw_spin_rq_unlock_irqrestore N/A 4 kernel/sched/sched.h:1611 rq_unlock_irqrestore 1607 5 kernel/sched/fair.c:8288 update_blocked_averages 8272 6 kernel/sched/fair.c:11133 run_rebalance_domains 11115 7 kernel/softirq.c:571 __do_softirq 528 8 kernel/softirq.c:445 invoke_softirq 433 9 kernel/softirq.c:650 __irq_exit_rcu 640 10 arch/x86/kernel/apic/apic.c:1106 sysvec_apic_timer_interrupt N/A 11 N/A asm_sysvec_apic_timer_interrupt N/A 12 include/linux/mmzone.h:1403 __nr_to_section 1395 13 include/linux/mmzone.h:1488 __pfn_to_section 1486 14 include/linux/mmzone.h:1539 pfn_valid 1524 15 arch/x86/mm/physaddr.c:65 __virt_addr_valid 47 16 mm/usercopy.c:188 check_heap_object 161 17 mm/usercopy.c:250 __check_object_size 212 18 mm/usercopy.c:212 __check_object_size N/A 19 include/linux/thread_info.h:199 check_object_size 195 20 lib/strncpy_from_user.c:137 strncpy_from_user 113 21 fs/namei.c:150 getname_flags 129 22 fs/namei.c:2896 user_path_at_empty 2893 23 include/linux/namei.h:57 user_path_at 54 24 fs/open.c:446 do_faccessat 420 25 arch/x86/entry/common.c:50 do_syscall_x64 40 26 arch/x86/entry/common.c:80 do_syscall_64 73 27 N/A entry_SYSCALL_64_after_hwframe N/A We are neither experts in the mm subsystem nor can provide a fix, but wanted to let you know about our findings. Cheers, Florian
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Wed, 14 Sep 2022 15:07:46 +0000 bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=216489 > > Bug ID: 216489 > Summary: Machine freezes due to memory lock > Product: Memory Management > Version: 2.5 > Kernel Version: 5.19.8 > Hardware: AMD > OS: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: Other > Assignee: akpm@linux-foundation.org > Reporter: dev@der-flo.net > Regression: No > > Hi all, > With Kernel 5.19.x we noticed system freezes. This happens in virtual > environments as well as on real hardware. > On a real hardware machine we were able to catch the moment of freeze with > continuous profiling. Thanks. I forwarded this to Uladzislau and he offered to help. He said: : I can help with debugging. What i need is reproduce steps. Could you : please clarify if it is easy to hit and what kind of profiling triggers it? and : I do not think that Matthew Wilcox commits destroys it but... I see that : __vunmap() is invoked by the free_work() thus a caller is in atomic : context including IRQ context. > Specification of the machine where we captured the freeze: > Thinkpad T14 > CPU: AMD Ryzen 7 PRO 4750U > Kernel: 5.19.8-200.fc36.x86_64 > > Stacktrace of kworker/12:3 that is using all resources and causing the > freeze: > > # Source Location Function Name Function Line > 0 arch/x86/include/asm/vdso/processor.h:13 rep_nop 11 > 1 arch/x86/include/asm/vdso/processor.h:18 cpu_relax 16 > 2 kernel/locking/qspinlock.c:514 native_queued_spin_lock_slowpath > 316 > 3 kernel/locking/qspinlock.c:316 native_queued_spin_lock_slowpath > N/A > 4 arch/x86/include/asm/paravirt.h:591 pv_queued_spin_lock_slowpath > 588 > 5 arch/x86/include/asm/qspinlock.h:51 queued_spin_lock_slowpath > 49 > 6 include/asm-generic/qspinlock.h:114 queued_spin_lock 107 > 7 include/linux/spinlock.h:185 do_raw_spin_lock 182 > 8 include/linux/spinlock_api_smp.h:134 __raw_spin_lock > 130 > 9 kernel/locking/spinlock.c:154 _raw_spin_lock 152 > 10 include/linux/spinlock.h:349 spin_lock 347 > 11 mm/vmalloc.c:1805 find_vmap_area 1801 > 12 mm/vmalloc.c:2525 find_vm_area 2521 > 13 mm/vmalloc.c:2639 __vunmap 2628 > 14 mm/vmalloc.c:97 free_work 91 > 15 kernel/workqueue.c:2289 process_one_work 2181 > 16 kernel/workqueue.c:2436 worker_thread 2378 > 17 kernel/kthread.c:376 kthread 330 > 18 N/A ret_from_fork N/A > > The functions in the above shown stacktrace hardly change. There is only one > commit 993d0b287e2ef7bee2e8b13b0ce4d2b5066f278e which introduces changes to > find_vmap_area() for 5.19. > > With this change in mind we looked for stacktraces which make also use of > this > new commit. And in a different kernel thread we do notice the use of > check_heap_object(): > > # Source Location Function Name Function Line > 0 arch/x86/include/asm/paravirt.h:704 arch_local_irq_enable 702 > 1 arch/x86/include/asm/irqflags.h:138 arch_local_irq_restore 135 > 2 kernel/sched/sched.h:1330 raw_spin_rq_unlock_irqrestore 1327 > 3 kernel/sched/sched.h:1327 raw_spin_rq_unlock_irqrestore N/A > 4 kernel/sched/sched.h:1611 rq_unlock_irqrestore 1607 > 5 kernel/sched/fair.c:8288 update_blocked_averages 8272 > 6 kernel/sched/fair.c:11133 run_rebalance_domains 11115 > 7 kernel/softirq.c:571 __do_softirq 528 > 8 kernel/softirq.c:445 invoke_softirq 433 > 9 kernel/softirq.c:650 __irq_exit_rcu 640 > 10 arch/x86/kernel/apic/apic.c:1106 sysvec_apic_timer_interrupt N/A > 11 N/A asm_sysvec_apic_timer_interrupt N/A > 12 include/linux/mmzone.h:1403 __nr_to_section 1395 > 13 include/linux/mmzone.h:1488 __pfn_to_section 1486 > 14 include/linux/mmzone.h:1539 pfn_valid 1524 > 15 arch/x86/mm/physaddr.c:65 __virt_addr_valid 47 > 16 mm/usercopy.c:188 check_heap_object 161 > 17 mm/usercopy.c:250 __check_object_size 212 > 18 mm/usercopy.c:212 __check_object_size N/A > 19 include/linux/thread_info.h:199 check_object_size 195 > 20 lib/strncpy_from_user.c:137 strncpy_from_user 113 > 21 fs/namei.c:150 getname_flags 129 > 22 fs/namei.c:2896 user_path_at_empty 2893 > 23 include/linux/namei.h:57 user_path_at 54 > 24 fs/open.c:446 do_faccessat 420 > 25 arch/x86/entry/common.c:50 do_syscall_x64 40 > 26 arch/x86/entry/common.c:80 do_syscall_64 73 > 27 N/A entry_SYSCALL_64_after_hwframe N/A > > We are neither experts in the mm subsystem nor can provide a fix, but wanted > to > let you know about our findings. > > Cheers, > Florian > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are the assignee for the bug.
On Thu, Sep 15, 2022 at 01:39:31PM -0700, Andrew Morton wrote: > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Wed, 14 Sep 2022 15:07:46 +0000 bugzilla-daemon@kernel.org wrote: > > > https://bugzilla.kernel.org/show_bug.cgi?id=216489 > > > > Bug ID: 216489 > > Summary: Machine freezes due to memory lock > > Product: Memory Management > > Version: 2.5 > > Kernel Version: 5.19.8 > > Hardware: AMD > > OS: Linux > > Tree: Mainline > > Status: NEW > > Severity: high > > Priority: P1 > > Component: Other > > Assignee: akpm@linux-foundation.org > > Reporter: dev@der-flo.net > > Regression: No > > > > Hi all, > > With Kernel 5.19.x we noticed system freezes. This happens in virtual > > environments as well as on real hardware. > > On a real hardware machine we were able to catch the moment of freeze with > > continuous profiling. > > Thanks. I forwarded this to Uladzislau and he offered to help. He said: > > > : I can help with debugging. What i need is reproduce steps. Could you > : please clarify if it is easy to hit and what kind of profiling triggers it? > > and > > : I do not think that Matthew Wilcox commits destroys it but... I see that > : __vunmap() is invoked by the free_work() thus a caller is in atomic > : context including IRQ context. To keep all of this information together; Florian emailed me off-list and I replied cc'ing Kees. I asked to try this patch to decide whether it's the extra load on the spinlock from examining the vmap tree more often: diff --git a/mm/usercopy.c b/mm/usercopy.c index c1ee15a98633..76d2d4fb6d22 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -173,15 +173,6 @@ static inline void check_heap_object(const void *ptr, unsigned long n, } if (is_vmalloc_addr(ptr)) { - struct vmap_area *area = find_vmap_area(addr); - - if (!area) - usercopy_abort("vmalloc", "no area", to_user, 0, n); - - if (n > area->va_end - addr) { - offset = addr - area->va_start; - usercopy_abort("vmalloc", NULL, to_user, offset, n); - } return; } Kees wrote: } If you can reproduce the hangs, perhaps enable: } } CONFIG_DEBUG_LOCKDEP=y } CONFIG_DEBUG_ATOMIC_SLEEP=y } } I would expect any hung spinlock to complain very loudly under } LOCKDEP... I hope we can keep the remainder of the debugging in this email thread. > > Specification of the machine where we captured the freeze: > > Thinkpad T14 > > CPU: AMD Ryzen 7 PRO 4750U > > Kernel: 5.19.8-200.fc36.x86_64 > > > > Stacktrace of kworker/12:3 that is using all resources and causing the > freeze: > > > > # Source Location Function Name Function > Line > > 0 arch/x86/include/asm/vdso/processor.h:13 rep_nop 11 > > 1 arch/x86/include/asm/vdso/processor.h:18 cpu_relax 16 > > 2 kernel/locking/qspinlock.c:514 > native_queued_spin_lock_slowpath > > 316 > > 3 kernel/locking/qspinlock.c:316 > native_queued_spin_lock_slowpath > > N/A > > 4 arch/x86/include/asm/paravirt.h:591 pv_queued_spin_lock_slowpath > > 588 > > 5 arch/x86/include/asm/qspinlock.h:51 queued_spin_lock_slowpath > 49 > > 6 include/asm-generic/qspinlock.h:114 queued_spin_lock 107 > > 7 include/linux/spinlock.h:185 do_raw_spin_lock 182 > > 8 include/linux/spinlock_api_smp.h:134 __raw_spin_lock > 130 > > 9 kernel/locking/spinlock.c:154 _raw_spin_lock 152 > > 10 include/linux/spinlock.h:349 spin_lock 347 > > 11 mm/vmalloc.c:1805 find_vmap_area 1801 > > 12 mm/vmalloc.c:2525 find_vm_area 2521 > > 13 mm/vmalloc.c:2639 __vunmap 2628 > > 14 mm/vmalloc.c:97 free_work 91 > > 15 kernel/workqueue.c:2289 process_one_work 2181 > > 16 kernel/workqueue.c:2436 worker_thread 2378 > > 17 kernel/kthread.c:376 kthread 330 > > 18 N/A ret_from_fork N/A > > > > The functions in the above shown stacktrace hardly change. There is only > one > > commit 993d0b287e2ef7bee2e8b13b0ce4d2b5066f278e which introduces changes to > > find_vmap_area() for 5.19. > > > > With this change in mind we looked for stacktraces which make also use of > this > > new commit. And in a different kernel thread we do notice the use of > > check_heap_object(): > > > > # Source Location Function Name Function Line > > 0 arch/x86/include/asm/paravirt.h:704 arch_local_irq_enable 702 > > 1 arch/x86/include/asm/irqflags.h:138 arch_local_irq_restore 135 > > 2 kernel/sched/sched.h:1330 raw_spin_rq_unlock_irqrestore 1327 > > 3 kernel/sched/sched.h:1327 raw_spin_rq_unlock_irqrestore N/A > > 4 kernel/sched/sched.h:1611 rq_unlock_irqrestore 1607 > > 5 kernel/sched/fair.c:8288 update_blocked_averages 8272 > > 6 kernel/sched/fair.c:11133 run_rebalance_domains 11115 > > 7 kernel/softirq.c:571 __do_softirq 528 > > 8 kernel/softirq.c:445 invoke_softirq 433 > > 9 kernel/softirq.c:650 __irq_exit_rcu 640 > > 10 arch/x86/kernel/apic/apic.c:1106 sysvec_apic_timer_interrupt N/A > > 11 N/A asm_sysvec_apic_timer_interrupt N/A > > 12 include/linux/mmzone.h:1403 __nr_to_section 1395 > > 13 include/linux/mmzone.h:1488 __pfn_to_section 1486 > > 14 include/linux/mmzone.h:1539 pfn_valid 1524 > > 15 arch/x86/mm/physaddr.c:65 __virt_addr_valid 47 > > 16 mm/usercopy.c:188 check_heap_object 161 > > 17 mm/usercopy.c:250 __check_object_size 212 > > 18 mm/usercopy.c:212 __check_object_size N/A > > 19 include/linux/thread_info.h:199 check_object_size 195 > > 20 lib/strncpy_from_user.c:137 strncpy_from_user 113 > > 21 fs/namei.c:150 getname_flags 129 > > 22 fs/namei.c:2896 user_path_at_empty 2893 > > 23 include/linux/namei.h:57 user_path_at 54 > > 24 fs/open.c:446 do_faccessat 420 > > 25 arch/x86/entry/common.c:50 do_syscall_x64 40 > > 26 arch/x86/entry/common.c:80 do_syscall_64 73 > > 27 N/A entry_SYSCALL_64_after_hwframe N/A > > > > We are neither experts in the mm subsystem nor can provide a fix, but > wanted to > > let you know about our findings. > > > > Cheers, > > Florian > > > > -- > > You may reply to this email to add a comment. > > > > You are receiving this mail because: > > You are the assignee for the bug. >
On Thu, Sep 15, 2022 at 4:42 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Sep 15, 2022 at 01:39:31PM -0700, Andrew Morton wrote: > > (switched to email. Please respond via emailed reply-to-all, not via the > > bugzilla web interface). > > > > On Wed, 14 Sep 2022 15:07:46 +0000 bugzilla-daemon@kernel.org wrote: > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=216489 > > > > > > Bug ID: 216489 > > > Summary: Machine freezes due to memory lock > > > Product: Memory Management > > > Version: 2.5 > > > Kernel Version: 5.19.8 > > > Hardware: AMD > > > OS: Linux > > > Tree: Mainline > > > Status: NEW > > > Severity: high > > > Priority: P1 > > > Component: Other > > > Assignee: akpm@linux-foundation.org > > > Reporter: dev@der-flo.net > > > Regression: No > > > > > > Hi all, > > > With Kernel 5.19.x we noticed system freezes. This happens in virtual > > > environments as well as on real hardware. > > > On a real hardware machine we were able to catch the moment of freeze > with > > > continuous profiling. > > > > Thanks. I forwarded this to Uladzislau and he offered to help. He said: > > > > > > : I can help with debugging. What i need is reproduce steps. Could you > > : please clarify if it is easy to hit and what kind of profiling triggers > it? > > > > and > > > > : I do not think that Matthew Wilcox commits destroys it but... I see that > > : __vunmap() is invoked by the free_work() thus a caller is in atomic > > : context including IRQ context. > > To keep all of this information together; Florian emailed me off-list > and I replied cc'ing Kees. > > I asked to try this patch to decide whether it's the extra load on the > spinlock from examining the vmap tree more often: > > diff --git a/mm/usercopy.c b/mm/usercopy.c > index c1ee15a98633..76d2d4fb6d22 100644 > --- a/mm/usercopy.c > +++ b/mm/usercopy.c > @@ -173,15 +173,6 @@ static inline void check_heap_object(const void *ptr, > unsigned long n, > } > > if (is_vmalloc_addr(ptr)) { > - struct vmap_area *area = find_vmap_area(addr); > - > - if (!area) > - usercopy_abort("vmalloc", "no area", to_user, 0, n); > - > - if (n > area->va_end - addr) { > - offset = addr - area->va_start; > - usercopy_abort("vmalloc", NULL, to_user, offset, n); > - } > return; > } > > > Kees wrote: > > } If you can reproduce the hangs, perhaps enable: > } > } CONFIG_DEBUG_LOCKDEP=y > } CONFIG_DEBUG_ATOMIC_SLEEP=y > } > } I would expect any hung spinlock to complain very loudly under > } LOCKDEP... > > I hope we can keep the remainder of the debugging in this email thread. > > > > Specification of the machine where we captured the freeze: > > > Thinkpad T14 > > > CPU: AMD Ryzen 7 PRO 4750U > > > Kernel: 5.19.8-200.fc36.x86_64 > > > > > > Stacktrace of kworker/12:3 that is using all resources and causing the > freeze: > > > > > > # Source Location Function Name Function > Line > > > 0 arch/x86/include/asm/vdso/processor.h:13 rep_nop > 11 > > > 1 arch/x86/include/asm/vdso/processor.h:18 cpu_relax > 16 > > > 2 kernel/locking/qspinlock.c:514 > native_queued_spin_lock_slowpath > > > 316 > > > 3 kernel/locking/qspinlock.c:316 > native_queued_spin_lock_slowpath > > > N/A > > > 4 arch/x86/include/asm/paravirt.h:591 pv_queued_spin_lock_slowpath > > > 588 > > > 5 arch/x86/include/asm/qspinlock.h:51 queued_spin_lock_slowpath > 49 > > > 6 include/asm-generic/qspinlock.h:114 queued_spin_lock > 107 > > > 7 include/linux/spinlock.h:185 do_raw_spin_lock > 182 > > > 8 include/linux/spinlock_api_smp.h:134 __raw_spin_lock > 130 > > > 9 kernel/locking/spinlock.c:154 _raw_spin_lock > 152 > > > 10 include/linux/spinlock.h:349 spin_lock 347 > > > 11 mm/vmalloc.c:1805 find_vmap_area 1801 > > > 12 mm/vmalloc.c:2525 find_vm_area 2521 > > > 13 mm/vmalloc.c:2639 __vunmap 2628 > > > 14 mm/vmalloc.c:97 free_work 91 > > > 15 kernel/workqueue.c:2289 process_one_work 2181 > > > 16 kernel/workqueue.c:2436 worker_thread 2378 > > > 17 kernel/kthread.c:376 kthread 330 > > > 18 N/A ret_from_fork N/A > > > > > > The functions in the above shown stacktrace hardly change. There is only > one > > > commit 993d0b287e2ef7bee2e8b13b0ce4d2b5066f278e which introduces changes > to > > > find_vmap_area() for 5.19. > > > > > > With this change in mind we looked for stacktraces which make also use of > this > > > new commit. And in a different kernel thread we do notice the use of > > > check_heap_object(): > > > > > > # Source Location Function Name Function Line > > > 0 arch/x86/include/asm/paravirt.h:704 arch_local_irq_enable 702 > > > 1 arch/x86/include/asm/irqflags.h:138 arch_local_irq_restore 135 > > > 2 kernel/sched/sched.h:1330 raw_spin_rq_unlock_irqrestore 1327 > > > 3 kernel/sched/sched.h:1327 raw_spin_rq_unlock_irqrestore N/A > > > 4 kernel/sched/sched.h:1611 rq_unlock_irqrestore 1607 > > > 5 kernel/sched/fair.c:8288 update_blocked_averages 8272 > > > 6 kernel/sched/fair.c:11133 run_rebalance_domains 11115 > > > 7 kernel/softirq.c:571 __do_softirq 528 > > > 8 kernel/softirq.c:445 invoke_softirq 433 > > > 9 kernel/softirq.c:650 __irq_exit_rcu 640 > > > 10 arch/x86/kernel/apic/apic.c:1106 sysvec_apic_timer_interrupt N/A > > > 11 N/A asm_sysvec_apic_timer_interrupt N/A > > > 12 include/linux/mmzone.h:1403 __nr_to_section 1395 > > > 13 include/linux/mmzone.h:1488 __pfn_to_section 1486 > > > 14 include/linux/mmzone.h:1539 pfn_valid 1524 > > > 15 arch/x86/mm/physaddr.c:65 __virt_addr_valid 47 > > > 16 mm/usercopy.c:188 check_heap_object 161 > > > 17 mm/usercopy.c:250 __check_object_size 212 > > > 18 mm/usercopy.c:212 __check_object_size N/A > > > 19 include/linux/thread_info.h:199 check_object_size 195 > > > 20 lib/strncpy_from_user.c:137 strncpy_from_user 113 > > > 21 fs/namei.c:150 getname_flags 129 > > > 22 fs/namei.c:2896 user_path_at_empty 2893 > > > 23 include/linux/namei.h:57 user_path_at 54 > > > 24 fs/open.c:446 do_faccessat 420 > > > 25 arch/x86/entry/common.c:50 do_syscall_x64 40 > > > 26 arch/x86/entry/common.c:80 do_syscall_64 73 > > > 27 N/A entry_SYSCALL_64_after_hwframe N/A > > > > > > We are neither experts in the mm subsystem nor can provide a fix, but > wanted to > > > let you know about our findings. > > > > > > Cheers, > > > Florian > > > > > > -- > > > You may reply to this email to add a comment. > > > > > > You are receiving this mail because: > > > You are the assignee for the bug. I think this is a manifest of the lockdep warning I reported a couple of weeks ago: https://lore.kernel.org/r/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@mail.gmail.com/
On Thu, Sep 15, 2022 at 05:59:56PM -0600, Yu Zhao wrote: > I think this is a manifest of the lockdep warning I reported a couple > of weeks ago: > > https://lore.kernel.org/r/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@mail.gmail.com/ That would certainly match the symptoms. Turning vmap_lock into an NMI-safe lock would be bad. I don't even know if we have primitives for that (it's not like you can disable an NMI ...) I don't quite have time to write a patch right now. Perhaps something like: struct vmap_area *find_vmap_area_nmi(unsigned long addr) { struct vmap_area *va; if (spin_trylock(&vmap_area_lock)) return NULL; va = __find_vmap_area(addr, &vmap_area_root); spin_unlock(&vmap_area_lock); return va; } and then call find_vmap_area_nmi() in check_heap_object(). I may have the polarity of the return value of spin_trylock() incorrect.
On Fri, Sep 16, 2022 at 09:38:33AM +0100, Matthew Wilcox wrote: > On Thu, Sep 15, 2022 at 05:59:56PM -0600, Yu Zhao wrote: > > I think this is a manifest of the lockdep warning I reported a couple > > of weeks ago: > > > https://lore.kernel.org/r/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@mail.gmail.com/ > > That would certainly match the symptoms. > > Turning vmap_lock into an NMI-safe lock would be bad. I don't even know > if we have primitives for that (it's not like you can disable an NMI > ...) > > I don't quite have time to write a patch right now. Perhaps something > like: > > struct vmap_area *find_vmap_area_nmi(unsigned long addr) > { > struct vmap_area *va; > > if (spin_trylock(&vmap_area_lock)) > return NULL; > va = __find_vmap_area(addr, &vmap_area_root); > spin_unlock(&vmap_area_lock); > > return va; > } > > and then call find_vmap_area_nmi() in check_heap_object(). I may have > the polarity of the return value of spin_trylock() incorrect. I think we'll need something slightly tweaked, since this would return NULL under any contention (and a NULL return is fatal in check_heap_object()). It seems like we need to explicitly check for being in nmi context in check_heap_object() to deal with it? Like this (only build tested): diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 096d48aa3437..c8a00f181a11 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -216,6 +216,7 @@ void free_vm_area(struct vm_struct *area); extern struct vm_struct *remove_vm_area(const void *addr); extern struct vm_struct *find_vm_area(const void *addr); struct vmap_area *find_vmap_area(unsigned long addr); +struct vmap_area *find_vmap_area_try(unsigned long addr); static inline bool is_vm_area_hugepages(const void *addr) { diff --git a/mm/usercopy.c b/mm/usercopy.c index c1ee15a98633..9f943c29e7ec 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -173,7 +173,16 @@ static inline void check_heap_object(const void *ptr, unsigned long n, } if (is_vmalloc_addr(ptr)) { - struct vmap_area *area = find_vmap_area(addr); + struct vmap_area *area; + + if (!in_nmi()) { + area = find_vmap_area(addr); + } else { + area = find_vmap_area_try(addr); + /* Give up under NMI to avoid deadlocks. */ + if (!area) + return; + } if (!area) usercopy_abort("vmalloc", "no area", to_user, 0, n); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index dd6cdb201195..f14f1902c2f6 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1840,6 +1840,17 @@ struct vmap_area *find_vmap_area(unsigned long addr) return va; } +struct vmap_area *find_vmap_area_try(unsigned long addr) +{ + struct vmap_area *va = NULL; + + if (spin_trylock(&vmap_area_lock)) { + va = __find_vmap_area(addr, &vmap_area_root); + spin_unlock(&vmap_area_lock); + } + return va; +} + /*** Per cpu kva allocator ***/ /*
(In reply to willy from comment #2) > On Thu, Sep 15, 2022 at 01:39:31PM -0700, Andrew Morton wrote: > [..] > I asked to try this patch to decide whether it's the extra load on the > spinlock from examining the vmap tree more often: > > diff --git a/mm/usercopy.c b/mm/usercopy.c > index c1ee15a98633..76d2d4fb6d22 100644 > --- a/mm/usercopy.c > +++ b/mm/usercopy.c > @@ -173,15 +173,6 @@ static inline void check_heap_object(const void *ptr, > unsigned long n, > } > > if (is_vmalloc_addr(ptr)) { > - struct vmap_area *area = find_vmap_area(addr); > - > - if (!area) > - usercopy_abort("vmalloc", "no area", to_user, 0, n); > - > - if (n > area->va_end - addr) { > - offset = addr - area->va_start; > - usercopy_abort("vmalloc", NULL, to_user, offset, n); > - } > return; > } > I did give this a try. But I can't tell if it works or not. The problem I have at the moment is, that we can't reliable reproduce the freeze. It "just" happens. And so far I can only report the symptoms we see.
On Fri, Sep 16, 2022 at 02:46:39AM -0700, Kees Cook wrote: > On Fri, Sep 16, 2022 at 09:38:33AM +0100, Matthew Wilcox wrote: > > On Thu, Sep 15, 2022 at 05:59:56PM -0600, Yu Zhao wrote: > > > I think this is a manifest of the lockdep warning I reported a couple > > > of weeks ago: > > > > https://lore.kernel.org/r/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@mail.gmail.com/ > > > > That would certainly match the symptoms. > > > > Turning vmap_lock into an NMI-safe lock would be bad. I don't even know > > if we have primitives for that (it's not like you can disable an NMI > > ...) > > > > I don't quite have time to write a patch right now. Perhaps something > > like: > > > > struct vmap_area *find_vmap_area_nmi(unsigned long addr) > > { > > struct vmap_area *va; > > > > if (spin_trylock(&vmap_area_lock)) > > return NULL; > > va = __find_vmap_area(addr, &vmap_area_root); > > spin_unlock(&vmap_area_lock); > > > > return va; > > } > > > > and then call find_vmap_area_nmi() in check_heap_object(). I may have > > the polarity of the return value of spin_trylock() incorrect. > > I think we'll need something slightly tweaked, since this would > return NULL under any contention (and a NULL return is fatal in > check_heap_object()). It seems like we need to explicitly check > for being in nmi context in check_heap_object() to deal with it? > Like this (only build tested): > > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h > index 096d48aa3437..c8a00f181a11 100644 > --- a/include/linux/vmalloc.h > +++ b/include/linux/vmalloc.h > @@ -216,6 +216,7 @@ void free_vm_area(struct vm_struct *area); > extern struct vm_struct *remove_vm_area(const void *addr); > extern struct vm_struct *find_vm_area(const void *addr); > struct vmap_area *find_vmap_area(unsigned long addr); > +struct vmap_area *find_vmap_area_try(unsigned long addr); > > static inline bool is_vm_area_hugepages(const void *addr) > { > diff --git a/mm/usercopy.c b/mm/usercopy.c > index c1ee15a98633..9f943c29e7ec 100644 > --- a/mm/usercopy.c > +++ b/mm/usercopy.c > @@ -173,7 +173,16 @@ static inline void check_heap_object(const void *ptr, > unsigned long n, > } > > if (is_vmalloc_addr(ptr)) { > - struct vmap_area *area = find_vmap_area(addr); > + struct vmap_area *area; > + > + if (!in_nmi()) { > + area = find_vmap_area(addr); > + } else { > + area = find_vmap_area_try(addr); > + /* Give up under NMI to avoid deadlocks. */ > + if (!area) > + return; > + } > > if (!area) > usercopy_abort("vmalloc", "no area", to_user, 0, n); > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index dd6cdb201195..f14f1902c2f6 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -1840,6 +1840,17 @@ struct vmap_area *find_vmap_area(unsigned long addr) > return va; > } > > +struct vmap_area *find_vmap_area_try(unsigned long addr) > +{ > + struct vmap_area *va = NULL; > + > + if (spin_trylock(&vmap_area_lock)) { > + va = __find_vmap_area(addr, &vmap_area_root); > + spin_unlock(&vmap_area_lock); > + } > + return va; > +} > + > /*** Per cpu kva allocator ***/ > > /* > OK. The problem is about using find_vmap_area() from the IRQ context. Indeed it can be dead-locked. It is not supposed to be used there. But if you want then we should have a helper. Please note that it might be a regular IRQ also so it is not limited to NMI context only, because somebody could decide later to use it from a regular IRQ. IMHO, it makes sense to make use of in_interrupt() helper instead so we cover here a hw-IRQ context including NMI one. It also would be aligned with deferred vfreeing: <snip> tatic void __vfree(const void *addr) { if (unlikely(in_interrupt())) __vfree_deferred(addr); else __vunmap(addr, 1); } <snip> so we handle here not only NMI scenario. I think we should align. -- Uladzislau Rezki
On Fri, Sep 16, 2022 at 02:28:08PM +0200, Uladzislau Rezki wrote: > On Fri, Sep 16, 2022 at 02:46:39AM -0700, Kees Cook wrote: > > On Fri, Sep 16, 2022 at 09:38:33AM +0100, Matthew Wilcox wrote: > > > On Thu, Sep 15, 2022 at 05:59:56PM -0600, Yu Zhao wrote: > > > > I think this is a manifest of the lockdep warning I reported a couple > > > > of weeks ago: > > > > > https://lore.kernel.org/r/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@mail.gmail.com/ > > > > > > That would certainly match the symptoms. > > > > > > Turning vmap_lock into an NMI-safe lock would be bad. I don't even know > > > if we have primitives for that (it's not like you can disable an NMI > > > ...) > > > > > > I don't quite have time to write a patch right now. Perhaps something > > > like: > > > > > > struct vmap_area *find_vmap_area_nmi(unsigned long addr) > > > { > > > struct vmap_area *va; > > > > > > if (spin_trylock(&vmap_area_lock)) > > > return NULL; > > > va = __find_vmap_area(addr, &vmap_area_root); > > > spin_unlock(&vmap_area_lock); > > > > > > return va; > > > } > > > > > > and then call find_vmap_area_nmi() in check_heap_object(). I may have > > > the polarity of the return value of spin_trylock() incorrect. > > > > I think we'll need something slightly tweaked, since this would > > return NULL under any contention (and a NULL return is fatal in > > check_heap_object()). It seems like we need to explicitly check > > for being in nmi context in check_heap_object() to deal with it? > > Like this (only build tested): > > > > > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h > > index 096d48aa3437..c8a00f181a11 100644 > > --- a/include/linux/vmalloc.h > > +++ b/include/linux/vmalloc.h > > @@ -216,6 +216,7 @@ void free_vm_area(struct vm_struct *area); > > extern struct vm_struct *remove_vm_area(const void *addr); > > extern struct vm_struct *find_vm_area(const void *addr); > > struct vmap_area *find_vmap_area(unsigned long addr); > > +struct vmap_area *find_vmap_area_try(unsigned long addr); > > > > static inline bool is_vm_area_hugepages(const void *addr) > > { > > diff --git a/mm/usercopy.c b/mm/usercopy.c > > index c1ee15a98633..9f943c29e7ec 100644 > > --- a/mm/usercopy.c > > +++ b/mm/usercopy.c > > @@ -173,7 +173,16 @@ static inline void check_heap_object(const void *ptr, > unsigned long n, > > } > > > > if (is_vmalloc_addr(ptr)) { > > - struct vmap_area *area = find_vmap_area(addr); > > + struct vmap_area *area; > > + > > + if (!in_nmi()) { > > + area = find_vmap_area(addr); > > + } else { > > + area = find_vmap_area_try(addr); > > + /* Give up under NMI to avoid deadlocks. */ > > + if (!area) > > + return; > > + } > > > > if (!area) > > usercopy_abort("vmalloc", "no area", to_user, 0, n); > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index dd6cdb201195..f14f1902c2f6 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -1840,6 +1840,17 @@ struct vmap_area *find_vmap_area(unsigned long addr) > > return va; > > } > > > > +struct vmap_area *find_vmap_area_try(unsigned long addr) > > +{ > > + struct vmap_area *va = NULL; > > + > > + if (spin_trylock(&vmap_area_lock)) { > > + va = __find_vmap_area(addr, &vmap_area_root); > > + spin_unlock(&vmap_area_lock); > > + } > > + return va; > > +} > > + > > /*** Per cpu kva allocator ***/ > > > > /* > > > OK. The problem is about using find_vmap_area() from the IRQ context. Indeed > it can be dead-locked. It is not supposed to be used there. But if you want > then we should have a helper. > > Please note that it might be a regular IRQ also so it is not limited to NMI > context only, because somebody could decide later to use it from a regular > IRQ. > > IMHO, it makes sense to make use of in_interrupt() helper instead so we > cover here a hw-IRQ context including NMI one. It also would be aligned > with deferred vfreeing: > > <snip> > tatic void __vfree(const void *addr) > { > if (unlikely(in_interrupt())) > __vfree_deferred(addr); > else > __vunmap(addr, 1); > } > <snip> > > so we handle here not only NMI scenario. I think we should align. > Another thing that i should mention is, using sleepable locks(it is for PREEMPT_RT) is not allowed in any atomic. So for PREEMPT_RT point of view it is broken. -- Uladzislau Rezki
On Fri, Sep 16, 2022 at 02:46:39AM -0700, Kees Cook wrote: > On Fri, Sep 16, 2022 at 09:38:33AM +0100, Matthew Wilcox wrote: > > On Thu, Sep 15, 2022 at 05:59:56PM -0600, Yu Zhao wrote: > > > I think this is a manifest of the lockdep warning I reported a couple > > > of weeks ago: > > > > https://lore.kernel.org/r/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@mail.gmail.com/ > > > > That would certainly match the symptoms. > > > > Turning vmap_lock into an NMI-safe lock would be bad. I don't even know > > if we have primitives for that (it's not like you can disable an NMI > > ...) > > > > I don't quite have time to write a patch right now. Perhaps something > > like: > > > > struct vmap_area *find_vmap_area_nmi(unsigned long addr) > > { > > struct vmap_area *va; > > > > if (spin_trylock(&vmap_area_lock)) > > return NULL; > > va = __find_vmap_area(addr, &vmap_area_root); > > spin_unlock(&vmap_area_lock); > > > > return va; > > } > > > > and then call find_vmap_area_nmi() in check_heap_object(). I may have > > the polarity of the return value of spin_trylock() incorrect. > > I think we'll need something slightly tweaked, since this would > return NULL under any contention (and a NULL return is fatal in > check_heap_object()). It seems like we need to explicitly check > for being in nmi context in check_heap_object() to deal with it? > Like this (only build tested): Right, and Ulad is right about it beig callable from any context. I think the longterm solution is to make the vmap_area_root tree walkable under RCU protection. For now, let's have a distinct return code (ERR_PTR(-EAGAIN), perhaps?) to indicate that we've hit contention. It generally won't matter if we hit it in process context because hardening doesn't have to be 100% reliable to be useful. Erm ... so what prevents this race: CPU 0 CPU 1 copy_to_user() check_heap_object() area = find_vmap_area(addr) __purge_vmap_area_lazy() merge_or_add_vmap_area_augment() __merge_or_add_vmap_area() kmem_cache_free(vmap_area_cachep, va); if (n > area->va_end - addr) { Yes, it's a race in the code that allocated this memory; they're simultaneously calling copy_to_user() and __vunmap(). We'll catch this bad behaviour sooner rather than later, but sometimes in trying to catch this bug, we'll get caught by the bug and go splat. I don't know that we need to go through heroics to be sure we don't get caught by this bug. It already has to run a workqueue to do the freeing. We could delay it even further with RCU or something, but we're only trading off one kind of badness for another.
On Fri, Sep 16, 2022 at 03:15:05PM +0100, Matthew Wilcox wrote: > Right, and Ulad is right about it beig callable from any context. I think > the longterm solution is to make the vmap_area_root tree walkable under > RCU protection. Agreed, I've updated my proposed patch. > For now, let's have a distinct return code (ERR_PTR(-EAGAIN), perhaps?) to > indicate that we've hit contention. It generally won't matter if we > hit it in process context because hardening doesn't have to be 100% > reliable to be useful. Right, as I note in the series[1], hardening shouldn't be getting called _at all_ in this path. :P -Kees [1] https://lore.kernel.org/linux-hardening/20220916135953.1320601-1-keescook@chromium.org/
On Fri, Sep 16, 2022 at 03:15:05PM +0100, Matthew Wilcox wrote: > On Fri, Sep 16, 2022 at 02:46:39AM -0700, Kees Cook wrote: > > On Fri, Sep 16, 2022 at 09:38:33AM +0100, Matthew Wilcox wrote: > > > On Thu, Sep 15, 2022 at 05:59:56PM -0600, Yu Zhao wrote: > > > > I think this is a manifest of the lockdep warning I reported a couple > > > > of weeks ago: > > > > > https://lore.kernel.org/r/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@mail.gmail.com/ > > > > > > That would certainly match the symptoms. > > > > > > Turning vmap_lock into an NMI-safe lock would be bad. I don't even know > > > if we have primitives for that (it's not like you can disable an NMI > > > ...) > > > > > > I don't quite have time to write a patch right now. Perhaps something > > > like: > > > > > > struct vmap_area *find_vmap_area_nmi(unsigned long addr) > > > { > > > struct vmap_area *va; > > > > > > if (spin_trylock(&vmap_area_lock)) > > > return NULL; > > > va = __find_vmap_area(addr, &vmap_area_root); > > > spin_unlock(&vmap_area_lock); > > > > > > return va; > > > } > > > > > > and then call find_vmap_area_nmi() in check_heap_object(). I may have > > > the polarity of the return value of spin_trylock() incorrect. > > > > I think we'll need something slightly tweaked, since this would > > return NULL under any contention (and a NULL return is fatal in > > check_heap_object()). It seems like we need to explicitly check > > for being in nmi context in check_heap_object() to deal with it? > > Like this (only build tested): > > Right, and Ulad is right about it beig callable from any context. I think > the longterm solution is to make the vmap_area_root tree walkable under > RCU protection. > > For now, let's have a distinct return code (ERR_PTR(-EAGAIN), perhaps?) to > indicate that we've hit contention. It generally won't matter if we > hit it in process context because hardening doesn't have to be 100% > reliable to be useful. > > Erm ... so what prevents this race: > > CPU 0 CPU 1 > copy_to_user() > check_heap_object() > area = find_vmap_area(addr) > __purge_vmap_area_lazy() > merge_or_add_vmap_area_augment() > __merge_or_add_vmap_area() > kmem_cache_free(vmap_area_cachep, va); > Sounds like it can happen. I think it is a good argument to switch to the RCU usage here for safe access to va after the lock is released. So i can think about it and put it as task to my todo list. Since it is not urgent so far it is OK to wait for a splat. But it might never happens :) -- Uladzislau Rezki
I can confirm the Debian 6.0-rc7 rt kernels lockup after about 5 min, while the non-rt kernel has been running for more than 20.