Bug 215616

Summary: With numa balancing enabled, physical page is migrated even though we called gup
Product: Memory Management Reporter: ogabbay
Component: OtherAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: high CC: daniel, dave, david, torvalds, xzpeter, yuranu
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.9 and above Subsystem:
Regression: No Bisected commit-id:

Description ogabbay 2022-02-16 16:45:59 UTC
We have a demo running on two dual-socket XEON servers that utilize the Habanalabs Gaudi AI accelerator.
Each server contains 8 PCIe accelerators.

The demo opens 8 processes (on each server), each process opens a file descriptor of a different Gaudi device. The user-space process does NOT use fork.

As part of the demo (and almost every program we run using our device), the userspace app allocates memory and call the HabanaLabs kernel driver to pin that memory and map it to the Gaudi's MMU, so the Gaudi's DMA engine will be able to read/write from that memory (very similar to GPU and RDMA drivers).

When working with kernel 5.9 and above, with numa-balancing enabled, and without using affinity on the user-space processes, we see that sometimes, the device reads a corrupted memory from the host. 

When I say corrupted memory, I mean that it should contain valid "commands" to the device's DMA engine, but in reality it contains garbage. Let's call that memory "command buffer"

When we analyze the Gaudi's MMU page tables and compare them to the CPU's page tables, we see the corruption arrives from the fact that in the CPU's page tables, the command buffer user-space virtual address points to physical page A, while in the device's MMU page tables, it points to physical page B.
If we read the memory through the CPU, the contents are valid.

We saw that we can make this bug disappear using one of the following workarounds:

1. Disable numa-balancing through /proc/sys/kernel/numa_balancing.

2. Use affinity for each process, binding each process to a specific numa node.

After bisecting between kernel 5.9 and 5.8, I've arrived to the following patch:
2020-09-04 - 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 -  mm: do_wp_page() simplification <Linus Torvalds>

If I revert this patch (and 3 more patches that came after it and depend on it), then the problem disappear.

I've verified this also on latest kernel 5.17-rc4 (had to revert 3 more patches that cleaned code that wasn't used because of that change).

I would appreciate that MM people will look on how this change affect the locking between GUP and numa-balancing code.
Comment 1 Dave Hansen 2022-02-16 19:16:28 UTC
get_user_pages() ensures that the physical page remains allocated.  But it does not guarantee that the mapping from which it was gup'd remains.  Non-CPU mmu's get updates to the CPU page tables via mmu notifiers in general.

The patch to which you bisected contains this hunk:

+               if (PageKsm(page) || page_count(page) != 1)
+                       goto copy;

That looks to be more aggressively copying the pages than before.  Copying changes the mapping.  If there were a driver bug where it assumed a static mapping after a gup, it's plausible that this hunk would cause more mapping changes and expose driver bugs more easily.

I'm not sure how NUMA balancing is involved.
Comment 2 Linus Torvalds 2022-02-16 19:24:52 UTC
(In reply to Dave Hansen from comment #1)
> 
> I'm not sure how NUMA balancing is involved.

I literally think it's simply just that NUMA balancing causes random page faults at possibly high frequencies, and the NUMA fault handler is running at the same time on another CPU and elevates the page count, and exposes some random situation.

Now, the thing that I don't udnerstand is why such a page would be marked COW to begin with, though. It shouldn't. The original page pinning should have COW'ed the page and made it writable - and NUMA balancing shouldn't affect *that*.

So my reaction is "How did it get unwritable in the first place?"

I have no idea what that demo program does, of course. That might be interesting.
Comment 3 ogabbay 2022-02-16 19:52:45 UTC
In general , the demo performs collective operations (all-gather, all-reduce) between 16 accelerators, that are installed in 2 servers. Inside the server they communicate through the internal RDMA links on the accelerators and between the two servers they communicate through TCP over the host NIC.

The demo is using a new library we call HCCL, which is an emulation library for Nvidia's NCCL library, but that's not really interesting.

The relevant thing it does is each process opens a single device and then start to perform collective operations with the other devices. The device reads both commands and data from the server's memory.

When the user wants the device to read/write from host memory, it performs the following:
1. Allocate memory in userspace, either hugepage using mmap, or regular memory using new.

2. Call the driver's ioctl to map that memory to the device's MMU. The driver pins the memory, dma-map it, and program the page tables of the device's MMU with the dma addresses. It then returns a different virtual address to the user, which only the device knows. We don't do SVM/UVA.

3. The user fills the memory with commands/data and then give the device virtual address to the device to access it.

So it is a fairly simple procedure. One that is done in every test, demo or real deep-learning training we are executing on our devices for the past 4 years.

A few months ago we encountered a similar bug (physical address in cpu is different than that in the device) when using pytorch (deep learning framework). The root cause of that bug was because pytorch use fork, and the fork happened after the memory was allocated, so COW killed us. We solved it by calling madvise with MADV_DONTFORK after each allocation.

But in this demo, we don't fork, at least not consciously. I will double-check if that happens by calling some utility or external library.

Another thing I noticed, is that in kernels 5.9 and above, the amount of migration increased significantly. From tens of migrations before 5.9 to 1k migrations per second (observed via /proc/vmstat)

Hope this info helps a bit.
Comment 4 Linus Torvalds 2022-02-16 20:01:55 UTC
(In reply to Linus Torvalds from comment #2)
> 
> So my reaction is "How did it get unwritable in the first place?"

Actually, I think I see how this happened.

Look at change_pte_range(), and that

        /*
         * Avoid trapping faults against the zero or KSM
         * pages. See similar comment in change_huge_pmd.
         */
        if (prot_numa) {

code block in particular.

It does that

                /* Also skip shared copy-on-write pages */
                if (is_cow_mapping(vma->vm_flags) &&
                    page_mapcount(page) != 1)
                        continue;

thing, and I think that 'page_mapcount(page) != 1' is exactly the problem.

It's basically using the wrong rule for what is a COW page.

And that wrong rule happens to match the old world order, where a COW fault would just fault the page in again.

That said, what still confuses me is

 (a) the code very explicitly tries to save the writability bit, with this code doing

        if (preserve_write)
                ptent = pte_mk_savedwrite(ptent);


    and the code to reinstate a NUMA page will do

        bool was_writable = pte_savedwrite(vmf->orig_pte);
        ...
        if (was_writable) 
                pte = pte_mkwrite(pte);


when putting the page back.

 (b) if the migration actually happens, all the COW rules are out the window anyway, since the migration code would just have allocated a new page (whether in the old or the new COW world order)

Anyway, I do suspect that changing that "page_mapcount(page) != 1" to "page_count(page) != 1" would just fix this.

Oded, mind tring that?
Comment 5 ogabbay 2022-02-16 20:04:26 UTC
sure, no problem, I will try that and update here.
Comment 6 Linus Torvalds 2022-02-16 20:06:50 UTC
(In reply to ogabbay from comment #3)
> 
> Another thing I noticed, is that in kernels 5.9 and above, the amount of
> migration increased significantly. From tens of migrations before 5.9 to 1k
> migrations per second (observed via /proc/vmstat)

So I'm wonderign whether that's the real issue.

See my (b) commend in Comment #4 above: I'm not really seeing anything that keeps NUMA migration from happening on a pinned page. Not really before, not after. It *looks* to me like a numa migration could happen even for a pinned page. And then that would obviously not work out right.

So I'm wondering if at least part of the issue might be incidental.

Also, it might be that the 'preserve_write' logic is slightly broken. The only thing that would cause is in the form of excessive COW faults, and normal loads wouldn't notice. But the pinned pages *would*, since they depend on "just keep it writable" as part of the semantics.
Comment 7 ogabbay 2022-02-16 21:44:49 UTC
Linus, I think you were spot-on with your suggestion.

So far it completed 12 successful runs of the demo. Up until now, the bug usually happened during the first 3 runs. The migration rate is still high (around 2-3k per second), but without failures.

I would like to leave it running in a loop overnight, and see tomorrow if there were no failures. With the revert it run for a full night. Each run is about 15 minutes.

If you want to make sure I didn't do a stupid mistake, I put the change you asked me to do in a commit over 5.17-rc4 at:

https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux.git/log/?h=linus_fix_numa

I'll update tomorrow (my) morning.

Thanks
Comment 8 Linus Torvalds 2022-02-16 22:08:01 UTC
(In reply to ogabbay from comment #7)
> Linus, I think you were spot-on with your suggestion.

I'm both glad and a bit disappointed.

I still think the numa migration code shouldn't have made a difference unless migration actually happens, and then it should have failed before that COW simplification too.

At the same time, the one-liner to disable migration for COW pages with multiple users is one that I think is obviously the right thing to do regardless, so I'm happy it works for you.

Let me know tomorrow, and assuming everything still looks good I'll commit that patch and mark it for stable.
Comment 9 ogabbay 2022-02-17 07:19:06 UTC
(In reply to Linus Torvalds from comment #8)
> (In reply to ogabbay from comment #7)
> > Linus, I think you were spot-on with your suggestion.
> 
> I'm both glad and a bit disappointed.
> 
> I still think the numa migration code shouldn't have made a difference
> unless migration actually happens, and then it should have failed before
> that COW simplification too.

So you are saying there are two options:

1. This behavior was always present, from day 1. However, because the migrations were very few (at least for our workloads), we never saw this "bug". But with this simplification, the amount of migrations increased by two-orders and now there is more chance for the original "bug" to occur.

2. The simplification removed some serialization/lock that introduced a race between the gup and numa migration.

From what you say, I understand you are leaning towards option 1, correct ?

> 
> At the same time, the one-liner to disable migration for COW pages with
> multiple users is one that I think is obviously the right thing to do
> regardless, so I'm happy it works for you.
> 
> Let me know tomorrow, and assuming everything still looks good I'll commit
> that patch and mark it for stable.

No failures were observed after the overnight runs. About 130 iterations so far without any failure. I'll let it keep running but I'm fairly certain I won't see any failure. I think it is good to add that patch.

Thanks for the fast help.
Comment 10 Peter Xu 2022-02-17 07:27:40 UTC
Hi, all,

(In reply to Linus Torvalds from comment #8)
> I still think the numa migration code shouldn't have made a difference
> unless migration actually happens, and then it should have failed before
> that COW simplification too.

Maybe it's because we'll check the page refcount before the final migration, and we bail out if there's GUP detected?  I'm looking at folio_migrate_mapping():

		/* Anonymous page without mapping */
		if (folio_ref_count(folio) != expected_count)
			return -EAGAIN;

> 
> At the same time, the one-liner to disable migration for COW pages with
> multiple users is one that I think is obviously the right thing to do
> regardless, so I'm happy it works for you.
> 
> Let me know tomorrow, and assuming everything still looks good I'll commit
> that patch and mark it for stable.

Replacing page_mapcount() with page_count() looks like a work around on the real problem.  Meanwhile, is page_count()==1 too strict in this case?  At least it doesn't consider PageSwapCache() as in expected_page_refs().  Not sure whether it'll have any performance implications on some NUMA hosts.

I also agree that the write bit shouldn't get lost at all, and that's still a mystery.  I was looking into all the procedures but I didn't find anything about NUMA balancing code that could cause it to happen.

However what we know is NUMA definitely triggered on these pinned pages even if they could have bailed out at last.  We did replace the ptes/pmds with migration entries but probably recovered them back.

So either we missed some tricky path in NUMA, or another guess is NUMA could have made some other code path happen easier, which could cause the write bit to get lost.  Don't have a solid clue, yet.
Comment 11 ogabbay 2022-02-17 07:33:06 UTC
(In reply to ogabbay from comment #9)
> (In reply to Linus Torvalds from comment #8)
> > (In reply to ogabbay from comment #7)
> > > Linus, I think you were spot-on with your suggestion.
> > 
> > I'm both glad and a bit disappointed.
> > 
> > I still think the numa migration code shouldn't have made a difference
> > unless migration actually happens, and then it should have failed before
> > that COW simplification too.
> 
> So you are saying there are two options:
> 
> 1. This behavior was always present, from day 1. However, because the
> migrations were very few (at least for our workloads), we never saw this
> "bug". But with this simplification, the amount of migrations increased by
> two-orders and now there is more chance for the original "bug" to occur.
> 
> 2. The simplification removed some serialization/lock that introduced a race
> between the gup and numa migration.
> 
> From what you say, I understand you are leaning towards option 1, correct ?
> 
> > 
> > At the same time, the one-liner to disable migration for COW pages with
> > multiple users is one that I think is obviously the right thing to do
> > regardless, so I'm happy it works for you.
> > 
> > Let me know tomorrow, and assuming everything still looks good I'll commit
> > that patch and mark it for stable.
> 
> No failures were observed after the overnight runs. About 130 iterations so
> far without any failure. I'll let it keep running but I'm fairly certain I
> won't see any failure. I think it is good to add that patch.
> 
> Thanks for the fast help.

Just wanted to add that this change hasn't reduced the amount of page migrations. They still happen at a rate of 1k-2k migrations per second.
Comment 12 Linus Torvalds 2022-02-17 17:20:42 UTC
(In reply to ogabbay from comment #9)
> 
> 1. This behavior was always present, from day 1. However, because the
> migrations were very few (at least for our workloads), we never saw this
> "bug". But with this simplification, the amount of migrations increased by
> two-orders and now there is more chance for the original "bug" to occur.

I doubt this is the case. The fact that you saw an increase in migrations is interesting, but I can't see how it really would be a fundamental reason.

No, what I think happens is that the NUMA migration code - even when it decided to not migrate in the end - caused the writability bit to be lost somehow. So the NUMA migration code probably causes not just NUMA faults, but also extraneous COW faults afterwards because it has switched the pte read-only when it lost the write bit.

And the old nonsensical "page_mapcount()" code happened to be the same condition that the COW code used to do (for horrible reasons). In fact, I think that page_mapcount() test existed exactly because it mirrored the COW code - somebody had seen the extra COW faults, and disabled NUMA balancing for this case.

The change to use "page_count()" thus disables the NUMA balancing just further, to use the *correct* test - the old one was nonsensical but happened to work.

Anyway, I committed that one-liner as commit 80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other uses") since it clearly fixes the issue. I'd love to know how the PTE writability got lost, but I'm hoping some actual NUMA balancing person would look into that...
Comment 13 David Hildenbrand 2022-02-18 09:16:10 UTC
I finally took a more detailed look at this issue and the fix.


change_pte_range() roughly does the following for MM_CP_PROT_NUMA

if (is_cow_mapping(vma->vm_flags) &&
    page_mapcount(page) != 1)
	continue
...
ptep_modify_prot_start()
-> ptep_get_and_clear()
...
ptent = pte_mk_savedwrite(ptent);
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
 ->  set_pte_at()


Now, I might be wrong, but replacing "page_mapcount(page) != 1" by "page_count(page) != 1" might not fix the complete issue. I think GUP-fast can just race with that code and FOLL_PIN / FOLL_GET that page, until the the page table entry was cleared *and* the TLB was flushed.


Looking at change_huge_pmd(), I don't think that we particularly care about the page being exclusive to a single process. Looking at the history, it was introduced via 859d4adc3415 ("mm: numa: do not trap faults on shared data section pages.") as a pure optimization. After all, we're not even holding the page lock.


So actually I think we might instead want to do something like the following that would also make it clearer that this is all best effort:

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 386d03e17b64..7b962ef654b9 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -92,11 +92,22 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
                                if (!page || PageKsm(page))
                                        continue;
 
-                               /* Also skip shared copy-on-write pages */
+                               /*
+                                * Also skip shared copy-on-write pages. Note
+                                * that this is a pure optimization and not
+                                * required for correctness.
+                                */
                                if (is_cow_mapping(vma->vm_flags) &&
                                    page_mapcount(page) != 1)
                                        continue;
 
+                               /*
+                                * Note: this is a racy check: migration of
+                                * pinned pages will fail either way.
+                                */
+                               if (page_maybe_dma_pinned(page))
+                                       continue;
+
                                /*
                                 * While migration can move some dirty pages,
                                 * it cannot move them all from MIGRATE_ASYNC



At least concurrent GUP racing with that code should still be able to observe that issue if I'm not missing something important. change_huge_pmd() is similarly affected, especially with the COW security issue fix in place (streamlining the COW logic) -- but that was to be expected to result in the same wrong COWs until the wrong COWs are fixed for all of these.


Having that said, as explained in [1], part 2 I have in the works as described above should take care of that issue for FOLL_PIN -- and I assume even without the page_count() check. part 3 would take care of the issue for O_DIRECT.

I already have a prototype of part2 running, if anybody who can reproduce the issue wants to give it a shot and see their machine go up in flames.



... but I also still don't see how write protection is lost. At first I suspected try_to_migrate(), because it checks against pte_write() and not pte_savedwrite(). However

  include/linux/pgtable.h:#define pte_savedwrite pte_write

for all except ppc64. So that doesn't look like a candidate.


[1] https://lkml.kernel.org/r/39d17db6-0f8a-0e54-289b-85b9baf1e936@redhat.com
Comment 14 ogabbay 2022-02-18 09:45:51 UTC
(In reply to David Hildenbrand from comment #13)
> I finally took a more detailed look at this issue and the fix.
> 
> 
> change_pte_range() roughly does the following for MM_CP_PROT_NUMA
> 
> if (is_cow_mapping(vma->vm_flags) &&
>     page_mapcount(page) != 1)
>       continue
> ...
> ptep_modify_prot_start()
> -> ptep_get_and_clear()
> ...
> ptent = pte_mk_savedwrite(ptent);
> ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>  ->  set_pte_at()
> 
> 
> Now, I might be wrong, but replacing "page_mapcount(page) != 1" by
> "page_count(page) != 1" might not fix the complete issue. I think GUP-fast
> can just race with that code and FOLL_PIN / FOLL_GET that page, until the
> the page table entry was cleared *and* the TLB was flushed.
> 
> 
> Looking at change_huge_pmd(), I don't think that we particularly care about
> the page being exclusive to a single process. Looking at the history, it was
> introduced via 859d4adc3415 ("mm: numa: do not trap faults on shared data
> section pages.") as a pure optimization. After all, we're not even holding
> the page lock.
> 
> 
> So actually I think we might instead want to do something like the following
> that would also make it clearer that this is all best effort:
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 386d03e17b64..7b962ef654b9 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -92,11 +92,22 @@ static unsigned long change_pte_range(struct
> vm_area_struct *vma, pmd_t *pmd,
>                                 if (!page || PageKsm(page))
>                                         continue;
>  
> -                               /* Also skip shared copy-on-write pages */
> +                               /*
> +                                * Also skip shared copy-on-write pages. Note
> +                                * that this is a pure optimization and not
> +                                * required for correctness.
> +                                */
>                                 if (is_cow_mapping(vma->vm_flags) &&
>                                     page_mapcount(page) != 1)
>                                         continue;
>  
> +                               /*
> +                                * Note: this is a racy check: migration of
> +                                * pinned pages will fail either way.
> +                                */
> +                               if (page_maybe_dma_pinned(page))
> +                                       continue;
> +
>                                 /*
>                                  * While migration can move some dirty pages,
>                                  * it cannot move them all from MIGRATE_ASYNC
> 
> 
> 
> At least concurrent GUP racing with that code should still be able to
> observe that issue if I'm not missing something important. change_huge_pmd()
> is similarly affected, especially with the COW security issue fix in place
> (streamlining the COW logic) -- but that was to be expected to result in the
> same wrong COWs until the wrong COWs are fixed for all of these.
> 
> 
> Having that said, as explained in [1], part 2 I have in the works as
> described above should take care of that issue for FOLL_PIN -- and I assume
> even without the page_count() check. part 3 would take care of the issue for
> O_DIRECT.
> 
> I already have a prototype of part2 running, if anybody who can reproduce
> the issue wants to give it a shot and see their machine go up in flames.

Hi David, 
I can try whatever you think is worth trying with my workload. 

Do you want me to take the change you wrote here alone? Or should I apply anything else? 

If you have a git tree ready I can clone, that will be the easiest way for me. 

> 
> 
> 
> ... but I also still don't see how write protection is lost. At first I
> suspected try_to_migrate(), because it checks against pte_write() and not
> pte_savedwrite(). However
> 
>   include/linux/pgtable.h:#define pte_savedwrite pte_write
> 
> for all except ppc64. So that doesn't look like a candidate.
> 
> 
> [1] https://lkml.kernel.org/r/39d17db6-0f8a-0e54-289b-85b9baf1e936@redhat.com
Comment 15 David Hildenbrand 2022-02-19 10:25:38 UTC
> > Having that said, as explained in [1], part 2 I have in the works as
> > described above should take care of that issue for FOLL_PIN -- and I assume
> > even without the page_count() check. part 3 would take care of the issue
> for
> > O_DIRECT.
> > 
> > I already have a prototype of part2 running, if anybody who can reproduce
> > the issue wants to give it a shot and see their machine go up in flames.
> 
> Hi David, 
> I can try whatever you think is worth trying with my workload. 
> 

Hi,


> Do you want me to take the change you wrote here alone? Or should I apply
> anything else? 

My change should have the same effect as Linus fix (for FOLL_PIN), and it would suffer from the same GUP-fast race, thus the added comments regarding possible races.

> 
> If you have a git tree ready I can clone, that will be the easiest way for
> me. 

See my private mail from yesterday with a link to a WIP version. Hopefully that will solve the issue you're seeing as well.
Comment 16 ogabbay 2022-02-21 09:00:57 UTC
(In reply to David Hildenbrand from comment #15)
> > > Having that said, as explained in [1], part 2 I have in the works as
> > > described above should take care of that issue for FOLL_PIN -- and I
> assume
> > > even without the page_count() check. part 3 would take care of the issue
> > for
> > > O_DIRECT.
> > > 
> > > I already have a prototype of part2 running, if anybody who can reproduce
> > > the issue wants to give it a shot and see their machine go up in flames.
> > 
> > Hi David, 
> > I can try whatever you think is worth trying with my workload. 
> > 
> 
> Hi,
> 
> 
> > Do you want me to take the change you wrote here alone? Or should I apply
> > anything else? 
> 
> My change should have the same effect as Linus fix (for FOLL_PIN), and it
> would suffer from the same GUP-fast race, thus the added comments regarding
> possible races.
> 
> > 
> > If you have a git tree ready I can clone, that will be the easiest way for
> > me. 
> 
> See my private mail from yesterday with a link to a WIP version. Hopefully
> that will solve the issue you're seeing as well.

I wanted to update here I tried your WIP branch and it also solves the issue I saw. I run above 250 iterations for more than 36 hours and the issue didn't reproduce
Comment 17 David Hildenbrand 2022-02-22 19:46:11 UTC
(In reply to ogabbay from comment #16)
> (In reply to David Hildenbrand from comment #15)
> > > > Having that said, as explained in [1], part 2 I have in the works as
> > > > described above should take care of that issue for FOLL_PIN -- and I
> > assume
> > > > even without the page_count() check. part 3 would take care of the
> issue
> > > for
> > > > O_DIRECT.
> > > > 
> > > > I already have a prototype of part2 running, if anybody who can
> reproduce
> > > > the issue wants to give it a shot and see their machine go up in
> flames.
> > > 
> > > Hi David, 
> > > I can try whatever you think is worth trying with my workload. 
> > > 
> > 
> > Hi,
> > 
> > 
> > > Do you want me to take the change you wrote here alone? Or should I apply
> > > anything else? 
> > 
> > My change should have the same effect as Linus fix (for FOLL_PIN), and it
> > would suffer from the same GUP-fast race, thus the added comments regarding
> > possible races.
> > 
> > > 
> > > If you have a git tree ready I can clone, that will be the easiest way
> for
> > > me. 
> > 
> > See my private mail from yesterday with a link to a WIP version. Hopefully
> > that will solve the issue you're seeing as well.
> 
> I wanted to update here I tried your WIP branch and it also solves the issue
> I saw. I run above 250 iterations for more than 36 hours and the issue
> didn't reproduce


Awesome, thanks for testing. I'm in the process of cleaning that work up (including part 2b) and testing, to then start sending it out as RFC. Thanks!
Comment 18 Linus Torvalds 2022-02-22 20:07:11 UTC
(In reply to David Hildenbrand from comment #17)
> 
> Awesome, thanks for testing. I'm in the process of cleaning that work up
> (including part 2b) and testing, to then start sending it out as RFC. Thanks!

Note that even with "might_be_pinned()" taking care of the GUP case, I really don't want change_pte_range() to go back to the nonsensical "page_mapcount()" logic.

It really is pure garbage. "mapcount" makes zero sense in there, with or without any GUP handling.

As mentioned in the commit message for 80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other uses") any other mapping might have been moved to the swap cache, and no longer count toward mapcount - because the reference is instead held by the swap cache, and page_count() is elevated.

(Of course, once it has been actively swapped out, even page_count() isn't elevated any more, but at that point it's also not in the page tables as a present page at all any more, so that code wouldn't see that case).

I really think we should get rid of page_mapcount() entirely. And if we don't get rid of it, we should at least make sure it is only used for cases where the number of actual mappings explicitly makes sense.

The only sensible use I can think of off-hand for page_mapcount() is when you're doing a rmap walk to find all memory mappings of a particular page. There may be others, but that's the only one that I think is _clearly_ sensible and you can make some optimization where you say "Ok, I don't need to check other VMs because the mapcount says it's ok".

And even that is actually somewhat questionable, because I don't see a good serialization point, so it had better also be a page that cannot be faulted in somewhere else. So even that rmap case is limited, and possibly just a "I have truncated the file, so the page is now invisible to new uses, and I will now walk all mappings to make sure it's gone from them".

End result: page_mapcount() should be treated with extreme care, because it's <i>so</i> pointless and possibly misleading.
Comment 19 David Hildenbrand 2022-02-23 08:17:21 UTC
> Note that even with "might_be_pinned()" taking care of the GUP case, I
> really don't want change_pte_range() to go back to the nonsensical
> "page_mapcount()" logic.
> 
> It really is pure garbage. "mapcount" makes zero sense in there, with or
> without any GUP handling.
> 
> As mentioned in the commit message for 80d47f5de5e3 ("mm: don't try to
> NUMA-migrate COW pages that have other uses") any other mapping might have
> been moved to the swap cache, and no longer count toward mapcount - because
> the reference is instead held by the swap cache, and page_count() is
> elevated.

I would completely agree if that code in question wouldn't have been added as a pure optimization in 859d4adc3415 ("mm: numa: do not trap faults on shared data section pages.").

The problem is: we don't have anything better to test for "am I likely the only relevant user of this page or is there most certainly another user".

page_mapcount(page) == 1: Might fail to detect shared usage
page_count(page) == 1: Might fail to detect exclusive usage

So in the context of an optimization as documented in 859d4adc3415 ("mm: numa: do not trap faults on shared data section pages."), I don't consider "page_mapcount(page) == 1:" *that* bad.

Maybe wrapping "page_mapcount(page) == 1" into something like "page_maybe_mapped_exclusively()" and clearly documenting the locking expectation+semantics might improve the situation.


But these are just my 2 cents, and I don't particularly care as long as we have it properly fixed.

> 
> (Of course, once it has been actively swapped out, even page_count() isn't
> elevated any more, but at that point it's also not in the page tables as a
> present page at all any more, so that code wouldn't see that case).
> 
> I really think we should get rid of page_mapcount() entirely. And if we
> don't get rid of it, we should at least make sure it is only used for cases
> where the number of actual mappings explicitly makes sense.

I am also not a big supported of page_mapcount(). But I do see value in it for

1) Is this page still mapped (what you describe below) -- mapcount != 0
2) Are we currently the only one mapping it or is there *most certainly* someone else mapping it right now -- mapcount != 1

> 
> The only sensible use I can think of off-hand for page_mapcount() is when
> you're doing a rmap walk to find all memory mappings of a particular page.
> There may be others, but that's the only one that I think is _clearly_
> sensible and you can make some optimization where you say "Ok, I don't need
> to check other VMs because the mapcount says it's ok".
> 
> And even that is actually somewhat questionable, because I don't see a good
> serialization point, so it had better also be a page that cannot be faulted
> in somewhere else. So even that rmap case is limited, and possibly just a "I
> have truncated the file, so the page is now invisible to new uses, and I
> will now walk all mappings to make sure it's gone from them".
> 
> End result: page_mapcount() should be treated with extreme care, because
> it's <i>so</i> pointless and possibly misleading.

Absolutely, and even the locking semantics are ... weird and we have to be very careful.


As I said, I'll send out the RFC of PageAnonExclusive() soonish, which Oded tested with. That should take care of any races with GUP-fast FOLL_PIN.