Bug 107781

Summary: init_memory_block adds missing sections to memory block on large systems
Product: Memory Management Reporter: Andrew Banman (abanman)
Component: OtherAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: normal CC: athorlton, rja, spartacus06
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.3 Subsystem:
Regression: No Bisected commit-id:
Attachments: all printk debug output + drop into kdb

Description Andrew Banman 2015-11-12 21:22:52 UTC
Created attachment 192881 [details]
all printk debug output + drop into kdb

When block_size_bytes is set to 2GB (default behavior for systems with 64GB
or more memory) init_memory_block runs the risk of adding non-present memory
sections to a memory block. These are sections that were not discovered in
sparse_init_one_section and so do not have a valid mem_map. Every pfn
associated with missing sections is invalid:
!SECTION_MARKED_PRESENT -> !SECTION_HAS_MEM_MAP -> pfn_valid = false.

The problem is that memory blocks are set to always span the full number of
sections per block, which runs the risk of including missing memory sections:

drivers/base/memory.c
---
614     mem->start_section_nr =
615             base_memory_block_id(scn_nr) * sections_per_block;
616     mem->end_section_nr = mem->start_section_nr + sections_per_block - 1;

Relevant commits:
cb5e39b8 - drivers: base: refactor add_memory_section to add_memory_block
d3360164 - memory hotplug: Update phys_index to [start|end]_section_nr
56a3c655 - memory-hotplug: update documentation ... and remove end_phys_index

printks show which memory sections are getting SECTION_MARKED_PRESENT
and SECTION_HAS_MEM_MAP, and the start & end section nums for each memory block.
I found that memory16 spans sections 256 through 271, but sections 264 through
511 are missing (see attached).

I found the problem when attempting to offline a block with missing sections.
# echo 0 > /sys/devices/system/memory/memory16/online
In test_pages_in_a_zone, pfn_valid_within always returns 1 since
CONFIG_HOLES_IN_ZONE* is not set, thereby allowing an invalid pfn from a missing
section to make its way through the rest of the code - quickly causing the
system to drop to kdb (see attached line 1233). This was on the recent 4.3
release.

I can't tell what the desired behavior is supposed to be. Was it intended for
memory blocks to have missing sections in order to give them a uniform number of
sections? If that's the case, then pfn_valid_within is dangerous. Or is what I
describe bad behavior, and memory blocks should only encompass valid pfns?
OR is the real bug the fact that we have missing sections to begin with?

Any advice would be great,

Thanks!

Andrew Banman
abanman@sgi.com

*Note that CONFIG_HOLES_IN_ZONE is not available on x86, and setting it would be
inappropriate in this case - the problem is missing sections, not holes in a
MAX_ORDER_NR_PAGES.
Comment 1 Seth Jennings 2015-11-23 22:15:57 UTC
Commit bdee237c0 added dynamic memory block sizing:
x86: mm: Use 2GB memory block size on large-memory x86-64 systems

From that commit message:
This caveat is that the memory can't be offlined (for hotplug or
otherwise) with the finer default 128MB granularity, but this is
unimportant due to the high memory densities generally used with such
large-memory systems, where eg a single DIMM is the order of 16GB.

The expectation of the code is the all sections in a block are present.

Before this commit, the MIN_MEMORY_BLOCK_SIZE (128MB) was the only block size for x86 and it matched the section size (i.e. there was only ever one section per block unless CONFIG_X86_UV was set)

After this commit, this is no longer the case.  This bug demonstrates there can be 128MB section holes in the new 2G block size.

Other than reverting the 2G dynamic block size, one fix could be to check each memory block for holes at block initialization time and prohibit offlining blocks with non-present sections.

probe_memory_block_size() could use some cleanup too :-/
Comment 2 Andrew Banman 2015-11-23 23:31:57 UTC
(In reply to Seth Jennings from comment #1)
> Other than reverting the 2G dynamic block size, one fix could be to check
> each memory block for holes at block initialization time and prohibit
> offlining blocks with non-present sections.

That's probably the simplest thing to do, although we'd have to live with holes in our blocks :( 

However, at the minimum, add_memory_block and memory_dev_init need to interface better. Right now it's possible for two memory blocks to have overlapping sections, primarily because the memory_dev_init loop doesn't know where the previous block actually ended. 

Alternatively we could implement a solution that prevents holes in the first place. I can think of two possibilities:

1. Add an array of section numbers as a mem block member, so that add_memory_block can assign the first 2GB of *present* sections to the next mem block. 


2. Allow variable block sizes, so that a new block starts at the next present section and ends up to 16 sections later- earlier if it finds a missing section. This could be a decent compromise between reverting the 2GB blocks and overhauling them.
In either case

Thank you for your reply!

AB
Comment 3 Seth Jennings 2015-12-01 22:23:15 UTC
(In reply to Andrew Banman from comment #2)
> (In reply to Seth Jennings from comment #1)
> > Other than reverting the 2G dynamic block size, one fix could be to check
> > each memory block for holes at block initialization time and prohibit
> > offlining blocks with non-present sections.
> 
> That's probably the simplest thing to do, although we'd have to live with
> holes in our blocks :( 
> 
> However, at the minimum, add_memory_block and memory_dev_init need to
> interface better. Right now it's possible for two memory blocks to have
> overlapping sections, primarily because the memory_dev_init loop doesn't
> know where the previous block actually ended.

I'm not following how a section could be added to two different memory blocks.  Can you explain how this could happen?

> 
> Alternatively we could implement a solution that prevents holes in the first
> place. I can think of two possibilities:
> 
> 1. Add an array of section numbers as a mem block member, so that
> add_memory_block can assign the first 2GB of *present* sections to the next
> mem block. 
> 
> 
> 2. Allow variable block sizes, so that a new block starts at the next
> present section and ends up to 16 sections later- earlier if it finds a
> missing section. This could be a decent compromise between reverting the 2GB
> blocks and overhauling them.

Both of those would be pretty invasive I think.

Either way, the patch is a quick fix against the crash:

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index dd30744..6d7b14c 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -303,6 +303,10 @@ static int memory_subsys_offline(struct device *dev)
        if (mem->state == MEM_OFFLINE)
                return 0;
 
+       /* Can't offline block with non-present sections */
+       if (mem->section_count != sections_per_block)
+               return -EINVAL;
+
        return memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
 }

What do you think?

Seth

> In either case
> 
> Thank you for your reply!
> 
> AB
Comment 4 Seth Jennings 2015-12-01 22:26:52 UTC
Also, FYI, I have this patch on this list to clean up probe_memory_block_size()

https://lkml.org/lkml/2015/11/30/556
Comment 5 Andrew Banman 2015-12-02 02:02:52 UTC
(In reply to Seth Jennings from comment #3)
> (In reply to Andrew Banman from comment #2)
> > (In reply to Seth Jennings from comment #1)
> > > Other than reverting the 2G dynamic block size, one fix could be to check
> > > each memory block for holes at block initialization time and prohibit
> > > offlining blocks with non-present sections.
> > 
> > That's probably the simplest thing to do, although we'd have to live with
> > holes in our blocks :( 
> > 
> > However, at the minimum, add_memory_block and memory_dev_init need to
> > interface better. Right now it's possible for two memory blocks to have
> > overlapping sections, primarily because the memory_dev_init loop doesn't
> > know where the previous block actually ended.
> 
> I'm not following how a section could be added to two different memory
> blocks.  Can you explain how this could happen?

Sure thing. 2BG block size corresponds to 16 sections_per_block. Suppose sections 0-3 are missing. memory_dev_init calls add_memory_block(i) starting with i=0, then i=16, etc.

        for (i = 0; i < NR_MEM_SECTIONS; i += sections_per_block) {
                err = add_memory_block(i);
                if (!ret)
                        ret = err;
        }

but add_memory_block will loop through sections until it finds a present one, and passes that section to init_memory_block as the start of the block:

        for (i = base_section_nr;
             (i < base_section_nr + sections_per_block) && i < NR_MEM_SECTIONS;
             i++) {
                if (!present_section_nr(i))
                        continue;
                if (section_count == 0)
                        section_nr = i;
                section_count++;
        }

        if (section_count == 0)
                return 0;
        ret = init_memory_block(&mem, __nr_to_section(section_nr), MEM_ONLINE);

Now as I mentioned above, init_memory_block will set the end_section_nr to be
sections_per_block ahead of the start_section_nr. So in this example memory0 would include sections 4-19. BUT, the memory_dev_init loop continues by calling add_memory_block(16), which doesn't find a missing section and so goes about constructing memory1 with sections 16-31.

> 
> > 
> > Alternatively we could implement a solution that prevents holes in the
> first
> > place. I can think of two possibilities:
> > 
> > 1. Add an array of section numbers as a mem block member, so that
> > add_memory_block can assign the first 2GB of *present* sections to the next
> > mem block. 
> > 
> > 
> > 2. Allow variable block sizes, so that a new block starts at the next
> > present section and ends up to 16 sections later- earlier if it finds a
> > missing section. This could be a decent compromise between reverting the
> 2GB
> > blocks and overhauling them.
> 
> Both of those would be pretty invasive I think.

I agree, but would it be worth it to bring the code in line with the original intention? I'm just going off the memory-hotplug doc: "It is expected that all memory sections in this range are present and no memory holes exist in the range".

There are a number of checks against missing sections throughout memory.c that
could be cleaned out if the memory_block initialization worked correctly :/
I'd say the second option is the better of the two.

> 
> Either way, the patch is a quick fix against the crash:
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index dd30744..6d7b14c 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -303,6 +303,10 @@ static int memory_subsys_offline(struct device *dev)
>         if (mem->state == MEM_OFFLINE)
>                 return 0;
>  
> +       /* Can't offline block with non-present sections */
> +       if (mem->section_count != sections_per_block)
> +               return -EINVAL;
> +
>         return memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
>  }
> 
> What do you think?

I think that would work for this case. The same logic lives in register_new_memory, but I'm confused why the block's section_count gets incremented there- I'm worried an inappropriate inc/dec of section_count would ruin the logic.

Actually this issue was half-way addressed by 

commit 21ea9f5ace3a7317cc3ba1fbc749758021a83136
Author: Russ Anderson <rja@sgi.com>
Date:   Wed Aug 28 16:35:18 2013 -0700
drivers/base/memory.c: fix show_mem_removable() to handle missing sections

The bad memory correctly had removable = 0. Perhaps there's a good way to use the removable value to prevent the user from trying to offline the block- maybe by just repeating the same check used there. Regardless of missing sections, if a block is labelled un-removable the user should probably be stopped.

It seems like this missing section issue keeps cropping up. To me this warrants a bigger fix than a bandaid, but I'm new to this :) 

Thanks,

AB

> 
> Seth
> 
> > In either case
> > 
> > Thank you for your reply!
> > 
> > AB