Created attachment 192881 [details] all printk debug output + drop into kdb When block_size_bytes is set to 2GB (default behavior for systems with 64GB or more memory) init_memory_block runs the risk of adding non-present memory sections to a memory block. These are sections that were not discovered in sparse_init_one_section and so do not have a valid mem_map. Every pfn associated with missing sections is invalid: !SECTION_MARKED_PRESENT -> !SECTION_HAS_MEM_MAP -> pfn_valid = false. The problem is that memory blocks are set to always span the full number of sections per block, which runs the risk of including missing memory sections: drivers/base/memory.c --- 614 mem->start_section_nr = 615 base_memory_block_id(scn_nr) * sections_per_block; 616 mem->end_section_nr = mem->start_section_nr + sections_per_block - 1; Relevant commits: cb5e39b8 - drivers: base: refactor add_memory_section to add_memory_block d3360164 - memory hotplug: Update phys_index to [start|end]_section_nr 56a3c655 - memory-hotplug: update documentation ... and remove end_phys_index printks show which memory sections are getting SECTION_MARKED_PRESENT and SECTION_HAS_MEM_MAP, and the start & end section nums for each memory block. I found that memory16 spans sections 256 through 271, but sections 264 through 511 are missing (see attached). I found the problem when attempting to offline a block with missing sections. # echo 0 > /sys/devices/system/memory/memory16/online In test_pages_in_a_zone, pfn_valid_within always returns 1 since CONFIG_HOLES_IN_ZONE* is not set, thereby allowing an invalid pfn from a missing section to make its way through the rest of the code - quickly causing the system to drop to kdb (see attached line 1233). This was on the recent 4.3 release. I can't tell what the desired behavior is supposed to be. Was it intended for memory blocks to have missing sections in order to give them a uniform number of sections? If that's the case, then pfn_valid_within is dangerous. Or is what I describe bad behavior, and memory blocks should only encompass valid pfns? OR is the real bug the fact that we have missing sections to begin with? Any advice would be great, Thanks! Andrew Banman abanman@sgi.com *Note that CONFIG_HOLES_IN_ZONE is not available on x86, and setting it would be inappropriate in this case - the problem is missing sections, not holes in a MAX_ORDER_NR_PAGES.
Commit bdee237c0 added dynamic memory block sizing: x86: mm: Use 2GB memory block size on large-memory x86-64 systems From that commit message: This caveat is that the memory can't be offlined (for hotplug or otherwise) with the finer default 128MB granularity, but this is unimportant due to the high memory densities generally used with such large-memory systems, where eg a single DIMM is the order of 16GB. The expectation of the code is the all sections in a block are present. Before this commit, the MIN_MEMORY_BLOCK_SIZE (128MB) was the only block size for x86 and it matched the section size (i.e. there was only ever one section per block unless CONFIG_X86_UV was set) After this commit, this is no longer the case. This bug demonstrates there can be 128MB section holes in the new 2G block size. Other than reverting the 2G dynamic block size, one fix could be to check each memory block for holes at block initialization time and prohibit offlining blocks with non-present sections. probe_memory_block_size() could use some cleanup too :-/
(In reply to Seth Jennings from comment #1) > Other than reverting the 2G dynamic block size, one fix could be to check > each memory block for holes at block initialization time and prohibit > offlining blocks with non-present sections. That's probably the simplest thing to do, although we'd have to live with holes in our blocks :( However, at the minimum, add_memory_block and memory_dev_init need to interface better. Right now it's possible for two memory blocks to have overlapping sections, primarily because the memory_dev_init loop doesn't know where the previous block actually ended. Alternatively we could implement a solution that prevents holes in the first place. I can think of two possibilities: 1. Add an array of section numbers as a mem block member, so that add_memory_block can assign the first 2GB of *present* sections to the next mem block. 2. Allow variable block sizes, so that a new block starts at the next present section and ends up to 16 sections later- earlier if it finds a missing section. This could be a decent compromise between reverting the 2GB blocks and overhauling them. In either case Thank you for your reply! AB
(In reply to Andrew Banman from comment #2) > (In reply to Seth Jennings from comment #1) > > Other than reverting the 2G dynamic block size, one fix could be to check > > each memory block for holes at block initialization time and prohibit > > offlining blocks with non-present sections. > > That's probably the simplest thing to do, although we'd have to live with > holes in our blocks :( > > However, at the minimum, add_memory_block and memory_dev_init need to > interface better. Right now it's possible for two memory blocks to have > overlapping sections, primarily because the memory_dev_init loop doesn't > know where the previous block actually ended. I'm not following how a section could be added to two different memory blocks. Can you explain how this could happen? > > Alternatively we could implement a solution that prevents holes in the first > place. I can think of two possibilities: > > 1. Add an array of section numbers as a mem block member, so that > add_memory_block can assign the first 2GB of *present* sections to the next > mem block. > > > 2. Allow variable block sizes, so that a new block starts at the next > present section and ends up to 16 sections later- earlier if it finds a > missing section. This could be a decent compromise between reverting the 2GB > blocks and overhauling them. Both of those would be pretty invasive I think. Either way, the patch is a quick fix against the crash: diff --git a/drivers/base/memory.c b/drivers/base/memory.c index dd30744..6d7b14c 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -303,6 +303,10 @@ static int memory_subsys_offline(struct device *dev) if (mem->state == MEM_OFFLINE) return 0; + /* Can't offline block with non-present sections */ + if (mem->section_count != sections_per_block) + return -EINVAL; + return memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE); } What do you think? Seth > In either case > > Thank you for your reply! > > AB
Also, FYI, I have this patch on this list to clean up probe_memory_block_size() https://lkml.org/lkml/2015/11/30/556
(In reply to Seth Jennings from comment #3) > (In reply to Andrew Banman from comment #2) > > (In reply to Seth Jennings from comment #1) > > > Other than reverting the 2G dynamic block size, one fix could be to check > > > each memory block for holes at block initialization time and prohibit > > > offlining blocks with non-present sections. > > > > That's probably the simplest thing to do, although we'd have to live with > > holes in our blocks :( > > > > However, at the minimum, add_memory_block and memory_dev_init need to > > interface better. Right now it's possible for two memory blocks to have > > overlapping sections, primarily because the memory_dev_init loop doesn't > > know where the previous block actually ended. > > I'm not following how a section could be added to two different memory > blocks. Can you explain how this could happen? Sure thing. 2BG block size corresponds to 16 sections_per_block. Suppose sections 0-3 are missing. memory_dev_init calls add_memory_block(i) starting with i=0, then i=16, etc. for (i = 0; i < NR_MEM_SECTIONS; i += sections_per_block) { err = add_memory_block(i); if (!ret) ret = err; } but add_memory_block will loop through sections until it finds a present one, and passes that section to init_memory_block as the start of the block: for (i = base_section_nr; (i < base_section_nr + sections_per_block) && i < NR_MEM_SECTIONS; i++) { if (!present_section_nr(i)) continue; if (section_count == 0) section_nr = i; section_count++; } if (section_count == 0) return 0; ret = init_memory_block(&mem, __nr_to_section(section_nr), MEM_ONLINE); Now as I mentioned above, init_memory_block will set the end_section_nr to be sections_per_block ahead of the start_section_nr. So in this example memory0 would include sections 4-19. BUT, the memory_dev_init loop continues by calling add_memory_block(16), which doesn't find a missing section and so goes about constructing memory1 with sections 16-31. > > > > > Alternatively we could implement a solution that prevents holes in the > first > > place. I can think of two possibilities: > > > > 1. Add an array of section numbers as a mem block member, so that > > add_memory_block can assign the first 2GB of *present* sections to the next > > mem block. > > > > > > 2. Allow variable block sizes, so that a new block starts at the next > > present section and ends up to 16 sections later- earlier if it finds a > > missing section. This could be a decent compromise between reverting the > 2GB > > blocks and overhauling them. > > Both of those would be pretty invasive I think. I agree, but would it be worth it to bring the code in line with the original intention? I'm just going off the memory-hotplug doc: "It is expected that all memory sections in this range are present and no memory holes exist in the range". There are a number of checks against missing sections throughout memory.c that could be cleaned out if the memory_block initialization worked correctly :/ I'd say the second option is the better of the two. > > Either way, the patch is a quick fix against the crash: > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > index dd30744..6d7b14c 100644 > --- a/drivers/base/memory.c > +++ b/drivers/base/memory.c > @@ -303,6 +303,10 @@ static int memory_subsys_offline(struct device *dev) > if (mem->state == MEM_OFFLINE) > return 0; > > + /* Can't offline block with non-present sections */ > + if (mem->section_count != sections_per_block) > + return -EINVAL; > + > return memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE); > } > > What do you think? I think that would work for this case. The same logic lives in register_new_memory, but I'm confused why the block's section_count gets incremented there- I'm worried an inappropriate inc/dec of section_count would ruin the logic. Actually this issue was half-way addressed by commit 21ea9f5ace3a7317cc3ba1fbc749758021a83136 Author: Russ Anderson <rja@sgi.com> Date: Wed Aug 28 16:35:18 2013 -0700 drivers/base/memory.c: fix show_mem_removable() to handle missing sections The bad memory correctly had removable = 0. Perhaps there's a good way to use the removable value to prevent the user from trying to offline the block- maybe by just repeating the same check used there. Regardless of missing sections, if a block is labelled un-removable the user should probably be stopped. It seems like this missing section issue keeps cropping up. To me this warrants a bigger fix than a bandaid, but I'm new to this :) Thanks, AB > > Seth > > > In either case > > > > Thank you for your reply! > > > > AB