Bug 107781
Summary: | init_memory_block adds missing sections to memory block on large systems | ||
---|---|---|---|
Product: | Memory Management | Reporter: | Andrew Banman (abanman) |
Component: | Other | Assignee: | Andrew Morton (akpm) |
Status: | NEW --- | ||
Severity: | normal | CC: | athorlton, rja, spartacus06 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.3 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | all printk debug output + drop into kdb |
Description
Andrew Banman
2015-11-12 21:22:52 UTC
Commit bdee237c0 added dynamic memory block sizing: x86: mm: Use 2GB memory block size on large-memory x86-64 systems From that commit message: This caveat is that the memory can't be offlined (for hotplug or otherwise) with the finer default 128MB granularity, but this is unimportant due to the high memory densities generally used with such large-memory systems, where eg a single DIMM is the order of 16GB. The expectation of the code is the all sections in a block are present. Before this commit, the MIN_MEMORY_BLOCK_SIZE (128MB) was the only block size for x86 and it matched the section size (i.e. there was only ever one section per block unless CONFIG_X86_UV was set) After this commit, this is no longer the case. This bug demonstrates there can be 128MB section holes in the new 2G block size. Other than reverting the 2G dynamic block size, one fix could be to check each memory block for holes at block initialization time and prohibit offlining blocks with non-present sections. probe_memory_block_size() could use some cleanup too :-/ (In reply to Seth Jennings from comment #1) > Other than reverting the 2G dynamic block size, one fix could be to check > each memory block for holes at block initialization time and prohibit > offlining blocks with non-present sections. That's probably the simplest thing to do, although we'd have to live with holes in our blocks :( However, at the minimum, add_memory_block and memory_dev_init need to interface better. Right now it's possible for two memory blocks to have overlapping sections, primarily because the memory_dev_init loop doesn't know where the previous block actually ended. Alternatively we could implement a solution that prevents holes in the first place. I can think of two possibilities: 1. Add an array of section numbers as a mem block member, so that add_memory_block can assign the first 2GB of *present* sections to the next mem block. 2. Allow variable block sizes, so that a new block starts at the next present section and ends up to 16 sections later- earlier if it finds a missing section. This could be a decent compromise between reverting the 2GB blocks and overhauling them. In either case Thank you for your reply! AB (In reply to Andrew Banman from comment #2) > (In reply to Seth Jennings from comment #1) > > Other than reverting the 2G dynamic block size, one fix could be to check > > each memory block for holes at block initialization time and prohibit > > offlining blocks with non-present sections. > > That's probably the simplest thing to do, although we'd have to live with > holes in our blocks :( > > However, at the minimum, add_memory_block and memory_dev_init need to > interface better. Right now it's possible for two memory blocks to have > overlapping sections, primarily because the memory_dev_init loop doesn't > know where the previous block actually ended. I'm not following how a section could be added to two different memory blocks. Can you explain how this could happen? > > Alternatively we could implement a solution that prevents holes in the first > place. I can think of two possibilities: > > 1. Add an array of section numbers as a mem block member, so that > add_memory_block can assign the first 2GB of *present* sections to the next > mem block. > > > 2. Allow variable block sizes, so that a new block starts at the next > present section and ends up to 16 sections later- earlier if it finds a > missing section. This could be a decent compromise between reverting the 2GB > blocks and overhauling them. Both of those would be pretty invasive I think. Either way, the patch is a quick fix against the crash: diff --git a/drivers/base/memory.c b/drivers/base/memory.c index dd30744..6d7b14c 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -303,6 +303,10 @@ static int memory_subsys_offline(struct device *dev) if (mem->state == MEM_OFFLINE) return 0; + /* Can't offline block with non-present sections */ + if (mem->section_count != sections_per_block) + return -EINVAL; + return memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE); } What do you think? Seth > In either case > > Thank you for your reply! > > AB Also, FYI, I have this patch on this list to clean up probe_memory_block_size() https://lkml.org/lkml/2015/11/30/556 (In reply to Seth Jennings from comment #3) > (In reply to Andrew Banman from comment #2) > > (In reply to Seth Jennings from comment #1) > > > Other than reverting the 2G dynamic block size, one fix could be to check > > > each memory block for holes at block initialization time and prohibit > > > offlining blocks with non-present sections. > > > > That's probably the simplest thing to do, although we'd have to live with > > holes in our blocks :( > > > > However, at the minimum, add_memory_block and memory_dev_init need to > > interface better. Right now it's possible for two memory blocks to have > > overlapping sections, primarily because the memory_dev_init loop doesn't > > know where the previous block actually ended. > > I'm not following how a section could be added to two different memory > blocks. Can you explain how this could happen? Sure thing. 2BG block size corresponds to 16 sections_per_block. Suppose sections 0-3 are missing. memory_dev_init calls add_memory_block(i) starting with i=0, then i=16, etc. for (i = 0; i < NR_MEM_SECTIONS; i += sections_per_block) { err = add_memory_block(i); if (!ret) ret = err; } but add_memory_block will loop through sections until it finds a present one, and passes that section to init_memory_block as the start of the block: for (i = base_section_nr; (i < base_section_nr + sections_per_block) && i < NR_MEM_SECTIONS; i++) { if (!present_section_nr(i)) continue; if (section_count == 0) section_nr = i; section_count++; } if (section_count == 0) return 0; ret = init_memory_block(&mem, __nr_to_section(section_nr), MEM_ONLINE); Now as I mentioned above, init_memory_block will set the end_section_nr to be sections_per_block ahead of the start_section_nr. So in this example memory0 would include sections 4-19. BUT, the memory_dev_init loop continues by calling add_memory_block(16), which doesn't find a missing section and so goes about constructing memory1 with sections 16-31. > > > > > Alternatively we could implement a solution that prevents holes in the > first > > place. I can think of two possibilities: > > > > 1. Add an array of section numbers as a mem block member, so that > > add_memory_block can assign the first 2GB of *present* sections to the next > > mem block. > > > > > > 2. Allow variable block sizes, so that a new block starts at the next > > present section and ends up to 16 sections later- earlier if it finds a > > missing section. This could be a decent compromise between reverting the > 2GB > > blocks and overhauling them. > > Both of those would be pretty invasive I think. I agree, but would it be worth it to bring the code in line with the original intention? I'm just going off the memory-hotplug doc: "It is expected that all memory sections in this range are present and no memory holes exist in the range". There are a number of checks against missing sections throughout memory.c that could be cleaned out if the memory_block initialization worked correctly :/ I'd say the second option is the better of the two. > > Either way, the patch is a quick fix against the crash: > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > index dd30744..6d7b14c 100644 > --- a/drivers/base/memory.c > +++ b/drivers/base/memory.c > @@ -303,6 +303,10 @@ static int memory_subsys_offline(struct device *dev) > if (mem->state == MEM_OFFLINE) > return 0; > > + /* Can't offline block with non-present sections */ > + if (mem->section_count != sections_per_block) > + return -EINVAL; > + > return memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE); > } > > What do you think? I think that would work for this case. The same logic lives in register_new_memory, but I'm confused why the block's section_count gets incremented there- I'm worried an inappropriate inc/dec of section_count would ruin the logic. Actually this issue was half-way addressed by commit 21ea9f5ace3a7317cc3ba1fbc749758021a83136 Author: Russ Anderson <rja@sgi.com> Date: Wed Aug 28 16:35:18 2013 -0700 drivers/base/memory.c: fix show_mem_removable() to handle missing sections The bad memory correctly had removable = 0. Perhaps there's a good way to use the removable value to prevent the user from trying to offline the block- maybe by just repeating the same check used there. Regardless of missing sections, if a block is labelled un-removable the user should probably be stopped. It seems like this missing section issue keeps cropping up. To me this warrants a bigger fix than a bandaid, but I'm new to this :) Thanks, AB > > Seth > > > In either case > > > > Thank you for your reply! > > > > AB |