Bug 206401

Summary: kernel panic on Hyper-V after 5 minutes due to memory hot-add
Product: Memory Management Reporter: Taketo Kabe (kkabe)
Component: OtherAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: blocking    
Priority: P1    
Hardware: i386   
OS: Linux   
Kernel Version: 5.5.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: Don't free pages just onlined on memory hot-add
Patch to add diags for memset() call
use C code for page_init_poison() instead of memset()
mm/hotplug: do not __free_pages_core in generic_online_page
dmesg.txt
panic.txt
kernel-i586.config
readelf-kcore.txt
dmesg
ha00.patch

Description Taketo Kabe 2020-02-04 11:00:25 UTC
When running on low memory (ex. initial 160MB) 
generating memory pressure on Hyper-V,
after running 5 minutes, the kernel panics when memory hot-add is
requested by the hypervisor.

related bug for kernel-4.19.95:
https://bugzilla.kernel.org/show_bug.cgi?id=206181#c12

[  302.169238] hv_balloon: Max. dynamic memory size: 1048576 MB
[  302.367821] BUG: unable to handle page fault for address: 00280000
[  302.367862] #PF: supervisor write access in kernel mode
[  302.367900] #PF: error_code(0x0002) - not-present page
[  302.367933] *pde = 00000000
[  302.367961] Oops: 0002 [#1] SMP
[  302.367984] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.5.0-1.el8.i586 #1
[  302.368030] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006  05/23/2012
[  302.368091] Workqueue: events hot_add_req [hv_balloon]
[  302.368129] EIP: memset+0xb/0x20
[  302.368161] Code: f9 01 72 0b 8a 0e 88 0f 8d b4 26 00 00 00 00 8b 45 f0 83 c4 04 5b 5e 5f 5d c3 90 8d 74 26 00 55 89 e5 57 89 c7 53 89 c3 89 d0 <f3> aa 89 d8 5b 5f 5d c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 3e
[  302.368209] EAX: ffffffff EBX: 00280000 ECX: 000a0000 EDX: ffffffff
[  302.368236] ESI: 00010000 EDI: 00280000 EBP: c993fdec ESP: c993fde4
[  302.368253] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010202
[  302.368275] CR0: 80050033 CR2: 00280000 CR3: 053eb000 CR4: 003406d0
[  302.368298] Call Trace:
[  302.368308]  page_init_poison+0x1d/0x30
[  302.368319]  sparse_add_section+0x137/0x1ab
[  302.368356]  __add_pages+0x9a/0x100
[  302.368373]  arch_add_memory+0x39/0x40
[  302.368394]  add_memory_resource+0x155/0x200
[  302.368412]  ? irq_work_queue+0x1f/0x30
[  302.368429]  __add_memory+0x7e/0xf0
[  302.368447]  add_memory+0x2c/0x40
[  302.368462]  hot_add_req+0x564/0x5c0 [hv_balloon]
[  302.368477]  process_one_work+0x176/0x310
[  302.368498]  worker_thread+0x39/0x3c0
[  302.368516]  kthread+0xf0/0x110
[  302.368530]  ? rescuer_thread+0x2f0/0x2f0
[  302.368551]  ? kthread_park+0x90/0x90
[  302.368570]  ret_from_fork+0x2e/0x40
[  302.368588] Modules linked in: rfkill intel_rapl_msr intel_rapl_common crc32_pclmul sg snd_pcm snd_timer snd intel_rapl_perf soundcore pcspkr hv_utils hv_netvsc hv_balloon hyperv_fb i2c_piix4 joydev ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod ata_generic hyperv_keyboard hid_hyperv hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw hv_vmbus libata
[  302.368659] CR2: 0000000000280000
[  302.368668] ---[ end trace ecd710eeebcc6d97 ]---
[  302.368690] EIP: memset+0xb/0x20
[  302.368704] Code: f9 01 72 0b 8a 0e 88 0f 8d b4 26 00 00 00 00 8b 45 f0 83 c4 04 5b 5e 5f 5d c3 90 8d 74 26 00 55 89 e5 57 89 c7 53 89 c3 89 d0 <f3> aa 89 d8 5b 5f 5d c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 3e
[  302.368769] EAX: ffffffff EBX: 00280000 ECX: 000a0000 EDX: ffffffff
[  302.368800] ESI: 00010000 EDI: 00280000 EBP: c993fdec ESP: c993fde4
[  302.368827] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010202
[  302.368856] CR0: 80050033 CR2: 00280000 CR3: 053eb000 CR4: 003406d0
[  302.368924] Kernel panic - not syncing: Fatal exception
[  302.368949] Kernel Offset: 0x2800000 from 0xc1000000 (relocation range: 0xc0000000-0xca7effff)
[  302.368972] ---[ end Kernel panic - not syncing: Fatal exception ]---
Comment 1 Taketo Kabe 2020-02-04 11:25:48 UTC
Created attachment 287103 [details]
Don't free pages just onlined on memory hot-add

This patch fixed the panic after 5 minutes.
Comment 2 Taketo Kabe 2020-02-06 11:18:01 UTC
Created attachment 287165 [details]
Patch to add diags for memset() call

Patch to see parameters passed in to memset()
Comment 3 Taketo Kabe 2020-02-06 11:27:54 UTC
The patch #c1 didn't completely fix the issue.

Something is happening in memset() of page_init_poison();
The patch https://bugzilla.kernel.org/show_bug.cgi?id=206401#c2
adds printk diag for memset() call.

It still panics after around 10 minutes;
the 1st argument of memset() is dumped properly (passed in as %ebx),
but panic dump shows %ebx having 0x280000, which the source is unknown.

If I wrote the memset() call inside page_init_poison() in C code
(as commented out in the patch #c2),
panic won't happen and hot-added memory is recognized properly.


[  302.572775] hv_balloon: Max. dynamic memory size: 1048576 MB
[  657.870102] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x10000) << PAGE_SHIFT)=0x10000000, (HA_CHUNK << PAGE_SHIFT)=134217728)
[  657.883219] sparse_add_section: page_init_poison(pfn_to_page(start_pfn=65536)=0xdf4a3955, (sizeof(struct page)=40 * nr_pages)=655360)
[  657.896495] page_init_poison: poisoning(0xdf4a3955 size=655360)
[  657.896521] __memset_generic: (0xdf4a3955, -1, 655360)
[  657.896542] BUG: unable to handle page fault for address: 00280000
[  657.896588] #PF: supervisor write access in kernel mode
[  657.896629] #PF: error_code(0x0002) - not-present page
[  657.896650] *pde = 00000000
[  657.896669] Oops: 0002 [#1] SMP
[  657.896682] CPU: 0 PID: 495 Comm: kworker/0:0 Tainted: G            E     5.5.0-2.el8.i586 #6
[  657.896713] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006  05/23/2012
[  657.896750] Workqueue: events hot_add_req [hv_balloon]
[  657.896769] EIP: memset+0x16/0x21
[  657.896794] Code: 00 00 00 00 8b 45 f0 83 c4 04 5b 5e 5f 5d c3 90 8d 74 26 00 55 89 e5 57 53 89 c3 83 ec 08 84 d2 0f 85 0f 00 00 00 89 df 89 d0 <f3> aa 8d 65 f8 89 d8 5b 5f 5d c3 0f be c2 51 50 53 68 50 7a 31 c7
[  657.896860] EAX: ffffffff EBX: 00280000 ECX: 000a0000 EDX: ffffffff
[  657.896885] ESI: 000a0000 EDI: 00280000 EBP: c9d31dbc ESP: c9d31dac
[  657.896910] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286
[  657.896934] CR0: 80050033 CR2: 00280000 CR3: 077ca000 CR4: 003406d0
[  657.896964] Call Trace:
[  657.896971]  page_init_poison.cold.6+0x21/0x29
[  657.896983]  sparse_add_section+0x168/0x1df
[  657.896993]  __add_pages+0x9a/0x100
[  657.897003]  arch_add_memory+0x39/0x40
[  657.897013]  add_memory_resource+0x155/0x200
[  657.897024]  __add_memory+0x7e/0xf0
[  657.897033]  add_memory+0x2c/0x40
[  657.897043]  hot_add_req+0x564/0x5c0 [hv_balloon]
[  657.897055]  process_one_work+0x176/0x310
[  657.897065]  worker_thread+0x39/0x3c0
[  657.897074]  kthread+0xf0/0x110
[  657.897083]  ? rescuer_thread+0x2f0/0x2f0
[  657.897094]  ? kthread_park+0x90/0x90
[  657.897104]  ret_from_fork+0x2e/0x40
[  657.897119] Modules linked in: rfkill intel_rapl_msr intel_rapl_common crc32_pclmul snd_pcm snd_timer snd intel_rapl_perf soundcore pcspkr hv_netvsc hyperv_fb sg hv_balloon(E) hv_utils i2c_piix4 joydev ip_tables ext4 mbcache jbd2 sd_mod sr_mod cdrom ata_generic hyperv_keyboard hid_hyperv hv_storvsc scsi_transport_fc ata_piix hv_vmbus crc32c_intel serio_raw libata
[  657.897211] CR2: 0000000000280000
[  657.897225] ---[ end trace ccb1bf2b4a48fd4e ]---
[  657.897240] EIP: memset+0x16/0x21
[  657.897253] Code: 00 00 00 00 8b 45 f0 83 c4 04 5b 5e 5f 5d c3 90 8d 74 26 00 55 89 e5 57 53 89 c3 83 ec 08 84 d2 0f 85 0f 00 00 00 89 df 89 d0 <f3> aa 8d 65 f8 89 d8 5b 5f 5d c3 0f be c2 51 50 53 68 50 7a 31 c7
[  657.897319] EAX: ffffffff EBX: 00280000 ECX: 000a0000 EDX: ffffffff
[  657.897349] ESI: 000a0000 EDI: 00280000 EBP: c9d31dbc ESP: c9d31dac
[  657.897368] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286
[  657.897402] CR0: 80050033 CR2: 00280000 CR3: 077ca000 CR4: 003406d0
[  657.897418] Kernel panic - not syncing: Fatal exception
[  657.897426] Kernel Offset: 0x5a00000 from 0xc1000000 (relocation range: 0xc0000000-0xcafeffff)
[  657.897477] ---[ end Kernel panic - not syncing: Fatal exception ]---
Comment 4 Taketo Kabe 2020-02-09 10:54:41 UTC
Created attachment 287259 [details]
use C code for page_init_poison() instead of memset()

not for production

The generated asm code:

c11f8591 <page_init_poison.cold.6>:
                pr_info("%s: poisoning(0x%p size=%u)\n", __func__, p, size);
c11f8591:       52                      push   %edx
c11f8592:       89 d6                   mov    %edx,%esi
c11f8594:       89 c3                   mov    %eax,%ebx
c11f8596:       50                      push   %eax
c11f8597:       68 60 29 87 c1          push   $0xc1872960
c11f859c:       68 d4 e9 ae c1          push   $0xc1aee9d4
c11f85a1:       e8 74 b5 ec ff          call   c10c3b1a <printk>
c11f85a6:       31 c0                   xor    %eax,%eax
c11f85a8:       83 c4 10                add    $0x10,%esp
                for (; size; size--) {
c11f85ab:       39 f0                   cmp    %esi,%eax
c11f85ad:       0f 84 24 fc ff ff       je     c11f81d7 <page_init_poison+0x17>
                        *p++ = PAGE_POISON_PATTERN;
c11f85b3:       c6 04 03 ff             movb   $0xff,(%ebx,%eax,1)
c11f85b7:       83 c0 01                add    $0x1,%eax
c11f85ba:       eb ef                   jmp    c11f85ab <page_init_poison.cold.6+0x1a>
Comment 5 Taketo Kabe 2020-02-09 11:02:59 UTC
I stopped using memset() in page_init_poison() and had a try
substituting C code for it;
after 10 minutes, memory hot-add kicks in and
it still tries to write to 0x00280000 coming out of blue.
%eax has proper value passed in at least on entry of the function, as 
pr_info() dump says.

I can't get out why %ebx is clobbered to 0x00280000.
This isn't -fPIC code, so GLOBAL_OFFSET_TABLE should't be an issue here.


[   20.642141] __memset_generic: (0x8df6368b, -1, 4096)
[   21.696914] Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist.
[   22.388843] __memset_generic: (0x20d15bd3, -52, 4096)
[   41.909261] __memset_generic: (0x8b330031, -1, 4096)
[   41.935897] __memset_generic: (0xb5798dd0, -1, 4096)
[  302.901147] hv_balloon: Max. dynamic memory size: 1048576 MB
[  642.922869] __memset_generic: (0xd3993b1f, -1, 4096)
[  647.684524] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x10000) << PAGE_SHIFT)=0x10000000, (HA_CHUNK << PAGE_SHIFT)=134217728)
[  647.713291] sparse_add_section: page_init_poison(pfn_to_page(start_pfn=65536)=0xd0c5af0b, (sizeof(struct page)=40 * nr_pages)=655360)
[  647.713360] page_init_poison: poisoning(0xd0c5af0b size=655360)
[  647.725456] BUG: unable to handle page fault for address: 00280000
[  647.725478] #PF: supervisor write access in kernel mode
[  647.725501] #PF: error_code(0x0002) - not-present page
[  647.725527] *pde = 00000000
[  647.725556] Oops: 0002 [#1] SMP
[  647.725571] CPU: 0 PID: 459 Comm: kworker/0:1 Tainted: G            E     5.5.0-2.el8.i586 #23
[  647.725614] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006  05/23/2012
[  647.725649] Workqueue: events hot_add_req [hv_balloon]
[  647.725672] EIP: page_init_poison.cold.6+0x22/0x2b
[  647.725698] Code: b5 ec ff 83 c4 3c c9 c3 52 89 d6 89 c3 50 68 60 29 47 c3 68 d4 e9 6e c3 e8 74 b5 ec ff 31 c0 83 c4 10 39 f0 0f 84 24 fc ff ff <c6> 04 03 ff 83 c0 01 eb ef 8b 43 04 a8 01 0f 85 6f 01 00 00 8b 43
[  647.725770] EAX: 00000000 EBX: 00280000 ECX: ca400f00 EDX: ca3f6e8c
[  647.725795] ESI: 000a0000 EDI: 00000004 EBP: c2573ddc ESP: c2573dd4
[  647.725815] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010287
[  647.725835] CR0: 80050033 CR2: 00280000 CR3: 08caa000 CR4: 003406d0
[  647.725853] Call Trace:
[  647.725862]  sparse_add_section+0x168/0x1df
[  647.725877]  __add_pages+0x9a/0x100
[  647.725895]  arch_add_memory+0x39/0x40
[  647.725909]  add_memory_resource+0x155/0x200
[  647.725927]  ? irq_work_queue+0x1f/0x30
[  647.725937]  __add_memory+0x7e/0xf0
[  647.725956]  add_memory+0x2c/0x40
[  647.725968]  hot_add_req+0x564/0x5c0 [hv_balloon]
[  647.725985]  process_one_work+0x176/0x310
[  647.726004]  worker_thread+0x39/0x3c0
[  647.726030]  kthread+0xf0/0x110
[  647.726043]  ? rescuer_thread+0x2f0/0x2f0
[  647.726072]  ? kthread_park+0x90/0x90
[  647.726101]  ret_from_fork+0x2e/0x40
[  647.726113] Modules linked in: rfkill intel_rapl_msr intel_rapl_common snd_pcm snd_timer snd sg crc32_pclmul soundcore intel_rapl_perf hv_utils pcspkr hv_balloon(E) i2c_piix4 joydev hv_netvsc hyperv_fb ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod ata_generic hid_hyperv hyperv_keyboard hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw libata hv_vmbus
[  647.726197] CR2: 0000000000280000
[  647.726230] ---[ end trace 9d9aa98b89d59f21 ]---
[  647.726287] EIP: page_init_poison.cold.6+0x22/0x2b
[  647.726336] Code: b5 ec ff 83 c4 3c c9 c3 52 89 d6 89 c3 50 68 60 29 47 c3 68 d4 e9 6e c3 e8 74 b5 ec ff 31 c0 83 c4 10 39 f0 0f 84 24 fc ff ff <c6> 04 03 ff 83 c0 01 eb ef 8b 43 04 a8 01 0f 85 6f 01 00 00 8b 43
[  647.726393] EAX: 00000000 EBX: 00280000 ECX: ca400f00 EDX: ca3f6e8c
[  647.726412] ESI: 000a0000 EDI: 00000004 EBP: c2573ddc ESP: c2573dd4
[  647.726429] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010287
[  647.726442] CR0: 80050033 CR2: 00280000 CR3: 08caa000 CR4: 003406d0
[  647.726455] Kernel panic - not syncing: Fatal exception
[  647.726473] Kernel Offset: 0x1c00000 from 0xc1000000 (relocation range: 0xc0000000-0xcafeffff)
[  647.726493] ---[ end Kernel panic - not syncing: Fatal exception ]---
Comment 6 Taketo Kabe 2020-02-10 05:40:04 UTC
sprinkling couple of "volatile"s made the compiler emit code
not using %ebx.
Still, -4(%ebp) (first argument of page_init_poison()) seems to be 
corrupted somewhere.

I'm out of idea. I believe printk() doesn't clobber stack beyond their own.


[   17.823414] Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist.
[   17.920401] __memset_generic: (0x25102f96, -52, 4096)
[   42.250068] __memset_generic: (0xa93cc8c2, -1, 4096)
[  302.493577] hv_balloon: Max. dynamic memory size: 1048576 MB
[  302.809823] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x10000) << PAGE_SHIFT)=0x10000000, (HA_CHUNK << PAGE_SHIFT)=134217728)
[  302.817265] sparse_add_section: page_init_poison(pfn_to_page(start_pfn=65536)=0x3d1d0f24, (sizeof(struct page)=40 * nr_pages)=655360)
[  302.823824] page_init_poison: poisoning(0x3d1d0f24 size=655360)	<<<<<<<correct "page" parameter
[  302.823866] page_init_poison: eax(-0x4(%ebp)) = 0x280000		<<<<<<<corrupt
[  302.823899] BUG: unable to handle page fault for address: 00280000
[  302.823948] #PF: supervisor write access in kernel mode
[  302.823992] #PF: error_code(0x0002) - not-present page
[  302.824036] *pde = 00000000
[  302.824059] Oops: 0002 [#1] SMP
[  302.824072] CPU: 0 PID: 209 Comm: kworker/0:3 Tainted: G            E     5.5.0-2.el8.i586 #27
[  302.824095] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006  05/23/2012
[  302.824123] Workqueue: events hot_add_req [hv_balloon]
[  302.824151] EIP: page_init_poison.cold.6+0x44/0x4c
[  302.824169] Code: 8b 45 fc 50 68 60 29 c7 c3 68 f8 e9 ee c3 e8 5c b5 ec ff 8b 55 f8 83 c4 1c 85 d2 0f 84 0c fc ff ff 8b 45 fc 83 ea 01 8d 48 01 <c6> 00 ff 89 4d fc eb e7 8b 43 04 a8 01 0f 85 6f 01 00 00 8b 43 10
[  302.824215] EAX: 00280000 EBX: 00010000 ECX: 00280001 EDX: 0009ffff
[  302.824234] ESI: c5e00000 EDI: 00000004 EBP: c57f1ddc ESP: c57f1dd4
[  302.824262] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010216
[  302.824291] CR0: 80050033 CR2: 00280000 CR3: 07ac6000 CR4: 003406d0
[  302.824324] Call Trace:
[  302.824340]  sparse_add_section+0x168/0x1df
[  302.824361]  __add_pages+0x9a/0x100
[  302.824376]  arch_add_memory+0x39/0x40
[  302.824393]  add_memory_resource+0x155/0x200
[  302.824407]  ? irq_work_queue+0x1f/0x30
[  302.824420]  __add_memory+0x7e/0xf0
[  302.824437]  add_memory+0x2c/0x40
[  302.824450]  hot_add_req+0x564/0x5c0 [hv_balloon]
[  302.824461]  process_one_work+0x176/0x310
[  302.824473]  worker_thread+0x39/0x3c0
[  302.824481]  kthread+0xf0/0x110
[  302.824491]  ? rescuer_thread+0x2f0/0x2f0
[  302.824504]  ? kthread_park+0x90/0x90
[  302.824518]  ret_from_fork+0x2e/0x40
[  302.824525] Modules linked in: rfkill intel_rapl_msr intel_rapl_common crc32_pclmul snd_pcm snd_timer snd intel_rapl_perf soundcore pcspkr hv_netvsc sg hyperv_fb i2c_piix4 hv_balloon(E) hv_utils joydev ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod ata_generic hyperv_keyboard hid_hyperv hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw libata hv_vmbus
[  302.824651] CR2: 0000000000280000
[  302.824671] ---[ end trace fc6908983c929182 ]---
[  302.824692] EIP: page_init_poison.cold.6+0x44/0x4c
[  302.824714] Code: 8b 45 fc 50 68 60 29 c7 c3 68 f8 e9 ee c3 e8 5c b5 ec ff 8b 55 f8 83 c4 1c 85 d2 0f 84 0c fc ff ff 8b 45 fc 83 ea 01 8d 48 01 <c6> 00 ff 89 4d fc eb e7 8b 43 04 a8 01 0f 85 6f 01 00 00 8b 43 10
[  302.824788] EAX: 00280000 EBX: 00010000 ECX: 00280001 EDX: 0009ffff
[  302.824817] ESI: c5e00000 EDI: 00000004 EBP: c57f1ddc ESP: c57f1dd4
[  302.824847] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010216
[  302.824861] CR0: 80050033 CR2: 00280000 CR3: 07ac6000 CR4: 003406d0
[  302.824887] Kernel panic - not syncing: Fatal exception
[  302.824902] Kernel Offset: 0x2400000 from 0xc1000000 (relocation range: 0xc0000000-0xcafeffff)
[  302.824930] ---[ end Kernel panic - not syncing: Fatal exception ]---


0000003f <page_init_poison.cold.6>:
                unsigned char * volatile p = (unsigned char *)page;
  3f:   89 45 fc                mov    %eax,-0x4(%ebp)
                pr_info("%s: poisoning(0x%p size=%u)\n", __func__, p, size);
  42:   8b 45 fc                mov    -0x4(%ebp),%eax
  45:   52                      push   %edx
  46:   50                      push   %eax			<<<<<<has correct value
  47:   68 00 00 00 00          push   $0x0
  4c:   68 08 01 00 00          push   $0x108
  51:   89 55 f8                mov    %edx,-0x8(%ebp)
  54:   e8 fc ff ff ff          call   55 <page_init_poison.cold.6+0x16>
                asm volatile ("mov -0x4(%%ebp),%0" : "=mr" (eax));
  59:   8b 45 fc                mov    -0x4(%ebp),%eax
                pr_info("%s: eax(-0x4(%%ebp)) = 0x%lx\n", __func__, eax);
  5c:   50                      push   %eax			<<<<<<corrupt to 0x280000
  5d:   68 00 00 00 00          push   $0x0
  62:   68 28 01 00 00          push   $0x128
  67:   e8 fc ff ff ff          call   68 <page_init_poison.cold.6+0x29>
                for (; size; size--) {
  6c:   8b 55 f8                mov    -0x8(%ebp),%edx
                pr_info("%s: eax(-0x4(%%ebp)) = 0x%lx\n", __func__, eax);
  6f:   83 c4 1c                add    $0x1c,%esp
                for (; size; size--) {
  72:   85 d2                   test   %edx,%edx
  74:   0f 84 14 00 00 00       je     8e <__dump_page.cold.7+0x3>
                        *p++ = PAGE_POISON_PATTERN;
  7a:   8b 45 fc                mov    -0x4(%ebp),%eax
                for (; size; size--) {
  7d:   83 ea 01                sub    $0x1,%edx
                        *p++ = PAGE_POISON_PATTERN;
  80:   8d 48 01                lea    0x1(%eax),%ecx
  83:   c6 00 ff                movb   $0xff,(%eax)
  86:   89 4d fc                mov    %ecx,-0x4(%ebp)
  89:   eb e7                   jmp    72 <page_init_poison.cold.6+0x33>
Comment 7 Taketo Kabe 2020-02-10 09:23:44 UTC
I suspected speculative execution is doing something wrong,
so inserted a serialization opcode (cpuid), but
stack value seems to be already corrupt.


[   21.746337] Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist.
[   22.653412] __memset_generic: (0x9895d2fb, -52, 4096)
[   23.387076] __memset_generic: (0x73d71684, -1, 4096)
[   42.498008] __memset_generic: (0xcbc859df, -1, 4096)
[   42.507944] __memset_generic: (0xb78a3cd6, -1, 4096)
[   57.065106] __memset_generic: (0x330820c3, -1, 4096)
[  302.573533] hv_balloon: Max. dynamic memory size: 1048576 MB
[  626.396603] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x10000) << PAGE_SHIFT)=0x10000000, (HA_CHUNK << PAGE_SHIFT)=134217728)
[  626.445711] sparse_add_section: page_init_poison(pfn_to_page(start_pfn=65536)=0xa8a43056, (sizeof(struct page)=40 * nr_pages)=655360)
[  626.454676] page_init_poison: poisoning(0xa8a43056 size=655360)	<<<<<<correct
[  626.454691] page_init_poison: -0xc(%ebp) = 0x280000			<<<<<<corrupt
[  626.454701] BUG: unable to handle page fault for address: 00280000
[  626.454713] #PF: supervisor write access in kernel mode
[  626.454723] #PF: error_code(0x0002) - not-present page
[  626.454733] *pde = 00000000
[  626.454741] Oops: 0002 [#1] SMP
[  626.454750] CPU: 0 PID: 253 Comm: kworker/0:6 Tainted: G            E     5.5.0-2.el8.i586 #31
[  626.454769] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006  05/23/2012
[  626.454786] Workqueue: events hot_add_req [hv_balloon]
[  626.454797] EIP: page_init_poison.cold.6+0x44/0x4c
[  626.454808] Code: c0 0f a2 8b 45 f4 50 68 60 29 07 c3 68 f7 ec 2e c3 e8 49 b5 ec ff 83 c4 1c 85 f6 0f 84 fe fb ff ff 8b 45 f4 83 ee 01 8d 50 01 <c6> 00 ff 89 55 f4 eb e7 8b 43 04 a8 01 0f 85 6f 01 00 00 8b 43 10
[  626.454839] EAX: 00280000 EBX: 756e6547 ECX: 00000007 EDX: 00280001
[  626.454850] ESI: 0009ffff EDI: 00000004 EBP: c96dfddc ESP: c96dfdd0
[  626.454870] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010216
[  626.454882] CR0: 80050033 CR2: 00280000 CR3: 035ca000 CR4: 003406d0
[  626.454896] Call Trace:
[  626.454903]  sparse_add_section+0x168/0x1df
[  626.454914]  __add_pages+0x9a/0x100
[  626.454927]  arch_add_memory+0x39/0x40
[  626.454937]  add_memory_resource+0x155/0x200
[  626.454952]  __add_memory+0x7e/0xf0
[  626.454958]  add_memory+0x2c/0x40
[  626.454964]  hot_add_req+0x564/0x5c0 [hv_balloon]
[  626.454976]  process_one_work+0x176/0x310
[  626.454984]  worker_thread+0x39/0x3c0
[  626.454992]  kthread+0xf0/0x110
[  626.454999]  ? rescuer_thread+0x2f0/0x2f0
[  626.455007]  ? kthread_park+0x90/0x90
[  626.455018]  ret_from_fork+0x2e/0x40
[  626.455039] Modules linked in: rfkill intel_rapl_msr intel_rapl_common sg crc32_pclmul snd_pcm snd_timer snd intel_rapl_perf soundcore hv_netvsc pcspkr hv_balloon(E) hv_utils hyperv_fb i2c_piix4 joydev ip_tables ext4 mbcache jbd2 sr_mod cdrom ata_generic sd_mod hid_hyperv hv_storvsc scsi_transport_fc hyperv_keyboard ata_piix hv_vmbus crc32c_intel serio_raw libata
[  626.455120] CR2: 0000000000280000
[  626.455128] ---[ end trace f0ac4774afc72c0d ]---
[  626.455143] EIP: page_init_poison.cold.6+0x44/0x4c
[  626.455157] Code: c0 0f a2 8b 45 f4 50 68 60 29 07 c3 68 f7 ec 2e c3 e8 49 b5 ec ff 83 c4 1c 85 f6 0f 84 fe fb ff ff 8b 45 f4 83 ee 01 8d 50 01 <c6> 00 ff 89 55 f4 eb e7 8b 43 04 a8 01 0f 85 6f 01 00 00 8b 43 10
[  626.455201] EAX: 00280000 EBX: 756e6547 ECX: 00000007 EDX: 00280001
[  626.455219] ESI: 0009ffff EDI: 00000004 EBP: c96dfddc ESP: c96dfdd0
[  626.455239] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010216
[  626.455262] CR0: 80050033 CR2: 00280000 CR3: 035ca000 CR4: 003406d0
[  626.455290] Kernel panic - not syncing: Fatal exception
[  626.455335] Kernel Offset: 0x1800000 from 0xc1000000 (relocation range: 0xc0000000-0xcafeffff)
[  626.455365] ---[ end Kernel panic - not syncing: Fatal exception ]---


c11f85a1 <page_init_poison.cold.6>:
                unsigned char * volatile p = (unsigned char *)page;
c11f85a1:       89 45 f4                mov    %eax,-0xc(%ebp)
                pr_info("%s: poisoning(0x%p size=%u)\n", __func__, p, size);
c11f85a4:       8b 45 f4                mov    -0xc(%ebp),%eax
c11f85a7:       89 d6                   mov    %edx,%esi
c11f85a9:       52                      push   %edx
c11f85aa:       50                      push   %eax		<<<<<<correct
c11f85ab:       68 60 29 87 c1          push   $0xc1872960
c11f85b0:       68 d8 e9 ae c1          push   $0xc1aee9d8
c11f85b5:       e8 60 b5 ec ff          call   c10c3b1a <printk>
                asm volatile ("cpuid" : : "a" (0) : "ebx","ecx","edx"); /* serialize */
c11f85ba:       31 c0                   xor    %eax,%eax
c11f85bc:       0f a2                   cpuid
                asm volatile ("mov -0xc(%%ebp),%0" : "=r" (eax));
c11f85be:       8b 45 f4                mov    -0xc(%ebp),%eax
                pr_info("%s: -0xc(%%ebp) = 0x%lx\n", __func__, eax);
c11f85c1:       50                      push   %eax		<<<<<<<corrupt to 0x280000
c11f85c2:       68 60 29 87 c1          push   $0xc1872960
c11f85c7:       68 f7 ec ae c1          push   $0xc1aeecf7
c11f85cc:       e8 49 b5 ec ff          call   c10c3b1a <printk>
c11f85d1:       83 c4 1c                add    $0x1c,%esp
                for (; size; size--) {
c11f85d4:       85 f6                   test   %esi,%esi
c11f85d6:       0f 84 fe fb ff ff       je     c11f81da <page_init_poison+0x1a>
                        *p++ = PAGE_POISON_PATTERN;
c11f85dc:       8b 45 f4                mov    -0xc(%ebp),%eax
                for (; size; size--) {
c11f85df:       83 ee 01                sub    $0x1,%esi
                        *p++ = PAGE_POISON_PATTERN;
c11f85e2:       8d 50 01                lea    0x1(%eax),%edx
c11f85e5:       c6 00 ff                movb   $0xff,(%eax)
c11f85e8:       89 55 f4                mov    %edx,-0xc(%ebp)
c11f85eb:       eb e7                   jmp    c11f85d4 <page_init_poison.cold.6+0x33>
Comment 8 Andrew Morton 2020-02-10 23:20:21 UTC
Please see http://lkml.kernel.org/r/20200210060923.GC8965@MiWiFi-R3L-srv

It is hoped that https://lore.kernel.org/linux-mm/20200209104826.3385-7-bhe@redhat.com/ will address this bug.  Are you able to test that patch?

Thanks.
Comment 9 Taketo Kabe 2020-02-11 05:59:09 UTC
Applied https://lore.kernel.org/linux-mm/20200209104826.3385-7-bhe@redhat.com/ patch;
page_init_poison() seems to be working now, but panics at different place.

I suspect related bug for kernel-4.19.95:
https://bugzilla.kernel.org/show_bug.cgi?id=206181#c12

And I'm still at loss why the register/stack was corrupted.


[   25.342662] Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist.
[   25.421668] __memset_generic: (0x876ab121, -52, 4096)
[   41.549826] __memset_generic: (0xc66e2fca, -1, 4096)
[  302.439480] hv_balloon: Max. dynamic memory size: 1048576 MB
[  302.886990] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x10000) << PAGE_SHIFT)=0x10000000, (HA_CHUNK << PAGE_SHIFT)=134217728)
[  302.896657] sparse_add_section: page_init_poison(pfn_to_page(start_pfn=65536)=0x98f4054f, (sizeof(struct page)=40 * nr_pages)=655360)
[  302.896716] page_init_poison: poisoning(0xfc490b07 size=655360)
[  302.907080] page_init_poison: -0xc(%ebp) = 0xfc490b07
[  303.182662] sparse_add_section: page_init_poison(pfn_to_page(start_pfn=81920)=0xa4eb0440, (sizeof(struct page)=40 * nr_pages)=655360)
[  303.182696] page_init_poison: poisoning(0x4aa5620d size=655360)
[  303.182720] page_init_poison: -0xc(%ebp) = 0x4aa5620d
[  303.462739] BUG: unable to handle page fault for address: d2bff000
[  303.462763] #PF: supervisor write access in kernel mode
[  303.462772] #PF: error_code(0x0002) - not-present page
[  303.462787] *pde = 00000000
[  303.462796] Oops: 0002 [#1] SMP
[  303.462806] CPU: 0 PID: 349 Comm: systemd-udevd Tainted: G            E     5.5.0-2.el8.i586 #40
[  303.462824] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006  05/23/2012
[  303.462843] EIP: wp_page_copy+0x8e/0x750
[  303.462855] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 a9 de e5 ff 89 45 bc 89 f8 e8 9f de e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29
[  303.462888] EAX: d2bff000 EBX: c746bf0c ECX: 018a3e00 EDX: c358c000
[  303.462907] ESI: c358c000 EDI: d2bff004 EBP: c746bed0 ESP: c746be8c
[  303.462923] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282
[  303.462938] CR0: 80050033 CR2: d2bff000 CR3: 0947d000 CR4: 003406d0
[  303.462953] Call Trace:
[  303.462967]  ? reuse_swap_page+0x83/0x390
[  303.462978]  do_wp_page+0x87/0x6e0
[  303.462989]  handle_mm_fault+0x808/0xe30
[  303.463000]  ? __sys_recvmsg+0x3c/0x80
[  303.463010]  __do_page_fault+0x18e/0x3d0
[  303.463022]  ? __do_page_fault+0x3d0/0x3d0
[  303.463033]  do_page_fault+0x28/0xd0
[  303.463043]  ? __do_page_fault+0x3d0/0x3d0
[  303.463068]  common_exception_read_cr2+0x15a/0x15f
[  303.463078] EIP: 0xb7b03e44
[  303.463086] Code: 8d 04 31 89 44 24 10 39 30 0f 85 77 01 00 00 8b 51 08 8b 41 0c 39 4a 0c 0f 85 35 01 00 00 39 48 08 0f 85 2c 01 00 00 89 42 0c <89> 50 08 81 fb ef 03 00 00 76 0b 8b 59 10 85 db 0f 85 da 01 00 00
[  303.463115] EAX: 018a4688 EBX: 00000141 ECX: 01827258 EDX: b7c2a878
[  303.463137] ESI: 00000140 EDI: 00000100 EBP: b7c2a7a0 ESP: bfa74a20
[  303.463148] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00210246
[  303.463160] Modules linked in: rfkill intel_rapl_msr intel_rapl_common crc32_pclmul intel_rapl_perf snd_pcm snd_timer snd soundcore pcspkr hv_utils hv_netvsc i2c_piix4 hyperv_fb sg hv_balloon(E) joydev ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod ata_generic hyperv_keyboard hid_hyperv hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw libata hv_vmbus
[  303.463218] CR2: 00000000d2bff000
[  303.463230] ---[ end trace 53dd1f0742b34512 ]---
[  303.463244] EIP: wp_page_copy+0x8e/0x750
[  303.463257] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 a9 de e5 ff 89 45 bc 89 f8 e8 9f de e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29
[  303.463288] EAX: d2bff000 EBX: c746bf0c ECX: 018a3e00 EDX: c358c000
[  303.463304] ESI: c358c000 EDI: d2bff004 EBP: c746bed0 ESP: c746be8c
[  303.463321] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282
[  303.463337] CR0: 80050033 CR2: d2bff000 CR3: 0947d000 CR4: 003406d0
[  303.463352] Kernel panic - not syncing: Fatal exception
[  303.463364] Kernel Offset: 0x2c00000 from 0xc1000000 (relocation range: 0xc0000000-0xcafeffff)
[  303.463383] ---[ end Kernel panic - not syncing: Fatal exception ]---
Comment 10 Taketo Kabe 2020-02-11 06:42:53 UTC
Created attachment 287293 [details]
mm/hotplug: do not __free_pages_core in generic_online_page

Patch as in https://lore.kernel.org/linux-mm/20200209104826.3385-7-bhe@redhat.com/ and this one seems to fix Hyper-V memody hot-add.
Comment 11 Taketo Kabe 2020-02-11 09:52:15 UTC
With 2 patches in https://bugzilla.kernel.org/show_bug.cgi?id=206401#c10 ,
plain text-based installation seems to work, but under severe memory pressure
running anaconda installer, it still panics.
This time sparse_add_section()->compaction_alloc() is the path, so 
it seems I've hit a different bug.


[  303.803241] invalid opcode: 0000 [#1] SMP
[  303.803250] CPU: 0 PID: 452 Comm: kworker/0:5 Not tainted 5.5.0-2.el8.i586 #1
[  303.803261] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006  05/23/2012
[  303.803281] Workqueue: events hot_add_req [hv_balloon]
[  303.803290] EIP: isolate_freepages_block+0x314/0x350
[  303.803299] Code: 8d 5c c3 d8 66 90 03 75 0c 03 5d cc 39 75 e4 0f 87 89 fd ff ff e9 e3 fd ff ff 8d 74 26 00 ba 50 e0 eb c2 89 d8 e8 8c 57 00 00 <0f> 0b 8d 76 00 8d bc 27 00 00 00 00 89 75 cc c7 45 dc 00 00 00 00
[  303.803326] EAX: c2eeecd2 EBX: dc5bfd80 ECX: 00000007 EDX: c2ebe050
[  303.803337] ESI: 0001fff0 EDI: dfb99c78 EBP: dfb99b34 ESP: dfb99af0
[  303.803353] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010092
[  303.803365] CR0: 80050033 CR2: 0184f6e4 CR3: 031ca000 CR4: 003406d0
[  303.803386] Call Trace:
[  303.803396]  compaction_alloc+0x862/0x920
[  303.803405]  ? isolate_freepages_block+0x350/0x350
[  303.803415]  migrate_pages+0xc6/0xa20
[  303.803425]  ? __ClearPageMovable+0xa0/0xa0
[  303.803440]  ? isolate_freepages_block+0x350/0x350
[  303.803448]  compact_zone+0x708/0xc50
[  303.803455]  ? __switch_to_asm+0x28/0x50
[  303.803465]  ? __switch_to_asm+0x34/0x50
[  303.803478]  try_to_compact_pages+0x122/0x2e0
[  303.803493]  __alloc_pages_direct_compact+0x6a/0x120
[  303.803507]  __alloc_pages_slowpath+0x39d/0xc10
[  303.803519]  ? update_load_avg+0xac/0x760
[  303.803528]  ? __switch_to_asm+0x34/0x50
[  303.803539]  ? __switch_to_asm+0x34/0x50
[  303.803551]  ? __switch_to_asm+0x28/0x50
[  303.803561]  ? __switch_to_asm+0x34/0x50
[  303.803569]  ? __switch_to_asm+0x28/0x50
[  303.803577]  ? __switch_to_asm+0x34/0x50
[  303.803587]  ? __switch_to_asm+0x28/0x50
[  303.803597]  __alloc_pages_nodemask+0x27a/0x2b0
[  303.803608]  ? __switch_to_asm+0x28/0x50
[  303.803618]  populate_section_memmap+0x16/0x4d
[  303.803630]  sparse_add_section+0xe5/0x18e
[  303.803641]  __add_pages+0x9a/0x100
[  303.803652]  arch_add_memory+0x39/0x40
[  303.803660]  add_memory_resource+0x155/0x200
[  303.803670]  __add_memory+0x7e/0xf0
[  303.803678]  ? boot_override_clock+0x20/0x47
[  303.803689]  add_memory+0x2c/0x40
[  303.803700]  hot_add_req+0x3ae/0x5c0 [hv_balloon]
[  303.803715]  process_one_work+0x176/0x310
[  303.803730]  worker_thread+0x39/0x3c0
[  303.803741]  kthread+0xf0/0x110
[  303.803752]  ? rescuer_thread+0x2f0/0x2f0
[  303.803760]  ? kthread_park+0x90/0x90
[  303.803768]  ret_from_fork+0x2e/0x40
[  303.803782] Modules linked in: vfat fat xfs libfc rfkill zram sg intel_rapl_msr intel_rapl_common intel_rapl_perf pcspkr hv_utils hv_balloon i2c_piix4 joydev ext4 mbcache jbd2 loop nls_utf8 isofs sr_mod cdrom sd_mod ata_generic 8021q garp mrp stp llc hv_netvsc hyperv_keyboard hid_hyperv hv_storvsc scsi_transport_fc hyperv_fb ata_piix crc32_pclmul serio_raw hv_vmbus libata sunrpc xts lrw dm_crypt dm_round_robin dm_multipath dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_zero dm_mod linear raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_intel raid1 raid0 iscsi_ibft squashfs cramfs be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 libcxgbi libcxgb iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi edd
[  303.803968] ---[ end trace 3a57a22ff51d9741 ]---
[  303.803994] EIP: isolate_freepages_block+0x314/0x350
[  303.804019] Code: 8d 5c c3 d8 66 90 03 75 0c 03 5d cc 39 75 e4 0f 87 89 fd ff ff e9 e3 fd ff ff 8d 74 26 00 ba 50 e0 eb c2 89 d8 e8 8c 57 00 00 <0f> 0b 8d 76 00 8d bc 27 00 00 00 00 89 75 cc c7 45 dc 00 00 00 00
[  303.804073] EAX: c2eeecd2 EBX: dc5bfd80 ECX: 00000007 EDX: c2ebe050
[  303.804086] ESI: 0001fff0 EDI: dfb99c78 EBP: dfb99b34 ESP: dfb99af0
[  303.804112] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010092
[  303.804139] CR0: 80050033 CR2: 0184f6e4 CR3: 031ca000 CR4: 003406d0
[  303.804169] Kernel panic - not syncing: Fatal exception
[  303.804194] Kernel Offset: 0x1400000 from 0xc1000000 (relocation range: 0xc0000000-0xe07effff)
[  303.804238] ---[ end Kernel panic - not syncing: Fatal exception ]---
Comment 12 Andrew Morton 2020-02-12 00:41:33 UTC
On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang <richardw.yang@linux.intel.com> wrote:

> On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote:
> >On 02/10/20 at 02:09pm, Baoquan He wrote:
> >> On 02/09/20 at 09:56pm, Andrew Morton wrote:
> >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com> wrote:
> >> > 
> >> > > Hi Andrew,
> >> > > 
> >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote:
> >> > > > On Tue, 04 Feb 2020 11:25:48 +0000
> bugzilla-daemon@bugzilla.kernel.org wrote:
> >> > > > 
> >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401
> >> > > > > 
> >> > > > 
> >> > > > An oops during mem hotadd.  Could someone please take a look when
> >> > > > convenient?
> >> > > 
> >> > > This has been addressed by Wei Yang's patch, please check it here:
> >> > > 
> >> > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> >> > > 
> >> > 
> >> > hm, OK, thanks.  It's unfortunate that a 5.5 fix is buried in a
> >> > six-patch series which is still in progress!  Can we please merge that
> >> > as a standalone fix with a cc:stable, Fixes:, etc?
> >
> >Maybe can add Fixes tag as follow when merge:
> >
> >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> >

The reporter (cc'ed here) is still seeing issues:
https://bugzilla.kernel.org/show_bug.cgi?id=206401

Could we please continue this investigation via emailed reply-to-all,
rather than via the bugzilla interface?
Comment 13 Baoquan He 2020-02-12 07:31:46 UTC
On 02/11/20 at 04:41pm, Andrew Morton wrote:
> On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang <richardw.yang@linux.intel.com>
> wrote:
> 
> > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote:
> > >On 02/10/20 at 02:09pm, Baoquan He wrote:
> > >> On 02/09/20 at 09:56pm, Andrew Morton wrote:
> > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com> wrote:
> > >> > 
> > >> > > Hi Andrew,
> > >> > > 
> > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote:
> > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > >> > > > 
> > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401
> > >> > > > > 
> > >> > > > 
> > >> > > > An oops during mem hotadd.  Could someone please take a look when
> > >> > > > convenient?
> > >> > > 
> > >> > > This has been addressed by Wei Yang's patch, please check it here:
> > >> > > 
> > >> > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> > >> > > 
> > >> > 
> > >> > hm, OK, thanks.  It's unfortunate that a 5.5 fix is buried in a
> > >> > six-patch series which is still in progress!  Can we please merge that
> > >> > as a standalone fix with a cc:stable, Fixes:, etc?
> > >
> > >Maybe can add Fixes tag as follow when merge:
> > >
> > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> > >
> 
> The reporter (cc'ed here) is still seeing issues:
> https://bugzilla.kernel.org/show_bug.cgi?id=206401
> 
> Could we please continue this investigation via emailed reply-to-all,
> rather than via the bugzilla interface?

Yes, people prefer mailing list to discuss issues.

Hi T.Kabe, 

Could you provide the call trace again after below patch is applied?
The comment #9 in bugzilla is not very clear to me.

mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM
http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com

And, as you said, applying above patch, and do not call
__free_pages_core() in generic_online_page() will work. I doubt it,
because without __free_pages_core(), your added pages are not added
into buddy for managing. I think we should make clear this problem
firstly, in order not to introduce new problem by improper work around,
then check next.

Thanks
Baoquan
Comment 14 David Hildenbrand 2020-02-12 08:22:15 UTC
On 12.02.20 08:31, Baoquan He wrote:
> On 02/11/20 at 04:41pm, Andrew Morton wrote:
>> On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang <richardw.yang@linux.intel.com>
>> wrote:
>>
>>> On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote:
>>>> On 02/10/20 at 02:09pm, Baoquan He wrote:
>>>>> On 02/09/20 at 09:56pm, Andrew Morton wrote:
>>>>>> On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com> wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> On 02/09/20 at 09:32pm, Andrew Morton wrote:
>>>>>>>> On Tue, 04 Feb 2020 11:25:48 +0000 bugzilla-daemon@bugzilla.kernel.org
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=206401
>>>>>>>>>
>>>>>>>>
>>>>>>>> An oops during mem hotadd.  Could someone please take a look when
>>>>>>>> convenient?
>>>>>>>
>>>>>>> This has been addressed by Wei Yang's patch, please check it here:
>>>>>>>
>>>>>>> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
>>>>>>>
>>>>>>
>>>>>> hm, OK, thanks.  It's unfortunate that a 5.5 fix is buried in a
>>>>>> six-patch series which is still in progress!  Can we please merge that
>>>>>> as a standalone fix with a cc:stable, Fixes:, etc?
>>>>
>>>> Maybe can add Fixes tag as follow when merge:
>>>>
>>>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>>>>
>>
>> The reporter (cc'ed here) is still seeing issues:
>> https://bugzilla.kernel.org/show_bug.cgi?id=206401
>>
>> Could we please continue this investigation via emailed reply-to-all,
>> rather than via the bugzilla interface?
> 
> Yes, people prefer mailing list to discuss issues.
> 
> Hi T.Kabe, 
> 
> Could you provide the call trace again after below patch is applied?
> The comment #9 in bugzilla is not very clear to me.
> 
> mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM
> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> 
> And, as you said, applying above patch, and do not call
> __free_pages_core() in generic_online_page() will work. I doubt it,
> because without __free_pages_core(), your added pages are not added
> into buddy for managing. 

Removing __free_pages_core() from generic_online_page() is just
plain wrong and would break memory hotplug in general. So that is
certainly not the right fix.

HV supports memory sections that are fully added, but only parts of
it are actually backed in the hypervisor, "online" and exposed to the buddy.

When onlining memory, it will online the backed parts via
hv_online_page()->generic_online_page(). When requested to hot add
more memory, the guest will online remaining parts that are now
backed handle_pg_range()->hv_bring_pgs_online().

So if generic_online_page() fails it's either because

1. HV guest driver has a bug and tries to online something it shouldn't
2. HV hypervisor has a bug and does not back memory properly before hot/adding
3. Memory hotplug code has a bug and does not properly add the memory block/sections


Please note that to using generic_online_page() in 

commit 30a9c246b9f6fe0591e8afb05758a3e3b096fabe
Author: David Hildenbrand <david@redhat.com>
Date:   Sat Nov 30 17:53:55 2019 -0800

    hv_balloon: use generic_online_page()
    
    Let's use the generic onlining function - which will now also take care
    of calling kernel_map_pages().

However, the old code ended up calling
	__free_pages_core() -> __free_pages()
End the new one ends up calling
	__online_page_free() -> __free_reserved_page() -> __free_page()
So I don't think it's related to that.


Especially, looking at the kernel messages, I can see that the kernel crashes
when adding memory, not when onlining it? So I do think there is still
something wrong in the SPARSE hot-add code if you keep seeing issues.
Comment 15 kabe 2020-02-13 04:30:04 UTC
bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv>

>> On 02/11/20 at 04:41pm, Andrew Morton wrote:
>> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang
>> <richardw.yang@linux.intel.com> wrote:
>> > 
>> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote:
>> > > >On 02/10/20 at 02:09pm, Baoquan He wrote:
>> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote:
>> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com>
>> wrote:
>> > > >> > 
>> > > >> > > Hi Andrew,
>> > > >> > > 
>> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote:
>> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000
>> bugzilla-daemon@bugzilla.kernel.org wrote:
>> > > >> > > > 
>> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401
>> > > >> > > > > 
>> > > >> > > > 
>> > > >> > > > An oops during mem hotadd.  Could someone please take a look
>> when
>> > > >> > > > convenient?
>> > > >> > > 
>> > > >> > > This has been addressed by Wei Yang's patch, please check it
>> here:
>> > > >> > > 
>> > > >> > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
>> > > >> > > 
>> > > >> > 
>> > > >> > hm, OK, thanks.  It's unfortunate that a 5.5 fix is buried in a
>> > > >> > six-patch series which is still in progress!  Can we please merge
>> that
>> > > >> > as a standalone fix with a cc:stable, Fixes:, etc?
>> > > >
>> > > >Maybe can add Fixes tag as follow when merge:
>> > > >
>> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>> > > >
>> > 
>> > The reporter (cc'ed here) is still seeing issues:
>> > https://bugzilla.kernel.org/show_bug.cgi?id=206401
>> > 
>> > Could we please continue this investigation via emailed reply-to-all,
>> > rather than via the bugzilla interface?
>> 
>> Yes, people prefer mailing list to discuss issues.
>> 
>> Hi T.Kabe, 
>> 
>> Could you provide the call trace again after below patch is applied?
>> The comment #9 in bugzilla is not very clear to me.
>> 
>> mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM
>> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
>> 
>> And, as you said, applying above patch, and do not call
>> __free_pages_core() in generic_online_page() will work. I doubt it,
>> because without __free_pages_core(), your added pages are not added
>> into buddy for managing. I think we should make clear this problem
>> firstly, in order not to introduce new problem by improper work around,
>> then check next.
>> 
>> Thanks
>> Baoquan

Got it, I restarted off fresh from kernel-5.6-rc1,
applied patch
>> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
and got the following panic.

Diag printk's for add_memory() et al is not there, but I guess
memory hot-add request from hypervisor is returning "success", 
corrupting something else and bombing out later.


[   24.289967] Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist.
[  302.263730] hv_balloon: Max. dynamic memory size: 1048576 MB
[  635.216014] BUG: unable to handle page fault for address: d13ff000
[  635.216058] #PF: supervisor write access in kernel mode
[  635.216076] #PF: error_code(0x0002) - not-present page
[  635.216106] *pde = 00000000
[  635.216139] Oops: 0002 [#1] SMP
[  635.216171] CPU: 0 PID: 470 Comm: systemd-udevd Not tainted 5.6.0-rc1.el8.i586 #1
[  635.216199] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006  05/23/2012
[  635.216233] EIP: wp_page_copy+0x8e/0x750
[  635.216253] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29
[  635.216293] EAX: d13ff000 EBX: c3743f28 ECX: 00000000 EDX: c10c9000
[  635.216314] ESI: c10c9000 EDI: d13ff004 EBP: c3743eec ESP: c3743ea8
[  635.216336] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282
[  635.216368] CR0: 80050033 CR2: d13ff000 CR3: 03add000 CR4: 003406d0
[  635.216389] Call Trace:
[  635.216407]  ? reuse_swap_page+0x83/0x390
[  635.216425]  do_wp_page+0x87/0x6e0
[  635.216438]  ? __do_sys_fstat64+0x4a/0x60
[  635.216453]  handle_mm_fault+0x808/0xe30
[  635.216468]  do_page_fault+0x19f/0x4d0
[  635.216484]  ? do_kern_addr_fault+0x80/0x80
[  635.216500]  common_exception_read_cr2+0x15a/0x15f
[  635.216521] EIP: 0xb7b28104
[  635.216538] Code: 29 f9 89 4c 24 10 83 f9 0f 0f 86 92 00 00 00 8b 45 40 8d 14 3e 8b 4c 24 0c 39 48 0c 75 74 8b 4c 24 0c 81 7c 24 10 ef 03 00 00 <89> 42 08 89 4a 0c 89 55 40 89 50 0c 76 0e c7 42 10 00 00 00 00 c7
[  635.216591] EAX: b7c4e7d8 EBX: 000011a0 ECX: b7c4e7d8 EDX: 01994178
[  635.216606] ESI: 01993168 EDI: 00001010 EBP: b7c4e7a0 ESP: bfcc9f00
[  635.216628] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00210293
[  635.216661] Modules linked in: rfkill intel_rapl_msr intel_rapl_common snd_pcm snd_timer snd soundcore crc32_pclmul intel_rapl_perf sg pcspkr hv_netvsc joydev i2c_piix4 hyperv_fb hv_utils hv_balloon ip_tables ext4 mbcache jbd2 sd_mod t10_pi sr_mod cdrom ata_generic hyperv_keyboard hid_hyperv hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw hv_vmbus libata
[  635.216758] CR2: 00000000d13ff000
[  635.216769] ---[ end trace dee4a93859538102 ]---
[  635.216785] EIP: wp_page_copy+0x8e/0x750
[  635.216811] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29
[  635.216847] EAX: d13ff000 EBX: c3743f28 ECX: 00000000 EDX: c10c9000
[  635.216864] ESI: c10c9000 EDI: d13ff004 EBP: c3743eec ESP: c3743ea8
[  635.216883] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282
[  635.216899] CR0: 80050033 CR2: d13ff000 CR3: 03add000 CR4: 003406d0
[  635.216914] Kernel panic - not syncing: Fatal exception
[  635.216926] Kernel Offset: 0x1400000 from 0xc1000000 (relocation range: 0xc0000000-0xcafeffff)
[  635.216946] ---[ end Kernel panic - not syncing: Fatal exception ]---
Comment 16 Baoquan He 2020-02-13 08:20:01 UTC
On 02/13/20 at 01:22pm, kabe@vega.pgw.jp wrote:
> bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv>
> 
> >> On 02/11/20 at 04:41pm, Andrew Morton wrote:
> >> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang
> <richardw.yang@linux.intel.com> wrote:
> >> > 
> >> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote:
> >> > > >On 02/10/20 at 02:09pm, Baoquan He wrote:
> >> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote:
> >> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com>
> wrote:
> >> > > >> > 
> >> > > >> > > Hi Andrew,
> >> > > >> > > 
> >> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote:
> >> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000
> bugzilla-daemon@bugzilla.kernel.org wrote:
> >> > > >> > > > 
> >> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401
> >> > > >> > > > > 
> >> > > >> > > > 
> >> > > >> > > > An oops during mem hotadd.  Could someone please take a look
> when
> >> > > >> > > > convenient?
> >> > > >> > > 
> >> > > >> > > This has been addressed by Wei Yang's patch, please check it
> here:
> >> > > >> > > 
> >> > > >> > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> >> > > >> > > 
> >> > > >> > 
> >> > > >> > hm, OK, thanks.  It's unfortunate that a 5.5 fix is buried in a
> >> > > >> > six-patch series which is still in progress!  Can we please merge
> that
> >> > > >> > as a standalone fix with a cc:stable, Fixes:, etc?
> >> > > >
> >> > > >Maybe can add Fixes tag as follow when merge:
> >> > > >
> >> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> >> > > >
> >> > 
> >> > The reporter (cc'ed here) is still seeing issues:
> >> > https://bugzilla.kernel.org/show_bug.cgi?id=206401
> >> > 
> >> > Could we please continue this investigation via emailed reply-to-all,
> >> > rather than via the bugzilla interface?
> >> 
> >> Yes, people prefer mailing list to discuss issues.
> >> 
> >> Hi T.Kabe, 
> >> 
> >> Could you provide the call trace again after below patch is applied?
> >> The comment #9 in bugzilla is not very clear to me.
> >> 
> >> mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM
> >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> >> 
> >> And, as you said, applying above patch, and do not call
> >> __free_pages_core() in generic_online_page() will work. I doubt it,
> >> because without __free_pages_core(), your added pages are not added
> >> into buddy for managing. I think we should make clear this problem
> >> firstly, in order not to introduce new problem by improper work around,
> >> then check next.
> >> 
> >> Thanks
> >> Baoquan
> 
> Got it, I restarted off fresh from kernel-5.6-rc1,
> applied patch
> >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> and got the following panic.
> 
> Diag printk's for add_memory() et al is not there, but I guess
> memory hot-add request from hypervisor is returning "success", 
> corrupting something else and bombing out later.
> 
> 
> [   24.289967] Not activating Mandatory Access Control as /sbin/tomoyo-init
> does not exist.
> [  302.263730] hv_balloon: Max. dynamic memory size: 1048576 MB
> [  635.216014] BUG: unable to handle page fault for address: d13ff000
> [  635.216058] #PF: supervisor write access in kernel mode
> [  635.216076] #PF: error_code(0x0002) - not-present page
> [  635.216106] *pde = 00000000

Thanks for the info. What ARCH is your system?  Could you attach your
kernel config and paste the output of executing 'readelf /proc/kcore'?

The pmd entry is not filled, I want to check which address range the kernel
is acessing, and please attach the log of dmesg. Probably it's hot added
page area, I guess, since this time the preceding trace is different
with comment #9.

> [  635.216139] Oops: 0002 [#1] SMP
> [  635.216171] CPU: 0 PID: 470 Comm: systemd-udevd Not tainted
> 5.6.0-rc1.el8.i586 #1
> [  635.216199] Hardware name: Microsoft Corporation Virtual Machine/Virtual
> Machine, BIOS 090006  05/23/2012
> [  635.216233] EIP: wp_page_copy+0x8e/0x750
> [  635.216253] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff
> 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08
> 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29
> [  635.216293] EAX: d13ff000 EBX: c3743f28 ECX: 00000000 EDX: c10c9000
> [  635.216314] ESI: c10c9000 EDI: d13ff004 EBP: c3743eec ESP: c3743ea8
> [  635.216336] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282
> [  635.216368] CR0: 80050033 CR2: d13ff000 CR3: 03add000 CR4: 003406d0
> [  635.216389] Call Trace:
> [  635.216407]  ? reuse_swap_page+0x83/0x390
> [  635.216425]  do_wp_page+0x87/0x6e0
> [  635.216438]  ? __do_sys_fstat64+0x4a/0x60
> [  635.216453]  handle_mm_fault+0x808/0xe30
> [  635.216468]  do_page_fault+0x19f/0x4d0
> [  635.216484]  ? do_kern_addr_fault+0x80/0x80
> [  635.216500]  common_exception_read_cr2+0x15a/0x15f
> [  635.216521] EIP: 0xb7b28104
> [  635.216538] Code: 29 f9 89 4c 24 10 83 f9 0f 0f 86 92 00 00 00 8b 45 40 8d
> 14 3e 8b 4c 24 0c 39 48 0c 75 74 8b 4c 24 0c 81 7c 24 10 ef 03 00 00 <89> 42
> 08 89 4a 0c 89 55 40 89 50 0c 76 0e c7 42 10 00 00 00 00 c7
> [  635.216591] EAX: b7c4e7d8 EBX: 000011a0 ECX: b7c4e7d8 EDX: 01994178
> [  635.216606] ESI: 01993168 EDI: 00001010 EBP: b7c4e7a0 ESP: bfcc9f00
> [  635.216628] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00210293
> [  635.216661] Modules linked in: rfkill intel_rapl_msr intel_rapl_common
> snd_pcm snd_timer snd soundcore crc32_pclmul intel_rapl_perf sg pcspkr
> hv_netvsc joydev i2c_piix4 hyperv_fb hv_utils hv_balloon ip_tables ext4
> mbcache jbd2 sd_mod t10_pi sr_mod cdrom ata_generic hyperv_keyboard
> hid_hyperv hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw
> hv_vmbus libata
> [  635.216758] CR2: 00000000d13ff000
> [  635.216769] ---[ end trace dee4a93859538102 ]---
> [  635.216785] EIP: wp_page_copy+0x8e/0x750
> [  635.216811] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff
> 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08
> 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29
> [  635.216847] EAX: d13ff000 EBX: c3743f28 ECX: 00000000 EDX: c10c9000
> [  635.216864] ESI: c10c9000 EDI: d13ff004 EBP: c3743eec ESP: c3743ea8
> [  635.216883] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282
> [  635.216899] CR0: 80050033 CR2: d13ff000 CR3: 03add000 CR4: 003406d0
> [  635.216914] Kernel panic - not syncing: Fatal exception
> [  635.216926] Kernel Offset: 0x1400000 from 0xc1000000 (relocation range:
> 0xc0000000-0xcafeffff)
> [  635.216946] ---[ end Kernel panic - not syncing: Fatal exception ]---
> 
> -- 
> kabe
>
Comment 17 Taketo Kabe 2020-02-14 14:30:04 UTC
Created attachment 287371 [details]
dmesg.txt

bhe@redhat.com sed in <20200213081941.GA19207@MiWiFi-R3L-srv>

>> On 02/13/20 at 01:22pm, kabe@vega.pgw.jp wrote:
>> > bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv>
>> > 
>> > >> On 02/11/20 at 04:41pm, Andrew Morton wrote:
>> > >> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang
>> <richardw.yang@linux.intel.com> wrote:
>> > >> > 
>> > >> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote:
>> > >> > > >On 02/10/20 at 02:09pm, Baoquan He wrote:
>> > >> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote:
>> > >> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com>
>> wrote:
>> > >> > > >> > 
>> > >> > > >> > > Hi Andrew,
>> > >> > > >> > > 
>> > >> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote:
>> > >> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000
>> bugzilla-daemon@bugzilla.kernel.org wrote:
>> > >> > > >> > > > 
>> > >> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401
>> > >> > > >> > > > > 
>> > >> > > >> > > > 
>> > >> > > >> > > > An oops during mem hotadd.  Could someone please take a
>> look when
>> > >> > > >> > > > convenient?
>> > >> > > >> > > 
>> > >> > > >> > > This has been addressed by Wei Yang's patch, please check it
>> here:
>> > >> > > >> > > 
>> > >> > > >> > >
>> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
>> > >> > > >> > > 
>> > >> > > >> > 
>> > >> > > >> > hm, OK, thanks.  It's unfortunate that a 5.5 fix is buried in
>> a
>> > >> > > >> > six-patch series which is still in progress!  Can we please
>> merge that
>> > >> > > >> > as a standalone fix with a cc:stable, Fixes:, etc?
>> > >> > > >
>> > >> > > >Maybe can add Fixes tag as follow when merge:
>> > >> > > >
>> > >> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>> > >> > > >
>> > >> > 
>> > >> > The reporter (cc'ed here) is still seeing issues:
>> > >> > https://bugzilla.kernel.org/show_bug.cgi?id=206401
>> > >> > 
>> > >> > Could we please continue this investigation via emailed reply-to-all,
>> > >> > rather than via the bugzilla interface?
>> > >> 
>> > >> Yes, people prefer mailing list to discuss issues.
>> > >> 
>> > >> Hi T.Kabe, 
>> > >> 
>> > >> Could you provide the call trace again after below patch is applied?
>> > >> The comment #9 in bugzilla is not very clear to me.
>> > >> 
>> > >> mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM
>> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
>> > >> 
>> > >> And, as you said, applying above patch, and do not call
>> > >> __free_pages_core() in generic_online_page() will work. I doubt it,
>> > >> because without __free_pages_core(), your added pages are not added
>> > >> into buddy for managing. I think we should make clear this problem
>> > >> firstly, in order not to introduce new problem by improper work around,
>> > >> then check next.
>> > >> 
>> > >> Thanks
>> > >> Baoquan
>> > 
>> > Got it, I restarted off fresh from kernel-5.6-rc1,
>> > applied patch
>> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
>> > and got the following panic.
>> > 
>> > Diag printk's for add_memory() et al is not there, but I guess
>> > memory hot-add request from hypervisor is returning "success", 
>> > corrupting something else and bombing out later.
>> > 
>> > 
>> > [   24.289967] Not activating Mandatory Access Control as
>> /sbin/tomoyo-init does not exist.
>> > [  302.263730] hv_balloon: Max. dynamic memory size: 1048576 MB
>> > [  635.216014] BUG: unable to handle page fault for address: d13ff000
>> > [  635.216058] #PF: supervisor write access in kernel mode
>> > [  635.216076] #PF: error_code(0x0002) - not-present page
>> > [  635.216106] *pde = 00000000
>> 
>> Thanks for the info. What ARCH is your system?  Could you attach your
>> kernel config and paste the output of executing 'readelf /proc/kcore'?

Arch is i386(i586), non-PAE.

I'll attach the "readelf -a /proc/kcore", dmesg and .config .
The stack trace is different this time also;
it seems to have slightly difference panic trace every time 
after handle_mm_fault().

I've temporary added pr_info() before and after add_memory() in hv_baloon.ko,
so it says it's taining the kernel.
add_memory() itself is returning 0 (success).


>> The pmd entry is not filled, I want to check which address range the kernel
>> is acessing, and please attach the log of dmesg. Probably it's hot added
>> page area, I guess, since this time the preceding trace is different
>> with comment #9.
>> 
>> > [  635.216139] Oops: 0002 [#1] SMP
>> > [  635.216171] CPU: 0 PID: 470 Comm: systemd-udevd Not tainted
>> 5.6.0-rc1.el8.i586 #1
>> > [  635.216199] Hardware name: Microsoft Corporation Virtual
>> Machine/Virtual Machine, BIOS 090006  05/23/2012
>> > [  635.216233] EIP: wp_page_copy+0x8e/0x750
>> > [  635.216253] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5
>> ff 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89>
>> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29
>> > [  635.216293] EAX: d13ff000 EBX: c3743f28 ECX: 00000000 EDX: c10c9000
>> > [  635.216314] ESI: c10c9000 EDI: d13ff004 EBP: c3743eec ESP: c3743ea8
>> > [  635.216336] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS:
>> 00210282
>> > [  635.216368] CR0: 80050033 CR2: d13ff000 CR3: 03add000 CR4: 003406d0
>> > [  635.216389] Call Trace:
>> > [  635.216407]  ? reuse_swap_page+0x83/0x390
>> > [  635.216425]  do_wp_page+0x87/0x6e0
>> > [  635.216438]  ? __do_sys_fstat64+0x4a/0x60
>> > [  635.216453]  handle_mm_fault+0x808/0xe30
>> > [  635.216468]  do_page_fault+0x19f/0x4d0
>> > [  635.216484]  ? do_kern_addr_fault+0x80/0x80
>> > [  635.216500]  common_exception_read_cr2+0x15a/0x15f
>> > [  635.216521] EIP: 0xb7b28104
[redacted]
Comment 18 Taketo Kabe 2020-02-14 14:30:05 UTC
Created attachment 287373 [details]
panic.txt
Comment 19 Taketo Kabe 2020-02-14 14:30:05 UTC
Created attachment 287375 [details]
kernel-i586.config
Comment 20 Taketo Kabe 2020-02-14 14:30:05 UTC
Created attachment 287377 [details]
readelf-kcore.txt
Comment 21 Baoquan He 2020-02-14 14:49:19 UTC
On 02/14/20 at 11:26pm, kkabe@vega.pgw.jp wrote:
> bhe@redhat.com sed in <20200213081941.GA19207@MiWiFi-R3L-srv>
> 
> >> On 02/13/20 at 01:22pm, kabe@vega.pgw.jp wrote:
> >> > bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv>
> >> > 
> >> > >> On 02/11/20 at 04:41pm, Andrew Morton wrote:
> >> > >> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang
> <richardw.yang@linux.intel.com> wrote:
> >> > >> > 
> >> > >> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote:
> >> > >> > > >On 02/10/20 at 02:09pm, Baoquan He wrote:
> >> > >> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote:
> >> > >> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He
> <bhe@redhat.com> wrote:
> >> > >> > > >> > 
> >> > >> > > >> > > Hi Andrew,
> >> > >> > > >> > > 
> >> > >> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote:
> >> > >> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000
> bugzilla-daemon@bugzilla.kernel.org wrote:
> >> > >> > > >> > > > 
> >> > >> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401
> >> > >> > > >> > > > > 
> >> > >> > > >> > > > 
> >> > >> > > >> > > > An oops during mem hotadd.  Could someone please take a
> look when
> >> > >> > > >> > > > convenient?
> >> > >> > > >> > > 
> >> > >> > > >> > > This has been addressed by Wei Yang's patch, please check
> it here:
> >> > >> > > >> > > 
> >> > >> > > >> > >
> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> >> > >> > > >> > > 
> >> > >> > > >> > 
> >> > >> > > >> > hm, OK, thanks.  It's unfortunate that a 5.5 fix is buried
> in a
> >> > >> > > >> > six-patch series which is still in progress!  Can we please
> merge that
> >> > >> > > >> > as a standalone fix with a cc:stable, Fixes:, etc?
> >> > >> > > >
> >> > >> > > >Maybe can add Fixes tag as follow when merge:
> >> > >> > > >
> >> > >> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section
> hotplug")
> >> > >> > > >
> >> > >> > 
> >> > >> > The reporter (cc'ed here) is still seeing issues:
> >> > >> > https://bugzilla.kernel.org/show_bug.cgi?id=206401
> >> > >> > 
> >> > >> > Could we please continue this investigation via emailed
> reply-to-all,
> >> > >> > rather than via the bugzilla interface?
> >> > >> 
> >> > >> Yes, people prefer mailing list to discuss issues.
> >> > >> 
> >> > >> Hi T.Kabe, 
> >> > >> 
> >> > >> Could you provide the call trace again after below patch is applied?
> >> > >> The comment #9 in bugzilla is not very clear to me.
> >> > >> 
> >> > >> mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM
> >> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> >> > >> 
> >> > >> And, as you said, applying above patch, and do not call
> >> > >> __free_pages_core() in generic_online_page() will work. I doubt it,
> >> > >> because without __free_pages_core(), your added pages are not added
> >> > >> into buddy for managing. I think we should make clear this problem
> >> > >> firstly, in order not to introduce new problem by improper work
> around,
> >> > >> then check next.
> >> > >> 
> >> > >> Thanks
> >> > >> Baoquan
> >> > 
> >> > Got it, I restarted off fresh from kernel-5.6-rc1,
> >> > applied patch
> >> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> >> > and got the following panic.
> >> > 
> >> > Diag printk's for add_memory() et al is not there, but I guess
> >> > memory hot-add request from hypervisor is returning "success", 
> >> > corrupting something else and bombing out later.
> >> > 
> >> > 
> >> > [   24.289967] Not activating Mandatory Access Control as
> /sbin/tomoyo-init does not exist.
> >> > [  302.263730] hv_balloon: Max. dynamic memory size: 1048576 MB
> >> > [  635.216014] BUG: unable to handle page fault for address: d13ff000
> >> > [  635.216058] #PF: supervisor write access in kernel mode
> >> > [  635.216076] #PF: error_code(0x0002) - not-present page
> >> > [  635.216106] *pde = 00000000
> >> 
> >> Thanks for the info. What ARCH is your system?  Could you attach your
> >> kernel config and paste the output of executing 'readelf /proc/kcore'?
> 
> Arch is i386(i586), non-PAE.
> 
> I'll attach the "readelf -a /proc/kcore", dmesg and .config .
> The stack trace is different this time also;
> it seems to have slightly difference panic trace every time 
> after handle_mm_fault().

Sorry, I didn't say it clearly. 'readelf -l /proc/kcore' is OK, and the
relevant call trace.

> 
> I've temporary added pr_info() before and after add_memory() in hv_baloon.ko,
> so it says it's taining the kernel.
> add_memory() itself is returning 0 (success).
> 
>
Comment 22 Baoquan He 2020-02-14 15:01:31 UTC
On 02/14/20 at 10:48pm, Baoquan He wrote:
> On 02/14/20 at 11:26pm, kkabe@vega.pgw.jp wrote:
> > bhe@redhat.com sed in <20200213081941.GA19207@MiWiFi-R3L-srv>
> > 
> > >> On 02/13/20 at 01:22pm, kabe@vega.pgw.jp wrote:
> > >> > bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv>
> > >> > 
> > >> > >> On 02/11/20 at 04:41pm, Andrew Morton wrote:
> > >> > >> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang
> <richardw.yang@linux.intel.com> wrote:
> > >> > >> > 
> > >> > >> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote:
> > >> > >> > > >On 02/10/20 at 02:09pm, Baoquan He wrote:
> > >> > >> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote:
> > >> > >> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He
> <bhe@redhat.com> wrote:
> > >> > >> > > >> > 
> > >> > >> > > >> > > Hi Andrew,
> > >> > >> > > >> > > 
> > >> > >> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote:
> > >> > >> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > >> > >> > > >> > > > 
> > >> > >> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401
> > >> > >> > > >> > > > > 
> > >> > >> > > >> > > > 
> > >> > >> > > >> > > > An oops during mem hotadd.  Could someone please take
> a look when
> > >> > >> > > >> > > > convenient?
> > >> > >> > > >> > > 
> > >> > >> > > >> > > This has been addressed by Wei Yang's patch, please
> check it here:
> > >> > >> > > >> > > 
> > >> > >> > > >> > >
> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> > >> > >> > > >> > > 
> > >> > >> > > >> > 
> > >> > >> > > >> > hm, OK, thanks.  It's unfortunate that a 5.5 fix is buried
> in a
> > >> > >> > > >> > six-patch series which is still in progress!  Can we
> please merge that
> > >> > >> > > >> > as a standalone fix with a cc:stable, Fixes:, etc?
> > >> > >> > > >
> > >> > >> > > >Maybe can add Fixes tag as follow when merge:
> > >> > >> > > >
> > >> > >> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section
> hotplug")
> > >> > >> > > >
> > >> > >> > 
> > >> > >> > The reporter (cc'ed here) is still seeing issues:
> > >> > >> > https://bugzilla.kernel.org/show_bug.cgi?id=206401
> > >> > >> > 
> > >> > >> > Could we please continue this investigation via emailed
> reply-to-all,
> > >> > >> > rather than via the bugzilla interface?
> > >> > >> 
> > >> > >> Yes, people prefer mailing list to discuss issues.
> > >> > >> 
> > >> > >> Hi T.Kabe, 
> > >> > >> 
> > >> > >> Could you provide the call trace again after below patch is
> applied?
> > >> > >> The comment #9 in bugzilla is not very clear to me.
> > >> > >> 
> > >> > >> mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM
> > >> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> > >> > >> 
> > >> > >> And, as you said, applying above patch, and do not call
> > >> > >> __free_pages_core() in generic_online_page() will work. I doubt it,
> > >> > >> because without __free_pages_core(), your added pages are not added
> > >> > >> into buddy for managing. I think we should make clear this problem
> > >> > >> firstly, in order not to introduce new problem by improper work
> around,
> > >> > >> then check next.
> > >> > >> 
> > >> > >> Thanks
> > >> > >> Baoquan
> > >> > 
> > >> > Got it, I restarted off fresh from kernel-5.6-rc1,
> > >> > applied patch
> > >> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> > >> > and got the following panic.
> > >> > 
> > >> > Diag printk's for add_memory() et al is not there, but I guess
> > >> > memory hot-add request from hypervisor is returning "success", 
> > >> > corrupting something else and bombing out later.
> > >> > 
> > >> > 
> > >> > [   24.289967] Not activating Mandatory Access Control as
> /sbin/tomoyo-init does not exist.
> > >> > [  302.263730] hv_balloon: Max. dynamic memory size: 1048576 MB
> > >> > [  635.216014] BUG: unable to handle page fault for address: d13ff000
> > >> > [  635.216058] #PF: supervisor write access in kernel mode
> > >> > [  635.216076] #PF: error_code(0x0002) - not-present page
> > >> > [  635.216106] *pde = 00000000
> > >> 
> > >> Thanks for the info. What ARCH is your system?  Could you attach your
> > >> kernel config and paste the output of executing 'readelf /proc/kcore'?
> > 
> > Arch is i386(i586), non-PAE.
> > 
> > I'll attach the "readelf -a /proc/kcore", dmesg and .config .
> > The stack trace is different this time also;
> > it seems to have slightly difference panic trace every time 
> > after handle_mm_fault().
> 
> Sorry, I didn't say it clearly. 'readelf -l /proc/kcore' is OK, and the
> relevant call trace.

No need to provide them, can find them from the 'readelf -a'. Will check
and see if I can find anything. Thanks for the info.

> 
> > 
> > I've temporary added pr_info() before and after add_memory() in
> hv_baloon.ko,
> > so it says it's taining the kernel.
> > add_memory() itself is returning 0 (success).
> > 
> > 
> 
>
Comment 23 Baoquan He 2020-02-17 04:49:07 UTC
On 02/14/20 at 11:26pm, kkabe@vega.pgw.jp wrote:
> bhe@redhat.com sed in <20200213081941.GA19207@MiWiFi-R3L-srv>
> 
> >> On 02/13/20 at 01:22pm, kabe@vega.pgw.jp wrote:
> >> > bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv>
> >> > 
> >> > >> On 02/11/20 at 04:41pm, Andrew Morton wrote:
> >> > >> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang
> <richardw.yang@linux.intel.com> wrote:
> >> > >> > 
> >> > >> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote:
> >> > >> > > >On 02/10/20 at 02:09pm, Baoquan He wrote:
> >> > >> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote:
> >> > >> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He
> <bhe@redhat.com> wrote:
> >> > >> > > >> > 
> >> > >> > > >> > > Hi Andrew,
> >> > >> > > >> > > 
> >> > >> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote:
> >> > >> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000
> bugzilla-daemon@bugzilla.kernel.org wrote:
> >> > >> > > >> > > > 
> >> > >> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401
> >> > >> > > >> > > > > 
> >> > >> > > >> > > > 
> >> > >> > > >> > > > An oops during mem hotadd.  Could someone please take a
> look when
> >> > >> > > >> > > > convenient?
> >> > >> > > >> > > 
> >> > >> > > >> > > This has been addressed by Wei Yang's patch, please check
> it here:
> >> > >> > > >> > > 
> >> > >> > > >> > >
> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> >> > >> > > >> > > 
> >> > >> > > >> > 
> >> > >> > > >> > hm, OK, thanks.  It's unfortunate that a 5.5 fix is buried
> in a
> >> > >> > > >> > six-patch series which is still in progress!  Can we please
> merge that
> >> > >> > > >> > as a standalone fix with a cc:stable, Fixes:, etc?
> >> > >> > > >
> >> > >> > > >Maybe can add Fixes tag as follow when merge:
> >> > >> > > >
> >> > >> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section
> hotplug")
> >> > >> > > >
> >> > >> > 
> >> > >> > The reporter (cc'ed here) is still seeing issues:
> >> > >> > https://bugzilla.kernel.org/show_bug.cgi?id=206401
> >> > >> > 
> >> > >> > Could we please continue this investigation via emailed
> reply-to-all,
> >> > >> > rather than via the bugzilla interface?
> >> > >> 
> >> > >> Yes, people prefer mailing list to discuss issues.
> >> > >> 
> >> > >> Hi T.Kabe, 
> >> > >> 
> >> > >> Could you provide the call trace again after below patch is applied?
> >> > >> The comment #9 in bugzilla is not very clear to me.
> >> > >> 
> >> > >> mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM
> >> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> >> > >> 
> >> > >> And, as you said, applying above patch, and do not call
> >> > >> __free_pages_core() in generic_online_page() will work. I doubt it,
> >> > >> because without __free_pages_core(), your added pages are not added
> >> > >> into buddy for managing. I think we should make clear this problem
> >> > >> firstly, in order not to introduce new problem by improper work
> around,
> >> > >> then check next.
> >> > >> 
> >> > >> Thanks
> >> > >> Baoquan
> >> > 
> >> > Got it, I restarted off fresh from kernel-5.6-rc1,
> >> > applied patch
> >> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> >> > and got the following panic.
> >> > 
> >> > Diag printk's for add_memory() et al is not there, but I guess
> >> > memory hot-add request from hypervisor is returning "success", 
> >> > corrupting something else and bombing out later.
> >> > 
> >> > 
> >> > [   24.289967] Not activating Mandatory Access Control as
> /sbin/tomoyo-init does not exist.
> >> > [  302.263730] hv_balloon: Max. dynamic memory size: 1048576 MB
> >> > [  635.216014] BUG: unable to handle page fault for address: d13ff000
> >> > [  635.216058] #PF: supervisor write access in kernel mode
> >> > [  635.216076] #PF: error_code(0x0002) - not-present page
> >> > [  635.216106] *pde = 00000000
> >> 
> >> Thanks for the info. What ARCH is your system?  Could you attach your
> >> kernel config and paste the output of executing 'readelf /proc/kcore'?
> 
> Arch is i386(i586), non-PAE.

Sorry, I roughly went through code, didn't get clue. Not sure if David
have idea about it.

By the way, may I know why you would like to run i386 guest on Hyper-V?

Found people are talking about the 32bit kernel supporting in upstream,
below is Linus's point of view.
https://lore.kernel.org/linux-fsdevel/CAHk-=wiGbz3oRvAVFtN-whW-d2F-STKsP1MZT4m_VeycAr1_VQ@mail.gmail.com/

> 
> I'll attach the "readelf -a /proc/kcore", dmesg and .config .
> The stack trace is different this time also;
> it seems to have slightly difference panic trace every time 
> after handle_mm_fault().
> 
> I've temporary added pr_info() before and after add_memory() in hv_baloon.ko,
> so it says it's taining the kernel.
> add_memory() itself is returning 0 (success).
> 
> 
> >> The pmd entry is not filled, I want to check which address range the
> kernel
> >> is acessing, and please attach the log of dmesg. Probably it's hot added
> >> page area, I guess, since this time the preceding trace is different
> >> with comment #9.
> >> 
> >> > [  635.216139] Oops: 0002 [#1] SMP
> >> > [  635.216171] CPU: 0 PID: 470 Comm: systemd-udevd Not tainted
> 5.6.0-rc1.el8.i586 #1
> >> > [  635.216199] Hardware name: Microsoft Corporation Virtual
> Machine/Virtual Machine, BIOS 090006  05/23/2012
> >> > [  635.216233] EIP: wp_page_copy+0x8e/0x750
> >> > [  635.216253] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85
> e5 ff 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6
> <89> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29
> >> > [  635.216293] EAX: d13ff000 EBX: c3743f28 ECX: 00000000 EDX: c10c9000
> >> > [  635.216314] ESI: c10c9000 EDI: d13ff004 EBP: c3743eec ESP: c3743ea8
> >> > [  635.216336] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS:
> 00210282
> >> > [  635.216368] CR0: 80050033 CR2: d13ff000 CR3: 03add000 CR4: 003406d0
> >> > [  635.216389] Call Trace:
> >> > [  635.216407]  ? reuse_swap_page+0x83/0x390
> >> > [  635.216425]  do_wp_page+0x87/0x6e0
> >> > [  635.216438]  ? __do_sys_fstat64+0x4a/0x60
> >> > [  635.216453]  handle_mm_fault+0x808/0xe30
> >> > [  635.216468]  do_page_fault+0x19f/0x4d0
> >> > [  635.216484]  ? do_kern_addr_fault+0x80/0x80
> >> > [  635.216500]  common_exception_read_cr2+0x15a/0x15f
> >> > [  635.216521] EIP: 0xb7b28104
Comment 24 Taketo Kabe 2020-02-17 05:31:27 UTC
bhe@redhat.com sed in <20200217044850.GD4816@MiWiFi-R3L-srv>

>> Sorry, I roughly went through code, didn't get clue. Not sure if David
>> have idea about it.
>> 
>> By the way, may I know why you would like to run i386 guest on Hyper-V?
>> 
>> Found people are talking about the 32bit kernel supporting in upstream,
>> below is Linus's point of view.
>>
>> https://lore.kernel.org/linux-fsdevel/CAHk-=wiGbz3oRvAVFtN-whW-d2F-STKsP1MZT4m_VeycAr1_VQ@mail.gmail.com/
>> 

<offtopic>
Using Hyper-V for testing out a new kernel is convenient and faster
before testing it out on a real i386 machine.
And I do want bugs squashed; meanwhile "hv_balloon.hot_add=0" will be 
a workaround.

I agree HIGHMEM64G (PAE) is going to be a deprecated feature, but I do miss
HIGHMEM4G (needed for 1GB memory support).
</offtopic>
Comment 25 Taketo Kabe 2020-02-17 05:46:31 UTC
bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv>

>> On 02/11/20 at 04:41pm, Andrew Morton wrote:
>> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang
>> <richardw.yang@linux.intel.com> wrote:
>> > 
>> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote:
>> > > >On 02/10/20 at 02:09pm, Baoquan He wrote:
>> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote:
>> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com>
>> wrote:
>> > > >> > 
>> > > >> > > Hi Andrew,
>> > > >> > > 
>> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote:
>> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000
>> bugzilla-daemon@bugzilla.kernel.org wrote:
>> > > >> > > > 
>> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401
>> > > >> > > > > 
>> > > >> > > > 
>> > > >> > > > An oops during mem hotadd.  Could someone please take a look
>> when
>> > > >> > > > convenient?
>> > > >> > > 
>> > > >> > > This has been addressed by Wei Yang's patch, please check it
>> here:
>> > > >> > > 
>> > > >> > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
>> > > >> > > 
>> > > >> > 
>> > > >> > hm, OK, thanks.  It's unfortunate that a 5.5 fix is buried in a
>> > > >> > six-patch series which is still in progress!  Can we please merge
>> that
>> > > >> > as a standalone fix with a cc:stable, Fixes:, etc?
>> > > >
>> > > >Maybe can add Fixes tag as follow when merge:
>> > > >
>> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>> > > >
>> > 
>> > The reporter (cc'ed here) is still seeing issues:
>> > https://bugzilla.kernel.org/show_bug.cgi?id=206401
>> > 
>> > Could we please continue this investigation via emailed reply-to-all,
>> > rather than via the bugzilla interface?
>> 
>> Yes, people prefer mailing list to discuss issues.


I found perplexing behavior in populate_section_memmap().

populate_section_memmap() calls alloc_pages(), and if that fails,
falls back to vmalloc().

But according to the trace, populate_section_memmap() seems to
throw out the alloc_pages() result and always falls back to vmalloc(),
which could be a wrong area to use.

I sprinkled pr_info() in mm/sparse.c:populate_section_memmap() as below:

===========================================
struct page * __meminit populate_section_memmap(unsigned long pfn,
                unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
{
        struct page *page, *ret;
        unsigned long memmap_size = sizeof(struct page) * PAGES_PER_SECTION;

        page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
        if (page) {
                goto got_map_page;
        }
pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size);
BUG_ON(page != 0);

        ret = vmalloc(memmap_size);
pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret);
        if (ret) {
                goto got_map_ptr;
        }

        return NULL;
got_map_page:
        ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
pr_info("%s: allocated struct page *page=0x%p\n", __func__, page);
got_map_ptr:

pr_info("%s: returning struct page * =0x%p\n", __func__, ret);
        return ret;
}
==================================================

and got a following panic.
It even ignores BUG_ON() (perhaps optimized out).

Is this worth investigating?
Disassembly doesn't reveal anything suspicious, but I have feeling that
I'm looking at disassembly different than that the CPU is seeing.
It's too trivial to be a compiler bug.


==================================================
[root@localhost ~]# readelf -l /proc/kcore

Elf file type is CORE (Core file)
Entry point 0x0
There are 3 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  NOTE           0x000094 0x00000000 0x00000000 0x01304 0x00000     0
  LOAD           0xaff2000 0xcaff0000 0xffffffff 0x3400e000 0x3400e000 RWE 0x1000
  LOAD           0x002000 0xc0000000 0x00000000 0xa7f0000 0xa7f0000 RWE 0x1000


[  302.784196] hv_balloon: Max. dynamic memory size: 1048576 MB
[  643.475080] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x10000) << PAGE_SHIFT)=0x10000000, (HA_CHUNK << PAGE_SHIFT)=134217728)
[  643.513804] populate_section_memmap: alloc_pages() returned 0xb1a7c4b2 (should be 0), reverting to vmalloc(memmap_size=655360)
[  643.513849] populate_section_memmap: vmalloc(655360) returned 0x11b0e715
[  643.513872] populate_section_memmap: returning struct page * =0x11b0e715
[  643.525352] populate_section_memmap: alloc_pages() returned 0xb1a7c4b2 (should be 0), reverting to vmalloc(memmap_size=655360)
[  643.536698] populate_section_memmap: vmalloc(655360) returned 0xf2ba6510
[  643.536722] populate_section_memmap: returning struct page * =0xf2ba6510
[  643.536749] hv_balloon: hv_mem_hot_add: add_memory() returned 0
[  645.394458] BUG: unable to handle page fault for address: d13ff000
[  645.394518] #PF: supervisor write access in kernel mode
[  645.394565] #PF: error_code(0x0002) - not-present page
[  645.394584] *pde = 00000000
[  645.394601] Oops: 0002 [#1] SMP
[  645.394614] CPU: 0 PID: 361 Comm: systemd-udevd Not tainted 5.6.0-rc1.el8.i586 #1
[  645.394636] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006  05/23/2012
[  645.394670] EIP: wp_page_copy+0x8e/0x750
[  645.394690] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29
[  645.394739] EAX: d13ff000 EBX: c752df28 ECX: 00000000 EDX: c5e0d000
[  645.394767] ESI: c5e0d000 EDI: d13ff004 EBP: c752deec ESP: c752dea8
[  645.394790] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282
[  645.394815] CR0: 80050033 CR2: d13ff000 CR3: 08e5a000 CR4: 003406d0
[  645.394840] Call Trace:
[  645.394852]  ? reuse_swap_page+0x83/0x390
[  645.394873]  do_wp_page+0x87/0x6e0
[  645.394885]  handle_mm_fault+0x808/0xe30
[  645.394893]  do_page_fault+0x19f/0x4d0
[  645.394901]  ? do_kern_addr_fault+0x80/0x80
[  645.394915]  common_exception_read_cr2+0x15a/0x15f
[  645.394930] EIP: 0xb7aaf8bb
[  645.394944] Code: 24 0c e3 2c 89 d7 83 e2 03 74 11 7a 04 aa 49 74 1f aa 49 74 1b 83 f2 01 75 02 aa 49 89 ca c1 e9 02 83 e2 03 69 c0 01 01 01 01 <f3> ab 89 d1 f3 aa 8b 44 24 08 5f c3 66 90 66 90 66 90 66 90 90 f3
[  645.394973] EAX: 00000000 EBX: b7f05f60 ECX: 0000000d EDX: 00000000
[  645.394988] ESI: 02194db4 EDI: 02194db4 EBP: b7f05db4 ESP: bffed978
[  645.395003] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00210206
[  645.395018] Modules linked in: rfkill intel_rapl_msr intel_rapl_common crc32_pclmul snd_pcm snd_timer snd soundcore intel_rapl_perf sg pcspkr hv_netvsc i2c_piix4 hyperv_fb hv_utils hv_balloon joydev ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod t10_pi ata_generic hyperv_keyboard hid_hyperv hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw hv_vmbus libata
[  645.395101] CR2: 00000000d13ff000
[  645.395121] ---[ end trace 3bb1d66cb8b20841 ]---
[  645.395144] EIP: wp_page_copy+0x8e/0x750
[  645.395157] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29
[  645.395206] EAX: d13ff000 EBX: c752df28 ECX: 00000000 EDX: c5e0d000
[  645.395235] ESI: c5e0d000 EDI: d13ff004 EBP: c752deec ESP: c752dea8
[  645.395261] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282
[  645.395278] CR0: 80050033 CR2: d13ff000 CR3: 08e5a000 CR4: 003406d0
[  645.395308] Kernel panic - not syncing: Fatal exception
[  645.395329] Kernel Offset: 0x3e00000 from 0xc1000000 (relocation range: 0xc0000000-0xcafeffff)
[  645.395354] ---[ end Kernel panic - not syncing: Fatal exception ]---
==================================================
Comment 26 Baoquan He 2020-02-17 07:45:20 UTC
On 02/17/20 at 02:46pm, kkabe@vega.pgw.jp wrote:
> bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv>
> 
> >> On 02/11/20 at 04:41pm, Andrew Morton wrote:
> >> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang
> <richardw.yang@linux.intel.com> wrote:
> >> > 
> >> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote:
> >> > > >On 02/10/20 at 02:09pm, Baoquan He wrote:
> >> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote:
> >> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com>
> wrote:
> >> > > >> > 
> >> > > >> > > Hi Andrew,
> >> > > >> > > 
> >> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote:
> >> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000
> bugzilla-daemon@bugzilla.kernel.org wrote:
> >> > > >> > > > 
> >> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401
> >> > > >> > > > > 
> >> > > >> > > > 
> >> > > >> > > > An oops during mem hotadd.  Could someone please take a look
> when
> >> > > >> > > > convenient?
> >> > > >> > > 
> >> > > >> > > This has been addressed by Wei Yang's patch, please check it
> here:
> >> > > >> > > 
> >> > > >> > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com
> >> > > >> > > 
> >> > > >> > 
> >> > > >> > hm, OK, thanks.  It's unfortunate that a 5.5 fix is buried in a
> >> > > >> > six-patch series which is still in progress!  Can we please merge
> that
> >> > > >> > as a standalone fix with a cc:stable, Fixes:, etc?
> >> > > >
> >> > > >Maybe can add Fixes tag as follow when merge:
> >> > > >
> >> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> >> > > >
> >> > 
> >> > The reporter (cc'ed here) is still seeing issues:
> >> > https://bugzilla.kernel.org/show_bug.cgi?id=206401
> >> > 
> >> > Could we please continue this investigation via emailed reply-to-all,
> >> > rather than via the bugzilla interface?
> >> 
> >> Yes, people prefer mailing list to discuss issues.
> 
> 
> I found perplexing behavior in populate_section_memmap().
> 
> populate_section_memmap() calls alloc_pages(), and if that fails,
> falls back to vmalloc().
> 
> But according to the trace, populate_section_memmap() seems to
> throw out the alloc_pages() result and always falls back to vmalloc(),
> which could be a wrong area to use.
> 
> I sprinkled pr_info() in mm/sparse.c:populate_section_memmap() as below:
> 
> ===========================================
> struct page * __meminit populate_section_memmap(unsigned long pfn,
>                 unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
> {
>         struct page *page, *ret;
>         unsigned long memmap_size = sizeof(struct page) * PAGES_PER_SECTION;
> 
>         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
>         if (page) {
>                 goto got_map_page;
>         }
> pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to
> vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size);
> BUG_ON(page != 0);
> 
>         ret = vmalloc(memmap_size);
> pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret);
>         if (ret) {
>                 goto got_map_ptr;
>         }
> 
>         return NULL;
> got_map_page:
>         ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
> pr_info("%s: allocated struct page *page=0x%p\n", __func__, page);
> got_map_ptr:
> 
> pr_info("%s: returning struct page * =0x%p\n", __func__, ret);
>         return ret;
> }
> ==================================================
> 
> and got a following panic.
> It even ignores BUG_ON() (perhaps optimized out).
> 
> Is this worth investigating?
> Disassembly doesn't reveal anything suspicious, but I have feeling that
> I'm looking at disassembly different than that the CPU is seeing.
> It's too trivial to be a compiler bug.
> 
> 
> ==================================================
> [root@localhost ~]# readelf -l /proc/kcore
> 
> Elf file type is CORE (Core file)
> Entry point 0x0
> There are 3 program headers, starting at offset 52
> 
> Program Headers:
>   Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
>   NOTE           0x000094 0x00000000 0x00000000 0x01304 0x00000     0
>   LOAD           0xaff2000 0xcaff0000 0xffffffff 0x3400e000 0x3400e000 RWE
>   0x1000

This should be vmalloc area, the region covers [0xcaff0000, 0xcaff0000+0x3400e000]
					    [0xcaff0000, 0xfeffe000]

>   LOAD           0x002000 0xc0000000 0x00000000 0xa7f0000 0xa7f0000 RWE
>   0x1000
This should be the direct mapping starting from 0xc0000000, covers the boot memory
you set for guest kernel, 168M,             [0x0xc0000000, 0xca7f0000]

Since system only detects your boot memory, the max_pfn is 168M, so
VMALLOC_START = high_memory + VMALLOC_OFFSET;

So any hot added memory will be taken as high memory. Sorry, I have
forgot most of details of i386, these are just my rough understanding
about it.


> 
> 
> [  302.784196] hv_balloon: Max. dynamic memory size: 1048576 MB
> [  643.475080] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0,
> ((start_pfn=0x10000) << PAGE_SHIFT)=0x10000000, (HA_CHUNK <<
> PAGE_SHIFT)=134217728)
> [  643.513804] populate_section_memmap: alloc_pages() returned 0xb1a7c4b2
> (should be 0), reverting to vmalloc(memmap_size=655360)

This pr_info is truly weird.

> [  643.513849] populate_section_memmap: vmalloc(655360) returned 0x11b0e715
> [  643.513872] populate_section_memmap: returning struct page * =0x11b0e715

But here the returned page address is 0x11b0e715, which is also bizarre.
Kernel address is above 3G, right?

> [  643.525352] populate_section_memmap: alloc_pages() returned 0xb1a7c4b2
> (should be 0), reverting to vmalloc(memmap_size=655360)
> [  643.536698] populate_section_memmap: vmalloc(655360) returned 0xf2ba6510
> [  643.536722] populate_section_memmap: returning struct page * =0xf2ba6510

Here, the returned page address looks regular.

> [  643.536749] hv_balloon: hv_mem_hot_add: add_memory() returned 0
> [  645.394458] BUG: unable to handle page fault for address: d13ff000
> [  645.394518] #PF: supervisor write access in kernel mode
> [  645.394565] #PF: error_code(0x0002) - not-present page
> [  645.394584] *pde = 00000000
> [  645.394601] Oops: 0002 [#1] SMP
> [  645.394614] CPU: 0 PID: 361 Comm: systemd-udevd Not tainted
> 5.6.0-rc1.el8.i586 #1
> [  645.394636] Hardware name: Microsoft Corporation Virtual Machine/Virtual
> Machine, BIOS 090006  05/23/2012
> [  645.394670] EIP: wp_page_copy+0x8e/0x750
> [  645.394690] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff
> 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08
> 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29
> [  645.394739] EAX: d13ff000 EBX: c752df28 ECX: 00000000 EDX: c5e0d000
> [  645.394767] ESI: c5e0d000 EDI: d13ff004 EBP: c752deec ESP: c752dea8
> [  645.394790] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282
> [  645.394815] CR0: 80050033 CR2: d13ff000 CR3: 08e5a000 CR4: 003406d0
> [  645.394840] Call Trace:
> [  645.394852]  ? reuse_swap_page+0x83/0x390
> [  645.394873]  do_wp_page+0x87/0x6e0
> [  645.394885]  handle_mm_fault+0x808/0xe30
> [  645.394893]  do_page_fault+0x19f/0x4d0
> [  645.394901]  ? do_kern_addr_fault+0x80/0x80
> [  645.394915]  common_exception_read_cr2+0x15a/0x15f
> [  645.394930] EIP: 0xb7aaf8bb
> [  645.394944] Code: 24 0c e3 2c 89 d7 83 e2 03 74 11 7a 04 aa 49 74 1f aa 49
> 74 1b 83 f2 01 75 02 aa 49 89 ca c1 e9 02 83 e2 03 69 c0 01 01 01 01 <f3> ab
> 89 d1 f3 aa 8b 44 24 08 5f c3 66 90 66 90 66 90 66 90 90 f3
> [  645.394973] EAX: 00000000 EBX: b7f05f60 ECX: 0000000d EDX: 00000000
> [  645.394988] ESI: 02194db4 EDI: 02194db4 EBP: b7f05db4 ESP: bffed978
> [  645.395003] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00210206
> [  645.395018] Modules linked in: rfkill intel_rapl_msr intel_rapl_common
> crc32_pclmul snd_pcm snd_timer snd soundcore intel_rapl_perf sg pcspkr
> hv_netvsc i2c_piix4 hyperv_fb hv_utils hv_balloon joydev ip_tables ext4
> mbcache jbd2 sr_mod cdrom sd_mod t10_pi ata_generic hyperv_keyboard
> hid_hyperv hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw
> hv_vmbus libata
> [  645.395101] CR2: 00000000d13ff000
> [  645.395121] ---[ end trace 3bb1d66cb8b20841 ]---
> [  645.395144] EIP: wp_page_copy+0x8e/0x750
> [  645.395157] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff
> 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08
> 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29
> [  645.395206] EAX: d13ff000 EBX: c752df28 ECX: 00000000 EDX: c5e0d000
> [  645.395235] ESI: c5e0d000 EDI: d13ff004 EBP: c752deec ESP: c752dea8
> [  645.395261] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282
> [  645.395278] CR0: 80050033 CR2: d13ff000 CR3: 08e5a000 CR4: 003406d0
> [  645.395308] Kernel panic - not syncing: Fatal exception
> [  645.395329] Kernel Offset: 0x3e00000 from 0xc1000000 (relocation range:
> 0xc0000000-0xcafeffff)
> [  645.395354] ---[ end Kernel panic - not syncing: Fatal exception ]---
> ==================================================
>
Comment 27 David Hildenbrand 2020-02-17 08:00:36 UTC
On 17.02.20 06:31, kkabe@vega.pgw.jp wrote:
> bhe@redhat.com sed in <20200217044850.GD4816@MiWiFi-R3L-srv>
> 
>>> Sorry, I roughly went through code, didn't get clue. Not sure if David
>>> have idea about it.
>>>
>>> By the way, may I know why you would like to run i386 guest on Hyper-V?
>>>
>>> Found people are talking about the 32bit kernel supporting in upstream,
>>> below is Linus's point of view.
>>>
>>> https://lore.kernel.org/linux-fsdevel/CAHk-=wiGbz3oRvAVFtN-whW-d2F-STKsP1MZT4m_VeycAr1_VQ@mail.gmail.com/
>>>
> 
> <offtopic>
> Using Hyper-V for testing out a new kernel is convenient and faster
> before testing it out on a real i386 machine.
> And I do want bugs squashed; meanwhile "hv_balloon.hot_add=0" will be 
> a workaround.
> 
> I agree HIGHMEM64G (PAE) is going to be a deprecated feature, but I do miss
> HIGHMEM4G (needed for 1GB memory support).
> </offtopic>
> 

Could it be that we are hotplugging highmem, but when onlining memory,
it will be onlined to ZONE_NORMAL and not ZONE_HIGHMEM? That could
explain why we fail at random points in time, when somebody stumbles
over such a page.
Comment 28 osalvador 2020-02-17 09:35:04 UTC
On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote:
> ===========================================
> struct page * __meminit populate_section_memmap(unsigned long pfn,
>                 unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
> {
>         struct page *page, *ret;
>         unsigned long memmap_size = sizeof(struct page) * PAGES_PER_SECTION;
> 
>         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
>         if (page) {
>                 goto got_map_page;
>         }
> pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to
> vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size);
> BUG_ON(page != 0);
> 
>         ret = vmalloc(memmap_size);
> pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret);
>         if (ret) {
>                 goto got_map_ptr;
>         }
> 
>         return NULL;
> got_map_page:
>         ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
> pr_info("%s: allocated struct page *page=0x%p\n", __func__, page);
> got_map_ptr:
> 
> pr_info("%s: returning struct page * =0x%p\n", __func__, ret);
>         return ret;
> }

Could you please replace %p with %px. Wih the first, pointers are hashed so it is trickier
to get an overview of the meaning.

David could be right about ZONE_NORMAL vs ZONE_HIGHMEM.
IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to only deal with
(ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE.
Although I really fail to see how this could cause the crash.

Could you also please capture /proc/zoneinfo before and after hotplugging memory?

And add this delta on top of your debugging patch?

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0a54ffac8c68..2b9c821d7cf0 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -574,6 +574,7 @@ EXPORT_SYMBOL_GPL(restore_online_page_callback);
 
 void generic_online_page(struct page *page, unsigned int order)
 {
+       pr_info("generic_online_page: page: %px order: %u\n", page, order);
        kernel_map_pages(page, 1 << order, 1);
        __free_pages_core(page, order);
        totalram_pages_add(1UL << order);
@@ -774,6 +775,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
        zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
        move_pfn_range_to_zone(zone, pfn, nr_pages, NULL);
 
+       pr_info("%s: pfn: %lx - %lx (zone: %s)\n", __func__, pfn, pfn + nr_pages, zone->name);
+
        arg.start_pfn = pfn;
        arg.nr_pages = nr_pages;
        node_states_check_changes_online(nr_pages, zone, &arg);
Comment 29 Baoquan He 2020-02-17 10:13:30 UTC
On 02/17/20 at 10:34am, Oscar Salvador wrote:
> On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote:
> > ===========================================
> > struct page * __meminit populate_section_memmap(unsigned long pfn,
> >                 unsigned long nr_pages, int nid, struct vmem_altmap
> *altmap)
> > {
> >         struct page *page, *ret;
> >         unsigned long memmap_size = sizeof(struct page) *
> PAGES_PER_SECTION;
> > 
> >         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN,
> get_order(memmap_size));
> >         if (page) {
> >                 goto got_map_page;
> >         }
> > pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to
> vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size);
> > BUG_ON(page != 0);
> > 
> >         ret = vmalloc(memmap_size);
> > pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret);
> >         if (ret) {
> >                 goto got_map_ptr;
> >         }
> > 
> >         return NULL;
> > got_map_page:
> >         ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
> > pr_info("%s: allocated struct page *page=0x%p\n", __func__, page);
> > got_map_ptr:
> > 
> > pr_info("%s: returning struct page * =0x%p\n", __func__, ret);
> >         return ret;
> > }
> 
> Could you please replace %p with %px. Wih the first, pointers are hashed so
> it is trickier
> to get an overview of the meaning.
> 
> David could be right about ZONE_NORMAL vs ZONE_HIGHMEM.
> IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to only deal
> with
> (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE.

Ah, I think you both have spotted the problem.
 
In i386, if w/o momory hot add, normal memory will only include those
below 896M and they are added into normal zone. The left are added into
highmem zone.
 
How this influence the page allocation?
 
Very huge. As we know, in i386, normal memory can be accessed with
virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be accessed
with kmap. However, the later hot added memory are all put into normal
memmory, accessing into them will stump into vmalloc area, I would say.
 
So, i386 doesn't support memory hot add well.  Not sure if below change
can make it work normally.
 
We can just adjus the hot adding code as we have done for boot memmory.
Iterate zone from highmem if allowed when hot add memory.
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 475d0d68a32c..1380392d9ef5 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -716,7 +716,10 @@ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn
 	struct pglist_data *pgdat = NODE_DATA(nid);
 	int zid;
 
-	for (zid = 0; zid <= ZONE_NORMAL; zid++) {
+	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		if (zid == ZONE_MOVABLE)
+			continue;
+
 		struct zone *zone = &pgdat->node_zones[zid];
 
 		if (zone_intersects(zone, start_pfn, nr_pages))
Comment 30 Baoquan He 2020-02-17 10:17:56 UTC
On 02/17/20 at 06:13pm, Baoquan He wrote:
> On 02/17/20 at 10:34am, Oscar Salvador wrote:
> > On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote:
> > > ===========================================
> > > struct page * __meminit populate_section_memmap(unsigned long pfn,
> > >                 unsigned long nr_pages, int nid, struct vmem_altmap
> *altmap)
> > > {
> > >         struct page *page, *ret;
> > >         unsigned long memmap_size = sizeof(struct page) *
> PAGES_PER_SECTION;
> > > 
> > >         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN,
> get_order(memmap_size));
> > >         if (page) {
> > >                 goto got_map_page;
> > >         }
> > > pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to
> vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size);
> > > BUG_ON(page != 0);
> > > 
> > >         ret = vmalloc(memmap_size);
> > > pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret);
> > >         if (ret) {
> > >                 goto got_map_ptr;
> > >         }
> > > 
> > >         return NULL;
> > > got_map_page:
> > >         ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
> > > pr_info("%s: allocated struct page *page=0x%p\n", __func__, page);
> > > got_map_ptr:
> > > 
> > > pr_info("%s: returning struct page * =0x%p\n", __func__, ret);
> > >         return ret;
> > > }
> > 
> > Could you please replace %p with %px. Wih the first, pointers are hashed so
> it is trickier
> > to get an overview of the meaning.
> > 
> > David could be right about ZONE_NORMAL vs ZONE_HIGHMEM.
> > IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to only
> deal with
> > (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE.
> 
> Ah, I think you both have spotted the problem.
>  
> In i386, if w/o momory hot add, normal memory will only include those
> below 896M and they are added into normal zone. The left are added into
> highmem zone.
>  
> How this influence the page allocation?
>  
> Very huge. As we know, in i386, normal memory can be accessed with
> virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be accessed

Hmm, here I mean __pa and __va.

> with kmap. However, the later hot added memory are all put into normal
> memmory, accessing into them will stump into vmalloc area, I would say.
>  
> So, i386 doesn't support memory hot add well.  Not sure if below change
> can make it work normally.
>  
> We can just adjus the hot adding code as we have done for boot memmory.
> Iterate zone from highmem if allowed when hot add memory.
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 475d0d68a32c..1380392d9ef5 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -716,7 +716,10 @@ static struct zone *default_kernel_zone_for_pfn(int nid,
> unsigned long start_pfn
>       struct pglist_data *pgdat = NODE_DATA(nid);
>       int zid;
>  
> -     for (zid = 0; zid <= ZONE_NORMAL; zid++) {
> +     for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> +             if (zid == ZONE_MOVABLE)
> +                     continue;
> +
>               struct zone *zone = &pgdat->node_zones[zid];
>  
>               if (zone_intersects(zone, start_pfn, nr_pages))
> 
>
Comment 31 David Hildenbrand 2020-02-17 10:24:31 UTC
On 17.02.20 11:13, Baoquan He wrote:
> On 02/17/20 at 10:34am, Oscar Salvador wrote:
>> On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote:
>>> ===========================================
>>> struct page * __meminit populate_section_memmap(unsigned long pfn,
>>>                 unsigned long nr_pages, int nid, struct vmem_altmap
>>>                 *altmap)
>>> {
>>>         struct page *page, *ret;
>>>         unsigned long memmap_size = sizeof(struct page) *
>>>         PAGES_PER_SECTION;
>>>
>>>         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN,
>>>         get_order(memmap_size));
>>>         if (page) {
>>>                 goto got_map_page;
>>>         }
>>> pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to
>>> vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size);
>>> BUG_ON(page != 0);
>>>
>>>         ret = vmalloc(memmap_size);
>>> pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret);
>>>         if (ret) {
>>>                 goto got_map_ptr;
>>>         }
>>>
>>>         return NULL;
>>> got_map_page:
>>>         ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
>>> pr_info("%s: allocated struct page *page=0x%p\n", __func__, page);
>>> got_map_ptr:
>>>
>>> pr_info("%s: returning struct page * =0x%p\n", __func__, ret);
>>>         return ret;
>>> }
>>
>> Could you please replace %p with %px. Wih the first, pointers are hashed so
>> it is trickier
>> to get an overview of the meaning.
>>
>> David could be right about ZONE_NORMAL vs ZONE_HIGHMEM.
>> IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to only deal
>> with
>> (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE.
> 
> Ah, I think you both have spotted the problem.
>  
> In i386, if w/o momory hot add, normal memory will only include those
> below 896M and they are added into normal zone. The left are added into
> highmem zone.
>  
> How this influence the page allocation?
>  
> Very huge. As we know, in i386, normal memory can be accessed with
> virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be accessed
> with kmap. However, the later hot added memory are all put into normal
> memmory, accessing into them will stump into vmalloc area, I would say.
>  
> So, i386 doesn't support memory hot add well.  Not sure if below change
> can make it work normally.
>  
> We can just adjus the hot adding code as we have done for boot memmory.
> Iterate zone from highmem if allowed when hot add memory.
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 475d0d68a32c..1380392d9ef5 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -716,7 +716,10 @@ static struct zone *default_kernel_zone_for_pfn(int nid,
> unsigned long start_pfn
>       struct pglist_data *pgdat = NODE_DATA(nid);
>       int zid;
>  
> -     for (zid = 0; zid <= ZONE_NORMAL; zid++) {
> +     for (zid = 0; zid < MAX_NR_ZONES; zid++) {

ZONE_DEVICE? :/

> +             if (zid == ZONE_MOVABLE)
> +                     continue;
> +
>               struct zone *zone = &pgdat->node_zones[zid];
>  
>               if (zone_intersects(zone, start_pfn, nr_pages))
> 
> 

What if somebody onlines memory from user space explicitly to the normal
zone? We can trigger crashes?

This doesn't look like it ever worked reliably, can we just disable
memory hotplug in case we have PAE? (especially, as continued i386
support is questionable)
Comment 32 Michal Hocko 2020-02-17 10:33:40 UTC
On Fri 14-02-20 23:26:29, kkabe@vega.pgw.jp wrote:
[...]
> [root@localhost ~]# [  302.391125] hv_balloon: Max. dynamic memory size:
> 1048576 MB

Is this saying that the system might hotplug up to 1TB of memory on this
32b system?

Btw. the hotplug support on highmem systems is quite likely to be broken
and/or full of corner cases. I seriously doubt this is something anybody
should be running in production without a _lot_ of work.

Is there any real usecase to run HyperV hotplug on 32b system?
Comment 33 Baoquan He 2020-02-17 10:34:01 UTC
On 02/17/20 at 11:24am, David Hildenbrand wrote:
> On 17.02.20 11:13, Baoquan He wrote:
> > On 02/17/20 at 10:34am, Oscar Salvador wrote:
> >> On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote:
> >>> ===========================================
> >>> struct page * __meminit populate_section_memmap(unsigned long pfn,
> >>>                 unsigned long nr_pages, int nid, struct vmem_altmap
> *altmap)
> >>> {
> >>>         struct page *page, *ret;
> >>>         unsigned long memmap_size = sizeof(struct page) *
> PAGES_PER_SECTION;
> >>>
> >>>         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN,
> get_order(memmap_size));
> >>>         if (page) {
> >>>                 goto got_map_page;
> >>>         }
> >>> pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to
> vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size);
> >>> BUG_ON(page != 0);
> >>>
> >>>         ret = vmalloc(memmap_size);
> >>> pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret);
> >>>         if (ret) {
> >>>                 goto got_map_ptr;
> >>>         }
> >>>
> >>>         return NULL;
> >>> got_map_page:
> >>>         ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
> >>> pr_info("%s: allocated struct page *page=0x%p\n", __func__, page);
> >>> got_map_ptr:
> >>>
> >>> pr_info("%s: returning struct page * =0x%p\n", __func__, ret);
> >>>         return ret;
> >>> }
> >>
> >> Could you please replace %p with %px. Wih the first, pointers are hashed
> so it is trickier
> >> to get an overview of the meaning.
> >>
> >> David could be right about ZONE_NORMAL vs ZONE_HIGHMEM.
> >> IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to only
> deal with
> >> (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE.
> > 
> > Ah, I think you both have spotted the problem.
> >  
> > In i386, if w/o momory hot add, normal memory will only include those
> > below 896M and they are added into normal zone. The left are added into
> > highmem zone.
> >  
> > How this influence the page allocation?
> >  
> > Very huge. As we know, in i386, normal memory can be accessed with
> > virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be accessed
> > with kmap. However, the later hot added memory are all put into normal
> > memmory, accessing into them will stump into vmalloc area, I would say.
> >  
> > So, i386 doesn't support memory hot add well.  Not sure if below change
> > can make it work normally.
> >  
> > We can just adjus the hot adding code as we have done for boot memmory.
> > Iterate zone from highmem if allowed when hot add memory.
> >  
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 475d0d68a32c..1380392d9ef5 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -716,7 +716,10 @@ static struct zone *default_kernel_zone_for_pfn(int
> nid, unsigned long start_pfn
> >     struct pglist_data *pgdat = NODE_DATA(nid);
> >     int zid;
> >  
> > -   for (zid = 0; zid <= ZONE_NORMAL; zid++) {
> > +   for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> 
> ZONE_DEVICE? :/

Not sure if ZONE_DEVICE will be supported on 32 bit system.


> 
> > +           if (zid == ZONE_MOVABLE)
> > +                   continue;
> > +
> >             struct zone *zone = &pgdat->node_zones[zid];
> >  
> >             if (zone_intersects(zone, start_pfn, nr_pages))
> > 
> > 
> 
> What if somebody onlines memory from user space explicitly to the normal
> zone? We can trigger crashes?

Seems the current i386 code doesn't support it. Unless we change that
too. If not reserving virtual address space, later added any memory has
to be highmem.

> 
> This doesn't look like it ever worked reliably, can we just disable
> memory hotplug in case we have PAE? (especially, as continued i386
> support is questionable)

This is not PAE, this is only HIGHMEM4G.
Comment 34 David Hildenbrand 2020-02-17 10:38:27 UTC
On 17.02.20 11:33, Baoquan He wrote:
> On 02/17/20 at 11:24am, David Hildenbrand wrote:
>> On 17.02.20 11:13, Baoquan He wrote:
>>> On 02/17/20 at 10:34am, Oscar Salvador wrote:
>>>> On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote:
>>>>> ===========================================
>>>>> struct page * __meminit populate_section_memmap(unsigned long pfn,
>>>>>                 unsigned long nr_pages, int nid, struct vmem_altmap
>>>>>                 *altmap)
>>>>> {
>>>>>         struct page *page, *ret;
>>>>>         unsigned long memmap_size = sizeof(struct page) *
>>>>>         PAGES_PER_SECTION;
>>>>>
>>>>>         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN,
>>>>>         get_order(memmap_size));
>>>>>         if (page) {
>>>>>                 goto got_map_page;
>>>>>         }
>>>>> pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to
>>>>> vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size);
>>>>> BUG_ON(page != 0);
>>>>>
>>>>>         ret = vmalloc(memmap_size);
>>>>> pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret);
>>>>>         if (ret) {
>>>>>                 goto got_map_ptr;
>>>>>         }
>>>>>
>>>>>         return NULL;
>>>>> got_map_page:
>>>>>         ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
>>>>> pr_info("%s: allocated struct page *page=0x%p\n", __func__, page);
>>>>> got_map_ptr:
>>>>>
>>>>> pr_info("%s: returning struct page * =0x%p\n", __func__, ret);
>>>>>         return ret;
>>>>> }
>>>>
>>>> Could you please replace %p with %px. Wih the first, pointers are hashed
>>>> so it is trickier
>>>> to get an overview of the meaning.
>>>>
>>>> David could be right about ZONE_NORMAL vs ZONE_HIGHMEM.
>>>> IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to only
>>>> deal with
>>>> (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE.
>>>
>>> Ah, I think you both have spotted the problem.
>>>  
>>> In i386, if w/o momory hot add, normal memory will only include those
>>> below 896M and they are added into normal zone. The left are added into
>>> highmem zone.
>>>  
>>> How this influence the page allocation?
>>>  
>>> Very huge. As we know, in i386, normal memory can be accessed with
>>> virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be accessed
>>> with kmap. However, the later hot added memory are all put into normal
>>> memmory, accessing into them will stump into vmalloc area, I would say.
>>>  
>>> So, i386 doesn't support memory hot add well.  Not sure if below change
>>> can make it work normally.
>>>  
>>> We can just adjus the hot adding code as we have done for boot memmory.
>>> Iterate zone from highmem if allowed when hot add memory.
>>>  
>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>> index 475d0d68a32c..1380392d9ef5 100644
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -716,7 +716,10 @@ static struct zone *default_kernel_zone_for_pfn(int
>>> nid, unsigned long start_pfn
>>>     struct pglist_data *pgdat = NODE_DATA(nid);
>>>     int zid;
>>>  
>>> -   for (zid = 0; zid <= ZONE_NORMAL; zid++) {
>>> +   for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>>
>> ZONE_DEVICE? :/
> 
> Not sure if ZONE_DEVICE will be supported on 32 bit system.
> 
> 
>>
>>> +           if (zid == ZONE_MOVABLE)
>>> +                   continue;
>>> +
>>>             struct zone *zone = &pgdat->node_zones[zid];
>>>  
>>>             if (zone_intersects(zone, start_pfn, nr_pages))
>>>
>>>
>>
>> What if somebody onlines memory from user space explicitly to the normal
>> zone? We can trigger crashes?
> 
> Seems the current i386 code doesn't support it. Unless we change that
> too. If not reserving virtual address space, later added any memory has
> to be highmem.
> 
>>
>> This doesn't look like it ever worked reliably, can we just disable
>> memory hotplug in case we have PAE? (especially, as continued i386
>> support is questionable)
> 
> This is not PAE, this is only HIGHMEM4G.
> 

Ah, okay. Anyhow, highmem combined with hotplug seems to be in a
questionable state. I'd vote for disabling it if possible.
Comment 35 Baoquan He 2020-02-17 11:21:09 UTC
On 02/17/20 at 11:38am, David Hildenbrand wrote:
> On 17.02.20 11:33, Baoquan He wrote:
> > On 02/17/20 at 11:24am, David Hildenbrand wrote:
> >> On 17.02.20 11:13, Baoquan He wrote:
> >>> On 02/17/20 at 10:34am, Oscar Salvador wrote:
> >>>> On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote:
> >>>>> ===========================================
> >>>>> struct page * __meminit populate_section_memmap(unsigned long pfn,
> >>>>>                 unsigned long nr_pages, int nid, struct vmem_altmap
> *altmap)
> >>>>> {
> >>>>>         struct page *page, *ret;
> >>>>>         unsigned long memmap_size = sizeof(struct page) *
> PAGES_PER_SECTION;
> >>>>>
> >>>>>         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN,
> get_order(memmap_size));
> >>>>>         if (page) {
> >>>>>                 goto got_map_page;
> >>>>>         }
> >>>>> pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to
> vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size);
> >>>>> BUG_ON(page != 0);
> >>>>>
> >>>>>         ret = vmalloc(memmap_size);
> >>>>> pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size,
> ret);
> >>>>>         if (ret) {
> >>>>>                 goto got_map_ptr;
> >>>>>         }
> >>>>>
> >>>>>         return NULL;
> >>>>> got_map_page:
> >>>>>         ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
> >>>>> pr_info("%s: allocated struct page *page=0x%p\n", __func__, page);
> >>>>> got_map_ptr:
> >>>>>
> >>>>> pr_info("%s: returning struct page * =0x%p\n", __func__, ret);
> >>>>>         return ret;
> >>>>> }
> >>>>
> >>>> Could you please replace %p with %px. Wih the first, pointers are hashed
> so it is trickier
> >>>> to get an overview of the meaning.
> >>>>
> >>>> David could be right about ZONE_NORMAL vs ZONE_HIGHMEM.
> >>>> IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to only
> deal with
> >>>> (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE.
> >>>
> >>> Ah, I think you both have spotted the problem.
> >>>  
> >>> In i386, if w/o momory hot add, normal memory will only include those
> >>> below 896M and they are added into normal zone. The left are added into
> >>> highmem zone.
> >>>  
> >>> How this influence the page allocation?
> >>>  
> >>> Very huge. As we know, in i386, normal memory can be accessed with
> >>> virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be accessed
> >>> with kmap. However, the later hot added memory are all put into normal
> >>> memmory, accessing into them will stump into vmalloc area, I would say.
> >>>  
> >>> So, i386 doesn't support memory hot add well.  Not sure if below change
> >>> can make it work normally.
> >>>  

Please try below code instead, see if it works. However, as David and
and Michal said in other reply, if no real use case, we may not be so
eager to support mem hotplug on i386. 


diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 475d0d68a32c..9faf47bd026e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -715,15 +715,20 @@ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
 	int zid;
+	enum zone_type default_zone = ZONE_NORMAL;
 
-	for (zid = 0; zid <= ZONE_NORMAL; zid++) {
+#ifdef CONFIG_HIGHMEM
+	default_zone = ZONE_HIGHMEM;
+#endif
+
+	for (zid = 0; zid <= default_zone; zid++) {
 		struct zone *zone = &pgdat->node_zones[zid];
 
 		if (zone_intersects(zone, start_pfn, nr_pages))
 			return zone;
 	}
 
-	return &pgdat->node_zones[ZONE_NORMAL];
+	return &pgdat->node_zones[default_zone];
 }
 
 static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
Comment 36 Taketo Kabe 2020-02-17 11:21:36 UTC
mhocko@kernel.org sed in <20200217103335.GI31531@dhcp22.suse.cz>

>> On Fri 14-02-20 23:26:29, kkabe@vega.pgw.jp wrote:
>> [...]
>> > [root@localhost ~]# [  302.391125] hv_balloon: Max. dynamic memory size:
>> 1048576 MB
>> 
>> Is this saying that the system might hotplug up to 1TB of memory on this
>> 32b system?

Probably. Hypervisor API uses 64-bit values, so
that's why I added add_memory() printk to see if it's overflowing 4GB.
I guessed drivers/hv/hv_balloon.c:hv_mem_hot_add() needs a check to
not hot-add over 4GB memory on non-PAE systems and so on,
but that's another story.



>> Btw. the hotplug support on highmem systems is quite likely to be broken
>> and/or full of corner cases. I seriously doubt this is something anybody
>> should be running in production without a _lot_ of work.
>> 
>> Is there any real usecase to run HyperV hotplug on 32b system?
>> -- 
>> Michal Hocko
>> SUSE Labs
Comment 37 Michal Hocko 2020-02-17 12:47:39 UTC
On Mon 17-02-20 19:20:54, Baoquan He wrote:
> On 02/17/20 at 11:38am, David Hildenbrand wrote:
> > On 17.02.20 11:33, Baoquan He wrote:
> > > On 02/17/20 at 11:24am, David Hildenbrand wrote:
> > >> On 17.02.20 11:13, Baoquan He wrote:
> > >>> On 02/17/20 at 10:34am, Oscar Salvador wrote:
> > >>>> On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote:
> > >>>>> ===========================================
> > >>>>> struct page * __meminit populate_section_memmap(unsigned long pfn,
> > >>>>>                 unsigned long nr_pages, int nid, struct vmem_altmap
> *altmap)
> > >>>>> {
> > >>>>>         struct page *page, *ret;
> > >>>>>         unsigned long memmap_size = sizeof(struct page) *
> PAGES_PER_SECTION;
> > >>>>>
> > >>>>>         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN,
> get_order(memmap_size));
> > >>>>>         if (page) {
> > >>>>>                 goto got_map_page;
> > >>>>>         }
> > >>>>> pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to
> vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size);
> > >>>>> BUG_ON(page != 0);
> > >>>>>
> > >>>>>         ret = vmalloc(memmap_size);
> > >>>>> pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size,
> ret);
> > >>>>>         if (ret) {
> > >>>>>                 goto got_map_ptr;
> > >>>>>         }
> > >>>>>
> > >>>>>         return NULL;
> > >>>>> got_map_page:
> > >>>>>         ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
> > >>>>> pr_info("%s: allocated struct page *page=0x%p\n", __func__, page);
> > >>>>> got_map_ptr:
> > >>>>>
> > >>>>> pr_info("%s: returning struct page * =0x%p\n", __func__, ret);
> > >>>>>         return ret;
> > >>>>> }
> > >>>>
> > >>>> Could you please replace %p with %px. Wih the first, pointers are
> hashed so it is trickier
> > >>>> to get an overview of the meaning.
> > >>>>
> > >>>> David could be right about ZONE_NORMAL vs ZONE_HIGHMEM.
> > >>>> IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to
> only deal with
> > >>>> (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE.
> > >>>
> > >>> Ah, I think you both have spotted the problem.
> > >>>  
> > >>> In i386, if w/o momory hot add, normal memory will only include those
> > >>> below 896M and they are added into normal zone. The left are added into
> > >>> highmem zone.
> > >>>  
> > >>> How this influence the page allocation?
> > >>>  
> > >>> Very huge. As we know, in i386, normal memory can be accessed with
> > >>> virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be accessed
> > >>> with kmap. However, the later hot added memory are all put into normal
> > >>> memmory, accessing into them will stump into vmalloc area, I would say.
> > >>>  
> > >>> So, i386 doesn't support memory hot add well.  Not sure if below change
> > >>> can make it work normally.
> > >>>  
> 
> Please try below code instead, see if it works. However, as David and
> and Michal said in other reply, if no real use case, we may not be so
> eager to support mem hotplug on i386. 

Yes please. Can we just mark it broken until there is a real usecase?
Convoluting the code even more for something that is not in use is just
adding a maintenance burden and the memory hotplug is seriously
understuffed in man power already.

This is likely a fallout of the hotplug rework (c6f03e2903c9e) from 2
years ago. I cannot really say whether the code worked reasonably before
the rework because I never considered hotplug on 32b to be something to
even try TBH. Mostly because lowmem is unlikely to ever benefit from
hotplug and adding more highmem just makes all the lowmem problems even
worse so this is dubious in itself.

That being said, I am willing to investigate further if there is a real
usecase for this but considering that nobody has noticed the breakage in
almost 3 years then I simply suspect that this is not really interesting
and marking it explicitly BROKEN is a better option.
Comment 38 Taketo Kabe 2020-02-18 06:24:54 UTC
Created attachment 287453 [details]
dmesg

bhe@redhat.com sed in <20200217112054.GA9823@MiWiFi-R3L-srv>

>> On 02/17/20 at 11:38am, David Hildenbrand wrote:
>> > On 17.02.20 11:33, Baoquan He wrote:
>> > > On 02/17/20 at 11:24am, David Hildenbrand wrote:
>> > >> On 17.02.20 11:13, Baoquan He wrote:
>> > >>> On 02/17/20 at 10:34am, Oscar Salvador wrote:
>> > >>>> On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote:
>> > >>>>> ===========================================
>> > >>>>> struct page * __meminit populate_section_memmap(unsigned long pfn,
>> > >>>>>                 unsigned long nr_pages, int nid, struct vmem_altmap
>> *altmap)
>> > >>>>> {
>> > >>>>>         struct page *page, *ret;
>> > >>>>>         unsigned long memmap_size = sizeof(struct page) *
>> PAGES_PER_SECTION;
>> > >>>>>
>> > >>>>>         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN,
>> get_order(memmap_size));
>> > >>>>>         if (page) {
>> > >>>>>                 goto got_map_page;
>> > >>>>>         }
>> > >>>>> pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to
>> vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size);
>> > >>>>> BUG_ON(page != 0);
>> > >>>>>
>> > >>>>>         ret = vmalloc(memmap_size);
>> > >>>>> pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size,
>> ret);
>> > >>>>>         if (ret) {
>> > >>>>>                 goto got_map_ptr;
>> > >>>>>         }
>> > >>>>>
>> > >>>>>         return NULL;
>> > >>>>> got_map_page:
>> > >>>>>         ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
>> > >>>>> pr_info("%s: allocated struct page *page=0x%p\n", __func__, page);
>> > >>>>> got_map_ptr:
>> > >>>>>
>> > >>>>> pr_info("%s: returning struct page * =0x%p\n", __func__, ret);
>> > >>>>>         return ret;
>> > >>>>> }
>> > >>>>
>> > >>>> Could you please replace %p with %px. Wih the first, pointers are
>> hashed so it is trickier
>> > >>>> to get an overview of the meaning.
>> > >>>>
>> > >>>> David could be right about ZONE_NORMAL vs ZONE_HIGHMEM.
>> > >>>> IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to
>> only deal with
>> > >>>> (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE.
>> > >>>
>> > >>> Ah, I think you both have spotted the problem.
>> > >>>  
>> > >>> In i386, if w/o momory hot add, normal memory will only include those
>> > >>> below 896M and they are added into normal zone. The left are added
>> into
>> > >>> highmem zone.
>> > >>>  
>> > >>> How this influence the page allocation?
>> > >>>  
>> > >>> Very huge. As we know, in i386, normal memory can be accessed with
>> > >>> virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be
>> accessed
>> > >>> with kmap. However, the later hot added memory are all put into normal
>> > >>> memmory, accessing into them will stump into vmalloc area, I would
>> say.
>> > >>>  
>> > >>> So, i386 doesn't support memory hot add well.  Not sure if below
>> change
>> > >>> can make it work normally.
>> > >>>  
>> 
>> Please try below code instead, see if it works. However, as David and
>> and Michal said in other reply, if no real use case, we may not be so
>> eager to support mem hotplug on i386. 
>> 
>> 
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 475d0d68a32c..9faf47bd026e 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -715,15 +715,20 @@ static struct zone *default_kernel_zone_for_pfn(int
>> nid, unsigned long start_pfn
>>  {
>>      struct pglist_data *pgdat = NODE_DATA(nid);
>>      int zid;
>> +    enum zone_type default_zone = ZONE_NORMAL;
>>  
>> -    for (zid = 0; zid <= ZONE_NORMAL; zid++) {
>> +#ifdef CONFIG_HIGHMEM
>> +    default_zone = ZONE_HIGHMEM;
>> +#endif
>> +
>> +    for (zid = 0; zid <= default_zone; zid++) {
>>              struct zone *zone = &pgdat->node_zones[zid];
>>  
>>              if (zone_intersects(zone, start_pfn, nr_pages))
>>                      return zone;
>>      }
>>  
>> -    return &pgdat->node_zones[ZONE_NORMAL];
>> +    return &pgdat->node_zones[default_zone];
>>  }
>>  
>>  static inline struct zone *default_zone_for_pfn(int nid, unsigned long
>>  start_pfn,
>> 
>> 

Tried out the above patch.
It seems to be working; no panic, total memory has increased and
the hot-added memory is added as HIGHMEM.

I had to backout Oscar's first section of patch
https://bugzilla.kernel.org/show_bug.cgi?id=206401#c28
since it spams console too much and bogs down systemd.

Minimal install of 168MB memory worked, so this time the sample is
running anaconda installer starting at 512MB.
Eventually memory was hot-added to around 1.2GB.

The weird pr_info() from populate_section_memmap() is still remaining though...

2nd parameter of add_memory() (phys_addr_t, 32bit on non-PAE) is 
going up to 0x60000000, so drivers/hv/hv_balloon.c:hv_mem_hot_add() may need
limit check to not overflow 4GB for heavier usage.
(Yes you should limit it in hypervisor dialog, but default is 1TB)


Do we need modifications for arch/x86/mm/init_32.c:arch_add_memory()
so that the hot-added memory is always in highmem area?
Currently it just >>PAGE_SHIFT given parameters and call generic __add_pages().


======================= readelf -l /proc/kcore:
Elf file type is CORE (Core file)
Entry point 0x0
There are 3 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  NOTE           0x000094 0x00000000 0x00000000 0x01304 0x00000     0
  LOAD           0x207f2000 0xe07f0000 0xffffffff 0x1e80e000 0x1e80e000 RWE 0x1000
  LOAD           0x002000 0xc0000000 0x00000000 0x1fff0000 0x1fff0000 RWE 0x1000

======================== dmesg excerpt:
[  302.503487] hv_balloon: Max. dynamic memory size: 1048576 MB
[  303.171640] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x28000) << PAGE_SHIFT)=0x28000000, (HA_CHUNK << PAGE_SHIFT)=134217728)
[  303.173031] populate_section_memmap: alloc_pages() returned 0x56164d26 (should be 0), reverting to vmalloc(memmap_size=655360)
[  303.173031] populate_section_memmap: vmalloc(655360) returned 0x912eede0
[  303.173031] populate_section_memmap: returning struct page * =0x912eede0
[  303.173032] populate_section_memmap: alloc_pages() returned 0x56164d26 (should be 0), reverting to vmalloc(memmap_size=655360)
[  303.173032] populate_section_memmap: vmalloc(655360) returned 0x900acc37
[  303.173032] populate_section_memmap: returning struct page * =0x900acc37
[  303.173033] hv_balloon: hv_mem_hot_add: add_memory() returned 0
[  303.213109] online_pages: pfn: 28000 - 2c000 (zone: HighMem)
[  303.223135] Built 1 zonelists, mobility grouping on.  Total pages: 123131
[  303.223139] online_pages: pfn: 2c000 - 30000 (zone: HighMem)
....
[  305.124224] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x60000) << PAGE_SHIFT)=0x60000000, (HA_CHUNK << PAGE_SHIFT)=134217728)
[  305.124239] populate_section_memmap: alloc_pages() returned 0x56164d26 (should be 0), reverting to vmalloc(memmap_size=655360)
[  305.124240] populate_section_memmap: vmalloc(655360) returned 0x5dd5170c
[  305.124240] populate_section_memmap: returning struct page * =0x5dd5170c
[  305.124254] populate_section_memmap: alloc_pages() returned 0x56164d26 (should be 0), reverting to vmalloc(memmap_size=655360)
[  305.124254] populate_section_memmap: vmalloc(655360) returned 0xf8ef699a
[  305.124254] populate_section_memmap: returning struct page * =0xf8ef699a
[  305.124256] hv_balloon: hv_mem_hot_add: add_memory() returned 0
[  305.143791] online_pages: pfn: 60000 - 64000 (zone: HighMem)
[  305.153186] online_pages: pfn: 64000 - 68000 (zone: HighMem)

======================= /proc/zoneinfo before hot-add
Node 0, zone      DMA
  per-node stats
      nr_inactive_anon 12069
      nr_active_anon 11288
      nr_inactive_file 13748
      nr_active_file 17527
      nr_unevictable 6734
      nr_slab_reclaimable 4337
      nr_slab_unreclaimable 8457
      nr_isolated_anon 0
      nr_isolated_file 0
      workingset_nodes 2262
      workingset_refault 223120
      workingset_activate 208515
      workingset_restore 137786
      workingset_nodereclaim 707
      nr_anon_pages 26686
      nr_mapped    10129
      nr_file_pages 34688
      nr_dirty     1
      nr_writeback 231
      nr_writeback_temp 0
      nr_shmem     942
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_file_hugepages 0
      nr_file_pmdmapped 0
      nr_anon_transparent_hugepages 0
      nr_unstable  0
      nr_vmscan_write 71210
      nr_vmscan_immediate_reclaim 3265
      nr_dirtied   6588
      nr_written   77555
      nr_kernel_misc_reclaimable 0
  pages free     403
        min      1049
        low      1055
        high     1061
        spanned  4095
        present  3998
        managed  3979
        protection: (0, 357, 357, 357)
      nr_free_pages 403
      nr_zone_inactive_anon 371
      nr_zone_active_anon 321
      nr_zone_inactive_file 544
      nr_zone_active_file 683
      nr_zone_unevictable 164
      nr_zone_write_pending 0
      nr_mlock     164
      nr_page_table_pages 11
      nr_kernel_stack 360
      nr_bounce    0
      nr_zspages   583
      nr_free_cma  0
  pagesets
    cpu: 0
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
  node_unreclaimable:  0
  start_pfn:           1
Node 0, zone   Normal
  pages free     1400
        min      592
        low      740
        high     888
        spanned  126960
        present  126960
        managed  105057
        protection: (0, 0, 0, 0)
      nr_free_pages 1400
      nr_zone_inactive_anon 11695
      nr_zone_active_anon 10961
      nr_zone_inactive_file 13204
      nr_zone_active_file 16844
      nr_zone_unevictable 6570
      nr_zone_write_pending 235
      nr_mlock     6570
      nr_page_table_pages 514
      nr_kernel_stack 1272
      nr_bounce    0
      nr_zspages   22175
      nr_free_cma  0
  pagesets
    cpu: 0
              count: 44
              high:  186
              batch: 31
  vm stats threshold: 6
  node_unreclaimable:  0
  start_pfn:           4096
Node 0, zone  HighMem
  pages free     0
        min      32
        low      32
        high     32
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 0)
Node 0, zone  Movable
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 0)

============================ /proc/zoneinfo after hot-add
Node 0, zone      DMA
  per-node stats
      nr_inactive_anon 13438
      nr_active_anon 10249
      nr_inactive_file 6955
      nr_active_file 26815
      nr_unevictable 6734
      nr_slab_reclaimable 4442
      nr_slab_unreclaimable 8670
      nr_isolated_anon 0
      nr_isolated_file 0
      workingset_nodes 2174
      workingset_refault 635931
      workingset_activate 594855
      workingset_restore 486703
      workingset_nodereclaim 1247
      nr_anon_pages 25862
      nr_mapped    12441
      nr_file_pages 38352
      nr_dirty     8
      nr_writeback 0
      nr_writeback_temp 0
      nr_shmem     2136
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_file_hugepages 0
      nr_file_pmdmapped 0
      nr_anon_transparent_hugepages 0
      nr_unstable  0
      nr_vmscan_write 123858
      nr_vmscan_immediate_reclaim 12156
      nr_dirtied   7219
      nr_written   130953
      nr_kernel_misc_reclaimable 0
  pages free     1380
        min      23
        low      28
        high     33
        spanned  4095
        present  3998
        managed  3979
        protection: (0, 410, 1306, 1306)
      nr_free_pages 1380
      nr_zone_inactive_anon 27
      nr_zone_active_anon 102
      nr_zone_inactive_file 122
      nr_zone_active_file 238
      nr_zone_unevictable 164
      nr_zone_write_pending 0
      nr_mlock     164
      nr_page_table_pages 13
      nr_kernel_stack 328
      nr_bounce    0
      nr_zspages   660
      nr_free_cma  0
  pagesets
    cpu: 0
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
  node_unreclaimable:  0
  start_pfn:           1
Node 0, zone   Normal
  pages free     20635
        min      633
        low      791
        high     949
        spanned  126960
        present  126960
        managed  105057
        protection: (0, 0, 7168, 7168)
      nr_free_pages 20635
      nr_zone_inactive_anon 8967
      nr_zone_active_anon 7980
      nr_zone_inactive_file 6309
      nr_zone_active_file 5881
      nr_zone_unevictable 6570
      nr_zone_write_pending 8
      nr_mlock     6570
      nr_page_table_pages 537
      nr_kernel_stack 1176
      nr_bounce    0
      nr_zspages   25936
      nr_free_cma  0
  pagesets
    cpu: 0
              count: 97
              high:  186
              batch: 31
  vm stats threshold: 6
  node_unreclaimable:  0
  start_pfn:           4096
Node 0, zone  HighMem
  pages free     199096
        min      128
        low      473
        high     818
        spanned  262144
        present  262144
        managed  229376
        protection: (0, 0, 0, 0)
      nr_free_pages 199096
      nr_zone_inactive_anon 4444
      nr_zone_active_anon 2167
      nr_zone_inactive_file 524
      nr_zone_active_file 20691
      nr_zone_unevictable 0
      nr_zone_write_pending 0
      nr_mlock     0
      nr_page_table_pages 0
      nr_kernel_stack 0
      nr_bounce    0
      nr_zspages   122
      nr_free_cma  0
  pagesets
    cpu: 0
              count: 67
              high:  378
              batch: 63
  vm stats threshold: 8
  node_unreclaimable:  0
  start_pfn:           163840
Node 0, zone  Movable
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 0)
Comment 39 Taketo Kabe 2020-02-18 06:24:55 UTC
Created attachment 287455 [details]
ha00.patch
Comment 40 Michal Hocko 2020-02-18 08:47:04 UTC
On Tue 18-02-20 15:24:48, kkabe@vega.pgw.jp wrote:
[...]
> Tried out the above patch.
> It seems to be working; no panic, total memory has increased and
> the hot-added memory is added as HIGHMEM.

I was about to post a patch to mark hotplug broken on 32b but it seems
you do care about this setup. Could you describe your usecase please?
Comment 41 Taketo Kabe 2020-02-18 09:19:05 UTC
mhocko@kernel.org sed in <20200218084700.GD21113@dhcp22.suse.cz>

>> On Tue 18-02-20 15:24:48, kkabe@vega.pgw.jp wrote:
>> [...]
>> > Tried out the above patch.
>> > It seems to be working; no panic, total memory has increased and
>> > the hot-added memory is added as HIGHMEM.
>> 
>> I was about to post a patch to mark hotplug broken on 32b but it seems
>> you do care about this setup. Could you describe your usecase please?

My usecase is testing out the kernel on Hyper-V before loading it on
real i686 machine. Hyper-V machine is faster to skim out other bugs.
So memory hot-add is not a must requirement for me,
but having hot-add may be handy to see the application memory requirement.
(as in the anaconda test revealed)

If we're disabling it, we have to announce it somewhere;
where is appropriate? `modinfo hv_balloon`'s "hot_add" description?
Comment 42 David Hildenbrand 2020-02-18 09:27:05 UTC
On 18.02.20 10:19, kkabe@vega.pgw.jp wrote:
> mhocko@kernel.org sed in <20200218084700.GD21113@dhcp22.suse.cz>
> 
>>> On Tue 18-02-20 15:24:48, kkabe@vega.pgw.jp wrote:
>>> [...]
>>>> Tried out the above patch.
>>>> It seems to be working; no panic, total memory has increased and
>>>> the hot-added memory is added as HIGHMEM.
>>>
>>> I was about to post a patch to mark hotplug broken on 32b but it seems
>>> you do care about this setup. Could you describe your usecase please?
> 
> My usecase is testing out the kernel on Hyper-V before loading it on
> real i686 machine. Hyper-V machine is faster to skim out other bugs.
> So memory hot-add is not a must requirement for me,
> but having hot-add may be handy to see the application memory requirement.
> (as in the anaconda test revealed)
> 
> If we're disabling it, we have to announce it somewhere;
> where is appropriate? `modinfo hv_balloon`'s "hot_add" description?
> 

I'd really vote to just disable that. Basic testing of kernels can be
done without memory hotadd. If I am not wrong, doing a "online_movable"
or "online_kernel" from user space could still make us trigger crashes
and would have to be fenced.
Comment 43 Michal Hocko 2020-02-18 10:05:37 UTC
On Tue 18-02-20 18:19:00, kkabe@vega.pgw.jp wrote:
> mhocko@kernel.org sed in <20200218084700.GD21113@dhcp22.suse.cz>
> 
> >> On Tue 18-02-20 15:24:48, kkabe@vega.pgw.jp wrote:
> >> [...]
> >> > Tried out the above patch.
> >> > It seems to be working; no panic, total memory has increased and
> >> > the hot-added memory is added as HIGHMEM.
> >> 
> >> I was about to post a patch to mark hotplug broken on 32b but it seems
> >> you do care about this setup. Could you describe your usecase please?
> 
> My usecase is testing out the kernel on Hyper-V before loading it on
> real i686 machine. Hyper-V machine is faster to skim out other bugs.
> So memory hot-add is not a must requirement for me,
> but having hot-add may be handy to see the application memory requirement.
> (as in the anaconda test revealed)

OK, thanks for the clarification. I am not sure that this qualifies
as a sufficient reason to maintain the code though.

> If we're disabling it, we have to announce it somewhere;
> where is appropriate? `modinfo hv_balloon`'s "hot_add" description?

This should behave the same way as when the CONFIG_MEMORY_HOTPLULG is
not enabled. And from a very cursory look hv_balloon.c already checks
for the config.

---
From 562f21abeda508f199c34358e50fbaa518cd5ed8 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 18 Feb 2020 08:04:13 +0100
Subject: [PATCH] memory_hotplug: disable the functionality for 32b

Memory hotlug is broken for 32b systems at least since c6f03e2903c9
("mm, memory_hotplug: remove zone restrictions") which has considerably
reworked how can be memory associated with movable/kernel zones. The
same is not really trivial to achieve in 32b where only lowmem is the
kernel zone. While we can tweak this immediate problem around there are
likely other land mines hidden at other places.

It is also quite dubious that there is a real usecase for the memory
hotplug on 32b in the first place. Low memory is just too small to be
hotplugable (for hot add) and generally unusable for hotremove. Adding
more memory to highmem is also dubious because it would increase the
low mem or vmalloc space pressure for memmaps.

Restrict the functionality to 64b systems. This will help future
development to focus on usecases that have real life application.  We
can remove this restriction in future in presence of a real life usecase
of course but until then make it explicit that hotplug on 32b is broken
and requires a non trivial amount of work to fix.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index ab80933be65f..2d5fe9e92969 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -154,6 +154,7 @@ config MEMORY_HOTPLUG
 	bool "Allow for memory hot-add"
 	depends on SPARSEMEM || X86_64_ACPI_NUMA
 	depends on ARCH_ENABLE_MEMORY_HOTPLUG
+	depends on 64BIT || BROKEN
 
 config MEMORY_HOTPLUG_SPARSE
 	def_bool y
Comment 44 David Hildenbrand 2020-02-18 10:11:58 UTC
On 18.02.20 11:05, Michal Hocko wrote:
> On Tue 18-02-20 18:19:00, kkabe@vega.pgw.jp wrote:
>> mhocko@kernel.org sed in <20200218084700.GD21113@dhcp22.suse.cz>
>>
>>>> On Tue 18-02-20 15:24:48, kkabe@vega.pgw.jp wrote:
>>>> [...]
>>>>> Tried out the above patch.
>>>>> It seems to be working; no panic, total memory has increased and
>>>>> the hot-added memory is added as HIGHMEM.
>>>>
>>>> I was about to post a patch to mark hotplug broken on 32b but it seems
>>>> you do care about this setup. Could you describe your usecase please?
>>
>> My usecase is testing out the kernel on Hyper-V before loading it on
>> real i686 machine. Hyper-V machine is faster to skim out other bugs.
>> So memory hot-add is not a must requirement for me,
>> but having hot-add may be handy to see the application memory requirement.
>> (as in the anaconda test revealed)
> 
> OK, thanks for the clarification. I am not sure that this qualifies
> as a sufficient reason to maintain the code though.
> 
>> If we're disabling it, we have to announce it somewhere;
>> where is appropriate? `modinfo hv_balloon`'s "hot_add" description?
> 
> This should behave the same way as when the CONFIG_MEMORY_HOTPLULG is
> not enabled. And from a very cursory look hv_balloon.c already checks
> for the config.
> 
> ---
> From 562f21abeda508f199c34358e50fbaa518cd5ed8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Tue, 18 Feb 2020 08:04:13 +0100
> Subject: [PATCH] memory_hotplug: disable the functionality for 32b
> 
> Memory hotlug is broken for 32b systems at least since c6f03e2903c9
> ("mm, memory_hotplug: remove zone restrictions") which has considerably
> reworked how can be memory associated with movable/kernel zones. The
> same is not really trivial to achieve in 32b where only lowmem is the
> kernel zone. While we can tweak this immediate problem around there are
> likely other land mines hidden at other places.
> 
> It is also quite dubious that there is a real usecase for the memory
> hotplug on 32b in the first place. Low memory is just too small to be
> hotplugable (for hot add) and generally unusable for hotremove. Adding
> more memory to highmem is also dubious because it would increase the
> low mem or vmalloc space pressure for memmaps.
> 
> Restrict the functionality to 64b systems. This will help future
> development to focus on usecases that have real life application.  We
> can remove this restriction in future in presence of a real life usecase
> of course but until then make it explicit that hotplug on 32b is broken
> and requires a non trivial amount of work to fix.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ab80933be65f..2d5fe9e92969 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -154,6 +154,7 @@ config MEMORY_HOTPLUG
>       bool "Allow for memory hot-add"
>       depends on SPARSEMEM || X86_64_ACPI_NUMA
>       depends on ARCH_ENABLE_MEMORY_HOTPLUG
> +     depends on 64BIT || BROKEN
>  
>  config MEMORY_HOTPLUG_SPARSE
>       def_bool y
> 

Acked-by: David Hildenbrand <david@redhat.com>
Comment 45 Baoquan He 2020-02-19 03:23:32 UTC
On 02/18/20 at 11:05am, Michal Hocko wrote:
> On Tue 18-02-20 18:19:00, kkabe@vega.pgw.jp wrote:
> > mhocko@kernel.org sed in <20200218084700.GD21113@dhcp22.suse.cz>
> > 
> > >> On Tue 18-02-20 15:24:48, kkabe@vega.pgw.jp wrote:
> > >> [...]
> > >> > Tried out the above patch.
> > >> > It seems to be working; no panic, total memory has increased and
> > >> > the hot-added memory is added as HIGHMEM.
> > >> 
> > >> I was about to post a patch to mark hotplug broken on 32b but it seems
> > >> you do care about this setup. Could you describe your usecase please?
> > 
> > My usecase is testing out the kernel on Hyper-V before loading it on
> > real i686 machine. Hyper-V machine is faster to skim out other bugs.
> > So memory hot-add is not a must requirement for me,
> > but having hot-add may be handy to see the application memory requirement.
> > (as in the anaconda test revealed)
> 
> OK, thanks for the clarification. I am not sure that this qualifies
> as a sufficient reason to maintain the code though.
> 
> > If we're disabling it, we have to announce it somewhere;
> > where is appropriate? `modinfo hv_balloon`'s "hot_add" description?
> 
> This should behave the same way as when the CONFIG_MEMORY_HOTPLULG is
> not enabled. And from a very cursory look hv_balloon.c already checks
> for the config.
> 
> ---
> From 562f21abeda508f199c34358e50fbaa518cd5ed8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Tue, 18 Feb 2020 08:04:13 +0100
> Subject: [PATCH] memory_hotplug: disable the functionality for 32b
> 
> Memory hotlug is broken for 32b systems at least since c6f03e2903c9
> ("mm, memory_hotplug: remove zone restrictions") which has considerably
> reworked how can be memory associated with movable/kernel zones. The
> same is not really trivial to achieve in 32b where only lowmem is the
> kernel zone. While we can tweak this immediate problem around there are
> likely other land mines hidden at other places.
> 
> It is also quite dubious that there is a real usecase for the memory
> hotplug on 32b in the first place. Low memory is just too small to be
> hotplugable (for hot add) and generally unusable for hotremove. Adding
> more memory to highmem is also dubious because it would increase the
> low mem or vmalloc space pressure for memmaps.
> 
> Restrict the functionality to 64b systems. This will help future
> development to focus on usecases that have real life application.  We
> can remove this restriction in future in presence of a real life usecase
> of course but until then make it explicit that hotplug on 32b is broken
> and requires a non trivial amount of work to fix.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

No objection to this, ack.
Acked-by: Baoquan He <bhe@redhat.com>

At least in our distros, we have taken the i386 off from our ARCH lists
for a very long time, hence I personally haven't followed i386 code for
a long time either. This can save our time when maintain the mem hotplug
code. Thanks for making this patch.

Thanks
Baoquan

> ---
>  mm/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ab80933be65f..2d5fe9e92969 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -154,6 +154,7 @@ config MEMORY_HOTPLUG
>       bool "Allow for memory hot-add"
>       depends on SPARSEMEM || X86_64_ACPI_NUMA
>       depends on ARCH_ENABLE_MEMORY_HOTPLUG
> +     depends on 64BIT || BROKEN
>  
>  config MEMORY_HOTPLUG_SPARSE
>       def_bool y
> -- 
> 2.24.1
> 
> -- 
> Michal Hocko
> SUSE Labs
>
Comment 46 Baoquan He 2020-02-19 03:39:55 UTC
On 02/18/20 at 03:24pm, kkabe@vega.pgw.jp wrote:
> bhe@redhat.com sed in <20200217112054.GA9823@MiWiFi-R3L-srv>
> >> Please try below code instead, see if it works. However, as David and
> >> and Michal said in other reply, if no real use case, we may not be so
> >> eager to support mem hotplug on i386. 
> >> 
> >> 
> >> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >> index 475d0d68a32c..9faf47bd026e 100644
> >> --- a/mm/memory_hotplug.c
> >> +++ b/mm/memory_hotplug.c
> >> @@ -715,15 +715,20 @@ static struct zone *default_kernel_zone_for_pfn(int
> nid, unsigned long start_pfn
> >>  {
> >>    struct pglist_data *pgdat = NODE_DATA(nid);
> >>    int zid;
> >> +  enum zone_type default_zone = ZONE_NORMAL;
> >>  
> >> -  for (zid = 0; zid <= ZONE_NORMAL; zid++) {
> >> +#ifdef CONFIG_HIGHMEM
> >> +  default_zone = ZONE_HIGHMEM;
> >> +#endif
> >> +
> >> +  for (zid = 0; zid <= default_zone; zid++) {
> >>            struct zone *zone = &pgdat->node_zones[zid];
> >>  
> >>            if (zone_intersects(zone, start_pfn, nr_pages))
> >>                    return zone;
> >>    }
> >>  
> >> -  return &pgdat->node_zones[ZONE_NORMAL];
> >> +  return &pgdat->node_zones[default_zone];
> >>  }
> >>  
> >>  static inline struct zone *default_zone_for_pfn(int nid, unsigned long
> start_pfn,
> >> 
> >> 
> 
> Tried out the above patch.
> It seems to be working; no panic, total memory has increased and
> the hot-added memory is added as HIGHMEM.

> Minimal install of 168MB memory worked, so this time the sample is
> running anaconda installer starting at 512MB.
> Eventually memory was hot-added to around 1.2GB.
> 
> The weird pr_info() from populate_section_memmap() is still remaining
> though...
> 
> 2nd parameter of add_memory() (phys_addr_t, 32bit on non-PAE) is 
> going up to 0x60000000, so drivers/hv/hv_balloon.c:hv_mem_hot_add() may need
> limit check to not overflow 4GB for heavier usage.
> (Yes you should limit it in hypervisor dialog, but default is 1TB)
> 
> 
> Do we need modifications for arch/x86/mm/init_32.c:arch_add_memory()
> so that the hot-added memory is always in highmem area?
> Currently it just >>PAGE_SHIFT given parameters and call generic
> __add_pages().

Hmm, it may not be hot added into highmem area always, if possible, it can be
added into movable area. From my point of view, the above change is
enough to make it work.

Sorry, man. The i386 is too old, as you see, people is more willing to
deprecate it so that focus on 64bit arch. You can still patch your
kernel with above code change for a while, but possibly won't be very
long.

Thanks
Baoquan
Comment 47 Andrew Morton 2020-02-19 21:46:47 UTC
On Tue, 18 Feb 2020 11:05:32 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> Subject: [PATCH] memory_hotplug: disable the functionality for 32b
> 
> Memory hotlug is broken for 32b systems at least since c6f03e2903c9
> ("mm, memory_hotplug: remove zone restrictions") which has considerably
> reworked how can be memory associated with movable/kernel zones. The
> same is not really trivial to achieve in 32b where only lowmem is the
> kernel zone. While we can tweak this immediate problem around there are
> likely other land mines hidden at other places.
> 
> It is also quite dubious that there is a real usecase for the memory
> hotplug on 32b in the first place. Low memory is just too small to be
> hotplugable (for hot add) and generally unusable for hotremove. Adding
> more memory to highmem is also dubious because it would increase the
> low mem or vmalloc space pressure for memmaps.
> 
> Restrict the functionality to 64b systems. This will help future
> development to focus on usecases that have real life application.  We
> can remove this restriction in future in presence of a real life usecase
> of course but until then make it explicit that hotplug on 32b is broken
> and requires a non trivial amount of work to fix.

(cc linux-arch)

(and linux-arm-kernel, as ARM is a major 32-bit user)

Does anyone see a problem with disabling memory hotplug on 32-bit builds?