When running on low memory (ex. initial 160MB) generating memory pressure on Hyper-V, after running 5 minutes, the kernel panics when memory hot-add is requested by the hypervisor. related bug for kernel-4.19.95: https://bugzilla.kernel.org/show_bug.cgi?id=206181#c12 [ 302.169238] hv_balloon: Max. dynamic memory size: 1048576 MB [ 302.367821] BUG: unable to handle page fault for address: 00280000 [ 302.367862] #PF: supervisor write access in kernel mode [ 302.367900] #PF: error_code(0x0002) - not-present page [ 302.367933] *pde = 00000000 [ 302.367961] Oops: 0002 [#1] SMP [ 302.367984] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.5.0-1.el8.i586 #1 [ 302.368030] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012 [ 302.368091] Workqueue: events hot_add_req [hv_balloon] [ 302.368129] EIP: memset+0xb/0x20 [ 302.368161] Code: f9 01 72 0b 8a 0e 88 0f 8d b4 26 00 00 00 00 8b 45 f0 83 c4 04 5b 5e 5f 5d c3 90 8d 74 26 00 55 89 e5 57 89 c7 53 89 c3 89 d0 <f3> aa 89 d8 5b 5f 5d c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 3e [ 302.368209] EAX: ffffffff EBX: 00280000 ECX: 000a0000 EDX: ffffffff [ 302.368236] ESI: 00010000 EDI: 00280000 EBP: c993fdec ESP: c993fde4 [ 302.368253] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010202 [ 302.368275] CR0: 80050033 CR2: 00280000 CR3: 053eb000 CR4: 003406d0 [ 302.368298] Call Trace: [ 302.368308] page_init_poison+0x1d/0x30 [ 302.368319] sparse_add_section+0x137/0x1ab [ 302.368356] __add_pages+0x9a/0x100 [ 302.368373] arch_add_memory+0x39/0x40 [ 302.368394] add_memory_resource+0x155/0x200 [ 302.368412] ? irq_work_queue+0x1f/0x30 [ 302.368429] __add_memory+0x7e/0xf0 [ 302.368447] add_memory+0x2c/0x40 [ 302.368462] hot_add_req+0x564/0x5c0 [hv_balloon] [ 302.368477] process_one_work+0x176/0x310 [ 302.368498] worker_thread+0x39/0x3c0 [ 302.368516] kthread+0xf0/0x110 [ 302.368530] ? rescuer_thread+0x2f0/0x2f0 [ 302.368551] ? kthread_park+0x90/0x90 [ 302.368570] ret_from_fork+0x2e/0x40 [ 302.368588] Modules linked in: rfkill intel_rapl_msr intel_rapl_common crc32_pclmul sg snd_pcm snd_timer snd intel_rapl_perf soundcore pcspkr hv_utils hv_netvsc hv_balloon hyperv_fb i2c_piix4 joydev ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod ata_generic hyperv_keyboard hid_hyperv hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw hv_vmbus libata [ 302.368659] CR2: 0000000000280000 [ 302.368668] ---[ end trace ecd710eeebcc6d97 ]--- [ 302.368690] EIP: memset+0xb/0x20 [ 302.368704] Code: f9 01 72 0b 8a 0e 88 0f 8d b4 26 00 00 00 00 8b 45 f0 83 c4 04 5b 5e 5f 5d c3 90 8d 74 26 00 55 89 e5 57 89 c7 53 89 c3 89 d0 <f3> aa 89 d8 5b 5f 5d c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 3e [ 302.368769] EAX: ffffffff EBX: 00280000 ECX: 000a0000 EDX: ffffffff [ 302.368800] ESI: 00010000 EDI: 00280000 EBP: c993fdec ESP: c993fde4 [ 302.368827] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010202 [ 302.368856] CR0: 80050033 CR2: 00280000 CR3: 053eb000 CR4: 003406d0 [ 302.368924] Kernel panic - not syncing: Fatal exception [ 302.368949] Kernel Offset: 0x2800000 from 0xc1000000 (relocation range: 0xc0000000-0xca7effff) [ 302.368972] ---[ end Kernel panic - not syncing: Fatal exception ]---
Created attachment 287103 [details] Don't free pages just onlined on memory hot-add This patch fixed the panic after 5 minutes.
Created attachment 287165 [details] Patch to add diags for memset() call Patch to see parameters passed in to memset()
The patch #c1 didn't completely fix the issue. Something is happening in memset() of page_init_poison(); The patch https://bugzilla.kernel.org/show_bug.cgi?id=206401#c2 adds printk diag for memset() call. It still panics after around 10 minutes; the 1st argument of memset() is dumped properly (passed in as %ebx), but panic dump shows %ebx having 0x280000, which the source is unknown. If I wrote the memset() call inside page_init_poison() in C code (as commented out in the patch #c2), panic won't happen and hot-added memory is recognized properly. [ 302.572775] hv_balloon: Max. dynamic memory size: 1048576 MB [ 657.870102] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x10000) << PAGE_SHIFT)=0x10000000, (HA_CHUNK << PAGE_SHIFT)=134217728) [ 657.883219] sparse_add_section: page_init_poison(pfn_to_page(start_pfn=65536)=0xdf4a3955, (sizeof(struct page)=40 * nr_pages)=655360) [ 657.896495] page_init_poison: poisoning(0xdf4a3955 size=655360) [ 657.896521] __memset_generic: (0xdf4a3955, -1, 655360) [ 657.896542] BUG: unable to handle page fault for address: 00280000 [ 657.896588] #PF: supervisor write access in kernel mode [ 657.896629] #PF: error_code(0x0002) - not-present page [ 657.896650] *pde = 00000000 [ 657.896669] Oops: 0002 [#1] SMP [ 657.896682] CPU: 0 PID: 495 Comm: kworker/0:0 Tainted: G E 5.5.0-2.el8.i586 #6 [ 657.896713] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012 [ 657.896750] Workqueue: events hot_add_req [hv_balloon] [ 657.896769] EIP: memset+0x16/0x21 [ 657.896794] Code: 00 00 00 00 8b 45 f0 83 c4 04 5b 5e 5f 5d c3 90 8d 74 26 00 55 89 e5 57 53 89 c3 83 ec 08 84 d2 0f 85 0f 00 00 00 89 df 89 d0 <f3> aa 8d 65 f8 89 d8 5b 5f 5d c3 0f be c2 51 50 53 68 50 7a 31 c7 [ 657.896860] EAX: ffffffff EBX: 00280000 ECX: 000a0000 EDX: ffffffff [ 657.896885] ESI: 000a0000 EDI: 00280000 EBP: c9d31dbc ESP: c9d31dac [ 657.896910] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286 [ 657.896934] CR0: 80050033 CR2: 00280000 CR3: 077ca000 CR4: 003406d0 [ 657.896964] Call Trace: [ 657.896971] page_init_poison.cold.6+0x21/0x29 [ 657.896983] sparse_add_section+0x168/0x1df [ 657.896993] __add_pages+0x9a/0x100 [ 657.897003] arch_add_memory+0x39/0x40 [ 657.897013] add_memory_resource+0x155/0x200 [ 657.897024] __add_memory+0x7e/0xf0 [ 657.897033] add_memory+0x2c/0x40 [ 657.897043] hot_add_req+0x564/0x5c0 [hv_balloon] [ 657.897055] process_one_work+0x176/0x310 [ 657.897065] worker_thread+0x39/0x3c0 [ 657.897074] kthread+0xf0/0x110 [ 657.897083] ? rescuer_thread+0x2f0/0x2f0 [ 657.897094] ? kthread_park+0x90/0x90 [ 657.897104] ret_from_fork+0x2e/0x40 [ 657.897119] Modules linked in: rfkill intel_rapl_msr intel_rapl_common crc32_pclmul snd_pcm snd_timer snd intel_rapl_perf soundcore pcspkr hv_netvsc hyperv_fb sg hv_balloon(E) hv_utils i2c_piix4 joydev ip_tables ext4 mbcache jbd2 sd_mod sr_mod cdrom ata_generic hyperv_keyboard hid_hyperv hv_storvsc scsi_transport_fc ata_piix hv_vmbus crc32c_intel serio_raw libata [ 657.897211] CR2: 0000000000280000 [ 657.897225] ---[ end trace ccb1bf2b4a48fd4e ]--- [ 657.897240] EIP: memset+0x16/0x21 [ 657.897253] Code: 00 00 00 00 8b 45 f0 83 c4 04 5b 5e 5f 5d c3 90 8d 74 26 00 55 89 e5 57 53 89 c3 83 ec 08 84 d2 0f 85 0f 00 00 00 89 df 89 d0 <f3> aa 8d 65 f8 89 d8 5b 5f 5d c3 0f be c2 51 50 53 68 50 7a 31 c7 [ 657.897319] EAX: ffffffff EBX: 00280000 ECX: 000a0000 EDX: ffffffff [ 657.897349] ESI: 000a0000 EDI: 00280000 EBP: c9d31dbc ESP: c9d31dac [ 657.897368] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286 [ 657.897402] CR0: 80050033 CR2: 00280000 CR3: 077ca000 CR4: 003406d0 [ 657.897418] Kernel panic - not syncing: Fatal exception [ 657.897426] Kernel Offset: 0x5a00000 from 0xc1000000 (relocation range: 0xc0000000-0xcafeffff) [ 657.897477] ---[ end Kernel panic - not syncing: Fatal exception ]---
Created attachment 287259 [details] use C code for page_init_poison() instead of memset() not for production The generated asm code: c11f8591 <page_init_poison.cold.6>: pr_info("%s: poisoning(0x%p size=%u)\n", __func__, p, size); c11f8591: 52 push %edx c11f8592: 89 d6 mov %edx,%esi c11f8594: 89 c3 mov %eax,%ebx c11f8596: 50 push %eax c11f8597: 68 60 29 87 c1 push $0xc1872960 c11f859c: 68 d4 e9 ae c1 push $0xc1aee9d4 c11f85a1: e8 74 b5 ec ff call c10c3b1a <printk> c11f85a6: 31 c0 xor %eax,%eax c11f85a8: 83 c4 10 add $0x10,%esp for (; size; size--) { c11f85ab: 39 f0 cmp %esi,%eax c11f85ad: 0f 84 24 fc ff ff je c11f81d7 <page_init_poison+0x17> *p++ = PAGE_POISON_PATTERN; c11f85b3: c6 04 03 ff movb $0xff,(%ebx,%eax,1) c11f85b7: 83 c0 01 add $0x1,%eax c11f85ba: eb ef jmp c11f85ab <page_init_poison.cold.6+0x1a>
I stopped using memset() in page_init_poison() and had a try substituting C code for it; after 10 minutes, memory hot-add kicks in and it still tries to write to 0x00280000 coming out of blue. %eax has proper value passed in at least on entry of the function, as pr_info() dump says. I can't get out why %ebx is clobbered to 0x00280000. This isn't -fPIC code, so GLOBAL_OFFSET_TABLE should't be an issue here. [ 20.642141] __memset_generic: (0x8df6368b, -1, 4096) [ 21.696914] Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist. [ 22.388843] __memset_generic: (0x20d15bd3, -52, 4096) [ 41.909261] __memset_generic: (0x8b330031, -1, 4096) [ 41.935897] __memset_generic: (0xb5798dd0, -1, 4096) [ 302.901147] hv_balloon: Max. dynamic memory size: 1048576 MB [ 642.922869] __memset_generic: (0xd3993b1f, -1, 4096) [ 647.684524] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x10000) << PAGE_SHIFT)=0x10000000, (HA_CHUNK << PAGE_SHIFT)=134217728) [ 647.713291] sparse_add_section: page_init_poison(pfn_to_page(start_pfn=65536)=0xd0c5af0b, (sizeof(struct page)=40 * nr_pages)=655360) [ 647.713360] page_init_poison: poisoning(0xd0c5af0b size=655360) [ 647.725456] BUG: unable to handle page fault for address: 00280000 [ 647.725478] #PF: supervisor write access in kernel mode [ 647.725501] #PF: error_code(0x0002) - not-present page [ 647.725527] *pde = 00000000 [ 647.725556] Oops: 0002 [#1] SMP [ 647.725571] CPU: 0 PID: 459 Comm: kworker/0:1 Tainted: G E 5.5.0-2.el8.i586 #23 [ 647.725614] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012 [ 647.725649] Workqueue: events hot_add_req [hv_balloon] [ 647.725672] EIP: page_init_poison.cold.6+0x22/0x2b [ 647.725698] Code: b5 ec ff 83 c4 3c c9 c3 52 89 d6 89 c3 50 68 60 29 47 c3 68 d4 e9 6e c3 e8 74 b5 ec ff 31 c0 83 c4 10 39 f0 0f 84 24 fc ff ff <c6> 04 03 ff 83 c0 01 eb ef 8b 43 04 a8 01 0f 85 6f 01 00 00 8b 43 [ 647.725770] EAX: 00000000 EBX: 00280000 ECX: ca400f00 EDX: ca3f6e8c [ 647.725795] ESI: 000a0000 EDI: 00000004 EBP: c2573ddc ESP: c2573dd4 [ 647.725815] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010287 [ 647.725835] CR0: 80050033 CR2: 00280000 CR3: 08caa000 CR4: 003406d0 [ 647.725853] Call Trace: [ 647.725862] sparse_add_section+0x168/0x1df [ 647.725877] __add_pages+0x9a/0x100 [ 647.725895] arch_add_memory+0x39/0x40 [ 647.725909] add_memory_resource+0x155/0x200 [ 647.725927] ? irq_work_queue+0x1f/0x30 [ 647.725937] __add_memory+0x7e/0xf0 [ 647.725956] add_memory+0x2c/0x40 [ 647.725968] hot_add_req+0x564/0x5c0 [hv_balloon] [ 647.725985] process_one_work+0x176/0x310 [ 647.726004] worker_thread+0x39/0x3c0 [ 647.726030] kthread+0xf0/0x110 [ 647.726043] ? rescuer_thread+0x2f0/0x2f0 [ 647.726072] ? kthread_park+0x90/0x90 [ 647.726101] ret_from_fork+0x2e/0x40 [ 647.726113] Modules linked in: rfkill intel_rapl_msr intel_rapl_common snd_pcm snd_timer snd sg crc32_pclmul soundcore intel_rapl_perf hv_utils pcspkr hv_balloon(E) i2c_piix4 joydev hv_netvsc hyperv_fb ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod ata_generic hid_hyperv hyperv_keyboard hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw libata hv_vmbus [ 647.726197] CR2: 0000000000280000 [ 647.726230] ---[ end trace 9d9aa98b89d59f21 ]--- [ 647.726287] EIP: page_init_poison.cold.6+0x22/0x2b [ 647.726336] Code: b5 ec ff 83 c4 3c c9 c3 52 89 d6 89 c3 50 68 60 29 47 c3 68 d4 e9 6e c3 e8 74 b5 ec ff 31 c0 83 c4 10 39 f0 0f 84 24 fc ff ff <c6> 04 03 ff 83 c0 01 eb ef 8b 43 04 a8 01 0f 85 6f 01 00 00 8b 43 [ 647.726393] EAX: 00000000 EBX: 00280000 ECX: ca400f00 EDX: ca3f6e8c [ 647.726412] ESI: 000a0000 EDI: 00000004 EBP: c2573ddc ESP: c2573dd4 [ 647.726429] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010287 [ 647.726442] CR0: 80050033 CR2: 00280000 CR3: 08caa000 CR4: 003406d0 [ 647.726455] Kernel panic - not syncing: Fatal exception [ 647.726473] Kernel Offset: 0x1c00000 from 0xc1000000 (relocation range: 0xc0000000-0xcafeffff) [ 647.726493] ---[ end Kernel panic - not syncing: Fatal exception ]---
sprinkling couple of "volatile"s made the compiler emit code not using %ebx. Still, -4(%ebp) (first argument of page_init_poison()) seems to be corrupted somewhere. I'm out of idea. I believe printk() doesn't clobber stack beyond their own. [ 17.823414] Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist. [ 17.920401] __memset_generic: (0x25102f96, -52, 4096) [ 42.250068] __memset_generic: (0xa93cc8c2, -1, 4096) [ 302.493577] hv_balloon: Max. dynamic memory size: 1048576 MB [ 302.809823] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x10000) << PAGE_SHIFT)=0x10000000, (HA_CHUNK << PAGE_SHIFT)=134217728) [ 302.817265] sparse_add_section: page_init_poison(pfn_to_page(start_pfn=65536)=0x3d1d0f24, (sizeof(struct page)=40 * nr_pages)=655360) [ 302.823824] page_init_poison: poisoning(0x3d1d0f24 size=655360) <<<<<<<correct "page" parameter [ 302.823866] page_init_poison: eax(-0x4(%ebp)) = 0x280000 <<<<<<<corrupt [ 302.823899] BUG: unable to handle page fault for address: 00280000 [ 302.823948] #PF: supervisor write access in kernel mode [ 302.823992] #PF: error_code(0x0002) - not-present page [ 302.824036] *pde = 00000000 [ 302.824059] Oops: 0002 [#1] SMP [ 302.824072] CPU: 0 PID: 209 Comm: kworker/0:3 Tainted: G E 5.5.0-2.el8.i586 #27 [ 302.824095] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012 [ 302.824123] Workqueue: events hot_add_req [hv_balloon] [ 302.824151] EIP: page_init_poison.cold.6+0x44/0x4c [ 302.824169] Code: 8b 45 fc 50 68 60 29 c7 c3 68 f8 e9 ee c3 e8 5c b5 ec ff 8b 55 f8 83 c4 1c 85 d2 0f 84 0c fc ff ff 8b 45 fc 83 ea 01 8d 48 01 <c6> 00 ff 89 4d fc eb e7 8b 43 04 a8 01 0f 85 6f 01 00 00 8b 43 10 [ 302.824215] EAX: 00280000 EBX: 00010000 ECX: 00280001 EDX: 0009ffff [ 302.824234] ESI: c5e00000 EDI: 00000004 EBP: c57f1ddc ESP: c57f1dd4 [ 302.824262] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010216 [ 302.824291] CR0: 80050033 CR2: 00280000 CR3: 07ac6000 CR4: 003406d0 [ 302.824324] Call Trace: [ 302.824340] sparse_add_section+0x168/0x1df [ 302.824361] __add_pages+0x9a/0x100 [ 302.824376] arch_add_memory+0x39/0x40 [ 302.824393] add_memory_resource+0x155/0x200 [ 302.824407] ? irq_work_queue+0x1f/0x30 [ 302.824420] __add_memory+0x7e/0xf0 [ 302.824437] add_memory+0x2c/0x40 [ 302.824450] hot_add_req+0x564/0x5c0 [hv_balloon] [ 302.824461] process_one_work+0x176/0x310 [ 302.824473] worker_thread+0x39/0x3c0 [ 302.824481] kthread+0xf0/0x110 [ 302.824491] ? rescuer_thread+0x2f0/0x2f0 [ 302.824504] ? kthread_park+0x90/0x90 [ 302.824518] ret_from_fork+0x2e/0x40 [ 302.824525] Modules linked in: rfkill intel_rapl_msr intel_rapl_common crc32_pclmul snd_pcm snd_timer snd intel_rapl_perf soundcore pcspkr hv_netvsc sg hyperv_fb i2c_piix4 hv_balloon(E) hv_utils joydev ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod ata_generic hyperv_keyboard hid_hyperv hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw libata hv_vmbus [ 302.824651] CR2: 0000000000280000 [ 302.824671] ---[ end trace fc6908983c929182 ]--- [ 302.824692] EIP: page_init_poison.cold.6+0x44/0x4c [ 302.824714] Code: 8b 45 fc 50 68 60 29 c7 c3 68 f8 e9 ee c3 e8 5c b5 ec ff 8b 55 f8 83 c4 1c 85 d2 0f 84 0c fc ff ff 8b 45 fc 83 ea 01 8d 48 01 <c6> 00 ff 89 4d fc eb e7 8b 43 04 a8 01 0f 85 6f 01 00 00 8b 43 10 [ 302.824788] EAX: 00280000 EBX: 00010000 ECX: 00280001 EDX: 0009ffff [ 302.824817] ESI: c5e00000 EDI: 00000004 EBP: c57f1ddc ESP: c57f1dd4 [ 302.824847] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010216 [ 302.824861] CR0: 80050033 CR2: 00280000 CR3: 07ac6000 CR4: 003406d0 [ 302.824887] Kernel panic - not syncing: Fatal exception [ 302.824902] Kernel Offset: 0x2400000 from 0xc1000000 (relocation range: 0xc0000000-0xcafeffff) [ 302.824930] ---[ end Kernel panic - not syncing: Fatal exception ]--- 0000003f <page_init_poison.cold.6>: unsigned char * volatile p = (unsigned char *)page; 3f: 89 45 fc mov %eax,-0x4(%ebp) pr_info("%s: poisoning(0x%p size=%u)\n", __func__, p, size); 42: 8b 45 fc mov -0x4(%ebp),%eax 45: 52 push %edx 46: 50 push %eax <<<<<<has correct value 47: 68 00 00 00 00 push $0x0 4c: 68 08 01 00 00 push $0x108 51: 89 55 f8 mov %edx,-0x8(%ebp) 54: e8 fc ff ff ff call 55 <page_init_poison.cold.6+0x16> asm volatile ("mov -0x4(%%ebp),%0" : "=mr" (eax)); 59: 8b 45 fc mov -0x4(%ebp),%eax pr_info("%s: eax(-0x4(%%ebp)) = 0x%lx\n", __func__, eax); 5c: 50 push %eax <<<<<<corrupt to 0x280000 5d: 68 00 00 00 00 push $0x0 62: 68 28 01 00 00 push $0x128 67: e8 fc ff ff ff call 68 <page_init_poison.cold.6+0x29> for (; size; size--) { 6c: 8b 55 f8 mov -0x8(%ebp),%edx pr_info("%s: eax(-0x4(%%ebp)) = 0x%lx\n", __func__, eax); 6f: 83 c4 1c add $0x1c,%esp for (; size; size--) { 72: 85 d2 test %edx,%edx 74: 0f 84 14 00 00 00 je 8e <__dump_page.cold.7+0x3> *p++ = PAGE_POISON_PATTERN; 7a: 8b 45 fc mov -0x4(%ebp),%eax for (; size; size--) { 7d: 83 ea 01 sub $0x1,%edx *p++ = PAGE_POISON_PATTERN; 80: 8d 48 01 lea 0x1(%eax),%ecx 83: c6 00 ff movb $0xff,(%eax) 86: 89 4d fc mov %ecx,-0x4(%ebp) 89: eb e7 jmp 72 <page_init_poison.cold.6+0x33>
I suspected speculative execution is doing something wrong, so inserted a serialization opcode (cpuid), but stack value seems to be already corrupt. [ 21.746337] Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist. [ 22.653412] __memset_generic: (0x9895d2fb, -52, 4096) [ 23.387076] __memset_generic: (0x73d71684, -1, 4096) [ 42.498008] __memset_generic: (0xcbc859df, -1, 4096) [ 42.507944] __memset_generic: (0xb78a3cd6, -1, 4096) [ 57.065106] __memset_generic: (0x330820c3, -1, 4096) [ 302.573533] hv_balloon: Max. dynamic memory size: 1048576 MB [ 626.396603] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x10000) << PAGE_SHIFT)=0x10000000, (HA_CHUNK << PAGE_SHIFT)=134217728) [ 626.445711] sparse_add_section: page_init_poison(pfn_to_page(start_pfn=65536)=0xa8a43056, (sizeof(struct page)=40 * nr_pages)=655360) [ 626.454676] page_init_poison: poisoning(0xa8a43056 size=655360) <<<<<<correct [ 626.454691] page_init_poison: -0xc(%ebp) = 0x280000 <<<<<<corrupt [ 626.454701] BUG: unable to handle page fault for address: 00280000 [ 626.454713] #PF: supervisor write access in kernel mode [ 626.454723] #PF: error_code(0x0002) - not-present page [ 626.454733] *pde = 00000000 [ 626.454741] Oops: 0002 [#1] SMP [ 626.454750] CPU: 0 PID: 253 Comm: kworker/0:6 Tainted: G E 5.5.0-2.el8.i586 #31 [ 626.454769] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012 [ 626.454786] Workqueue: events hot_add_req [hv_balloon] [ 626.454797] EIP: page_init_poison.cold.6+0x44/0x4c [ 626.454808] Code: c0 0f a2 8b 45 f4 50 68 60 29 07 c3 68 f7 ec 2e c3 e8 49 b5 ec ff 83 c4 1c 85 f6 0f 84 fe fb ff ff 8b 45 f4 83 ee 01 8d 50 01 <c6> 00 ff 89 55 f4 eb e7 8b 43 04 a8 01 0f 85 6f 01 00 00 8b 43 10 [ 626.454839] EAX: 00280000 EBX: 756e6547 ECX: 00000007 EDX: 00280001 [ 626.454850] ESI: 0009ffff EDI: 00000004 EBP: c96dfddc ESP: c96dfdd0 [ 626.454870] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010216 [ 626.454882] CR0: 80050033 CR2: 00280000 CR3: 035ca000 CR4: 003406d0 [ 626.454896] Call Trace: [ 626.454903] sparse_add_section+0x168/0x1df [ 626.454914] __add_pages+0x9a/0x100 [ 626.454927] arch_add_memory+0x39/0x40 [ 626.454937] add_memory_resource+0x155/0x200 [ 626.454952] __add_memory+0x7e/0xf0 [ 626.454958] add_memory+0x2c/0x40 [ 626.454964] hot_add_req+0x564/0x5c0 [hv_balloon] [ 626.454976] process_one_work+0x176/0x310 [ 626.454984] worker_thread+0x39/0x3c0 [ 626.454992] kthread+0xf0/0x110 [ 626.454999] ? rescuer_thread+0x2f0/0x2f0 [ 626.455007] ? kthread_park+0x90/0x90 [ 626.455018] ret_from_fork+0x2e/0x40 [ 626.455039] Modules linked in: rfkill intel_rapl_msr intel_rapl_common sg crc32_pclmul snd_pcm snd_timer snd intel_rapl_perf soundcore hv_netvsc pcspkr hv_balloon(E) hv_utils hyperv_fb i2c_piix4 joydev ip_tables ext4 mbcache jbd2 sr_mod cdrom ata_generic sd_mod hid_hyperv hv_storvsc scsi_transport_fc hyperv_keyboard ata_piix hv_vmbus crc32c_intel serio_raw libata [ 626.455120] CR2: 0000000000280000 [ 626.455128] ---[ end trace f0ac4774afc72c0d ]--- [ 626.455143] EIP: page_init_poison.cold.6+0x44/0x4c [ 626.455157] Code: c0 0f a2 8b 45 f4 50 68 60 29 07 c3 68 f7 ec 2e c3 e8 49 b5 ec ff 83 c4 1c 85 f6 0f 84 fe fb ff ff 8b 45 f4 83 ee 01 8d 50 01 <c6> 00 ff 89 55 f4 eb e7 8b 43 04 a8 01 0f 85 6f 01 00 00 8b 43 10 [ 626.455201] EAX: 00280000 EBX: 756e6547 ECX: 00000007 EDX: 00280001 [ 626.455219] ESI: 0009ffff EDI: 00000004 EBP: c96dfddc ESP: c96dfdd0 [ 626.455239] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010216 [ 626.455262] CR0: 80050033 CR2: 00280000 CR3: 035ca000 CR4: 003406d0 [ 626.455290] Kernel panic - not syncing: Fatal exception [ 626.455335] Kernel Offset: 0x1800000 from 0xc1000000 (relocation range: 0xc0000000-0xcafeffff) [ 626.455365] ---[ end Kernel panic - not syncing: Fatal exception ]--- c11f85a1 <page_init_poison.cold.6>: unsigned char * volatile p = (unsigned char *)page; c11f85a1: 89 45 f4 mov %eax,-0xc(%ebp) pr_info("%s: poisoning(0x%p size=%u)\n", __func__, p, size); c11f85a4: 8b 45 f4 mov -0xc(%ebp),%eax c11f85a7: 89 d6 mov %edx,%esi c11f85a9: 52 push %edx c11f85aa: 50 push %eax <<<<<<correct c11f85ab: 68 60 29 87 c1 push $0xc1872960 c11f85b0: 68 d8 e9 ae c1 push $0xc1aee9d8 c11f85b5: e8 60 b5 ec ff call c10c3b1a <printk> asm volatile ("cpuid" : : "a" (0) : "ebx","ecx","edx"); /* serialize */ c11f85ba: 31 c0 xor %eax,%eax c11f85bc: 0f a2 cpuid asm volatile ("mov -0xc(%%ebp),%0" : "=r" (eax)); c11f85be: 8b 45 f4 mov -0xc(%ebp),%eax pr_info("%s: -0xc(%%ebp) = 0x%lx\n", __func__, eax); c11f85c1: 50 push %eax <<<<<<<corrupt to 0x280000 c11f85c2: 68 60 29 87 c1 push $0xc1872960 c11f85c7: 68 f7 ec ae c1 push $0xc1aeecf7 c11f85cc: e8 49 b5 ec ff call c10c3b1a <printk> c11f85d1: 83 c4 1c add $0x1c,%esp for (; size; size--) { c11f85d4: 85 f6 test %esi,%esi c11f85d6: 0f 84 fe fb ff ff je c11f81da <page_init_poison+0x1a> *p++ = PAGE_POISON_PATTERN; c11f85dc: 8b 45 f4 mov -0xc(%ebp),%eax for (; size; size--) { c11f85df: 83 ee 01 sub $0x1,%esi *p++ = PAGE_POISON_PATTERN; c11f85e2: 8d 50 01 lea 0x1(%eax),%edx c11f85e5: c6 00 ff movb $0xff,(%eax) c11f85e8: 89 55 f4 mov %edx,-0xc(%ebp) c11f85eb: eb e7 jmp c11f85d4 <page_init_poison.cold.6+0x33>
Please see http://lkml.kernel.org/r/20200210060923.GC8965@MiWiFi-R3L-srv It is hoped that https://lore.kernel.org/linux-mm/20200209104826.3385-7-bhe@redhat.com/ will address this bug. Are you able to test that patch? Thanks.
Applied https://lore.kernel.org/linux-mm/20200209104826.3385-7-bhe@redhat.com/ patch; page_init_poison() seems to be working now, but panics at different place. I suspect related bug for kernel-4.19.95: https://bugzilla.kernel.org/show_bug.cgi?id=206181#c12 And I'm still at loss why the register/stack was corrupted. [ 25.342662] Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist. [ 25.421668] __memset_generic: (0x876ab121, -52, 4096) [ 41.549826] __memset_generic: (0xc66e2fca, -1, 4096) [ 302.439480] hv_balloon: Max. dynamic memory size: 1048576 MB [ 302.886990] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x10000) << PAGE_SHIFT)=0x10000000, (HA_CHUNK << PAGE_SHIFT)=134217728) [ 302.896657] sparse_add_section: page_init_poison(pfn_to_page(start_pfn=65536)=0x98f4054f, (sizeof(struct page)=40 * nr_pages)=655360) [ 302.896716] page_init_poison: poisoning(0xfc490b07 size=655360) [ 302.907080] page_init_poison: -0xc(%ebp) = 0xfc490b07 [ 303.182662] sparse_add_section: page_init_poison(pfn_to_page(start_pfn=81920)=0xa4eb0440, (sizeof(struct page)=40 * nr_pages)=655360) [ 303.182696] page_init_poison: poisoning(0x4aa5620d size=655360) [ 303.182720] page_init_poison: -0xc(%ebp) = 0x4aa5620d [ 303.462739] BUG: unable to handle page fault for address: d2bff000 [ 303.462763] #PF: supervisor write access in kernel mode [ 303.462772] #PF: error_code(0x0002) - not-present page [ 303.462787] *pde = 00000000 [ 303.462796] Oops: 0002 [#1] SMP [ 303.462806] CPU: 0 PID: 349 Comm: systemd-udevd Tainted: G E 5.5.0-2.el8.i586 #40 [ 303.462824] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012 [ 303.462843] EIP: wp_page_copy+0x8e/0x750 [ 303.462855] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 a9 de e5 ff 89 45 bc 89 f8 e8 9f de e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29 [ 303.462888] EAX: d2bff000 EBX: c746bf0c ECX: 018a3e00 EDX: c358c000 [ 303.462907] ESI: c358c000 EDI: d2bff004 EBP: c746bed0 ESP: c746be8c [ 303.462923] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282 [ 303.462938] CR0: 80050033 CR2: d2bff000 CR3: 0947d000 CR4: 003406d0 [ 303.462953] Call Trace: [ 303.462967] ? reuse_swap_page+0x83/0x390 [ 303.462978] do_wp_page+0x87/0x6e0 [ 303.462989] handle_mm_fault+0x808/0xe30 [ 303.463000] ? __sys_recvmsg+0x3c/0x80 [ 303.463010] __do_page_fault+0x18e/0x3d0 [ 303.463022] ? __do_page_fault+0x3d0/0x3d0 [ 303.463033] do_page_fault+0x28/0xd0 [ 303.463043] ? __do_page_fault+0x3d0/0x3d0 [ 303.463068] common_exception_read_cr2+0x15a/0x15f [ 303.463078] EIP: 0xb7b03e44 [ 303.463086] Code: 8d 04 31 89 44 24 10 39 30 0f 85 77 01 00 00 8b 51 08 8b 41 0c 39 4a 0c 0f 85 35 01 00 00 39 48 08 0f 85 2c 01 00 00 89 42 0c <89> 50 08 81 fb ef 03 00 00 76 0b 8b 59 10 85 db 0f 85 da 01 00 00 [ 303.463115] EAX: 018a4688 EBX: 00000141 ECX: 01827258 EDX: b7c2a878 [ 303.463137] ESI: 00000140 EDI: 00000100 EBP: b7c2a7a0 ESP: bfa74a20 [ 303.463148] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00210246 [ 303.463160] Modules linked in: rfkill intel_rapl_msr intel_rapl_common crc32_pclmul intel_rapl_perf snd_pcm snd_timer snd soundcore pcspkr hv_utils hv_netvsc i2c_piix4 hyperv_fb sg hv_balloon(E) joydev ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod ata_generic hyperv_keyboard hid_hyperv hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw libata hv_vmbus [ 303.463218] CR2: 00000000d2bff000 [ 303.463230] ---[ end trace 53dd1f0742b34512 ]--- [ 303.463244] EIP: wp_page_copy+0x8e/0x750 [ 303.463257] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 a9 de e5 ff 89 45 bc 89 f8 e8 9f de e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29 [ 303.463288] EAX: d2bff000 EBX: c746bf0c ECX: 018a3e00 EDX: c358c000 [ 303.463304] ESI: c358c000 EDI: d2bff004 EBP: c746bed0 ESP: c746be8c [ 303.463321] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282 [ 303.463337] CR0: 80050033 CR2: d2bff000 CR3: 0947d000 CR4: 003406d0 [ 303.463352] Kernel panic - not syncing: Fatal exception [ 303.463364] Kernel Offset: 0x2c00000 from 0xc1000000 (relocation range: 0xc0000000-0xcafeffff) [ 303.463383] ---[ end Kernel panic - not syncing: Fatal exception ]---
Created attachment 287293 [details] mm/hotplug: do not __free_pages_core in generic_online_page Patch as in https://lore.kernel.org/linux-mm/20200209104826.3385-7-bhe@redhat.com/ and this one seems to fix Hyper-V memody hot-add.
With 2 patches in https://bugzilla.kernel.org/show_bug.cgi?id=206401#c10 , plain text-based installation seems to work, but under severe memory pressure running anaconda installer, it still panics. This time sparse_add_section()->compaction_alloc() is the path, so it seems I've hit a different bug. [ 303.803241] invalid opcode: 0000 [#1] SMP [ 303.803250] CPU: 0 PID: 452 Comm: kworker/0:5 Not tainted 5.5.0-2.el8.i586 #1 [ 303.803261] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012 [ 303.803281] Workqueue: events hot_add_req [hv_balloon] [ 303.803290] EIP: isolate_freepages_block+0x314/0x350 [ 303.803299] Code: 8d 5c c3 d8 66 90 03 75 0c 03 5d cc 39 75 e4 0f 87 89 fd ff ff e9 e3 fd ff ff 8d 74 26 00 ba 50 e0 eb c2 89 d8 e8 8c 57 00 00 <0f> 0b 8d 76 00 8d bc 27 00 00 00 00 89 75 cc c7 45 dc 00 00 00 00 [ 303.803326] EAX: c2eeecd2 EBX: dc5bfd80 ECX: 00000007 EDX: c2ebe050 [ 303.803337] ESI: 0001fff0 EDI: dfb99c78 EBP: dfb99b34 ESP: dfb99af0 [ 303.803353] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010092 [ 303.803365] CR0: 80050033 CR2: 0184f6e4 CR3: 031ca000 CR4: 003406d0 [ 303.803386] Call Trace: [ 303.803396] compaction_alloc+0x862/0x920 [ 303.803405] ? isolate_freepages_block+0x350/0x350 [ 303.803415] migrate_pages+0xc6/0xa20 [ 303.803425] ? __ClearPageMovable+0xa0/0xa0 [ 303.803440] ? isolate_freepages_block+0x350/0x350 [ 303.803448] compact_zone+0x708/0xc50 [ 303.803455] ? __switch_to_asm+0x28/0x50 [ 303.803465] ? __switch_to_asm+0x34/0x50 [ 303.803478] try_to_compact_pages+0x122/0x2e0 [ 303.803493] __alloc_pages_direct_compact+0x6a/0x120 [ 303.803507] __alloc_pages_slowpath+0x39d/0xc10 [ 303.803519] ? update_load_avg+0xac/0x760 [ 303.803528] ? __switch_to_asm+0x34/0x50 [ 303.803539] ? __switch_to_asm+0x34/0x50 [ 303.803551] ? __switch_to_asm+0x28/0x50 [ 303.803561] ? __switch_to_asm+0x34/0x50 [ 303.803569] ? __switch_to_asm+0x28/0x50 [ 303.803577] ? __switch_to_asm+0x34/0x50 [ 303.803587] ? __switch_to_asm+0x28/0x50 [ 303.803597] __alloc_pages_nodemask+0x27a/0x2b0 [ 303.803608] ? __switch_to_asm+0x28/0x50 [ 303.803618] populate_section_memmap+0x16/0x4d [ 303.803630] sparse_add_section+0xe5/0x18e [ 303.803641] __add_pages+0x9a/0x100 [ 303.803652] arch_add_memory+0x39/0x40 [ 303.803660] add_memory_resource+0x155/0x200 [ 303.803670] __add_memory+0x7e/0xf0 [ 303.803678] ? boot_override_clock+0x20/0x47 [ 303.803689] add_memory+0x2c/0x40 [ 303.803700] hot_add_req+0x3ae/0x5c0 [hv_balloon] [ 303.803715] process_one_work+0x176/0x310 [ 303.803730] worker_thread+0x39/0x3c0 [ 303.803741] kthread+0xf0/0x110 [ 303.803752] ? rescuer_thread+0x2f0/0x2f0 [ 303.803760] ? kthread_park+0x90/0x90 [ 303.803768] ret_from_fork+0x2e/0x40 [ 303.803782] Modules linked in: vfat fat xfs libfc rfkill zram sg intel_rapl_msr intel_rapl_common intel_rapl_perf pcspkr hv_utils hv_balloon i2c_piix4 joydev ext4 mbcache jbd2 loop nls_utf8 isofs sr_mod cdrom sd_mod ata_generic 8021q garp mrp stp llc hv_netvsc hyperv_keyboard hid_hyperv hv_storvsc scsi_transport_fc hyperv_fb ata_piix crc32_pclmul serio_raw hv_vmbus libata sunrpc xts lrw dm_crypt dm_round_robin dm_multipath dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_zero dm_mod linear raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_intel raid1 raid0 iscsi_ibft squashfs cramfs be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 libcxgbi libcxgb iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi edd [ 303.803968] ---[ end trace 3a57a22ff51d9741 ]--- [ 303.803994] EIP: isolate_freepages_block+0x314/0x350 [ 303.804019] Code: 8d 5c c3 d8 66 90 03 75 0c 03 5d cc 39 75 e4 0f 87 89 fd ff ff e9 e3 fd ff ff 8d 74 26 00 ba 50 e0 eb c2 89 d8 e8 8c 57 00 00 <0f> 0b 8d 76 00 8d bc 27 00 00 00 00 89 75 cc c7 45 dc 00 00 00 00 [ 303.804073] EAX: c2eeecd2 EBX: dc5bfd80 ECX: 00000007 EDX: c2ebe050 [ 303.804086] ESI: 0001fff0 EDI: dfb99c78 EBP: dfb99b34 ESP: dfb99af0 [ 303.804112] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010092 [ 303.804139] CR0: 80050033 CR2: 0184f6e4 CR3: 031ca000 CR4: 003406d0 [ 303.804169] Kernel panic - not syncing: Fatal exception [ 303.804194] Kernel Offset: 0x1400000 from 0xc1000000 (relocation range: 0xc0000000-0xe07effff) [ 303.804238] ---[ end Kernel panic - not syncing: Fatal exception ]---
On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang <richardw.yang@linux.intel.com> wrote: > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote: > >On 02/10/20 at 02:09pm, Baoquan He wrote: > >> On 02/09/20 at 09:56pm, Andrew Morton wrote: > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com> wrote: > >> > > >> > > Hi Andrew, > >> > > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote: > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000 > bugzilla-daemon@bugzilla.kernel.org wrote: > >> > > > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401 > >> > > > > > >> > > > > >> > > > An oops during mem hotadd. Could someone please take a look when > >> > > > convenient? > >> > > > >> > > This has been addressed by Wei Yang's patch, please check it here: > >> > > > >> > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > >> > > > >> > > >> > hm, OK, thanks. It's unfortunate that a 5.5 fix is buried in a > >> > six-patch series which is still in progress! Can we please merge that > >> > as a standalone fix with a cc:stable, Fixes:, etc? > > > >Maybe can add Fixes tag as follow when merge: > > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") > > The reporter (cc'ed here) is still seeing issues: https://bugzilla.kernel.org/show_bug.cgi?id=206401 Could we please continue this investigation via emailed reply-to-all, rather than via the bugzilla interface?
On 02/11/20 at 04:41pm, Andrew Morton wrote: > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang <richardw.yang@linux.intel.com> > wrote: > > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote: > > >On 02/10/20 at 02:09pm, Baoquan He wrote: > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote: > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com> wrote: > > >> > > > >> > > Hi Andrew, > > >> > > > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote: > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000 > bugzilla-daemon@bugzilla.kernel.org wrote: > > >> > > > > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401 > > >> > > > > > > >> > > > > > >> > > > An oops during mem hotadd. Could someone please take a look when > > >> > > > convenient? > > >> > > > > >> > > This has been addressed by Wei Yang's patch, please check it here: > > >> > > > > >> > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > > >> > > > > >> > > > >> > hm, OK, thanks. It's unfortunate that a 5.5 fix is buried in a > > >> > six-patch series which is still in progress! Can we please merge that > > >> > as a standalone fix with a cc:stable, Fixes:, etc? > > > > > >Maybe can add Fixes tag as follow when merge: > > > > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") > > > > > The reporter (cc'ed here) is still seeing issues: > https://bugzilla.kernel.org/show_bug.cgi?id=206401 > > Could we please continue this investigation via emailed reply-to-all, > rather than via the bugzilla interface? Yes, people prefer mailing list to discuss issues. Hi T.Kabe, Could you provide the call trace again after below patch is applied? The comment #9 in bugzilla is not very clear to me. mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com And, as you said, applying above patch, and do not call __free_pages_core() in generic_online_page() will work. I doubt it, because without __free_pages_core(), your added pages are not added into buddy for managing. I think we should make clear this problem firstly, in order not to introduce new problem by improper work around, then check next. Thanks Baoquan
On 12.02.20 08:31, Baoquan He wrote: > On 02/11/20 at 04:41pm, Andrew Morton wrote: >> On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang <richardw.yang@linux.intel.com> >> wrote: >> >>> On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote: >>>> On 02/10/20 at 02:09pm, Baoquan He wrote: >>>>> On 02/09/20 at 09:56pm, Andrew Morton wrote: >>>>>> On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com> wrote: >>>>>> >>>>>>> Hi Andrew, >>>>>>> >>>>>>> On 02/09/20 at 09:32pm, Andrew Morton wrote: >>>>>>>> On Tue, 04 Feb 2020 11:25:48 +0000 bugzilla-daemon@bugzilla.kernel.org >>>>>>>> wrote: >>>>>>>> >>>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=206401 >>>>>>>>> >>>>>>>> >>>>>>>> An oops during mem hotadd. Could someone please take a look when >>>>>>>> convenient? >>>>>>> >>>>>>> This has been addressed by Wei Yang's patch, please check it here: >>>>>>> >>>>>>> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com >>>>>>> >>>>>> >>>>>> hm, OK, thanks. It's unfortunate that a 5.5 fix is buried in a >>>>>> six-patch series which is still in progress! Can we please merge that >>>>>> as a standalone fix with a cc:stable, Fixes:, etc? >>>> >>>> Maybe can add Fixes tag as follow when merge: >>>> >>>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >>>> >> >> The reporter (cc'ed here) is still seeing issues: >> https://bugzilla.kernel.org/show_bug.cgi?id=206401 >> >> Could we please continue this investigation via emailed reply-to-all, >> rather than via the bugzilla interface? > > Yes, people prefer mailing list to discuss issues. > > Hi T.Kabe, > > Could you provide the call trace again after below patch is applied? > The comment #9 in bugzilla is not very clear to me. > > mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > > And, as you said, applying above patch, and do not call > __free_pages_core() in generic_online_page() will work. I doubt it, > because without __free_pages_core(), your added pages are not added > into buddy for managing. Removing __free_pages_core() from generic_online_page() is just plain wrong and would break memory hotplug in general. So that is certainly not the right fix. HV supports memory sections that are fully added, but only parts of it are actually backed in the hypervisor, "online" and exposed to the buddy. When onlining memory, it will online the backed parts via hv_online_page()->generic_online_page(). When requested to hot add more memory, the guest will online remaining parts that are now backed handle_pg_range()->hv_bring_pgs_online(). So if generic_online_page() fails it's either because 1. HV guest driver has a bug and tries to online something it shouldn't 2. HV hypervisor has a bug and does not back memory properly before hot/adding 3. Memory hotplug code has a bug and does not properly add the memory block/sections Please note that to using generic_online_page() in commit 30a9c246b9f6fe0591e8afb05758a3e3b096fabe Author: David Hildenbrand <david@redhat.com> Date: Sat Nov 30 17:53:55 2019 -0800 hv_balloon: use generic_online_page() Let's use the generic onlining function - which will now also take care of calling kernel_map_pages(). However, the old code ended up calling __free_pages_core() -> __free_pages() End the new one ends up calling __online_page_free() -> __free_reserved_page() -> __free_page() So I don't think it's related to that. Especially, looking at the kernel messages, I can see that the kernel crashes when adding memory, not when onlining it? So I do think there is still something wrong in the SPARSE hot-add code if you keep seeing issues.
bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv> >> On 02/11/20 at 04:41pm, Andrew Morton wrote: >> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang >> <richardw.yang@linux.intel.com> wrote: >> > >> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote: >> > > >On 02/10/20 at 02:09pm, Baoquan He wrote: >> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote: >> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com> >> wrote: >> > > >> > >> > > >> > > Hi Andrew, >> > > >> > > >> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote: >> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000 >> bugzilla-daemon@bugzilla.kernel.org wrote: >> > > >> > > > >> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401 >> > > >> > > > > >> > > >> > > > >> > > >> > > > An oops during mem hotadd. Could someone please take a look >> when >> > > >> > > > convenient? >> > > >> > > >> > > >> > > This has been addressed by Wei Yang's patch, please check it >> here: >> > > >> > > >> > > >> > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com >> > > >> > > >> > > >> > >> > > >> > hm, OK, thanks. It's unfortunate that a 5.5 fix is buried in a >> > > >> > six-patch series which is still in progress! Can we please merge >> that >> > > >> > as a standalone fix with a cc:stable, Fixes:, etc? >> > > > >> > > >Maybe can add Fixes tag as follow when merge: >> > > > >> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >> > > > >> > >> > The reporter (cc'ed here) is still seeing issues: >> > https://bugzilla.kernel.org/show_bug.cgi?id=206401 >> > >> > Could we please continue this investigation via emailed reply-to-all, >> > rather than via the bugzilla interface? >> >> Yes, people prefer mailing list to discuss issues. >> >> Hi T.Kabe, >> >> Could you provide the call trace again after below patch is applied? >> The comment #9 in bugzilla is not very clear to me. >> >> mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com >> >> And, as you said, applying above patch, and do not call >> __free_pages_core() in generic_online_page() will work. I doubt it, >> because without __free_pages_core(), your added pages are not added >> into buddy for managing. I think we should make clear this problem >> firstly, in order not to introduce new problem by improper work around, >> then check next. >> >> Thanks >> Baoquan Got it, I restarted off fresh from kernel-5.6-rc1, applied patch >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com and got the following panic. Diag printk's for add_memory() et al is not there, but I guess memory hot-add request from hypervisor is returning "success", corrupting something else and bombing out later. [ 24.289967] Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist. [ 302.263730] hv_balloon: Max. dynamic memory size: 1048576 MB [ 635.216014] BUG: unable to handle page fault for address: d13ff000 [ 635.216058] #PF: supervisor write access in kernel mode [ 635.216076] #PF: error_code(0x0002) - not-present page [ 635.216106] *pde = 00000000 [ 635.216139] Oops: 0002 [#1] SMP [ 635.216171] CPU: 0 PID: 470 Comm: systemd-udevd Not tainted 5.6.0-rc1.el8.i586 #1 [ 635.216199] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012 [ 635.216233] EIP: wp_page_copy+0x8e/0x750 [ 635.216253] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29 [ 635.216293] EAX: d13ff000 EBX: c3743f28 ECX: 00000000 EDX: c10c9000 [ 635.216314] ESI: c10c9000 EDI: d13ff004 EBP: c3743eec ESP: c3743ea8 [ 635.216336] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282 [ 635.216368] CR0: 80050033 CR2: d13ff000 CR3: 03add000 CR4: 003406d0 [ 635.216389] Call Trace: [ 635.216407] ? reuse_swap_page+0x83/0x390 [ 635.216425] do_wp_page+0x87/0x6e0 [ 635.216438] ? __do_sys_fstat64+0x4a/0x60 [ 635.216453] handle_mm_fault+0x808/0xe30 [ 635.216468] do_page_fault+0x19f/0x4d0 [ 635.216484] ? do_kern_addr_fault+0x80/0x80 [ 635.216500] common_exception_read_cr2+0x15a/0x15f [ 635.216521] EIP: 0xb7b28104 [ 635.216538] Code: 29 f9 89 4c 24 10 83 f9 0f 0f 86 92 00 00 00 8b 45 40 8d 14 3e 8b 4c 24 0c 39 48 0c 75 74 8b 4c 24 0c 81 7c 24 10 ef 03 00 00 <89> 42 08 89 4a 0c 89 55 40 89 50 0c 76 0e c7 42 10 00 00 00 00 c7 [ 635.216591] EAX: b7c4e7d8 EBX: 000011a0 ECX: b7c4e7d8 EDX: 01994178 [ 635.216606] ESI: 01993168 EDI: 00001010 EBP: b7c4e7a0 ESP: bfcc9f00 [ 635.216628] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00210293 [ 635.216661] Modules linked in: rfkill intel_rapl_msr intel_rapl_common snd_pcm snd_timer snd soundcore crc32_pclmul intel_rapl_perf sg pcspkr hv_netvsc joydev i2c_piix4 hyperv_fb hv_utils hv_balloon ip_tables ext4 mbcache jbd2 sd_mod t10_pi sr_mod cdrom ata_generic hyperv_keyboard hid_hyperv hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw hv_vmbus libata [ 635.216758] CR2: 00000000d13ff000 [ 635.216769] ---[ end trace dee4a93859538102 ]--- [ 635.216785] EIP: wp_page_copy+0x8e/0x750 [ 635.216811] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29 [ 635.216847] EAX: d13ff000 EBX: c3743f28 ECX: 00000000 EDX: c10c9000 [ 635.216864] ESI: c10c9000 EDI: d13ff004 EBP: c3743eec ESP: c3743ea8 [ 635.216883] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282 [ 635.216899] CR0: 80050033 CR2: d13ff000 CR3: 03add000 CR4: 003406d0 [ 635.216914] Kernel panic - not syncing: Fatal exception [ 635.216926] Kernel Offset: 0x1400000 from 0xc1000000 (relocation range: 0xc0000000-0xcafeffff) [ 635.216946] ---[ end Kernel panic - not syncing: Fatal exception ]---
On 02/13/20 at 01:22pm, kabe@vega.pgw.jp wrote: > bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv> > > >> On 02/11/20 at 04:41pm, Andrew Morton wrote: > >> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang > <richardw.yang@linux.intel.com> wrote: > >> > > >> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote: > >> > > >On 02/10/20 at 02:09pm, Baoquan He wrote: > >> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote: > >> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com> > wrote: > >> > > >> > > >> > > >> > > Hi Andrew, > >> > > >> > > > >> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote: > >> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000 > bugzilla-daemon@bugzilla.kernel.org wrote: > >> > > >> > > > > >> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401 > >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > An oops during mem hotadd. Could someone please take a look > when > >> > > >> > > > convenient? > >> > > >> > > > >> > > >> > > This has been addressed by Wei Yang's patch, please check it > here: > >> > > >> > > > >> > > >> > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > >> > > >> > > > >> > > >> > > >> > > >> > hm, OK, thanks. It's unfortunate that a 5.5 fix is buried in a > >> > > >> > six-patch series which is still in progress! Can we please merge > that > >> > > >> > as a standalone fix with a cc:stable, Fixes:, etc? > >> > > > > >> > > >Maybe can add Fixes tag as follow when merge: > >> > > > > >> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") > >> > > > > >> > > >> > The reporter (cc'ed here) is still seeing issues: > >> > https://bugzilla.kernel.org/show_bug.cgi?id=206401 > >> > > >> > Could we please continue this investigation via emailed reply-to-all, > >> > rather than via the bugzilla interface? > >> > >> Yes, people prefer mailing list to discuss issues. > >> > >> Hi T.Kabe, > >> > >> Could you provide the call trace again after below patch is applied? > >> The comment #9 in bugzilla is not very clear to me. > >> > >> mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > >> > >> And, as you said, applying above patch, and do not call > >> __free_pages_core() in generic_online_page() will work. I doubt it, > >> because without __free_pages_core(), your added pages are not added > >> into buddy for managing. I think we should make clear this problem > >> firstly, in order not to introduce new problem by improper work around, > >> then check next. > >> > >> Thanks > >> Baoquan > > Got it, I restarted off fresh from kernel-5.6-rc1, > applied patch > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > and got the following panic. > > Diag printk's for add_memory() et al is not there, but I guess > memory hot-add request from hypervisor is returning "success", > corrupting something else and bombing out later. > > > [ 24.289967] Not activating Mandatory Access Control as /sbin/tomoyo-init > does not exist. > [ 302.263730] hv_balloon: Max. dynamic memory size: 1048576 MB > [ 635.216014] BUG: unable to handle page fault for address: d13ff000 > [ 635.216058] #PF: supervisor write access in kernel mode > [ 635.216076] #PF: error_code(0x0002) - not-present page > [ 635.216106] *pde = 00000000 Thanks for the info. What ARCH is your system? Could you attach your kernel config and paste the output of executing 'readelf /proc/kcore'? The pmd entry is not filled, I want to check which address range the kernel is acessing, and please attach the log of dmesg. Probably it's hot added page area, I guess, since this time the preceding trace is different with comment #9. > [ 635.216139] Oops: 0002 [#1] SMP > [ 635.216171] CPU: 0 PID: 470 Comm: systemd-udevd Not tainted > 5.6.0-rc1.el8.i586 #1 > [ 635.216199] Hardware name: Microsoft Corporation Virtual Machine/Virtual > Machine, BIOS 090006 05/23/2012 > [ 635.216233] EIP: wp_page_copy+0x8e/0x750 > [ 635.216253] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff > 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 > 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29 > [ 635.216293] EAX: d13ff000 EBX: c3743f28 ECX: 00000000 EDX: c10c9000 > [ 635.216314] ESI: c10c9000 EDI: d13ff004 EBP: c3743eec ESP: c3743ea8 > [ 635.216336] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282 > [ 635.216368] CR0: 80050033 CR2: d13ff000 CR3: 03add000 CR4: 003406d0 > [ 635.216389] Call Trace: > [ 635.216407] ? reuse_swap_page+0x83/0x390 > [ 635.216425] do_wp_page+0x87/0x6e0 > [ 635.216438] ? __do_sys_fstat64+0x4a/0x60 > [ 635.216453] handle_mm_fault+0x808/0xe30 > [ 635.216468] do_page_fault+0x19f/0x4d0 > [ 635.216484] ? do_kern_addr_fault+0x80/0x80 > [ 635.216500] common_exception_read_cr2+0x15a/0x15f > [ 635.216521] EIP: 0xb7b28104 > [ 635.216538] Code: 29 f9 89 4c 24 10 83 f9 0f 0f 86 92 00 00 00 8b 45 40 8d > 14 3e 8b 4c 24 0c 39 48 0c 75 74 8b 4c 24 0c 81 7c 24 10 ef 03 00 00 <89> 42 > 08 89 4a 0c 89 55 40 89 50 0c 76 0e c7 42 10 00 00 00 00 c7 > [ 635.216591] EAX: b7c4e7d8 EBX: 000011a0 ECX: b7c4e7d8 EDX: 01994178 > [ 635.216606] ESI: 01993168 EDI: 00001010 EBP: b7c4e7a0 ESP: bfcc9f00 > [ 635.216628] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00210293 > [ 635.216661] Modules linked in: rfkill intel_rapl_msr intel_rapl_common > snd_pcm snd_timer snd soundcore crc32_pclmul intel_rapl_perf sg pcspkr > hv_netvsc joydev i2c_piix4 hyperv_fb hv_utils hv_balloon ip_tables ext4 > mbcache jbd2 sd_mod t10_pi sr_mod cdrom ata_generic hyperv_keyboard > hid_hyperv hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw > hv_vmbus libata > [ 635.216758] CR2: 00000000d13ff000 > [ 635.216769] ---[ end trace dee4a93859538102 ]--- > [ 635.216785] EIP: wp_page_copy+0x8e/0x750 > [ 635.216811] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff > 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 > 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29 > [ 635.216847] EAX: d13ff000 EBX: c3743f28 ECX: 00000000 EDX: c10c9000 > [ 635.216864] ESI: c10c9000 EDI: d13ff004 EBP: c3743eec ESP: c3743ea8 > [ 635.216883] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282 > [ 635.216899] CR0: 80050033 CR2: d13ff000 CR3: 03add000 CR4: 003406d0 > [ 635.216914] Kernel panic - not syncing: Fatal exception > [ 635.216926] Kernel Offset: 0x1400000 from 0xc1000000 (relocation range: > 0xc0000000-0xcafeffff) > [ 635.216946] ---[ end Kernel panic - not syncing: Fatal exception ]--- > > -- > kabe >
Created attachment 287371 [details] dmesg.txt bhe@redhat.com sed in <20200213081941.GA19207@MiWiFi-R3L-srv> >> On 02/13/20 at 01:22pm, kabe@vega.pgw.jp wrote: >> > bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv> >> > >> > >> On 02/11/20 at 04:41pm, Andrew Morton wrote: >> > >> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang >> <richardw.yang@linux.intel.com> wrote: >> > >> > >> > >> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote: >> > >> > > >On 02/10/20 at 02:09pm, Baoquan He wrote: >> > >> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote: >> > >> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com> >> wrote: >> > >> > > >> > >> > >> > > >> > > Hi Andrew, >> > >> > > >> > > >> > >> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote: >> > >> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000 >> bugzilla-daemon@bugzilla.kernel.org wrote: >> > >> > > >> > > > >> > >> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401 >> > >> > > >> > > > > >> > >> > > >> > > > >> > >> > > >> > > > An oops during mem hotadd. Could someone please take a >> look when >> > >> > > >> > > > convenient? >> > >> > > >> > > >> > >> > > >> > > This has been addressed by Wei Yang's patch, please check it >> here: >> > >> > > >> > > >> > >> > > >> > > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com >> > >> > > >> > > >> > >> > > >> > >> > >> > > >> > hm, OK, thanks. It's unfortunate that a 5.5 fix is buried in >> a >> > >> > > >> > six-patch series which is still in progress! Can we please >> merge that >> > >> > > >> > as a standalone fix with a cc:stable, Fixes:, etc? >> > >> > > > >> > >> > > >Maybe can add Fixes tag as follow when merge: >> > >> > > > >> > >> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >> > >> > > > >> > >> > >> > >> > The reporter (cc'ed here) is still seeing issues: >> > >> > https://bugzilla.kernel.org/show_bug.cgi?id=206401 >> > >> > >> > >> > Could we please continue this investigation via emailed reply-to-all, >> > >> > rather than via the bugzilla interface? >> > >> >> > >> Yes, people prefer mailing list to discuss issues. >> > >> >> > >> Hi T.Kabe, >> > >> >> > >> Could you provide the call trace again after below patch is applied? >> > >> The comment #9 in bugzilla is not very clear to me. >> > >> >> > >> mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM >> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com >> > >> >> > >> And, as you said, applying above patch, and do not call >> > >> __free_pages_core() in generic_online_page() will work. I doubt it, >> > >> because without __free_pages_core(), your added pages are not added >> > >> into buddy for managing. I think we should make clear this problem >> > >> firstly, in order not to introduce new problem by improper work around, >> > >> then check next. >> > >> >> > >> Thanks >> > >> Baoquan >> > >> > Got it, I restarted off fresh from kernel-5.6-rc1, >> > applied patch >> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com >> > and got the following panic. >> > >> > Diag printk's for add_memory() et al is not there, but I guess >> > memory hot-add request from hypervisor is returning "success", >> > corrupting something else and bombing out later. >> > >> > >> > [ 24.289967] Not activating Mandatory Access Control as >> /sbin/tomoyo-init does not exist. >> > [ 302.263730] hv_balloon: Max. dynamic memory size: 1048576 MB >> > [ 635.216014] BUG: unable to handle page fault for address: d13ff000 >> > [ 635.216058] #PF: supervisor write access in kernel mode >> > [ 635.216076] #PF: error_code(0x0002) - not-present page >> > [ 635.216106] *pde = 00000000 >> >> Thanks for the info. What ARCH is your system? Could you attach your >> kernel config and paste the output of executing 'readelf /proc/kcore'? Arch is i386(i586), non-PAE. I'll attach the "readelf -a /proc/kcore", dmesg and .config . The stack trace is different this time also; it seems to have slightly difference panic trace every time after handle_mm_fault(). I've temporary added pr_info() before and after add_memory() in hv_baloon.ko, so it says it's taining the kernel. add_memory() itself is returning 0 (success). >> The pmd entry is not filled, I want to check which address range the kernel >> is acessing, and please attach the log of dmesg. Probably it's hot added >> page area, I guess, since this time the preceding trace is different >> with comment #9. >> >> > [ 635.216139] Oops: 0002 [#1] SMP >> > [ 635.216171] CPU: 0 PID: 470 Comm: systemd-udevd Not tainted >> 5.6.0-rc1.el8.i586 #1 >> > [ 635.216199] Hardware name: Microsoft Corporation Virtual >> Machine/Virtual Machine, BIOS 090006 05/23/2012 >> > [ 635.216233] EIP: wp_page_copy+0x8e/0x750 >> > [ 635.216253] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 >> ff 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> >> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29 >> > [ 635.216293] EAX: d13ff000 EBX: c3743f28 ECX: 00000000 EDX: c10c9000 >> > [ 635.216314] ESI: c10c9000 EDI: d13ff004 EBP: c3743eec ESP: c3743ea8 >> > [ 635.216336] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: >> 00210282 >> > [ 635.216368] CR0: 80050033 CR2: d13ff000 CR3: 03add000 CR4: 003406d0 >> > [ 635.216389] Call Trace: >> > [ 635.216407] ? reuse_swap_page+0x83/0x390 >> > [ 635.216425] do_wp_page+0x87/0x6e0 >> > [ 635.216438] ? __do_sys_fstat64+0x4a/0x60 >> > [ 635.216453] handle_mm_fault+0x808/0xe30 >> > [ 635.216468] do_page_fault+0x19f/0x4d0 >> > [ 635.216484] ? do_kern_addr_fault+0x80/0x80 >> > [ 635.216500] common_exception_read_cr2+0x15a/0x15f >> > [ 635.216521] EIP: 0xb7b28104 [redacted]
Created attachment 287373 [details] panic.txt
Created attachment 287375 [details] kernel-i586.config
Created attachment 287377 [details] readelf-kcore.txt
On 02/14/20 at 11:26pm, kkabe@vega.pgw.jp wrote: > bhe@redhat.com sed in <20200213081941.GA19207@MiWiFi-R3L-srv> > > >> On 02/13/20 at 01:22pm, kabe@vega.pgw.jp wrote: > >> > bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv> > >> > > >> > >> On 02/11/20 at 04:41pm, Andrew Morton wrote: > >> > >> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang > <richardw.yang@linux.intel.com> wrote: > >> > >> > > >> > >> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote: > >> > >> > > >On 02/10/20 at 02:09pm, Baoquan He wrote: > >> > >> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote: > >> > >> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He > <bhe@redhat.com> wrote: > >> > >> > > >> > > >> > >> > > >> > > Hi Andrew, > >> > >> > > >> > > > >> > >> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote: > >> > >> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000 > bugzilla-daemon@bugzilla.kernel.org wrote: > >> > >> > > >> > > > > >> > >> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401 > >> > >> > > >> > > > > > >> > >> > > >> > > > > >> > >> > > >> > > > An oops during mem hotadd. Could someone please take a > look when > >> > >> > > >> > > > convenient? > >> > >> > > >> > > > >> > >> > > >> > > This has been addressed by Wei Yang's patch, please check > it here: > >> > >> > > >> > > > >> > >> > > >> > > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > >> > >> > > >> > > > >> > >> > > >> > > >> > >> > > >> > hm, OK, thanks. It's unfortunate that a 5.5 fix is buried > in a > >> > >> > > >> > six-patch series which is still in progress! Can we please > merge that > >> > >> > > >> > as a standalone fix with a cc:stable, Fixes:, etc? > >> > >> > > > > >> > >> > > >Maybe can add Fixes tag as follow when merge: > >> > >> > > > > >> > >> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section > hotplug") > >> > >> > > > > >> > >> > > >> > >> > The reporter (cc'ed here) is still seeing issues: > >> > >> > https://bugzilla.kernel.org/show_bug.cgi?id=206401 > >> > >> > > >> > >> > Could we please continue this investigation via emailed > reply-to-all, > >> > >> > rather than via the bugzilla interface? > >> > >> > >> > >> Yes, people prefer mailing list to discuss issues. > >> > >> > >> > >> Hi T.Kabe, > >> > >> > >> > >> Could you provide the call trace again after below patch is applied? > >> > >> The comment #9 in bugzilla is not very clear to me. > >> > >> > >> > >> mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM > >> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > >> > >> > >> > >> And, as you said, applying above patch, and do not call > >> > >> __free_pages_core() in generic_online_page() will work. I doubt it, > >> > >> because without __free_pages_core(), your added pages are not added > >> > >> into buddy for managing. I think we should make clear this problem > >> > >> firstly, in order not to introduce new problem by improper work > around, > >> > >> then check next. > >> > >> > >> > >> Thanks > >> > >> Baoquan > >> > > >> > Got it, I restarted off fresh from kernel-5.6-rc1, > >> > applied patch > >> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > >> > and got the following panic. > >> > > >> > Diag printk's for add_memory() et al is not there, but I guess > >> > memory hot-add request from hypervisor is returning "success", > >> > corrupting something else and bombing out later. > >> > > >> > > >> > [ 24.289967] Not activating Mandatory Access Control as > /sbin/tomoyo-init does not exist. > >> > [ 302.263730] hv_balloon: Max. dynamic memory size: 1048576 MB > >> > [ 635.216014] BUG: unable to handle page fault for address: d13ff000 > >> > [ 635.216058] #PF: supervisor write access in kernel mode > >> > [ 635.216076] #PF: error_code(0x0002) - not-present page > >> > [ 635.216106] *pde = 00000000 > >> > >> Thanks for the info. What ARCH is your system? Could you attach your > >> kernel config and paste the output of executing 'readelf /proc/kcore'? > > Arch is i386(i586), non-PAE. > > I'll attach the "readelf -a /proc/kcore", dmesg and .config . > The stack trace is different this time also; > it seems to have slightly difference panic trace every time > after handle_mm_fault(). Sorry, I didn't say it clearly. 'readelf -l /proc/kcore' is OK, and the relevant call trace. > > I've temporary added pr_info() before and after add_memory() in hv_baloon.ko, > so it says it's taining the kernel. > add_memory() itself is returning 0 (success). > >
On 02/14/20 at 10:48pm, Baoquan He wrote: > On 02/14/20 at 11:26pm, kkabe@vega.pgw.jp wrote: > > bhe@redhat.com sed in <20200213081941.GA19207@MiWiFi-R3L-srv> > > > > >> On 02/13/20 at 01:22pm, kabe@vega.pgw.jp wrote: > > >> > bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv> > > >> > > > >> > >> On 02/11/20 at 04:41pm, Andrew Morton wrote: > > >> > >> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang > <richardw.yang@linux.intel.com> wrote: > > >> > >> > > > >> > >> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote: > > >> > >> > > >On 02/10/20 at 02:09pm, Baoquan He wrote: > > >> > >> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote: > > >> > >> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He > <bhe@redhat.com> wrote: > > >> > >> > > >> > > > >> > >> > > >> > > Hi Andrew, > > >> > >> > > >> > > > > >> > >> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote: > > >> > >> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000 > bugzilla-daemon@bugzilla.kernel.org wrote: > > >> > >> > > >> > > > > > >> > >> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401 > > >> > >> > > >> > > > > > > >> > >> > > >> > > > > > >> > >> > > >> > > > An oops during mem hotadd. Could someone please take > a look when > > >> > >> > > >> > > > convenient? > > >> > >> > > >> > > > > >> > >> > > >> > > This has been addressed by Wei Yang's patch, please > check it here: > > >> > >> > > >> > > > > >> > >> > > >> > > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > > >> > >> > > >> > > > > >> > >> > > >> > > > >> > >> > > >> > hm, OK, thanks. It's unfortunate that a 5.5 fix is buried > in a > > >> > >> > > >> > six-patch series which is still in progress! Can we > please merge that > > >> > >> > > >> > as a standalone fix with a cc:stable, Fixes:, etc? > > >> > >> > > > > > >> > >> > > >Maybe can add Fixes tag as follow when merge: > > >> > >> > > > > > >> > >> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section > hotplug") > > >> > >> > > > > > >> > >> > > > >> > >> > The reporter (cc'ed here) is still seeing issues: > > >> > >> > https://bugzilla.kernel.org/show_bug.cgi?id=206401 > > >> > >> > > > >> > >> > Could we please continue this investigation via emailed > reply-to-all, > > >> > >> > rather than via the bugzilla interface? > > >> > >> > > >> > >> Yes, people prefer mailing list to discuss issues. > > >> > >> > > >> > >> Hi T.Kabe, > > >> > >> > > >> > >> Could you provide the call trace again after below patch is > applied? > > >> > >> The comment #9 in bugzilla is not very clear to me. > > >> > >> > > >> > >> mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM > > >> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > > >> > >> > > >> > >> And, as you said, applying above patch, and do not call > > >> > >> __free_pages_core() in generic_online_page() will work. I doubt it, > > >> > >> because without __free_pages_core(), your added pages are not added > > >> > >> into buddy for managing. I think we should make clear this problem > > >> > >> firstly, in order not to introduce new problem by improper work > around, > > >> > >> then check next. > > >> > >> > > >> > >> Thanks > > >> > >> Baoquan > > >> > > > >> > Got it, I restarted off fresh from kernel-5.6-rc1, > > >> > applied patch > > >> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > > >> > and got the following panic. > > >> > > > >> > Diag printk's for add_memory() et al is not there, but I guess > > >> > memory hot-add request from hypervisor is returning "success", > > >> > corrupting something else and bombing out later. > > >> > > > >> > > > >> > [ 24.289967] Not activating Mandatory Access Control as > /sbin/tomoyo-init does not exist. > > >> > [ 302.263730] hv_balloon: Max. dynamic memory size: 1048576 MB > > >> > [ 635.216014] BUG: unable to handle page fault for address: d13ff000 > > >> > [ 635.216058] #PF: supervisor write access in kernel mode > > >> > [ 635.216076] #PF: error_code(0x0002) - not-present page > > >> > [ 635.216106] *pde = 00000000 > > >> > > >> Thanks for the info. What ARCH is your system? Could you attach your > > >> kernel config and paste the output of executing 'readelf /proc/kcore'? > > > > Arch is i386(i586), non-PAE. > > > > I'll attach the "readelf -a /proc/kcore", dmesg and .config . > > The stack trace is different this time also; > > it seems to have slightly difference panic trace every time > > after handle_mm_fault(). > > Sorry, I didn't say it clearly. 'readelf -l /proc/kcore' is OK, and the > relevant call trace. No need to provide them, can find them from the 'readelf -a'. Will check and see if I can find anything. Thanks for the info. > > > > > I've temporary added pr_info() before and after add_memory() in > hv_baloon.ko, > > so it says it's taining the kernel. > > add_memory() itself is returning 0 (success). > > > > > >
On 02/14/20 at 11:26pm, kkabe@vega.pgw.jp wrote: > bhe@redhat.com sed in <20200213081941.GA19207@MiWiFi-R3L-srv> > > >> On 02/13/20 at 01:22pm, kabe@vega.pgw.jp wrote: > >> > bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv> > >> > > >> > >> On 02/11/20 at 04:41pm, Andrew Morton wrote: > >> > >> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang > <richardw.yang@linux.intel.com> wrote: > >> > >> > > >> > >> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote: > >> > >> > > >On 02/10/20 at 02:09pm, Baoquan He wrote: > >> > >> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote: > >> > >> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He > <bhe@redhat.com> wrote: > >> > >> > > >> > > >> > >> > > >> > > Hi Andrew, > >> > >> > > >> > > > >> > >> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote: > >> > >> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000 > bugzilla-daemon@bugzilla.kernel.org wrote: > >> > >> > > >> > > > > >> > >> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401 > >> > >> > > >> > > > > > >> > >> > > >> > > > > >> > >> > > >> > > > An oops during mem hotadd. Could someone please take a > look when > >> > >> > > >> > > > convenient? > >> > >> > > >> > > > >> > >> > > >> > > This has been addressed by Wei Yang's patch, please check > it here: > >> > >> > > >> > > > >> > >> > > >> > > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > >> > >> > > >> > > > >> > >> > > >> > > >> > >> > > >> > hm, OK, thanks. It's unfortunate that a 5.5 fix is buried > in a > >> > >> > > >> > six-patch series which is still in progress! Can we please > merge that > >> > >> > > >> > as a standalone fix with a cc:stable, Fixes:, etc? > >> > >> > > > > >> > >> > > >Maybe can add Fixes tag as follow when merge: > >> > >> > > > > >> > >> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section > hotplug") > >> > >> > > > > >> > >> > > >> > >> > The reporter (cc'ed here) is still seeing issues: > >> > >> > https://bugzilla.kernel.org/show_bug.cgi?id=206401 > >> > >> > > >> > >> > Could we please continue this investigation via emailed > reply-to-all, > >> > >> > rather than via the bugzilla interface? > >> > >> > >> > >> Yes, people prefer mailing list to discuss issues. > >> > >> > >> > >> Hi T.Kabe, > >> > >> > >> > >> Could you provide the call trace again after below patch is applied? > >> > >> The comment #9 in bugzilla is not very clear to me. > >> > >> > >> > >> mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM > >> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > >> > >> > >> > >> And, as you said, applying above patch, and do not call > >> > >> __free_pages_core() in generic_online_page() will work. I doubt it, > >> > >> because without __free_pages_core(), your added pages are not added > >> > >> into buddy for managing. I think we should make clear this problem > >> > >> firstly, in order not to introduce new problem by improper work > around, > >> > >> then check next. > >> > >> > >> > >> Thanks > >> > >> Baoquan > >> > > >> > Got it, I restarted off fresh from kernel-5.6-rc1, > >> > applied patch > >> > >> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > >> > and got the following panic. > >> > > >> > Diag printk's for add_memory() et al is not there, but I guess > >> > memory hot-add request from hypervisor is returning "success", > >> > corrupting something else and bombing out later. > >> > > >> > > >> > [ 24.289967] Not activating Mandatory Access Control as > /sbin/tomoyo-init does not exist. > >> > [ 302.263730] hv_balloon: Max. dynamic memory size: 1048576 MB > >> > [ 635.216014] BUG: unable to handle page fault for address: d13ff000 > >> > [ 635.216058] #PF: supervisor write access in kernel mode > >> > [ 635.216076] #PF: error_code(0x0002) - not-present page > >> > [ 635.216106] *pde = 00000000 > >> > >> Thanks for the info. What ARCH is your system? Could you attach your > >> kernel config and paste the output of executing 'readelf /proc/kcore'? > > Arch is i386(i586), non-PAE. Sorry, I roughly went through code, didn't get clue. Not sure if David have idea about it. By the way, may I know why you would like to run i386 guest on Hyper-V? Found people are talking about the 32bit kernel supporting in upstream, below is Linus's point of view. https://lore.kernel.org/linux-fsdevel/CAHk-=wiGbz3oRvAVFtN-whW-d2F-STKsP1MZT4m_VeycAr1_VQ@mail.gmail.com/ > > I'll attach the "readelf -a /proc/kcore", dmesg and .config . > The stack trace is different this time also; > it seems to have slightly difference panic trace every time > after handle_mm_fault(). > > I've temporary added pr_info() before and after add_memory() in hv_baloon.ko, > so it says it's taining the kernel. > add_memory() itself is returning 0 (success). > > > >> The pmd entry is not filled, I want to check which address range the > kernel > >> is acessing, and please attach the log of dmesg. Probably it's hot added > >> page area, I guess, since this time the preceding trace is different > >> with comment #9. > >> > >> > [ 635.216139] Oops: 0002 [#1] SMP > >> > [ 635.216171] CPU: 0 PID: 470 Comm: systemd-udevd Not tainted > 5.6.0-rc1.el8.i586 #1 > >> > [ 635.216199] Hardware name: Microsoft Corporation Virtual > Machine/Virtual Machine, BIOS 090006 05/23/2012 > >> > [ 635.216233] EIP: wp_page_copy+0x8e/0x750 > >> > [ 635.216253] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 > e5 ff 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 > <89> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29 > >> > [ 635.216293] EAX: d13ff000 EBX: c3743f28 ECX: 00000000 EDX: c10c9000 > >> > [ 635.216314] ESI: c10c9000 EDI: d13ff004 EBP: c3743eec ESP: c3743ea8 > >> > [ 635.216336] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: > 00210282 > >> > [ 635.216368] CR0: 80050033 CR2: d13ff000 CR3: 03add000 CR4: 003406d0 > >> > [ 635.216389] Call Trace: > >> > [ 635.216407] ? reuse_swap_page+0x83/0x390 > >> > [ 635.216425] do_wp_page+0x87/0x6e0 > >> > [ 635.216438] ? __do_sys_fstat64+0x4a/0x60 > >> > [ 635.216453] handle_mm_fault+0x808/0xe30 > >> > [ 635.216468] do_page_fault+0x19f/0x4d0 > >> > [ 635.216484] ? do_kern_addr_fault+0x80/0x80 > >> > [ 635.216500] common_exception_read_cr2+0x15a/0x15f > >> > [ 635.216521] EIP: 0xb7b28104
bhe@redhat.com sed in <20200217044850.GD4816@MiWiFi-R3L-srv> >> Sorry, I roughly went through code, didn't get clue. Not sure if David >> have idea about it. >> >> By the way, may I know why you would like to run i386 guest on Hyper-V? >> >> Found people are talking about the 32bit kernel supporting in upstream, >> below is Linus's point of view. >> >> https://lore.kernel.org/linux-fsdevel/CAHk-=wiGbz3oRvAVFtN-whW-d2F-STKsP1MZT4m_VeycAr1_VQ@mail.gmail.com/ >> <offtopic> Using Hyper-V for testing out a new kernel is convenient and faster before testing it out on a real i386 machine. And I do want bugs squashed; meanwhile "hv_balloon.hot_add=0" will be a workaround. I agree HIGHMEM64G (PAE) is going to be a deprecated feature, but I do miss HIGHMEM4G (needed for 1GB memory support). </offtopic>
bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv> >> On 02/11/20 at 04:41pm, Andrew Morton wrote: >> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang >> <richardw.yang@linux.intel.com> wrote: >> > >> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote: >> > > >On 02/10/20 at 02:09pm, Baoquan He wrote: >> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote: >> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com> >> wrote: >> > > >> > >> > > >> > > Hi Andrew, >> > > >> > > >> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote: >> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000 >> bugzilla-daemon@bugzilla.kernel.org wrote: >> > > >> > > > >> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401 >> > > >> > > > > >> > > >> > > > >> > > >> > > > An oops during mem hotadd. Could someone please take a look >> when >> > > >> > > > convenient? >> > > >> > > >> > > >> > > This has been addressed by Wei Yang's patch, please check it >> here: >> > > >> > > >> > > >> > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com >> > > >> > > >> > > >> > >> > > >> > hm, OK, thanks. It's unfortunate that a 5.5 fix is buried in a >> > > >> > six-patch series which is still in progress! Can we please merge >> that >> > > >> > as a standalone fix with a cc:stable, Fixes:, etc? >> > > > >> > > >Maybe can add Fixes tag as follow when merge: >> > > > >> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >> > > > >> > >> > The reporter (cc'ed here) is still seeing issues: >> > https://bugzilla.kernel.org/show_bug.cgi?id=206401 >> > >> > Could we please continue this investigation via emailed reply-to-all, >> > rather than via the bugzilla interface? >> >> Yes, people prefer mailing list to discuss issues. I found perplexing behavior in populate_section_memmap(). populate_section_memmap() calls alloc_pages(), and if that fails, falls back to vmalloc(). But according to the trace, populate_section_memmap() seems to throw out the alloc_pages() result and always falls back to vmalloc(), which could be a wrong area to use. I sprinkled pr_info() in mm/sparse.c:populate_section_memmap() as below: =========================================== struct page * __meminit populate_section_memmap(unsigned long pfn, unsigned long nr_pages, int nid, struct vmem_altmap *altmap) { struct page *page, *ret; unsigned long memmap_size = sizeof(struct page) * PAGES_PER_SECTION; page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size)); if (page) { goto got_map_page; } pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size); BUG_ON(page != 0); ret = vmalloc(memmap_size); pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret); if (ret) { goto got_map_ptr; } return NULL; got_map_page: ret = (struct page *)pfn_to_kaddr(page_to_pfn(page)); pr_info("%s: allocated struct page *page=0x%p\n", __func__, page); got_map_ptr: pr_info("%s: returning struct page * =0x%p\n", __func__, ret); return ret; } ================================================== and got a following panic. It even ignores BUG_ON() (perhaps optimized out). Is this worth investigating? Disassembly doesn't reveal anything suspicious, but I have feeling that I'm looking at disassembly different than that the CPU is seeing. It's too trivial to be a compiler bug. ================================================== [root@localhost ~]# readelf -l /proc/kcore Elf file type is CORE (Core file) Entry point 0x0 There are 3 program headers, starting at offset 52 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align NOTE 0x000094 0x00000000 0x00000000 0x01304 0x00000 0 LOAD 0xaff2000 0xcaff0000 0xffffffff 0x3400e000 0x3400e000 RWE 0x1000 LOAD 0x002000 0xc0000000 0x00000000 0xa7f0000 0xa7f0000 RWE 0x1000 [ 302.784196] hv_balloon: Max. dynamic memory size: 1048576 MB [ 643.475080] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x10000) << PAGE_SHIFT)=0x10000000, (HA_CHUNK << PAGE_SHIFT)=134217728) [ 643.513804] populate_section_memmap: alloc_pages() returned 0xb1a7c4b2 (should be 0), reverting to vmalloc(memmap_size=655360) [ 643.513849] populate_section_memmap: vmalloc(655360) returned 0x11b0e715 [ 643.513872] populate_section_memmap: returning struct page * =0x11b0e715 [ 643.525352] populate_section_memmap: alloc_pages() returned 0xb1a7c4b2 (should be 0), reverting to vmalloc(memmap_size=655360) [ 643.536698] populate_section_memmap: vmalloc(655360) returned 0xf2ba6510 [ 643.536722] populate_section_memmap: returning struct page * =0xf2ba6510 [ 643.536749] hv_balloon: hv_mem_hot_add: add_memory() returned 0 [ 645.394458] BUG: unable to handle page fault for address: d13ff000 [ 645.394518] #PF: supervisor write access in kernel mode [ 645.394565] #PF: error_code(0x0002) - not-present page [ 645.394584] *pde = 00000000 [ 645.394601] Oops: 0002 [#1] SMP [ 645.394614] CPU: 0 PID: 361 Comm: systemd-udevd Not tainted 5.6.0-rc1.el8.i586 #1 [ 645.394636] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012 [ 645.394670] EIP: wp_page_copy+0x8e/0x750 [ 645.394690] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29 [ 645.394739] EAX: d13ff000 EBX: c752df28 ECX: 00000000 EDX: c5e0d000 [ 645.394767] ESI: c5e0d000 EDI: d13ff004 EBP: c752deec ESP: c752dea8 [ 645.394790] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282 [ 645.394815] CR0: 80050033 CR2: d13ff000 CR3: 08e5a000 CR4: 003406d0 [ 645.394840] Call Trace: [ 645.394852] ? reuse_swap_page+0x83/0x390 [ 645.394873] do_wp_page+0x87/0x6e0 [ 645.394885] handle_mm_fault+0x808/0xe30 [ 645.394893] do_page_fault+0x19f/0x4d0 [ 645.394901] ? do_kern_addr_fault+0x80/0x80 [ 645.394915] common_exception_read_cr2+0x15a/0x15f [ 645.394930] EIP: 0xb7aaf8bb [ 645.394944] Code: 24 0c e3 2c 89 d7 83 e2 03 74 11 7a 04 aa 49 74 1f aa 49 74 1b 83 f2 01 75 02 aa 49 89 ca c1 e9 02 83 e2 03 69 c0 01 01 01 01 <f3> ab 89 d1 f3 aa 8b 44 24 08 5f c3 66 90 66 90 66 90 66 90 90 f3 [ 645.394973] EAX: 00000000 EBX: b7f05f60 ECX: 0000000d EDX: 00000000 [ 645.394988] ESI: 02194db4 EDI: 02194db4 EBP: b7f05db4 ESP: bffed978 [ 645.395003] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00210206 [ 645.395018] Modules linked in: rfkill intel_rapl_msr intel_rapl_common crc32_pclmul snd_pcm snd_timer snd soundcore intel_rapl_perf sg pcspkr hv_netvsc i2c_piix4 hyperv_fb hv_utils hv_balloon joydev ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod t10_pi ata_generic hyperv_keyboard hid_hyperv hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw hv_vmbus libata [ 645.395101] CR2: 00000000d13ff000 [ 645.395121] ---[ end trace 3bb1d66cb8b20841 ]--- [ 645.395144] EIP: wp_page_copy+0x8e/0x750 [ 645.395157] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29 [ 645.395206] EAX: d13ff000 EBX: c752df28 ECX: 00000000 EDX: c5e0d000 [ 645.395235] ESI: c5e0d000 EDI: d13ff004 EBP: c752deec ESP: c752dea8 [ 645.395261] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282 [ 645.395278] CR0: 80050033 CR2: d13ff000 CR3: 08e5a000 CR4: 003406d0 [ 645.395308] Kernel panic - not syncing: Fatal exception [ 645.395329] Kernel Offset: 0x3e00000 from 0xc1000000 (relocation range: 0xc0000000-0xcafeffff) [ 645.395354] ---[ end Kernel panic - not syncing: Fatal exception ]--- ==================================================
On 02/17/20 at 02:46pm, kkabe@vega.pgw.jp wrote: > bhe@redhat.com sed in <20200212073123.GG8965@MiWiFi-R3L-srv> > > >> On 02/11/20 at 04:41pm, Andrew Morton wrote: > >> > On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang > <richardw.yang@linux.intel.com> wrote: > >> > > >> > > On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote: > >> > > >On 02/10/20 at 02:09pm, Baoquan He wrote: > >> > > >> On 02/09/20 at 09:56pm, Andrew Morton wrote: > >> > > >> > On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@redhat.com> > wrote: > >> > > >> > > >> > > >> > > Hi Andrew, > >> > > >> > > > >> > > >> > > On 02/09/20 at 09:32pm, Andrew Morton wrote: > >> > > >> > > > On Tue, 04 Feb 2020 11:25:48 +0000 > bugzilla-daemon@bugzilla.kernel.org wrote: > >> > > >> > > > > >> > > >> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206401 > >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > An oops during mem hotadd. Could someone please take a look > when > >> > > >> > > > convenient? > >> > > >> > > > >> > > >> > > This has been addressed by Wei Yang's patch, please check it > here: > >> > > >> > > > >> > > >> > > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@redhat.com > >> > > >> > > > >> > > >> > > >> > > >> > hm, OK, thanks. It's unfortunate that a 5.5 fix is buried in a > >> > > >> > six-patch series which is still in progress! Can we please merge > that > >> > > >> > as a standalone fix with a cc:stable, Fixes:, etc? > >> > > > > >> > > >Maybe can add Fixes tag as follow when merge: > >> > > > > >> > > >Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") > >> > > > > >> > > >> > The reporter (cc'ed here) is still seeing issues: > >> > https://bugzilla.kernel.org/show_bug.cgi?id=206401 > >> > > >> > Could we please continue this investigation via emailed reply-to-all, > >> > rather than via the bugzilla interface? > >> > >> Yes, people prefer mailing list to discuss issues. > > > I found perplexing behavior in populate_section_memmap(). > > populate_section_memmap() calls alloc_pages(), and if that fails, > falls back to vmalloc(). > > But according to the trace, populate_section_memmap() seems to > throw out the alloc_pages() result and always falls back to vmalloc(), > which could be a wrong area to use. > > I sprinkled pr_info() in mm/sparse.c:populate_section_memmap() as below: > > =========================================== > struct page * __meminit populate_section_memmap(unsigned long pfn, > unsigned long nr_pages, int nid, struct vmem_altmap *altmap) > { > struct page *page, *ret; > unsigned long memmap_size = sizeof(struct page) * PAGES_PER_SECTION; > > page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size)); > if (page) { > goto got_map_page; > } > pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to > vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size); > BUG_ON(page != 0); > > ret = vmalloc(memmap_size); > pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret); > if (ret) { > goto got_map_ptr; > } > > return NULL; > got_map_page: > ret = (struct page *)pfn_to_kaddr(page_to_pfn(page)); > pr_info("%s: allocated struct page *page=0x%p\n", __func__, page); > got_map_ptr: > > pr_info("%s: returning struct page * =0x%p\n", __func__, ret); > return ret; > } > ================================================== > > and got a following panic. > It even ignores BUG_ON() (perhaps optimized out). > > Is this worth investigating? > Disassembly doesn't reveal anything suspicious, but I have feeling that > I'm looking at disassembly different than that the CPU is seeing. > It's too trivial to be a compiler bug. > > > ================================================== > [root@localhost ~]# readelf -l /proc/kcore > > Elf file type is CORE (Core file) > Entry point 0x0 > There are 3 program headers, starting at offset 52 > > Program Headers: > Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align > NOTE 0x000094 0x00000000 0x00000000 0x01304 0x00000 0 > LOAD 0xaff2000 0xcaff0000 0xffffffff 0x3400e000 0x3400e000 RWE > 0x1000 This should be vmalloc area, the region covers [0xcaff0000, 0xcaff0000+0x3400e000] [0xcaff0000, 0xfeffe000] > LOAD 0x002000 0xc0000000 0x00000000 0xa7f0000 0xa7f0000 RWE > 0x1000 This should be the direct mapping starting from 0xc0000000, covers the boot memory you set for guest kernel, 168M, [0x0xc0000000, 0xca7f0000] Since system only detects your boot memory, the max_pfn is 168M, so VMALLOC_START = high_memory + VMALLOC_OFFSET; So any hot added memory will be taken as high memory. Sorry, I have forgot most of details of i386, these are just my rough understanding about it. > > > [ 302.784196] hv_balloon: Max. dynamic memory size: 1048576 MB > [ 643.475080] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, > ((start_pfn=0x10000) << PAGE_SHIFT)=0x10000000, (HA_CHUNK << > PAGE_SHIFT)=134217728) > [ 643.513804] populate_section_memmap: alloc_pages() returned 0xb1a7c4b2 > (should be 0), reverting to vmalloc(memmap_size=655360) This pr_info is truly weird. > [ 643.513849] populate_section_memmap: vmalloc(655360) returned 0x11b0e715 > [ 643.513872] populate_section_memmap: returning struct page * =0x11b0e715 But here the returned page address is 0x11b0e715, which is also bizarre. Kernel address is above 3G, right? > [ 643.525352] populate_section_memmap: alloc_pages() returned 0xb1a7c4b2 > (should be 0), reverting to vmalloc(memmap_size=655360) > [ 643.536698] populate_section_memmap: vmalloc(655360) returned 0xf2ba6510 > [ 643.536722] populate_section_memmap: returning struct page * =0xf2ba6510 Here, the returned page address looks regular. > [ 643.536749] hv_balloon: hv_mem_hot_add: add_memory() returned 0 > [ 645.394458] BUG: unable to handle page fault for address: d13ff000 > [ 645.394518] #PF: supervisor write access in kernel mode > [ 645.394565] #PF: error_code(0x0002) - not-present page > [ 645.394584] *pde = 00000000 > [ 645.394601] Oops: 0002 [#1] SMP > [ 645.394614] CPU: 0 PID: 361 Comm: systemd-udevd Not tainted > 5.6.0-rc1.el8.i586 #1 > [ 645.394636] Hardware name: Microsoft Corporation Virtual Machine/Virtual > Machine, BIOS 090006 05/23/2012 > [ 645.394670] EIP: wp_page_copy+0x8e/0x750 > [ 645.394690] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff > 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 > 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29 > [ 645.394739] EAX: d13ff000 EBX: c752df28 ECX: 00000000 EDX: c5e0d000 > [ 645.394767] ESI: c5e0d000 EDI: d13ff004 EBP: c752deec ESP: c752dea8 > [ 645.394790] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282 > [ 645.394815] CR0: 80050033 CR2: d13ff000 CR3: 08e5a000 CR4: 003406d0 > [ 645.394840] Call Trace: > [ 645.394852] ? reuse_swap_page+0x83/0x390 > [ 645.394873] do_wp_page+0x87/0x6e0 > [ 645.394885] handle_mm_fault+0x808/0xe30 > [ 645.394893] do_page_fault+0x19f/0x4d0 > [ 645.394901] ? do_kern_addr_fault+0x80/0x80 > [ 645.394915] common_exception_read_cr2+0x15a/0x15f > [ 645.394930] EIP: 0xb7aaf8bb > [ 645.394944] Code: 24 0c e3 2c 89 d7 83 e2 03 74 11 7a 04 aa 49 74 1f aa 49 > 74 1b 83 f2 01 75 02 aa 49 89 ca c1 e9 02 83 e2 03 69 c0 01 01 01 01 <f3> ab > 89 d1 f3 aa 8b 44 24 08 5f c3 66 90 66 90 66 90 66 90 90 f3 > [ 645.394973] EAX: 00000000 EBX: b7f05f60 ECX: 0000000d EDX: 00000000 > [ 645.394988] ESI: 02194db4 EDI: 02194db4 EBP: b7f05db4 ESP: bffed978 > [ 645.395003] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00210206 > [ 645.395018] Modules linked in: rfkill intel_rapl_msr intel_rapl_common > crc32_pclmul snd_pcm snd_timer snd soundcore intel_rapl_perf sg pcspkr > hv_netvsc i2c_piix4 hyperv_fb hv_utils hv_balloon joydev ip_tables ext4 > mbcache jbd2 sr_mod cdrom sd_mod t10_pi ata_generic hyperv_keyboard > hid_hyperv hv_storvsc scsi_transport_fc ata_piix crc32c_intel serio_raw > hv_vmbus libata > [ 645.395101] CR2: 00000000d13ff000 > [ 645.395121] ---[ end trace 3bb1d66cb8b20841 ]--- > [ 645.395144] EIP: wp_page_copy+0x8e/0x750 > [ 645.395157] Code: 03 00 00 8b 45 d0 85 c0 0f 84 46 05 00 00 e8 d9 85 e5 ff > 89 45 bc 89 f8 e8 cf 85 e5 ff 8b 55 bc 8d 78 04 8b 0a 83 e7 fc 89 d6 <89> 08 > 8b 8a fc 0f 00 00 89 88 fc 0f 00 00 89 c1 29 f9 89 55 bc 29 > [ 645.395206] EAX: d13ff000 EBX: c752df28 ECX: 00000000 EDX: c5e0d000 > [ 645.395235] ESI: c5e0d000 EDI: d13ff004 EBP: c752deec ESP: c752dea8 > [ 645.395261] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282 > [ 645.395278] CR0: 80050033 CR2: d13ff000 CR3: 08e5a000 CR4: 003406d0 > [ 645.395308] Kernel panic - not syncing: Fatal exception > [ 645.395329] Kernel Offset: 0x3e00000 from 0xc1000000 (relocation range: > 0xc0000000-0xcafeffff) > [ 645.395354] ---[ end Kernel panic - not syncing: Fatal exception ]--- > ================================================== >
On 17.02.20 06:31, kkabe@vega.pgw.jp wrote: > bhe@redhat.com sed in <20200217044850.GD4816@MiWiFi-R3L-srv> > >>> Sorry, I roughly went through code, didn't get clue. Not sure if David >>> have idea about it. >>> >>> By the way, may I know why you would like to run i386 guest on Hyper-V? >>> >>> Found people are talking about the 32bit kernel supporting in upstream, >>> below is Linus's point of view. >>> >>> https://lore.kernel.org/linux-fsdevel/CAHk-=wiGbz3oRvAVFtN-whW-d2F-STKsP1MZT4m_VeycAr1_VQ@mail.gmail.com/ >>> > > <offtopic> > Using Hyper-V for testing out a new kernel is convenient and faster > before testing it out on a real i386 machine. > And I do want bugs squashed; meanwhile "hv_balloon.hot_add=0" will be > a workaround. > > I agree HIGHMEM64G (PAE) is going to be a deprecated feature, but I do miss > HIGHMEM4G (needed for 1GB memory support). > </offtopic> > Could it be that we are hotplugging highmem, but when onlining memory, it will be onlined to ZONE_NORMAL and not ZONE_HIGHMEM? That could explain why we fail at random points in time, when somebody stumbles over such a page.
On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote: > =========================================== > struct page * __meminit populate_section_memmap(unsigned long pfn, > unsigned long nr_pages, int nid, struct vmem_altmap *altmap) > { > struct page *page, *ret; > unsigned long memmap_size = sizeof(struct page) * PAGES_PER_SECTION; > > page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size)); > if (page) { > goto got_map_page; > } > pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to > vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size); > BUG_ON(page != 0); > > ret = vmalloc(memmap_size); > pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret); > if (ret) { > goto got_map_ptr; > } > > return NULL; > got_map_page: > ret = (struct page *)pfn_to_kaddr(page_to_pfn(page)); > pr_info("%s: allocated struct page *page=0x%p\n", __func__, page); > got_map_ptr: > > pr_info("%s: returning struct page * =0x%p\n", __func__, ret); > return ret; > } Could you please replace %p with %px. Wih the first, pointers are hashed so it is trickier to get an overview of the meaning. David could be right about ZONE_NORMAL vs ZONE_HIGHMEM. IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to only deal with (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE. Although I really fail to see how this could cause the crash. Could you also please capture /proc/zoneinfo before and after hotplugging memory? And add this delta on top of your debugging patch? diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 0a54ffac8c68..2b9c821d7cf0 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -574,6 +574,7 @@ EXPORT_SYMBOL_GPL(restore_online_page_callback); void generic_online_page(struct page *page, unsigned int order) { + pr_info("generic_online_page: page: %px order: %u\n", page, order); kernel_map_pages(page, 1 << order, 1); __free_pages_core(page, order); totalram_pages_add(1UL << order); @@ -774,6 +775,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages); move_pfn_range_to_zone(zone, pfn, nr_pages, NULL); + pr_info("%s: pfn: %lx - %lx (zone: %s)\n", __func__, pfn, pfn + nr_pages, zone->name); + arg.start_pfn = pfn; arg.nr_pages = nr_pages; node_states_check_changes_online(nr_pages, zone, &arg);
On 02/17/20 at 10:34am, Oscar Salvador wrote: > On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote: > > =========================================== > > struct page * __meminit populate_section_memmap(unsigned long pfn, > > unsigned long nr_pages, int nid, struct vmem_altmap > *altmap) > > { > > struct page *page, *ret; > > unsigned long memmap_size = sizeof(struct page) * > PAGES_PER_SECTION; > > > > page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, > get_order(memmap_size)); > > if (page) { > > goto got_map_page; > > } > > pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to > vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size); > > BUG_ON(page != 0); > > > > ret = vmalloc(memmap_size); > > pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret); > > if (ret) { > > goto got_map_ptr; > > } > > > > return NULL; > > got_map_page: > > ret = (struct page *)pfn_to_kaddr(page_to_pfn(page)); > > pr_info("%s: allocated struct page *page=0x%p\n", __func__, page); > > got_map_ptr: > > > > pr_info("%s: returning struct page * =0x%p\n", __func__, ret); > > return ret; > > } > > Could you please replace %p with %px. Wih the first, pointers are hashed so > it is trickier > to get an overview of the meaning. > > David could be right about ZONE_NORMAL vs ZONE_HIGHMEM. > IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to only deal > with > (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE. Ah, I think you both have spotted the problem. In i386, if w/o momory hot add, normal memory will only include those below 896M and they are added into normal zone. The left are added into highmem zone. How this influence the page allocation? Very huge. As we know, in i386, normal memory can be accessed with virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be accessed with kmap. However, the later hot added memory are all put into normal memmory, accessing into them will stump into vmalloc area, I would say. So, i386 doesn't support memory hot add well. Not sure if below change can make it work normally. We can just adjus the hot adding code as we have done for boot memmory. Iterate zone from highmem if allowed when hot add memory. diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 475d0d68a32c..1380392d9ef5 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -716,7 +716,10 @@ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn struct pglist_data *pgdat = NODE_DATA(nid); int zid; - for (zid = 0; zid <= ZONE_NORMAL; zid++) { + for (zid = 0; zid < MAX_NR_ZONES; zid++) { + if (zid == ZONE_MOVABLE) + continue; + struct zone *zone = &pgdat->node_zones[zid]; if (zone_intersects(zone, start_pfn, nr_pages))
On 02/17/20 at 06:13pm, Baoquan He wrote: > On 02/17/20 at 10:34am, Oscar Salvador wrote: > > On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote: > > > =========================================== > > > struct page * __meminit populate_section_memmap(unsigned long pfn, > > > unsigned long nr_pages, int nid, struct vmem_altmap > *altmap) > > > { > > > struct page *page, *ret; > > > unsigned long memmap_size = sizeof(struct page) * > PAGES_PER_SECTION; > > > > > > page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, > get_order(memmap_size)); > > > if (page) { > > > goto got_map_page; > > > } > > > pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to > vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size); > > > BUG_ON(page != 0); > > > > > > ret = vmalloc(memmap_size); > > > pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret); > > > if (ret) { > > > goto got_map_ptr; > > > } > > > > > > return NULL; > > > got_map_page: > > > ret = (struct page *)pfn_to_kaddr(page_to_pfn(page)); > > > pr_info("%s: allocated struct page *page=0x%p\n", __func__, page); > > > got_map_ptr: > > > > > > pr_info("%s: returning struct page * =0x%p\n", __func__, ret); > > > return ret; > > > } > > > > Could you please replace %p with %px. Wih the first, pointers are hashed so > it is trickier > > to get an overview of the meaning. > > > > David could be right about ZONE_NORMAL vs ZONE_HIGHMEM. > > IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to only > deal with > > (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE. > > Ah, I think you both have spotted the problem. > > In i386, if w/o momory hot add, normal memory will only include those > below 896M and they are added into normal zone. The left are added into > highmem zone. > > How this influence the page allocation? > > Very huge. As we know, in i386, normal memory can be accessed with > virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be accessed Hmm, here I mean __pa and __va. > with kmap. However, the later hot added memory are all put into normal > memmory, accessing into them will stump into vmalloc area, I would say. > > So, i386 doesn't support memory hot add well. Not sure if below change > can make it work normally. > > We can just adjus the hot adding code as we have done for boot memmory. > Iterate zone from highmem if allowed when hot add memory. > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 475d0d68a32c..1380392d9ef5 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -716,7 +716,10 @@ static struct zone *default_kernel_zone_for_pfn(int nid, > unsigned long start_pfn > struct pglist_data *pgdat = NODE_DATA(nid); > int zid; > > - for (zid = 0; zid <= ZONE_NORMAL; zid++) { > + for (zid = 0; zid < MAX_NR_ZONES; zid++) { > + if (zid == ZONE_MOVABLE) > + continue; > + > struct zone *zone = &pgdat->node_zones[zid]; > > if (zone_intersects(zone, start_pfn, nr_pages)) > >
On 17.02.20 11:13, Baoquan He wrote: > On 02/17/20 at 10:34am, Oscar Salvador wrote: >> On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote: >>> =========================================== >>> struct page * __meminit populate_section_memmap(unsigned long pfn, >>> unsigned long nr_pages, int nid, struct vmem_altmap >>> *altmap) >>> { >>> struct page *page, *ret; >>> unsigned long memmap_size = sizeof(struct page) * >>> PAGES_PER_SECTION; >>> >>> page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, >>> get_order(memmap_size)); >>> if (page) { >>> goto got_map_page; >>> } >>> pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to >>> vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size); >>> BUG_ON(page != 0); >>> >>> ret = vmalloc(memmap_size); >>> pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret); >>> if (ret) { >>> goto got_map_ptr; >>> } >>> >>> return NULL; >>> got_map_page: >>> ret = (struct page *)pfn_to_kaddr(page_to_pfn(page)); >>> pr_info("%s: allocated struct page *page=0x%p\n", __func__, page); >>> got_map_ptr: >>> >>> pr_info("%s: returning struct page * =0x%p\n", __func__, ret); >>> return ret; >>> } >> >> Could you please replace %p with %px. Wih the first, pointers are hashed so >> it is trickier >> to get an overview of the meaning. >> >> David could be right about ZONE_NORMAL vs ZONE_HIGHMEM. >> IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to only deal >> with >> (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE. > > Ah, I think you both have spotted the problem. > > In i386, if w/o momory hot add, normal memory will only include those > below 896M and they are added into normal zone. The left are added into > highmem zone. > > How this influence the page allocation? > > Very huge. As we know, in i386, normal memory can be accessed with > virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be accessed > with kmap. However, the later hot added memory are all put into normal > memmory, accessing into them will stump into vmalloc area, I would say. > > So, i386 doesn't support memory hot add well. Not sure if below change > can make it work normally. > > We can just adjus the hot adding code as we have done for boot memmory. > Iterate zone from highmem if allowed when hot add memory. > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 475d0d68a32c..1380392d9ef5 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -716,7 +716,10 @@ static struct zone *default_kernel_zone_for_pfn(int nid, > unsigned long start_pfn > struct pglist_data *pgdat = NODE_DATA(nid); > int zid; > > - for (zid = 0; zid <= ZONE_NORMAL; zid++) { > + for (zid = 0; zid < MAX_NR_ZONES; zid++) { ZONE_DEVICE? :/ > + if (zid == ZONE_MOVABLE) > + continue; > + > struct zone *zone = &pgdat->node_zones[zid]; > > if (zone_intersects(zone, start_pfn, nr_pages)) > > What if somebody onlines memory from user space explicitly to the normal zone? We can trigger crashes? This doesn't look like it ever worked reliably, can we just disable memory hotplug in case we have PAE? (especially, as continued i386 support is questionable)
On Fri 14-02-20 23:26:29, kkabe@vega.pgw.jp wrote: [...] > [root@localhost ~]# [ 302.391125] hv_balloon: Max. dynamic memory size: > 1048576 MB Is this saying that the system might hotplug up to 1TB of memory on this 32b system? Btw. the hotplug support on highmem systems is quite likely to be broken and/or full of corner cases. I seriously doubt this is something anybody should be running in production without a _lot_ of work. Is there any real usecase to run HyperV hotplug on 32b system?
On 02/17/20 at 11:24am, David Hildenbrand wrote: > On 17.02.20 11:13, Baoquan He wrote: > > On 02/17/20 at 10:34am, Oscar Salvador wrote: > >> On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote: > >>> =========================================== > >>> struct page * __meminit populate_section_memmap(unsigned long pfn, > >>> unsigned long nr_pages, int nid, struct vmem_altmap > *altmap) > >>> { > >>> struct page *page, *ret; > >>> unsigned long memmap_size = sizeof(struct page) * > PAGES_PER_SECTION; > >>> > >>> page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, > get_order(memmap_size)); > >>> if (page) { > >>> goto got_map_page; > >>> } > >>> pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to > vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size); > >>> BUG_ON(page != 0); > >>> > >>> ret = vmalloc(memmap_size); > >>> pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret); > >>> if (ret) { > >>> goto got_map_ptr; > >>> } > >>> > >>> return NULL; > >>> got_map_page: > >>> ret = (struct page *)pfn_to_kaddr(page_to_pfn(page)); > >>> pr_info("%s: allocated struct page *page=0x%p\n", __func__, page); > >>> got_map_ptr: > >>> > >>> pr_info("%s: returning struct page * =0x%p\n", __func__, ret); > >>> return ret; > >>> } > >> > >> Could you please replace %p with %px. Wih the first, pointers are hashed > so it is trickier > >> to get an overview of the meaning. > >> > >> David could be right about ZONE_NORMAL vs ZONE_HIGHMEM. > >> IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to only > deal with > >> (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE. > > > > Ah, I think you both have spotted the problem. > > > > In i386, if w/o momory hot add, normal memory will only include those > > below 896M and they are added into normal zone. The left are added into > > highmem zone. > > > > How this influence the page allocation? > > > > Very huge. As we know, in i386, normal memory can be accessed with > > virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be accessed > > with kmap. However, the later hot added memory are all put into normal > > memmory, accessing into them will stump into vmalloc area, I would say. > > > > So, i386 doesn't support memory hot add well. Not sure if below change > > can make it work normally. > > > > We can just adjus the hot adding code as we have done for boot memmory. > > Iterate zone from highmem if allowed when hot add memory. > > > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > > index 475d0d68a32c..1380392d9ef5 100644 > > --- a/mm/memory_hotplug.c > > +++ b/mm/memory_hotplug.c > > @@ -716,7 +716,10 @@ static struct zone *default_kernel_zone_for_pfn(int > nid, unsigned long start_pfn > > struct pglist_data *pgdat = NODE_DATA(nid); > > int zid; > > > > - for (zid = 0; zid <= ZONE_NORMAL; zid++) { > > + for (zid = 0; zid < MAX_NR_ZONES; zid++) { > > ZONE_DEVICE? :/ Not sure if ZONE_DEVICE will be supported on 32 bit system. > > > + if (zid == ZONE_MOVABLE) > > + continue; > > + > > struct zone *zone = &pgdat->node_zones[zid]; > > > > if (zone_intersects(zone, start_pfn, nr_pages)) > > > > > > What if somebody onlines memory from user space explicitly to the normal > zone? We can trigger crashes? Seems the current i386 code doesn't support it. Unless we change that too. If not reserving virtual address space, later added any memory has to be highmem. > > This doesn't look like it ever worked reliably, can we just disable > memory hotplug in case we have PAE? (especially, as continued i386 > support is questionable) This is not PAE, this is only HIGHMEM4G.
On 17.02.20 11:33, Baoquan He wrote: > On 02/17/20 at 11:24am, David Hildenbrand wrote: >> On 17.02.20 11:13, Baoquan He wrote: >>> On 02/17/20 at 10:34am, Oscar Salvador wrote: >>>> On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote: >>>>> =========================================== >>>>> struct page * __meminit populate_section_memmap(unsigned long pfn, >>>>> unsigned long nr_pages, int nid, struct vmem_altmap >>>>> *altmap) >>>>> { >>>>> struct page *page, *ret; >>>>> unsigned long memmap_size = sizeof(struct page) * >>>>> PAGES_PER_SECTION; >>>>> >>>>> page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, >>>>> get_order(memmap_size)); >>>>> if (page) { >>>>> goto got_map_page; >>>>> } >>>>> pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to >>>>> vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size); >>>>> BUG_ON(page != 0); >>>>> >>>>> ret = vmalloc(memmap_size); >>>>> pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, ret); >>>>> if (ret) { >>>>> goto got_map_ptr; >>>>> } >>>>> >>>>> return NULL; >>>>> got_map_page: >>>>> ret = (struct page *)pfn_to_kaddr(page_to_pfn(page)); >>>>> pr_info("%s: allocated struct page *page=0x%p\n", __func__, page); >>>>> got_map_ptr: >>>>> >>>>> pr_info("%s: returning struct page * =0x%p\n", __func__, ret); >>>>> return ret; >>>>> } >>>> >>>> Could you please replace %p with %px. Wih the first, pointers are hashed >>>> so it is trickier >>>> to get an overview of the meaning. >>>> >>>> David could be right about ZONE_NORMAL vs ZONE_HIGHMEM. >>>> IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to only >>>> deal with >>>> (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE. >>> >>> Ah, I think you both have spotted the problem. >>> >>> In i386, if w/o momory hot add, normal memory will only include those >>> below 896M and they are added into normal zone. The left are added into >>> highmem zone. >>> >>> How this influence the page allocation? >>> >>> Very huge. As we know, in i386, normal memory can be accessed with >>> virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be accessed >>> with kmap. However, the later hot added memory are all put into normal >>> memmory, accessing into them will stump into vmalloc area, I would say. >>> >>> So, i386 doesn't support memory hot add well. Not sure if below change >>> can make it work normally. >>> >>> We can just adjus the hot adding code as we have done for boot memmory. >>> Iterate zone from highmem if allowed when hot add memory. >>> >>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >>> index 475d0d68a32c..1380392d9ef5 100644 >>> --- a/mm/memory_hotplug.c >>> +++ b/mm/memory_hotplug.c >>> @@ -716,7 +716,10 @@ static struct zone *default_kernel_zone_for_pfn(int >>> nid, unsigned long start_pfn >>> struct pglist_data *pgdat = NODE_DATA(nid); >>> int zid; >>> >>> - for (zid = 0; zid <= ZONE_NORMAL; zid++) { >>> + for (zid = 0; zid < MAX_NR_ZONES; zid++) { >> >> ZONE_DEVICE? :/ > > Not sure if ZONE_DEVICE will be supported on 32 bit system. > > >> >>> + if (zid == ZONE_MOVABLE) >>> + continue; >>> + >>> struct zone *zone = &pgdat->node_zones[zid]; >>> >>> if (zone_intersects(zone, start_pfn, nr_pages)) >>> >>> >> >> What if somebody onlines memory from user space explicitly to the normal >> zone? We can trigger crashes? > > Seems the current i386 code doesn't support it. Unless we change that > too. If not reserving virtual address space, later added any memory has > to be highmem. > >> >> This doesn't look like it ever worked reliably, can we just disable >> memory hotplug in case we have PAE? (especially, as continued i386 >> support is questionable) > > This is not PAE, this is only HIGHMEM4G. > Ah, okay. Anyhow, highmem combined with hotplug seems to be in a questionable state. I'd vote for disabling it if possible.
On 02/17/20 at 11:38am, David Hildenbrand wrote: > On 17.02.20 11:33, Baoquan He wrote: > > On 02/17/20 at 11:24am, David Hildenbrand wrote: > >> On 17.02.20 11:13, Baoquan He wrote: > >>> On 02/17/20 at 10:34am, Oscar Salvador wrote: > >>>> On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote: > >>>>> =========================================== > >>>>> struct page * __meminit populate_section_memmap(unsigned long pfn, > >>>>> unsigned long nr_pages, int nid, struct vmem_altmap > *altmap) > >>>>> { > >>>>> struct page *page, *ret; > >>>>> unsigned long memmap_size = sizeof(struct page) * > PAGES_PER_SECTION; > >>>>> > >>>>> page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, > get_order(memmap_size)); > >>>>> if (page) { > >>>>> goto got_map_page; > >>>>> } > >>>>> pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to > vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size); > >>>>> BUG_ON(page != 0); > >>>>> > >>>>> ret = vmalloc(memmap_size); > >>>>> pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, > ret); > >>>>> if (ret) { > >>>>> goto got_map_ptr; > >>>>> } > >>>>> > >>>>> return NULL; > >>>>> got_map_page: > >>>>> ret = (struct page *)pfn_to_kaddr(page_to_pfn(page)); > >>>>> pr_info("%s: allocated struct page *page=0x%p\n", __func__, page); > >>>>> got_map_ptr: > >>>>> > >>>>> pr_info("%s: returning struct page * =0x%p\n", __func__, ret); > >>>>> return ret; > >>>>> } > >>>> > >>>> Could you please replace %p with %px. Wih the first, pointers are hashed > so it is trickier > >>>> to get an overview of the meaning. > >>>> > >>>> David could be right about ZONE_NORMAL vs ZONE_HIGHMEM. > >>>> IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to only > deal with > >>>> (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE. > >>> > >>> Ah, I think you both have spotted the problem. > >>> > >>> In i386, if w/o momory hot add, normal memory will only include those > >>> below 896M and they are added into normal zone. The left are added into > >>> highmem zone. > >>> > >>> How this influence the page allocation? > >>> > >>> Very huge. As we know, in i386, normal memory can be accessed with > >>> virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be accessed > >>> with kmap. However, the later hot added memory are all put into normal > >>> memmory, accessing into them will stump into vmalloc area, I would say. > >>> > >>> So, i386 doesn't support memory hot add well. Not sure if below change > >>> can make it work normally. > >>> Please try below code instead, see if it works. However, as David and and Michal said in other reply, if no real use case, we may not be so eager to support mem hotplug on i386. diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 475d0d68a32c..9faf47bd026e 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -715,15 +715,20 @@ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn { struct pglist_data *pgdat = NODE_DATA(nid); int zid; + enum zone_type default_zone = ZONE_NORMAL; - for (zid = 0; zid <= ZONE_NORMAL; zid++) { +#ifdef CONFIG_HIGHMEM + default_zone = ZONE_HIGHMEM; +#endif + + for (zid = 0; zid <= default_zone; zid++) { struct zone *zone = &pgdat->node_zones[zid]; if (zone_intersects(zone, start_pfn, nr_pages)) return zone; } - return &pgdat->node_zones[ZONE_NORMAL]; + return &pgdat->node_zones[default_zone]; } static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
mhocko@kernel.org sed in <20200217103335.GI31531@dhcp22.suse.cz> >> On Fri 14-02-20 23:26:29, kkabe@vega.pgw.jp wrote: >> [...] >> > [root@localhost ~]# [ 302.391125] hv_balloon: Max. dynamic memory size: >> 1048576 MB >> >> Is this saying that the system might hotplug up to 1TB of memory on this >> 32b system? Probably. Hypervisor API uses 64-bit values, so that's why I added add_memory() printk to see if it's overflowing 4GB. I guessed drivers/hv/hv_balloon.c:hv_mem_hot_add() needs a check to not hot-add over 4GB memory on non-PAE systems and so on, but that's another story. >> Btw. the hotplug support on highmem systems is quite likely to be broken >> and/or full of corner cases. I seriously doubt this is something anybody >> should be running in production without a _lot_ of work. >> >> Is there any real usecase to run HyperV hotplug on 32b system? >> -- >> Michal Hocko >> SUSE Labs
On Mon 17-02-20 19:20:54, Baoquan He wrote: > On 02/17/20 at 11:38am, David Hildenbrand wrote: > > On 17.02.20 11:33, Baoquan He wrote: > > > On 02/17/20 at 11:24am, David Hildenbrand wrote: > > >> On 17.02.20 11:13, Baoquan He wrote: > > >>> On 02/17/20 at 10:34am, Oscar Salvador wrote: > > >>>> On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote: > > >>>>> =========================================== > > >>>>> struct page * __meminit populate_section_memmap(unsigned long pfn, > > >>>>> unsigned long nr_pages, int nid, struct vmem_altmap > *altmap) > > >>>>> { > > >>>>> struct page *page, *ret; > > >>>>> unsigned long memmap_size = sizeof(struct page) * > PAGES_PER_SECTION; > > >>>>> > > >>>>> page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, > get_order(memmap_size)); > > >>>>> if (page) { > > >>>>> goto got_map_page; > > >>>>> } > > >>>>> pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to > vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size); > > >>>>> BUG_ON(page != 0); > > >>>>> > > >>>>> ret = vmalloc(memmap_size); > > >>>>> pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, > ret); > > >>>>> if (ret) { > > >>>>> goto got_map_ptr; > > >>>>> } > > >>>>> > > >>>>> return NULL; > > >>>>> got_map_page: > > >>>>> ret = (struct page *)pfn_to_kaddr(page_to_pfn(page)); > > >>>>> pr_info("%s: allocated struct page *page=0x%p\n", __func__, page); > > >>>>> got_map_ptr: > > >>>>> > > >>>>> pr_info("%s: returning struct page * =0x%p\n", __func__, ret); > > >>>>> return ret; > > >>>>> } > > >>>> > > >>>> Could you please replace %p with %px. Wih the first, pointers are > hashed so it is trickier > > >>>> to get an overview of the meaning. > > >>>> > > >>>> David could be right about ZONE_NORMAL vs ZONE_HIGHMEM. > > >>>> IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to > only deal with > > >>>> (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE. > > >>> > > >>> Ah, I think you both have spotted the problem. > > >>> > > >>> In i386, if w/o momory hot add, normal memory will only include those > > >>> below 896M and they are added into normal zone. The left are added into > > >>> highmem zone. > > >>> > > >>> How this influence the page allocation? > > >>> > > >>> Very huge. As we know, in i386, normal memory can be accessed with > > >>> virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be accessed > > >>> with kmap. However, the later hot added memory are all put into normal > > >>> memmory, accessing into them will stump into vmalloc area, I would say. > > >>> > > >>> So, i386 doesn't support memory hot add well. Not sure if below change > > >>> can make it work normally. > > >>> > > Please try below code instead, see if it works. However, as David and > and Michal said in other reply, if no real use case, we may not be so > eager to support mem hotplug on i386. Yes please. Can we just mark it broken until there is a real usecase? Convoluting the code even more for something that is not in use is just adding a maintenance burden and the memory hotplug is seriously understuffed in man power already. This is likely a fallout of the hotplug rework (c6f03e2903c9e) from 2 years ago. I cannot really say whether the code worked reasonably before the rework because I never considered hotplug on 32b to be something to even try TBH. Mostly because lowmem is unlikely to ever benefit from hotplug and adding more highmem just makes all the lowmem problems even worse so this is dubious in itself. That being said, I am willing to investigate further if there is a real usecase for this but considering that nobody has noticed the breakage in almost 3 years then I simply suspect that this is not really interesting and marking it explicitly BROKEN is a better option.
Created attachment 287453 [details] dmesg bhe@redhat.com sed in <20200217112054.GA9823@MiWiFi-R3L-srv> >> On 02/17/20 at 11:38am, David Hildenbrand wrote: >> > On 17.02.20 11:33, Baoquan He wrote: >> > > On 02/17/20 at 11:24am, David Hildenbrand wrote: >> > >> On 17.02.20 11:13, Baoquan He wrote: >> > >>> On 02/17/20 at 10:34am, Oscar Salvador wrote: >> > >>>> On Mon, Feb 17, 2020 at 02:46:27PM +0900, kkabe@vega.pgw.jp wrote: >> > >>>>> =========================================== >> > >>>>> struct page * __meminit populate_section_memmap(unsigned long pfn, >> > >>>>> unsigned long nr_pages, int nid, struct vmem_altmap >> *altmap) >> > >>>>> { >> > >>>>> struct page *page, *ret; >> > >>>>> unsigned long memmap_size = sizeof(struct page) * >> PAGES_PER_SECTION; >> > >>>>> >> > >>>>> page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, >> get_order(memmap_size)); >> > >>>>> if (page) { >> > >>>>> goto got_map_page; >> > >>>>> } >> > >>>>> pr_info("%s: alloc_pages() returned 0x%p (should be 0), reverting to >> vmalloc(memmap_size=%lu)\n", __func__, page, memmap_size); >> > >>>>> BUG_ON(page != 0); >> > >>>>> >> > >>>>> ret = vmalloc(memmap_size); >> > >>>>> pr_info("%s: vmalloc(%lu) returned 0x%p\n", __func__, memmap_size, >> ret); >> > >>>>> if (ret) { >> > >>>>> goto got_map_ptr; >> > >>>>> } >> > >>>>> >> > >>>>> return NULL; >> > >>>>> got_map_page: >> > >>>>> ret = (struct page *)pfn_to_kaddr(page_to_pfn(page)); >> > >>>>> pr_info("%s: allocated struct page *page=0x%p\n", __func__, page); >> > >>>>> got_map_ptr: >> > >>>>> >> > >>>>> pr_info("%s: returning struct page * =0x%p\n", __func__, ret); >> > >>>>> return ret; >> > >>>>> } >> > >>>> >> > >>>> Could you please replace %p with %px. Wih the first, pointers are >> hashed so it is trickier >> > >>>> to get an overview of the meaning. >> > >>>> >> > >>>> David could be right about ZONE_NORMAL vs ZONE_HIGHMEM. >> > >>>> IIUC, default_kernel_zone_for_pfn and default_zone_for_pfn seem to >> only deal with >> > >>>> (ZONE_DMA,ZONE_NORMAL] or ZONE_MOVABLE. >> > >>> >> > >>> Ah, I think you both have spotted the problem. >> > >>> >> > >>> In i386, if w/o momory hot add, normal memory will only include those >> > >>> below 896M and they are added into normal zone. The left are added >> into >> > >>> highmem zone. >> > >>> >> > >>> How this influence the page allocation? >> > >>> >> > >>> Very huge. As we know, in i386, normal memory can be accessed with >> > >>> virt_to_phys, namely PAGE_OFFSET + phys. But highmem has to be >> accessed >> > >>> with kmap. However, the later hot added memory are all put into normal >> > >>> memmory, accessing into them will stump into vmalloc area, I would >> say. >> > >>> >> > >>> So, i386 doesn't support memory hot add well. Not sure if below >> change >> > >>> can make it work normally. >> > >>> >> >> Please try below code instead, see if it works. However, as David and >> and Michal said in other reply, if no real use case, we may not be so >> eager to support mem hotplug on i386. >> >> >> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >> index 475d0d68a32c..9faf47bd026e 100644 >> --- a/mm/memory_hotplug.c >> +++ b/mm/memory_hotplug.c >> @@ -715,15 +715,20 @@ static struct zone *default_kernel_zone_for_pfn(int >> nid, unsigned long start_pfn >> { >> struct pglist_data *pgdat = NODE_DATA(nid); >> int zid; >> + enum zone_type default_zone = ZONE_NORMAL; >> >> - for (zid = 0; zid <= ZONE_NORMAL; zid++) { >> +#ifdef CONFIG_HIGHMEM >> + default_zone = ZONE_HIGHMEM; >> +#endif >> + >> + for (zid = 0; zid <= default_zone; zid++) { >> struct zone *zone = &pgdat->node_zones[zid]; >> >> if (zone_intersects(zone, start_pfn, nr_pages)) >> return zone; >> } >> >> - return &pgdat->node_zones[ZONE_NORMAL]; >> + return &pgdat->node_zones[default_zone]; >> } >> >> static inline struct zone *default_zone_for_pfn(int nid, unsigned long >> start_pfn, >> >> Tried out the above patch. It seems to be working; no panic, total memory has increased and the hot-added memory is added as HIGHMEM. I had to backout Oscar's first section of patch https://bugzilla.kernel.org/show_bug.cgi?id=206401#c28 since it spams console too much and bogs down systemd. Minimal install of 168MB memory worked, so this time the sample is running anaconda installer starting at 512MB. Eventually memory was hot-added to around 1.2GB. The weird pr_info() from populate_section_memmap() is still remaining though... 2nd parameter of add_memory() (phys_addr_t, 32bit on non-PAE) is going up to 0x60000000, so drivers/hv/hv_balloon.c:hv_mem_hot_add() may need limit check to not overflow 4GB for heavier usage. (Yes you should limit it in hypervisor dialog, but default is 1TB) Do we need modifications for arch/x86/mm/init_32.c:arch_add_memory() so that the hot-added memory is always in highmem area? Currently it just >>PAGE_SHIFT given parameters and call generic __add_pages(). ======================= readelf -l /proc/kcore: Elf file type is CORE (Core file) Entry point 0x0 There are 3 program headers, starting at offset 52 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align NOTE 0x000094 0x00000000 0x00000000 0x01304 0x00000 0 LOAD 0x207f2000 0xe07f0000 0xffffffff 0x1e80e000 0x1e80e000 RWE 0x1000 LOAD 0x002000 0xc0000000 0x00000000 0x1fff0000 0x1fff0000 RWE 0x1000 ======================== dmesg excerpt: [ 302.503487] hv_balloon: Max. dynamic memory size: 1048576 MB [ 303.171640] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x28000) << PAGE_SHIFT)=0x28000000, (HA_CHUNK << PAGE_SHIFT)=134217728) [ 303.173031] populate_section_memmap: alloc_pages() returned 0x56164d26 (should be 0), reverting to vmalloc(memmap_size=655360) [ 303.173031] populate_section_memmap: vmalloc(655360) returned 0x912eede0 [ 303.173031] populate_section_memmap: returning struct page * =0x912eede0 [ 303.173032] populate_section_memmap: alloc_pages() returned 0x56164d26 (should be 0), reverting to vmalloc(memmap_size=655360) [ 303.173032] populate_section_memmap: vmalloc(655360) returned 0x900acc37 [ 303.173032] populate_section_memmap: returning struct page * =0x900acc37 [ 303.173033] hv_balloon: hv_mem_hot_add: add_memory() returned 0 [ 303.213109] online_pages: pfn: 28000 - 2c000 (zone: HighMem) [ 303.223135] Built 1 zonelists, mobility grouping on. Total pages: 123131 [ 303.223139] online_pages: pfn: 2c000 - 30000 (zone: HighMem) .... [ 305.124224] hv_balloon: hv_mem_hot_add: calling add_memory(nid=0, ((start_pfn=0x60000) << PAGE_SHIFT)=0x60000000, (HA_CHUNK << PAGE_SHIFT)=134217728) [ 305.124239] populate_section_memmap: alloc_pages() returned 0x56164d26 (should be 0), reverting to vmalloc(memmap_size=655360) [ 305.124240] populate_section_memmap: vmalloc(655360) returned 0x5dd5170c [ 305.124240] populate_section_memmap: returning struct page * =0x5dd5170c [ 305.124254] populate_section_memmap: alloc_pages() returned 0x56164d26 (should be 0), reverting to vmalloc(memmap_size=655360) [ 305.124254] populate_section_memmap: vmalloc(655360) returned 0xf8ef699a [ 305.124254] populate_section_memmap: returning struct page * =0xf8ef699a [ 305.124256] hv_balloon: hv_mem_hot_add: add_memory() returned 0 [ 305.143791] online_pages: pfn: 60000 - 64000 (zone: HighMem) [ 305.153186] online_pages: pfn: 64000 - 68000 (zone: HighMem) ======================= /proc/zoneinfo before hot-add Node 0, zone DMA per-node stats nr_inactive_anon 12069 nr_active_anon 11288 nr_inactive_file 13748 nr_active_file 17527 nr_unevictable 6734 nr_slab_reclaimable 4337 nr_slab_unreclaimable 8457 nr_isolated_anon 0 nr_isolated_file 0 workingset_nodes 2262 workingset_refault 223120 workingset_activate 208515 workingset_restore 137786 workingset_nodereclaim 707 nr_anon_pages 26686 nr_mapped 10129 nr_file_pages 34688 nr_dirty 1 nr_writeback 231 nr_writeback_temp 0 nr_shmem 942 nr_shmem_hugepages 0 nr_shmem_pmdmapped 0 nr_file_hugepages 0 nr_file_pmdmapped 0 nr_anon_transparent_hugepages 0 nr_unstable 0 nr_vmscan_write 71210 nr_vmscan_immediate_reclaim 3265 nr_dirtied 6588 nr_written 77555 nr_kernel_misc_reclaimable 0 pages free 403 min 1049 low 1055 high 1061 spanned 4095 present 3998 managed 3979 protection: (0, 357, 357, 357) nr_free_pages 403 nr_zone_inactive_anon 371 nr_zone_active_anon 321 nr_zone_inactive_file 544 nr_zone_active_file 683 nr_zone_unevictable 164 nr_zone_write_pending 0 nr_mlock 164 nr_page_table_pages 11 nr_kernel_stack 360 nr_bounce 0 nr_zspages 583 nr_free_cma 0 pagesets cpu: 0 count: 0 high: 0 batch: 1 vm stats threshold: 2 node_unreclaimable: 0 start_pfn: 1 Node 0, zone Normal pages free 1400 min 592 low 740 high 888 spanned 126960 present 126960 managed 105057 protection: (0, 0, 0, 0) nr_free_pages 1400 nr_zone_inactive_anon 11695 nr_zone_active_anon 10961 nr_zone_inactive_file 13204 nr_zone_active_file 16844 nr_zone_unevictable 6570 nr_zone_write_pending 235 nr_mlock 6570 nr_page_table_pages 514 nr_kernel_stack 1272 nr_bounce 0 nr_zspages 22175 nr_free_cma 0 pagesets cpu: 0 count: 44 high: 186 batch: 31 vm stats threshold: 6 node_unreclaimable: 0 start_pfn: 4096 Node 0, zone HighMem pages free 0 min 32 low 32 high 32 spanned 0 present 0 managed 0 protection: (0, 0, 0, 0) Node 0, zone Movable pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 0, 0) ============================ /proc/zoneinfo after hot-add Node 0, zone DMA per-node stats nr_inactive_anon 13438 nr_active_anon 10249 nr_inactive_file 6955 nr_active_file 26815 nr_unevictable 6734 nr_slab_reclaimable 4442 nr_slab_unreclaimable 8670 nr_isolated_anon 0 nr_isolated_file 0 workingset_nodes 2174 workingset_refault 635931 workingset_activate 594855 workingset_restore 486703 workingset_nodereclaim 1247 nr_anon_pages 25862 nr_mapped 12441 nr_file_pages 38352 nr_dirty 8 nr_writeback 0 nr_writeback_temp 0 nr_shmem 2136 nr_shmem_hugepages 0 nr_shmem_pmdmapped 0 nr_file_hugepages 0 nr_file_pmdmapped 0 nr_anon_transparent_hugepages 0 nr_unstable 0 nr_vmscan_write 123858 nr_vmscan_immediate_reclaim 12156 nr_dirtied 7219 nr_written 130953 nr_kernel_misc_reclaimable 0 pages free 1380 min 23 low 28 high 33 spanned 4095 present 3998 managed 3979 protection: (0, 410, 1306, 1306) nr_free_pages 1380 nr_zone_inactive_anon 27 nr_zone_active_anon 102 nr_zone_inactive_file 122 nr_zone_active_file 238 nr_zone_unevictable 164 nr_zone_write_pending 0 nr_mlock 164 nr_page_table_pages 13 nr_kernel_stack 328 nr_bounce 0 nr_zspages 660 nr_free_cma 0 pagesets cpu: 0 count: 0 high: 0 batch: 1 vm stats threshold: 2 node_unreclaimable: 0 start_pfn: 1 Node 0, zone Normal pages free 20635 min 633 low 791 high 949 spanned 126960 present 126960 managed 105057 protection: (0, 0, 7168, 7168) nr_free_pages 20635 nr_zone_inactive_anon 8967 nr_zone_active_anon 7980 nr_zone_inactive_file 6309 nr_zone_active_file 5881 nr_zone_unevictable 6570 nr_zone_write_pending 8 nr_mlock 6570 nr_page_table_pages 537 nr_kernel_stack 1176 nr_bounce 0 nr_zspages 25936 nr_free_cma 0 pagesets cpu: 0 count: 97 high: 186 batch: 31 vm stats threshold: 6 node_unreclaimable: 0 start_pfn: 4096 Node 0, zone HighMem pages free 199096 min 128 low 473 high 818 spanned 262144 present 262144 managed 229376 protection: (0, 0, 0, 0) nr_free_pages 199096 nr_zone_inactive_anon 4444 nr_zone_active_anon 2167 nr_zone_inactive_file 524 nr_zone_active_file 20691 nr_zone_unevictable 0 nr_zone_write_pending 0 nr_mlock 0 nr_page_table_pages 0 nr_kernel_stack 0 nr_bounce 0 nr_zspages 122 nr_free_cma 0 pagesets cpu: 0 count: 67 high: 378 batch: 63 vm stats threshold: 8 node_unreclaimable: 0 start_pfn: 163840 Node 0, zone Movable pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 0, 0)
Created attachment 287455 [details] ha00.patch
On Tue 18-02-20 15:24:48, kkabe@vega.pgw.jp wrote: [...] > Tried out the above patch. > It seems to be working; no panic, total memory has increased and > the hot-added memory is added as HIGHMEM. I was about to post a patch to mark hotplug broken on 32b but it seems you do care about this setup. Could you describe your usecase please?
mhocko@kernel.org sed in <20200218084700.GD21113@dhcp22.suse.cz> >> On Tue 18-02-20 15:24:48, kkabe@vega.pgw.jp wrote: >> [...] >> > Tried out the above patch. >> > It seems to be working; no panic, total memory has increased and >> > the hot-added memory is added as HIGHMEM. >> >> I was about to post a patch to mark hotplug broken on 32b but it seems >> you do care about this setup. Could you describe your usecase please? My usecase is testing out the kernel on Hyper-V before loading it on real i686 machine. Hyper-V machine is faster to skim out other bugs. So memory hot-add is not a must requirement for me, but having hot-add may be handy to see the application memory requirement. (as in the anaconda test revealed) If we're disabling it, we have to announce it somewhere; where is appropriate? `modinfo hv_balloon`'s "hot_add" description?
On 18.02.20 10:19, kkabe@vega.pgw.jp wrote: > mhocko@kernel.org sed in <20200218084700.GD21113@dhcp22.suse.cz> > >>> On Tue 18-02-20 15:24:48, kkabe@vega.pgw.jp wrote: >>> [...] >>>> Tried out the above patch. >>>> It seems to be working; no panic, total memory has increased and >>>> the hot-added memory is added as HIGHMEM. >>> >>> I was about to post a patch to mark hotplug broken on 32b but it seems >>> you do care about this setup. Could you describe your usecase please? > > My usecase is testing out the kernel on Hyper-V before loading it on > real i686 machine. Hyper-V machine is faster to skim out other bugs. > So memory hot-add is not a must requirement for me, > but having hot-add may be handy to see the application memory requirement. > (as in the anaconda test revealed) > > If we're disabling it, we have to announce it somewhere; > where is appropriate? `modinfo hv_balloon`'s "hot_add" description? > I'd really vote to just disable that. Basic testing of kernels can be done without memory hotadd. If I am not wrong, doing a "online_movable" or "online_kernel" from user space could still make us trigger crashes and would have to be fenced.
On Tue 18-02-20 18:19:00, kkabe@vega.pgw.jp wrote: > mhocko@kernel.org sed in <20200218084700.GD21113@dhcp22.suse.cz> > > >> On Tue 18-02-20 15:24:48, kkabe@vega.pgw.jp wrote: > >> [...] > >> > Tried out the above patch. > >> > It seems to be working; no panic, total memory has increased and > >> > the hot-added memory is added as HIGHMEM. > >> > >> I was about to post a patch to mark hotplug broken on 32b but it seems > >> you do care about this setup. Could you describe your usecase please? > > My usecase is testing out the kernel on Hyper-V before loading it on > real i686 machine. Hyper-V machine is faster to skim out other bugs. > So memory hot-add is not a must requirement for me, > but having hot-add may be handy to see the application memory requirement. > (as in the anaconda test revealed) OK, thanks for the clarification. I am not sure that this qualifies as a sufficient reason to maintain the code though. > If we're disabling it, we have to announce it somewhere; > where is appropriate? `modinfo hv_balloon`'s "hot_add" description? This should behave the same way as when the CONFIG_MEMORY_HOTPLULG is not enabled. And from a very cursory look hv_balloon.c already checks for the config. --- From 562f21abeda508f199c34358e50fbaa518cd5ed8 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Tue, 18 Feb 2020 08:04:13 +0100 Subject: [PATCH] memory_hotplug: disable the functionality for 32b Memory hotlug is broken for 32b systems at least since c6f03e2903c9 ("mm, memory_hotplug: remove zone restrictions") which has considerably reworked how can be memory associated with movable/kernel zones. The same is not really trivial to achieve in 32b where only lowmem is the kernel zone. While we can tweak this immediate problem around there are likely other land mines hidden at other places. It is also quite dubious that there is a real usecase for the memory hotplug on 32b in the first place. Low memory is just too small to be hotplugable (for hot add) and generally unusable for hotremove. Adding more memory to highmem is also dubious because it would increase the low mem or vmalloc space pressure for memmaps. Restrict the functionality to 64b systems. This will help future development to focus on usecases that have real life application. We can remove this restriction in future in presence of a real life usecase of course but until then make it explicit that hotplug on 32b is broken and requires a non trivial amount of work to fix. Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/Kconfig b/mm/Kconfig index ab80933be65f..2d5fe9e92969 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -154,6 +154,7 @@ config MEMORY_HOTPLUG bool "Allow for memory hot-add" depends on SPARSEMEM || X86_64_ACPI_NUMA depends on ARCH_ENABLE_MEMORY_HOTPLUG + depends on 64BIT || BROKEN config MEMORY_HOTPLUG_SPARSE def_bool y
On 18.02.20 11:05, Michal Hocko wrote: > On Tue 18-02-20 18:19:00, kkabe@vega.pgw.jp wrote: >> mhocko@kernel.org sed in <20200218084700.GD21113@dhcp22.suse.cz> >> >>>> On Tue 18-02-20 15:24:48, kkabe@vega.pgw.jp wrote: >>>> [...] >>>>> Tried out the above patch. >>>>> It seems to be working; no panic, total memory has increased and >>>>> the hot-added memory is added as HIGHMEM. >>>> >>>> I was about to post a patch to mark hotplug broken on 32b but it seems >>>> you do care about this setup. Could you describe your usecase please? >> >> My usecase is testing out the kernel on Hyper-V before loading it on >> real i686 machine. Hyper-V machine is faster to skim out other bugs. >> So memory hot-add is not a must requirement for me, >> but having hot-add may be handy to see the application memory requirement. >> (as in the anaconda test revealed) > > OK, thanks for the clarification. I am not sure that this qualifies > as a sufficient reason to maintain the code though. > >> If we're disabling it, we have to announce it somewhere; >> where is appropriate? `modinfo hv_balloon`'s "hot_add" description? > > This should behave the same way as when the CONFIG_MEMORY_HOTPLULG is > not enabled. And from a very cursory look hv_balloon.c already checks > for the config. > > --- > From 562f21abeda508f199c34358e50fbaa518cd5ed8 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Tue, 18 Feb 2020 08:04:13 +0100 > Subject: [PATCH] memory_hotplug: disable the functionality for 32b > > Memory hotlug is broken for 32b systems at least since c6f03e2903c9 > ("mm, memory_hotplug: remove zone restrictions") which has considerably > reworked how can be memory associated with movable/kernel zones. The > same is not really trivial to achieve in 32b where only lowmem is the > kernel zone. While we can tweak this immediate problem around there are > likely other land mines hidden at other places. > > It is also quite dubious that there is a real usecase for the memory > hotplug on 32b in the first place. Low memory is just too small to be > hotplugable (for hot add) and generally unusable for hotremove. Adding > more memory to highmem is also dubious because it would increase the > low mem or vmalloc space pressure for memmaps. > > Restrict the functionality to 64b systems. This will help future > development to focus on usecases that have real life application. We > can remove this restriction in future in presence of a real life usecase > of course but until then make it explicit that hotplug on 32b is broken > and requires a non trivial amount of work to fix. > > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > mm/Kconfig | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/Kconfig b/mm/Kconfig > index ab80933be65f..2d5fe9e92969 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -154,6 +154,7 @@ config MEMORY_HOTPLUG > bool "Allow for memory hot-add" > depends on SPARSEMEM || X86_64_ACPI_NUMA > depends on ARCH_ENABLE_MEMORY_HOTPLUG > + depends on 64BIT || BROKEN > > config MEMORY_HOTPLUG_SPARSE > def_bool y > Acked-by: David Hildenbrand <david@redhat.com>
On 02/18/20 at 11:05am, Michal Hocko wrote: > On Tue 18-02-20 18:19:00, kkabe@vega.pgw.jp wrote: > > mhocko@kernel.org sed in <20200218084700.GD21113@dhcp22.suse.cz> > > > > >> On Tue 18-02-20 15:24:48, kkabe@vega.pgw.jp wrote: > > >> [...] > > >> > Tried out the above patch. > > >> > It seems to be working; no panic, total memory has increased and > > >> > the hot-added memory is added as HIGHMEM. > > >> > > >> I was about to post a patch to mark hotplug broken on 32b but it seems > > >> you do care about this setup. Could you describe your usecase please? > > > > My usecase is testing out the kernel on Hyper-V before loading it on > > real i686 machine. Hyper-V machine is faster to skim out other bugs. > > So memory hot-add is not a must requirement for me, > > but having hot-add may be handy to see the application memory requirement. > > (as in the anaconda test revealed) > > OK, thanks for the clarification. I am not sure that this qualifies > as a sufficient reason to maintain the code though. > > > If we're disabling it, we have to announce it somewhere; > > where is appropriate? `modinfo hv_balloon`'s "hot_add" description? > > This should behave the same way as when the CONFIG_MEMORY_HOTPLULG is > not enabled. And from a very cursory look hv_balloon.c already checks > for the config. > > --- > From 562f21abeda508f199c34358e50fbaa518cd5ed8 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Tue, 18 Feb 2020 08:04:13 +0100 > Subject: [PATCH] memory_hotplug: disable the functionality for 32b > > Memory hotlug is broken for 32b systems at least since c6f03e2903c9 > ("mm, memory_hotplug: remove zone restrictions") which has considerably > reworked how can be memory associated with movable/kernel zones. The > same is not really trivial to achieve in 32b where only lowmem is the > kernel zone. While we can tweak this immediate problem around there are > likely other land mines hidden at other places. > > It is also quite dubious that there is a real usecase for the memory > hotplug on 32b in the first place. Low memory is just too small to be > hotplugable (for hot add) and generally unusable for hotremove. Adding > more memory to highmem is also dubious because it would increase the > low mem or vmalloc space pressure for memmaps. > > Restrict the functionality to 64b systems. This will help future > development to focus on usecases that have real life application. We > can remove this restriction in future in presence of a real life usecase > of course but until then make it explicit that hotplug on 32b is broken > and requires a non trivial amount of work to fix. > > Signed-off-by: Michal Hocko <mhocko@suse.com> No objection to this, ack. Acked-by: Baoquan He <bhe@redhat.com> At least in our distros, we have taken the i386 off from our ARCH lists for a very long time, hence I personally haven't followed i386 code for a long time either. This can save our time when maintain the mem hotplug code. Thanks for making this patch. Thanks Baoquan > --- > mm/Kconfig | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/Kconfig b/mm/Kconfig > index ab80933be65f..2d5fe9e92969 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -154,6 +154,7 @@ config MEMORY_HOTPLUG > bool "Allow for memory hot-add" > depends on SPARSEMEM || X86_64_ACPI_NUMA > depends on ARCH_ENABLE_MEMORY_HOTPLUG > + depends on 64BIT || BROKEN > > config MEMORY_HOTPLUG_SPARSE > def_bool y > -- > 2.24.1 > > -- > Michal Hocko > SUSE Labs >
On 02/18/20 at 03:24pm, kkabe@vega.pgw.jp wrote: > bhe@redhat.com sed in <20200217112054.GA9823@MiWiFi-R3L-srv> > >> Please try below code instead, see if it works. However, as David and > >> and Michal said in other reply, if no real use case, we may not be so > >> eager to support mem hotplug on i386. > >> > >> > >> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > >> index 475d0d68a32c..9faf47bd026e 100644 > >> --- a/mm/memory_hotplug.c > >> +++ b/mm/memory_hotplug.c > >> @@ -715,15 +715,20 @@ static struct zone *default_kernel_zone_for_pfn(int > nid, unsigned long start_pfn > >> { > >> struct pglist_data *pgdat = NODE_DATA(nid); > >> int zid; > >> + enum zone_type default_zone = ZONE_NORMAL; > >> > >> - for (zid = 0; zid <= ZONE_NORMAL; zid++) { > >> +#ifdef CONFIG_HIGHMEM > >> + default_zone = ZONE_HIGHMEM; > >> +#endif > >> + > >> + for (zid = 0; zid <= default_zone; zid++) { > >> struct zone *zone = &pgdat->node_zones[zid]; > >> > >> if (zone_intersects(zone, start_pfn, nr_pages)) > >> return zone; > >> } > >> > >> - return &pgdat->node_zones[ZONE_NORMAL]; > >> + return &pgdat->node_zones[default_zone]; > >> } > >> > >> static inline struct zone *default_zone_for_pfn(int nid, unsigned long > start_pfn, > >> > >> > > Tried out the above patch. > It seems to be working; no panic, total memory has increased and > the hot-added memory is added as HIGHMEM. > Minimal install of 168MB memory worked, so this time the sample is > running anaconda installer starting at 512MB. > Eventually memory was hot-added to around 1.2GB. > > The weird pr_info() from populate_section_memmap() is still remaining > though... > > 2nd parameter of add_memory() (phys_addr_t, 32bit on non-PAE) is > going up to 0x60000000, so drivers/hv/hv_balloon.c:hv_mem_hot_add() may need > limit check to not overflow 4GB for heavier usage. > (Yes you should limit it in hypervisor dialog, but default is 1TB) > > > Do we need modifications for arch/x86/mm/init_32.c:arch_add_memory() > so that the hot-added memory is always in highmem area? > Currently it just >>PAGE_SHIFT given parameters and call generic > __add_pages(). Hmm, it may not be hot added into highmem area always, if possible, it can be added into movable area. From my point of view, the above change is enough to make it work. Sorry, man. The i386 is too old, as you see, people is more willing to deprecate it so that focus on 64bit arch. You can still patch your kernel with above code change for a while, but possibly won't be very long. Thanks Baoquan
On Tue, 18 Feb 2020 11:05:32 +0100 Michal Hocko <mhocko@kernel.org> wrote: > Subject: [PATCH] memory_hotplug: disable the functionality for 32b > > Memory hotlug is broken for 32b systems at least since c6f03e2903c9 > ("mm, memory_hotplug: remove zone restrictions") which has considerably > reworked how can be memory associated with movable/kernel zones. The > same is not really trivial to achieve in 32b where only lowmem is the > kernel zone. While we can tweak this immediate problem around there are > likely other land mines hidden at other places. > > It is also quite dubious that there is a real usecase for the memory > hotplug on 32b in the first place. Low memory is just too small to be > hotplugable (for hot add) and generally unusable for hotremove. Adding > more memory to highmem is also dubious because it would increase the > low mem or vmalloc space pressure for memmaps. > > Restrict the functionality to 64b systems. This will help future > development to focus on usecases that have real life application. We > can remove this restriction in future in presence of a real life usecase > of course but until then make it explicit that hotplug on 32b is broken > and requires a non trivial amount of work to fix. (cc linux-arch) (and linux-arm-kernel, as ARM is a major 32-bit user) Does anyone see a problem with disabling memory hotplug on 32-bit builds?