Created attachment 190641 [details] call graph S4 resume will fail and reboot about 1 out of 100 times. On the Via Nano, we narrowed it down to a double fault in mwait_play_dead(). An Intel Haswell appears to fail similarly, but we did not narrow it all the way down. from smpboot.c: static inline void mwait_play_dead(void) ... mwait_ptr = ¤t_thread_info()->flags; .. while (1) { clflush(mwait_ptr); //** <--- double fault occurs here **// mb(); __monitor(mwait_ptr, 0, 0); mb(); __mwait(eax, 0); ... } } The double fault occurs when the boot cpu is managing the resume (resume_target_kernel()) and the non-boot cpus are in the above loop (from disable_nonboot_cpus()). If a "spurious"-wakeup out of the mwait happens, the non-boot cpu may tablewalk for the mwait_ptr again, but it appears that mwait_ptr is no longer present in the page tables (the boot cpu is in set_up_temporary_mapping() during this time). This causes a pagefault-pagefault-doublefault. A possible fix is to have mwait_ptr point to somewhere else that stays paged during the resume process. One experiment where we had `mwait_ptr = mwait_play_dead` appeared to fix it. However, that is probably not ideal for performance, as it is pointing to code, and all the cpus will clflush that same address. Attached is a drawn call graph of the observed failure case.
.. while (1) { clflush(mwait_ptr); //** <--- double fault occurs here **// mb(); __monitor(mwait_ptr, 0, 0); mb(); __mwait(eax, 0); ... Not sure if this is worth to point out, but I saw there is an additional mb() in front of clflush, do you miss it while typing or it's indeed not there in your tree?
(In reply to Aaron Lu from comment #1) > Not sure if this is worth to point out, but I saw there is an additional > mb() in front of clflush, do you miss it while typing or it's indeed not > there in your tree? Sorry, I missed that while typing. The mb() is there.
Adding the author HPA for comments.
Hi Varun, (In reply to Varun Koyyalagunta from comment #0) > > The double fault occurs when the boot cpu is managing the resume > (resume_target_kernel()) and the non-boot cpus are in the above loop (from > disable_nonboot_cpus()). If a "spurious"-wakeup out of the mwait happens, > the non-boot cpu may tablewalk for the mwait_ptr again, but it appears that > mwait_ptr is no longer present in the page tables (the boot cpu is in > set_up_temporary_mapping() during this time). This causes a > pagefault-pagefault-doublefault. > thanks for your info, I have 2 questions: 1.how do you confirm boot cpu is right in set_up_temporary_mapping when non-boot cpu dies? since set_up_temporary_mapping is to create a temperary page-table for boot cpu, but not for non-boot cpus, besides at this stage in set_up_temporary_mapping, no page tables will be invalid/replaced by the temperary one util restore_image is invoked by boot cpu, so set_up_temporary_mapping does not affect the mapping of nonboot cpus. 2. what does 'double fault' mean? is it a string like: "unable to handle xxx " then panic? Can you please upload the panic log or screenshot? if possible, please also print the value of mwait_ptr before __monitor(printk("%p\n", mwait_ptr);, then we can see if the mapping for mwait_ptr has been broken. The possible cause of this problm I'm thinking of , might be that, if the boot cpu is doing the copy of pages to their original positions in restore_image, it might breaks the pagetable of non-boot cpus, so if one of the non-boot cpus has been woken up at this time, it might found mwait_ptr not in its own page table, then panic. thanks, yu
1) Sorry, you are right, the boot cpu is in restore_image, specifically core_restore_code, when the non-boot cpu fails. We confirmed this using our internal cpu hardware debug tools, which create a trace of the processor execution. 2) By double fault, I meant an x86 double fault exception. 1166 I:0x0001046FD3:0x3BAE0F:"CLFLUSH DS:[RBX]"; // WB: 1166 D:0x0001046FD3:0x61EF4B; // LONG MODE, CPL=0, thread 1, step 0 // DATA ADDRESS TRANSLATION, virtual address = 0xFFFF8800BA800010 // DATA TABLE WALK, virtual address = 0xFFFF8800BA800010 // pdptc hit pderead 0x009BEBFEA0 0x2D9D9063; // WB: PDE, LOAD pderead 0x009BEBFEA4 0x00000000; // WB: PDE, LOAD pteread 0x002D9D9000 0x00000000; // WB: PTE, LOAD pteread 0x002D9D9004 0x00000000; // WB: PTE, LOAD // EXCEPTION 14: PAGE NOT PRESENT, EC=00000h, ADDR=0xFFFF8800BA800010h // Load 8-B 0xFFFFFFFFFF5780E0 // DATA TABLE WALK, virtual address = 0xFFFFFFFFFF5780E0 pml4read 0x0001C12FF8 0x01C15067; // WB: PML4, LOAD pml4read 0x0001C12FFC 0x00000000; // WB: PML4, LOAD pdptread 0x0001C15FF8 0x01C17067; // WB: PDPT, LOAD pdptread 0x0001C15FFC 0x00000000; // WB: PDPT, LOAD pderead 0x0001C17FD0 0x01C18067; // WB: PDE, LOAD pderead 0x0001C17FD4 0x00000000; // WB: PDE, LOAD pteread 0x0001C18BC0 0x01E8D161; // WB: PTE, LOAD pteread 0x0001C18BC4 0x80000000; // WB: PTE, LOAD memread 0x0001E8D0E0 0x0010E940; // WB: DESCRIPTOR, IDT, LOAD memread 0x0001E8D0E4 0x81788E00; // WB: DESCRIPTOR, IDT, LOAD // Load 8-B 0xFFFFFFFFFF5780E8 memread 0x0001E8D0E8 0xFFFFFFFF; // WB: DESCRIPTOR, IDT, LOAD memread 0x0001E8D0EC 0x00000000; // WB: DESCRIPTOR, IDT, LOAD // Load 8-B 0xFFFF8800BFC89010 memread 0x00BFC89010 0x0000FFFF; // WB: DESCRIPTOR, GDT, LOAD memread 0x00BFC89014 0x00AF9B00; // WB: DESCRIPTOR, GDT, LOAD // Store 8-B 0xFFFF8800BA803E88 // DATA ADDRESS TRANSLATION, virtual address = 0xFFFF8800BA803E88 // DATA TABLE WALK, virtual address = 0xFFFF8800BA803E88 pml4read 0x0001C12880 0x01FD2067; // WB: PML4, STORE pml4read 0x0001C12884 0x00000000; // WB: PML4, STORE pdptread 0x0001FD2010 0x9BEBF063; // WB: PDPT, STORE pdptread 0x0001FD2014 0x00000000; // WB: PDPT, STORE pderead 0x009BEBFEA0 0x2D9D9063; // WB: PDE, STORE pderead 0x009BEBFEA4 0x00000000; // WB: PDE, STORE pteread 0x002D9D9018 0x00000000; // WB: PTE, STORE pteread 0x002D9D901C 0x00000000; // WB: PTE, STORE // EXCEPTION 14: PAGE NOT PRESENT, EC=00002h, ADDR=0xFFFF8800BA803E88h // EXCEPTION 8: DOUBLE FAULT ...
Oops, I accidentally hit reply with just half the message. That CLFLUSH above is from our execution trace on core 1. It is the CLFLUSH before the MONITOR/MWAIT. The trace shows the double fault exception happening after two bad page faults. The mwait_ptr is in RBX, which has value 0xFFFF8800BA800010. We don't have the panic log, as I think it did not contain useful information. We may have a screenshot, but it isn't on hand at the moment.
(In reply to Varun Koyyalagunta from comment #6) > Oops, I accidentally hit reply with just half the message. > > That CLFLUSH above is from our execution trace on core 1. It is the CLFLUSH > before the MONITOR/MWAIT. The trace shows the double fault exception > happening after two bad page faults. The mwait_ptr is in RBX, which has > value 0xFFFF8800BA800010. > > We don't have the panic log, as I think it did not contain useful > information. We may have a screenshot, but it isn't on hand at the moment. OK this value 0xFFFF8800BA800010 is already very valuable, since this is a direct-mapping address in kernel pagetable, I think it is what I suspect at #Comment 4, the page table cpu1 is using has been overwritten. Currently one of the solutions for this problem can be that, to leverage nonboot cpus also use the same temorary page table as the boot cpu. It might need some assembly code to be added before non boot cpus go to mwait_play_dead. I'll try..
Created attachment 200111 [details] trial patch to enable the temporary direct-mapping for non-boot cpus before they die Please help check if the patch make any sense? thanks
please do not apply above patch since there is one bug in it. I'll do more investigation and reply later.
BTW, the reason why cpu1 is woken up by "spurious" is because that, cpu0 right wrote a value(when restoring pages) to the address of monitor address mwait_ptr.
Created attachment 201321 [details] force nonboot cpus to switch to safe pagetable when restoring page frames Hi, please help try if this patch works for you? Yu
(In reply to Chen Yu from comment #11) > Created attachment 201321 [details] > force nonboot cpus to switch to safe pagetable when restoring page frames > > Hi, > please help try if this patch works for you? > Yu The patch is not working for me. I'm working off of 4.3.0-rc3. I made one change to mwait_play_dead - I replaced the '__mwait(...)' with an 'invlpg mwait_ptr'. This makes the failure happen very quickly, as the non-boot cpus never go to sleep and are constantly doing a tablewalk for the mwait_ptr. With the invlpg and the above patch, on a haswell, I ran 'pm-hibernate'. The OS crashes and reboots usually within 1 to 2 iterations of 'pm-hibernate'. With the same setup, I tried the invlpg change with the workaround of having mwait_ptr point to code - 'mwait_ptr = mwait_play_dead'. This ran for 200 iterations of pm-hibernate without crashing.
(In reply to Varun Koyyalagunta from comment #12) > (In reply to Chen Yu from comment #11) > > Created attachment 201321 [details] > > force nonboot cpus to switch to safe pagetable when restoring page frames > > > > Hi, > > please help try if this patch works for you? > > Yu > > The patch is not working for me. I'm working off of 4.3.0-rc3. I made one > change to mwait_play_dead - I replaced the '__mwait(...)' with an 'invlpg > mwait_ptr'. This makes the failure happen very quickly, as the non-boot cpus > never go to sleep and are constantly doing a tablewalk for the mwait_ptr. > > With the invlpg and the above patch, on a haswell, I ran 'pm-hibernate'. The > OS crashes and reboots usually within 1 to 2 iterations of 'pm-hibernate'. > > With the same setup, I tried the invlpg change with the workaround of having > mwait_ptr point to code - 'mwait_ptr = mwait_play_dead'. This ran for 200 > iterations of pm-hibernate without crashing. Cool, thanks for your method to reproduce it. The reason why set mwait_ptr = mwait_play_dead solve this problem might be that, mwait_play_dead is a kernel text mapping area, while current_thread is located in dynamically allocated address space. But I still wonder if the dynamically allocated address space for nonboot cpus would be overlapped, if it does, will nonboot cpu crash too?
After discussion internally, we decide to change the monitor address to zero, because of the following reasons: 1. The cause of this panic is because of the inconsistence of memory layout during hibernation, which means, the mapping for dynamic page frames has changed, and when the boot CPU tries to copy the data to one page frame, it wakes up the nonboot CPU incorrectly because the nonboot CPUs are just monitoring this address, and since the memory layout has changed and the page tables for this address are not the same before/after hibernation, the violation of address access is triggered. 2.Yes, as Varun suggested before, we can change the mwait_ptr to a kernel text address, since according to the semantics of hibernation, the kernel text remains the same across hibernation, thus the page tables also remain the same, we will not trigger any exception even nonboot CPUs are woken up when boot cpu is writing data to text section. 3.A more graceful solution is to set the monitor to zero, thus the boot CPU will not wake up the nonboot CPUs incorrectly, even if the nonboot CPUs are woken up, they can fall asleep again quickly. here's the proposal patch, do you have a chance to try? thanks/
Created attachment 219111 [details] monitor the zero page address
Created attachment 219161 [details] use zero page for monitor address when cpu is offline
(In reply to Chen Yu from comment #16) > Created attachment 219161 [details] > use zero page for monitor address when cpu is offline This worked for me on a haswell. I ran 500 hibernations of the patch. I also ran 1000 hibernations where I changed the mwait to an invlpg, so the cpu never goes to sleep and the monitor would constantly tablewalk. That also worked.
(In reply to Varun Koyyalagunta from comment #17) > (In reply to Chen Yu from comment #16) > > Created attachment 219161 [details] > > use zero page for monitor address when cpu is offline > > This worked for me on a haswell. I ran 500 hibernations of the patch. > I also ran 1000 hibernations where I changed the mwait to an invlpg, so the > cpu never goes to sleep and the monitor would constantly tablewalk. That > also worked. Thank you very much.Would you please also test with the following method, without any patch applied: 1. first boot the kernel normally 2. after boot into the kernel, edit the default grub option, to add a 'maxcpus=1' and update-grub, (in this way, the resume kernel does not have to offline the nonboot CPUs during resume) 3. do a stress test on hibernation to see if there is any problem. (Although the resume kernel only have one boot CPU, but after resumed , all nonboot CPUs should be woken up.) I used the following script FYI: #!/bin/bash echo platform > /sys/power/disk echo none > /sys/power/pm_test iteration=1 while [ $iteration != 101 ] do echo echo $iteration rtcwake -m disk -s 60 sleep 30 iteration=$(($iteration + 1)) done
(In reply to Chen Yu from comment #18) > (In reply to Varun Koyyalagunta from comment #17) > > (In reply to Chen Yu from comment #16) > > > Created attachment 219161 [details] > > > use zero page for monitor address when cpu is offline > > > > This worked for me on a haswell. I ran 500 hibernations of the patch. > > I also ran 1000 hibernations where I changed the mwait to an invlpg, so the > > cpu never goes to sleep and the monitor would constantly tablewalk. That > > also worked. > > Thank you very much.Would you please also test with the following method, > without any patch applied: But please test both with and without your invlpg patch.
A more straightforward solution is to use HLT in the "play dead" loops invoked by disable_nonboot_cpus() in general, and it shouldn't matter energy-consumption.
Created attachment 219731 [details] Use hlt when disable nonboot CPUs
Varun, could you please check if #Comment 21 work for you?(with invlpg applied) thanks!
We'll provide another solution based on monitor on last cache line of page zero later.
> But please test both with and without your invlpg patch. I ran into some system stability issues, so this took a while to test. But for me, it looks like with or without the patch, maxcpus=1 does not work. I can upload the crash's call trace if that's helpful.
(In reply to Varun Koyyalagunta from comment #24) > > But please test both with and without your invlpg patch. > I ran into some system stability issues, so this took a while to test. But > for me, it looks like with or without the patch, maxcpus=1 does not work. I > can upload the crash's call trace if that's helpful. ok, so the solution of maxcpus=1 might not be stable. Anyway, could you help check if this patch work for you? And this is a improvement for #Comment 16, to reduce the risk of cache confliction.
Created attachment 220081 [details] zero page with least cacheline access
(In reply to Chen Yu from comment #26) > Created attachment 220081 [details] > zero page with least cacheline access I'll try this out. But I noticed this patch keeps the clflush. With regards to cache contention, on all the parts I've tested, a clflush will trigger the monitor. So with more than one nonboot CPU, they could keep waking each other up with clflushes when they share the same monitor address. This wasn't a problem with the original code because each CPU had a different monitor address. Another side note - I'm not sure what purpose the wbinvd and memory barriers serve here.
(In reply to Varun Koyyalagunta from comment #27) > (In reply to Chen Yu from comment #26) > > Created attachment 220081 [details] > > zero page with least cacheline access > > I'll try this out. But I noticed this patch keeps the clflush. With regards > to cache contention, on all the parts I've tested, a clflush will trigger > the monitor. So with more than one nonboot CPU, they could keep waking each > other up with clflushes when they share the same monitor address. This > wasn't a problem with the original code because each CPU had a different > monitor address. So, do you mean this ping-pong scenario: 1. CPU1 wait at page0 2. CPU2 cflush page0, wake CPU1 up, the CPU2 wait at page0 3. CPU1 is woken up, and invoke cflush page0, thus wake up CPU2 again. then the nonboot CPUs never sleep for long. I'm not sure if cflush would wake up the CPU from mwait, since I checked in the SDM, only ''store'' operation to the address will wake up the CPU, but if it true, then I think we either should allocate a new 'zero page' and monitor the different part in this page, or we should use hlt directly, could you also confirm if #Comment 21 work for you? > > Another side note - I'm not sure what purpose the wbinvd and memory barriers > serve here. Not sure, just leave it there as the original git log said mb() make no harm ...
Comment on attachment 220081 [details] zero page with least cacheline access This patch is incorrect.
Created attachment 220241 [details] v2-zero page with least cacheline access
#Comment 21 and #Comment 30 are two solutions, let's see if both of them work and choose one of them as final solution.
> So, do you mean this ping-pong scenario: > 1. CPU1 wait at page0 > 2. CPU2 cflush page0, wake CPU1 up, the CPU2 wait at page0 > 3. CPU1 is woken up, and invoke cflush page0, thus wake up CPU2 again. > then the nonboot CPUs never sleep for long. Yes. I tried the following test: CPU0 mwaits on page0 CPUX clflushes page0 Where I varied the X in CPUX in {1,2,3}. On a haswell, skylake, and sandybridge, CPU0 was woken up by the clflush. > I'm not sure if clflush would wake up the CPU from mwait, since I checked in > the SDM, only ''store'' operation to the address will wake up the CPU, but One possible implementation of a clflush is a read-invalidate snoop, which is what a store might look like. > I think we either should allocate a new 'zero page' and monitor the > different part in this page, I don't think you need to allocate a new page, but just use a different cacheline in the page for each cpu, eg 'mwait_ptr = empty_zero_page + 64 * cpu_number' > could you also confirm if #Comment 21 work for you? I will try it.
> > I think we either should allocate a new 'zero page' and monitor the > > different part in this page, > > I don't think you need to allocate a new page, but just use a different > cacheline in the page for each cpu, eg 'mwait_ptr = empty_zero_page + 64 * > cpu_number' > Forgot to mention the easier solution - remove the clflush. I don't think it's necessary.
It's an erratum fix for certain CPUs.
The erratum mentioned in the comments - > AAI65. MONITOR/MWAIT May Have Excessive False Wakeups > > Problem: Normally, if MWAIT is used to enter a C-state that is C1 or higher, > a store to the address range armed by the MONITOR instruction will cause the > processor to exit MWAIT. Due to this erratum, false wakeups may occur when > the monitored address range was recently written prior to executing the > MONITOR instruction. > > Implication: Due to this erratum, performance and power savings may be > impacted due to excessive false wakeups. > > Workaround: Execute a CLFLUSH Instruction immediately before every MONITOR > instruction when the monitored location may have been recently written. It seems like it would be safe to take out the clflush, as the suggested monitor address in empty_zero_page should not have been written to recently. Also, I think the continous ping-ponging of false wakeups from doing the clflush would be a bigger power savings problem. If we still want the clflush, it can probably be moved outside the monitor-mwait loop. That will still cause a spurious wakeup, but stop the continuous ping-ponging.
(In reply to Varun Koyyalagunta from comment #35) > It seems like it would be safe to take out the clflush, as the suggested > monitor address in empty_zero_page should not have been written to recently. > Also, I think the continous ping-ponging of false wakeups from doing the > clflush would be a bigger power savings problem. > Sounds reasonable. Monitor different address in a zero page would be compatible with current semantics. The reason why I was thinking of allocate new zero page and monitor different address is because: 1. if we monitor different address, there is no need to remove cflush. 2. As HPA mentioned in https://patchwork.kernel.org/patch/9158227/: > One implementation of MWAIT is for the waiting processor to claim the > line in an exclusive state, which would be highly undesirable for the > zero page. so I'm afraid the setting of cacheline in zero page as exclusive state by offline CPU, might impact the performance of other online CPUs who read zero page frequently, in order to get ride of this, just put another suspend-used-only zero page behind zero page. 3.If we want to monitor different address in current zero page, then the CPU number is limited to PAGE_SIZE/L1_CACHE_LINE, which is 64 on a x86_64, which is not enough for servers, so maybe more zero pages are required. BTW, since hlt force the CPU enter C1, if the power consumption is not much higher than current mwait solution when doing hibernation, thus hlt would be a simpler solution.
(In reply to Chen Yu from comment #36) > 2. As HPA mentioned in https://patchwork.kernel.org/patch/9158227/: > > One implementation of MWAIT is for the waiting processor to claim the > > line in an exclusive state, which would be highly undesirable for the > > zero page. > so I'm afraid the setting of cacheline in zero page as exclusive state by > offline CPU, might impact the performance of other online CPUs who read zero > page frequently The monitor implementations I'm familiar with keep the line in a shared state. If an exclusive state was used, when multiple cpus are monitor-mwaiting the same address, only one could have it exclusive at at time. So they'd continuously wake up each other as each tries to keep its monitored line exclusive. I can't be sure what all real processors do, but I don't think any reasonable power-saving processor would implement mwait that way.
(In reply to Chen Yu from comment #36) > 3.If we want to monitor different address in current zero page, then the CPU > number is limited to PAGE_SIZE/L1_CACHE_LINE, which is 64 on a x86_64, > which is not enough for servers, so maybe more zero pages are required. Yeah, things will get hairy if we need a line per processor. > BTW, since hlt force the CPU enter C1, if the power consumption is not much > higher than current mwait solution when doing hibernation, thus hlt would be > a simpler solution. I'm still testing the hlt patch. I'll get back to you on that. As for power consumption, it all depends on how long these processors need to be dead for. Hlt is fine for short times. And hlt would even be more of a power savings over the deep sleep mwait if the cpu is woken up right away, as going down into a C5/C6 and waking up from it does take energy and time.
The hlt patch is working for me.
Actually the hlt patch missed one scenario, that is, if some CPUs are offline before the hibernation, these CPUs will remain in mwait even with hlt patch applied, thus cause the same issue when doing page restore. So here's a series of patch to deal with situation, hope this one also work for you.
Created attachment 220951 [details] 0001-fix-frozen-cpus-overwrite [PATCH][1/4]
Created attachment 220961 [details] 0002-add-arch-disable-nonboot [PATCH][2/4]
Created attachment 220971 [details] 0003-resume-flag [PATCH][3/4]
Created attachment 220981 [details] 0004-using-hlt-and-kick-offlinecpus [PATCH][4/4]
(In reply to Chen Yu from comment #40) > So here's a series of patch to deal with situation, hope this one also work > for you. This works for me too. FYI, I'm using kernel 4.3.0-rc3. I've tried some newer kernels, but I ran into strange hibernation problems, so for the purposes of this bug I've been testing on the older kernel. The signature of __cpu_up/down has changed since 4.3, so I modified your patch slightly to match the older function signature.
(In reply to Varun Koyyalagunta from comment #45) > (In reply to Chen Yu from comment #40) > > So here's a series of patch to deal with situation, hope this one also work > > for you. > This works for me too. > FYI, I'm using kernel 4.3.0-rc3. I've tried some newer kernels, but I ran > into strange hibernation problems, so for the purposes of this bug I've been > testing on the older kernel. The signature of __cpu_up/down has changed > since 4.3, so I modified your patch slightly to match the older function > signature. Nwer kernel has some problems since the commit of commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata) because it split text mapping into dynamic allocated page tables, which might broken hibernation resume process.Currently we are seeking for solutions. Thanks for your feedback, I'll send the patchset out soon.
Latest patch at: https://patchwork.kernel.org/patch/9208541/ https://patchwork.kernel.org/patch/9202321/ Would you please help check if it works?
(In reply to Chen Yu from comment #47) > Latest patch at: > https://patchwork.kernel.org/patch/9208541/ > https://patchwork.kernel.org/patch/9202321/ > Would you please help check if it works? Sorry please ignore this one, I replied to the wrong thread.
https://patchwork.kernel.org/patch/9217459/
Close because patch has been merged: commit 406f992e4a372dafbe3c2cff7efbb2002a5c8ebd Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Date: Thu Jul 14 03:55:23 2016 +0200 x86 / hibernate: Use hlt_play_dead() when resuming from hibernation