Bug 106371

Summary: mwait_play_dead faults on invalid monitor address
Product: Power Management Reporter: Varun Koyyalagunta (cpudebug)
Component: Hibernation/SuspendAssignee: Chen Yu (yu.c.chen)
Status: CLOSED CODE_FIX    
Severity: normal CC: aaron.lu, hpa, lenb, rafaeln.dev, rjw, rui.zhang, yu.c.chen
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.19 Subsystem:
Regression: No Bisected commit-id:
Attachments: call graph
trial patch to enable the temporary direct-mapping for non-boot cpus before they die
force nonboot cpus to switch to safe pagetable when restoring page frames
monitor the zero page address
use zero page for monitor address when cpu is offline
Use hlt when disable nonboot CPUs
zero page with least cacheline access
v2-zero page with least cacheline access
0001-fix-frozen-cpus-overwrite
0002-add-arch-disable-nonboot
0003-resume-flag
0004-using-hlt-and-kick-offlinecpus

Description Varun Koyyalagunta 2015-10-20 19:10:15 UTC
Created attachment 190641 [details]
call graph

S4 resume will fail and reboot about 1 out of 100 times. On the Via Nano, we narrowed it down to a double fault in mwait_play_dead(). An Intel Haswell appears to fail similarly, but we did not narrow it all the way down.

from smpboot.c:
  static inline void mwait_play_dead(void)
    ...
    mwait_ptr = &current_thread_info()->flags;
    ..
    while (1) {
      clflush(mwait_ptr); //** <--- double fault occurs here **//
      mb();
      __monitor(mwait_ptr, 0, 0);
      mb();
      __mwait(eax, 0);
      ...
    }
  }

The double fault occurs when the boot cpu is managing the resume (resume_target_kernel()) and the non-boot cpus are in the above loop (from disable_nonboot_cpus()). If a "spurious"-wakeup out of the mwait happens, the non-boot cpu may tablewalk for the mwait_ptr again, but it appears that mwait_ptr is no longer present in the page tables (the boot cpu is in set_up_temporary_mapping() during this time). This causes a pagefault-pagefault-doublefault.

A possible fix is to have mwait_ptr point to somewhere else that stays paged during the resume process. One experiment where we had `mwait_ptr = mwait_play_dead` appeared to fix it. However, that is probably not ideal for performance, as it is pointing to code, and all the cpus will clflush that same address.

Attached is a drawn call graph of the observed failure case.
Comment 1 Aaron Lu 2015-10-21 02:20:49 UTC
    ..
    while (1) {
      clflush(mwait_ptr); //** <--- double fault occurs here **//
      mb();
      __monitor(mwait_ptr, 0, 0);
      mb();
      __mwait(eax, 0);
      ...

Not sure if this is worth to point out, but I saw there is an additional mb() in front of clflush, do you miss it while typing or it's indeed not there in your tree?
Comment 2 Varun Koyyalagunta 2015-10-21 05:29:40 UTC
(In reply to Aaron Lu from comment #1)
> Not sure if this is worth to point out, but I saw there is an additional
> mb() in front of clflush, do you miss it while typing or it's indeed not
> there in your tree?

Sorry, I missed that while typing. The mb() is there.
Comment 3 Aaron Lu 2015-10-21 05:44:57 UTC
Adding the author HPA for comments.
Comment 4 Chen Yu 2016-01-15 12:28:28 UTC
Hi Varun,
(In reply to Varun Koyyalagunta from comment #0)
> 
> The double fault occurs when the boot cpu is managing the resume
> (resume_target_kernel()) and the non-boot cpus are in the above loop (from
> disable_nonboot_cpus()). If a "spurious"-wakeup out of the mwait happens,
> the non-boot cpu may tablewalk for the mwait_ptr again, but it appears that
> mwait_ptr is no longer present in the page tables (the boot cpu is in
> set_up_temporary_mapping() during this time). This causes a
> pagefault-pagefault-doublefault.
> 
thanks for your info, 
I have 2 questions:

1.how do you confirm boot cpu is right in set_up_temporary_mapping when non-boot cpu dies?  since set_up_temporary_mapping is to create a temperary page-table for boot cpu, but not for non-boot cpus, besides at this stage in set_up_temporary_mapping, no page tables will be invalid/replaced by the temperary  one util restore_image is invoked by boot cpu, so set_up_temporary_mapping does not affect the mapping of nonboot cpus.

2. what does 'double fault' mean? is it a string like:
"unable to handle xxx " then panic? Can you please upload the panic log or screenshot?  if possible, please also print the value of mwait_ptr  before __monitor(printk("%p\n", mwait_ptr);, then we can see if the mapping for mwait_ptr has been broken. 

The possible cause of this problm I'm thinking of , might be that, if the boot cpu is doing the copy of pages to their original positions in restore_image, it might breaks the pagetable of non-boot cpus, so if one of the non-boot cpus has been woken up at this time, it might found mwait_ptr not in its own page table, then panic.

thanks,
yu
Comment 5 Varun Koyyalagunta 2016-01-15 18:17:35 UTC
1) Sorry, you are right, the boot cpu is in restore_image, specifically core_restore_code, when the non-boot cpu fails. We confirmed this using our internal cpu hardware debug tools, which create a trace of the processor execution.

2) By double fault, I meant an x86 double fault exception.

    1166   I:0x0001046FD3:0x3BAE0F:"CLFLUSH DS:[RBX]"; // WB:
    1166   D:0x0001046FD3:0x61EF4B;
   // LONG MODE, CPL=0, thread 1, step 0
   // DATA ADDRESS TRANSLATION, virtual address = 0xFFFF8800BA800010
   //   DATA TABLE WALK, virtual address = 0xFFFF8800BA800010
   //   pdptc hit
   pderead    0x009BEBFEA0    0x2D9D9063; // WB: PDE, LOAD
   pderead    0x009BEBFEA4    0x00000000; // WB: PDE, LOAD
   pteread    0x002D9D9000    0x00000000; // WB: PTE, LOAD
   pteread    0x002D9D9004    0x00000000; // WB: PTE, LOAD
   // EXCEPTION 14: PAGE NOT PRESENT, EC=00000h, ADDR=0xFFFF8800BA800010h
   // Load 8-B 0xFFFFFFFFFF5780E0
   //   DATA TABLE WALK, virtual address = 0xFFFFFFFFFF5780E0
   pml4read   0x0001C12FF8    0x01C15067; // WB: PML4, LOAD
   pml4read   0x0001C12FFC    0x00000000; // WB: PML4, LOAD
   pdptread   0x0001C15FF8    0x01C17067; // WB: PDPT, LOAD
   pdptread   0x0001C15FFC    0x00000000; // WB: PDPT, LOAD
   pderead    0x0001C17FD0    0x01C18067; // WB: PDE, LOAD
   pderead    0x0001C17FD4    0x00000000; // WB: PDE, LOAD
   pteread    0x0001C18BC0    0x01E8D161; // WB: PTE, LOAD
   pteread    0x0001C18BC4    0x80000000; // WB: PTE, LOAD
   memread    0x0001E8D0E0    0x0010E940; // WB: DESCRIPTOR, IDT, LOAD
   memread    0x0001E8D0E4    0x81788E00; // WB: DESCRIPTOR, IDT, LOAD
   // Load 8-B 0xFFFFFFFFFF5780E8
   memread    0x0001E8D0E8    0xFFFFFFFF; // WB: DESCRIPTOR, IDT, LOAD
   memread    0x0001E8D0EC    0x00000000; // WB: DESCRIPTOR, IDT, LOAD
   // Load 8-B 0xFFFF8800BFC89010
   memread    0x00BFC89010    0x0000FFFF; // WB: DESCRIPTOR, GDT, LOAD
   memread    0x00BFC89014    0x00AF9B00; // WB: DESCRIPTOR, GDT, LOAD
   // Store 8-B 0xFFFF8800BA803E88
   // DATA ADDRESS TRANSLATION, virtual address = 0xFFFF8800BA803E88
   //   DATA TABLE WALK, virtual address = 0xFFFF8800BA803E88
   pml4read   0x0001C12880    0x01FD2067; // WB: PML4, STORE
   pml4read   0x0001C12884    0x00000000; // WB: PML4, STORE
   pdptread   0x0001FD2010    0x9BEBF063; // WB: PDPT, STORE
   pdptread   0x0001FD2014    0x00000000; // WB: PDPT, STORE
   pderead    0x009BEBFEA0    0x2D9D9063; // WB: PDE, STORE
   pderead    0x009BEBFEA4    0x00000000; // WB: PDE, STORE
   pteread    0x002D9D9018    0x00000000; // WB: PTE, STORE
   pteread    0x002D9D901C    0x00000000; // WB: PTE, STORE
   // EXCEPTION 14: PAGE NOT PRESENT, EC=00002h, ADDR=0xFFFF8800BA803E88h
   // EXCEPTION 8: DOUBLE FAULT
   ...
Comment 6 Varun Koyyalagunta 2016-01-15 18:27:13 UTC
Oops, I accidentally hit reply with just half the message.

That CLFLUSH above is from our execution trace on core 1. It is the CLFLUSH before the MONITOR/MWAIT. The trace shows the double fault exception happening after two bad page faults. The mwait_ptr is in RBX, which has value 0xFFFF8800BA800010.

We don't have the panic log, as I think it did not contain useful information. We may have a screenshot, but it isn't on hand at the moment.
Comment 7 Chen Yu 2016-01-16 03:43:51 UTC
(In reply to Varun Koyyalagunta from comment #6)
> Oops, I accidentally hit reply with just half the message.
> 
> That CLFLUSH above is from our execution trace on core 1. It is the CLFLUSH
> before the MONITOR/MWAIT. The trace shows the double fault exception
> happening after two bad page faults. The mwait_ptr is in RBX, which has
> value 0xFFFF8800BA800010.
> 
> We don't have the panic log, as I think it did not contain useful
> information. We may have a screenshot, but it isn't on hand at the moment.

OK this value 0xFFFF8800BA800010 is already very valuable, since this is a direct-mapping address in kernel pagetable, I think it is what I suspect at #Comment 4, the page table cpu1 is using has been overwritten. Currently one of the solutions for this problem can be that, to leverage nonboot cpus also use the same temorary page table as the boot cpu. It might need some assembly code to be added before non boot cpus go to mwait_play_dead. I'll try..
Comment 8 Chen Yu 2016-01-17 09:59:36 UTC
Created attachment 200111 [details]
trial patch to enable the temporary direct-mapping for non-boot cpus before they die

Please help check if the patch make any sense? thanks
Comment 9 Chen Yu 2016-01-18 02:10:40 UTC
please do not apply above patch since there is one bug in it. I'll do more investigation and reply later.
Comment 10 Chen Yu 2016-01-18 09:09:44 UTC
BTW, the reason why cpu1 is woken up by "spurious" is because that, cpu0 right wrote a value(when restoring pages) to the address of monitor address mwait_ptr.
Comment 11 Chen Yu 2016-01-24 16:15:21 UTC
Created attachment 201321 [details]
force nonboot cpus to switch to safe pagetable when restoring page frames

Hi,
please help try if this patch works for you? 
Yu
Comment 12 Varun Koyyalagunta 2016-01-28 19:19:57 UTC
(In reply to Chen Yu from comment #11)
> Created attachment 201321 [details]
> force nonboot cpus to switch to safe pagetable when restoring page frames
> 
> Hi,
> please help try if this patch works for you? 
> Yu

The patch is not working for me. I'm working off of 4.3.0-rc3. I made one change to mwait_play_dead - I replaced the '__mwait(...)' with an 'invlpg mwait_ptr'. This makes the failure happen very quickly, as the non-boot cpus never go to sleep and are constantly doing a tablewalk for the mwait_ptr.

With the invlpg and the above patch, on a haswell, I ran 'pm-hibernate'. The OS crashes and reboots usually within 1 to 2 iterations of 'pm-hibernate'.

With the same setup, I tried the invlpg change with the workaround of having mwait_ptr point to code - 'mwait_ptr = mwait_play_dead'. This ran for 200 iterations of pm-hibernate without crashing.
Comment 13 Chen Yu 2016-01-29 02:03:18 UTC
(In reply to Varun Koyyalagunta from comment #12)
> (In reply to Chen Yu from comment #11)
> > Created attachment 201321 [details]
> > force nonboot cpus to switch to safe pagetable when restoring page frames
> > 
> > Hi,
> > please help try if this patch works for you? 
> > Yu
> 
> The patch is not working for me. I'm working off of 4.3.0-rc3. I made one
> change to mwait_play_dead - I replaced the '__mwait(...)' with an 'invlpg
> mwait_ptr'. This makes the failure happen very quickly, as the non-boot cpus
> never go to sleep and are constantly doing a tablewalk for the mwait_ptr.
> 
> With the invlpg and the above patch, on a haswell, I ran 'pm-hibernate'. The
> OS crashes and reboots usually within 1 to 2 iterations of 'pm-hibernate'.
> 
> With the same setup, I tried the invlpg change with the workaround of having
> mwait_ptr point to code - 'mwait_ptr = mwait_play_dead'. This ran for 200
> iterations of pm-hibernate without crashing.

Cool, thanks for your method to reproduce it. The reason why set mwait_ptr = mwait_play_dead  solve this problem might be that,  mwait_play_dead is a kernel text mapping area, while current_thread is located in dynamically allocated address space. But I still wonder if the dynamically allocated address space for nonboot cpus would be overlapped, if it does, will nonboot cpu crash too?
Comment 14 Chen Yu 2016-06-05 16:05:54 UTC
After discussion internally, we decide to change the monitor address to zero, because of the following reasons:

1. The cause of this panic is because of the inconsistence of memory layout during hibernation, which means, the mapping for dynamic page frames has changed, and when the boot CPU tries to copy the data to one page frame, it wakes up the nonboot CPU incorrectly because the nonboot CPUs are just monitoring this address, and since the memory layout has changed and the page tables for this address are not the same before/after hibernation, the violation of address access is triggered.


2.Yes, as Varun suggested before, we can change the mwait_ptr  to a kernel text address, since according to the semantics of hibernation, the kernel text remains the same across hibernation, thus the page tables also remain the same, we will not trigger any exception even nonboot CPUs are woken up when boot cpu is writing data to text section.

3.A more graceful solution is to set the monitor to zero, thus the boot CPU will not wake up the nonboot CPUs incorrectly, even if the nonboot CPUs are woken up, they can fall asleep again quickly.

here's the proposal patch, do you have a chance to try? thanks/
Comment 15 Chen Yu 2016-06-05 16:06:47 UTC
Created attachment 219111 [details]
monitor the zero page address
Comment 16 Chen Yu 2016-06-06 03:35:02 UTC
Created attachment 219161 [details]
use zero page for monitor address when cpu is offline
Comment 17 Varun Koyyalagunta 2016-06-07 23:15:00 UTC
(In reply to Chen Yu from comment #16)
> Created attachment 219161 [details]
> use zero page for monitor address when cpu is offline

This worked for me on a haswell. I ran 500 hibernations of the patch.
I also ran 1000 hibernations where I changed the mwait to an invlpg, so the cpu never goes to sleep and the monitor would constantly tablewalk. That also worked.
Comment 18 Chen Yu 2016-06-08 03:48:23 UTC
(In reply to Varun Koyyalagunta from comment #17)
> (In reply to Chen Yu from comment #16)
> > Created attachment 219161 [details]
> > use zero page for monitor address when cpu is offline
> 
> This worked for me on a haswell. I ran 500 hibernations of the patch.
> I also ran 1000 hibernations where I changed the mwait to an invlpg, so the
> cpu never goes to sleep and the monitor would constantly tablewalk. That
> also worked.

Thank you very much.Would you please also test with the following method, without any patch applied:
1. first boot the kernel normally
2. after boot into the kernel, edit the default grub option, to add a 'maxcpus=1'
   and update-grub, (in this way, the resume kernel does not have to offline the nonboot CPUs during resume)
3. do a stress test on hibernation to see if there is any problem.
(Although the resume kernel only have one boot CPU, but after resumed , all nonboot CPUs should be woken up.)

I used the following script FYI:

#!/bin/bash
echo platform > /sys/power/disk
echo none > /sys/power/pm_test

iteration=1
while [ $iteration != 101 ]
        do
                echo
                echo $iteration

                rtcwake -m disk -s 60
                sleep 30

                iteration=$(($iteration + 1))
done
Comment 19 Chen Yu 2016-06-08 03:51:40 UTC
(In reply to Chen Yu from comment #18)
> (In reply to Varun Koyyalagunta from comment #17)
> > (In reply to Chen Yu from comment #16)
> > > Created attachment 219161 [details]
> > > use zero page for monitor address when cpu is offline
> > 
> > This worked for me on a haswell. I ran 500 hibernations of the patch.
> > I also ran 1000 hibernations where I changed the mwait to an invlpg, so the
> > cpu never goes to sleep and the monitor would constantly tablewalk. That
> > also worked.
> 
> Thank you very much.Would you please also test with the following method,
> without any patch applied:
But please test both with and without your invlpg patch.
Comment 20 Chen Yu 2016-06-13 11:57:52 UTC
A more straightforward solution is to use HLT in the "play dead" loops invoked by
disable_nonboot_cpus() in general, and it shouldn't matter energy-consumption.
Comment 21 Chen Yu 2016-06-13 11:58:54 UTC
Created attachment 219731 [details]
Use hlt when disable nonboot CPUs
Comment 22 Chen Yu 2016-06-13 12:17:38 UTC
Varun, could you please check if #Comment 21 work for you?(with invlpg applied) thanks!
Comment 23 Chen Yu 2016-06-14 00:23:16 UTC
We'll provide another solution based on monitor on last cache line of page zero later.
Comment 24 Varun Koyyalagunta 2016-06-14 20:44:01 UTC
> But please test both with and without your invlpg patch.
I ran into some system stability issues, so this took a while to test. But for me, it looks like with or without the patch, maxcpus=1 does not work. I can upload the crash's call trace if that's helpful.
Comment 25 Chen Yu 2016-06-15 11:14:15 UTC
(In reply to Varun Koyyalagunta from comment #24)
> > But please test both with and without your invlpg patch.
> I ran into some system stability issues, so this took a while to test. But
> for me, it looks like with or without the patch, maxcpus=1 does not work. I
> can upload the crash's call trace if that's helpful.

ok, so the solution of maxcpus=1  might not be stable.
Anyway, could you help check if this patch work for you? And this is a improvement for #Comment 16, to reduce the risk of cache confliction.
Comment 26 Chen Yu 2016-06-15 11:16:01 UTC
Created attachment 220081 [details]
zero page with least cacheline access
Comment 27 Varun Koyyalagunta 2016-06-15 16:20:18 UTC
(In reply to Chen Yu from comment #26)
> Created attachment 220081 [details]
> zero page with least cacheline access

I'll try this out. But I noticed this patch keeps the clflush. With regards to cache contention, on all the parts I've tested, a clflush will trigger the monitor. So with more than one nonboot CPU, they could keep waking each other up with clflushes when they share the same monitor address. This wasn't a problem with the original code because each CPU had a different monitor address.

Another side note - I'm not sure what purpose the wbinvd and memory barriers serve here.
Comment 28 Chen Yu 2016-06-16 02:05:19 UTC
(In reply to Varun Koyyalagunta from comment #27)
> (In reply to Chen Yu from comment #26)
> > Created attachment 220081 [details]
> > zero page with least cacheline access
> 
> I'll try this out. But I noticed this patch keeps the clflush. With regards
> to cache contention, on all the parts I've tested, a clflush will trigger
> the monitor. So with more than one nonboot CPU, they could keep waking each
> other up with clflushes when they share the same monitor address. This
> wasn't a problem with the original code because each CPU had a different
> monitor address.

So, do you mean this ping-pong scenario:
1. CPU1 wait at page0
2. CPU2 cflush page0, wake CPU1 up, the CPU2 wait at page0
3. CPU1 is woken up, and invoke cflush page0, thus wake up CPU2 again.
then the nonboot CPUs never sleep for long.

I'm not sure if cflush would wake up the CPU from mwait, since I checked in the SDM, only ''store'' operation to the address will wake up the CPU, but if it true, then
I think we either should allocate a new 'zero page' and monitor the different part in this page, or we should use hlt directly, could you also confirm if #Comment 21
work for you?
> 
> Another side note - I'm not sure what purpose the wbinvd and memory barriers
> serve here.
Not sure, just leave it there as the original git log said mb() make no harm ...
Comment 29 Chen Yu 2016-06-16 05:59:04 UTC
Comment on attachment 220081 [details]
zero page with least cacheline access

This patch is incorrect.
Comment 30 Chen Yu 2016-06-16 08:56:32 UTC
Created attachment 220241 [details]
v2-zero page with least cacheline access
Comment 31 Chen Yu 2016-06-16 08:57:39 UTC
#Comment 21 and #Comment 30 are two solutions, let's see if both of them work and choose one of them as final solution.
Comment 32 Varun Koyyalagunta 2016-06-16 17:34:54 UTC
> So, do you mean this ping-pong scenario:
> 1. CPU1 wait at page0
> 2. CPU2 cflush page0, wake CPU1 up, the CPU2 wait at page0
> 3. CPU1 is woken up, and invoke cflush page0, thus wake up CPU2 again.
> then the nonboot CPUs never sleep for long.

Yes. I tried the following test:

CPU0 mwaits on page0
CPUX clflushes page0

Where I varied the X in CPUX in {1,2,3}.
On a haswell, skylake, and sandybridge, CPU0 was woken up by the clflush.

> I'm not sure if clflush would wake up the CPU from mwait, since I checked in
> the SDM, only ''store'' operation to the address will wake up the CPU, but

One possible implementation of a clflush is a read-invalidate snoop, which is what a store might look like.

> I think we either should allocate a new 'zero page' and monitor the
> different part in this page,

I don't think you need to allocate a new page, but just use a different cacheline in the page for each cpu, eg 'mwait_ptr = empty_zero_page + 64 * cpu_number'

> could you also confirm if #Comment 21 work for you?

I will try it.
Comment 33 Varun Koyyalagunta 2016-06-16 17:37:41 UTC
> > I think we either should allocate a new 'zero page' and monitor the
> > different part in this page,
> 
> I don't think you need to allocate a new page, but just use a different
> cacheline in the page for each cpu, eg 'mwait_ptr = empty_zero_page + 64 *
> cpu_number'
> 

Forgot to mention the easier solution - remove the clflush. I don't think it's necessary.
Comment 34 H. Peter Anvin 2016-06-16 19:16:11 UTC
It's an erratum fix for certain CPUs.
Comment 35 Varun Koyyalagunta 2016-06-16 20:19:58 UTC
The erratum mentioned in the comments -
> AAI65. MONITOR/MWAIT May Have Excessive False Wakeups
>
> Problem: Normally, if MWAIT is used to enter a C-state that is C1 or higher,
> a store to the address range armed by the MONITOR instruction will cause the
> processor to exit MWAIT. Due to this erratum, false wakeups may occur when
> the monitored address range was recently written prior to executing the
> MONITOR instruction.
>
> Implication: Due to this erratum, performance and power savings may be
> impacted due to excessive false wakeups.
>
> Workaround: Execute a CLFLUSH Instruction immediately before every MONITOR
> instruction when the monitored location may have been recently written.

It seems like it would be safe to take out the clflush, as the suggested
monitor address in empty_zero_page should not have been written to recently.
Also, I think the continous ping-ponging of false wakeups from doing the
clflush would be a bigger power savings problem.

If we still want the clflush, it can probably be moved outside the
monitor-mwait loop. That will still cause a spurious wakeup, but stop the
continuous ping-ponging.
Comment 36 Chen Yu 2016-06-17 02:42:03 UTC
(In reply to Varun Koyyalagunta from comment #35) 
> It seems like it would be safe to take out the clflush, as the suggested
> monitor address in empty_zero_page should not have been written to recently.
> Also, I think the continous ping-ponging of false wakeups from doing the
> clflush would be a bigger power savings problem.
> 
Sounds reasonable.

Monitor different address in a zero page would be compatible with current semantics. The reason why I was thinking of allocate new zero page and monitor different address is because:

1. if we monitor different address, there is no need to remove cflush.

2. As HPA mentioned in https://patchwork.kernel.org/patch/9158227/:
> One implementation of MWAIT is for the waiting processor to claim the
> line in an exclusive state, which would be highly undesirable for the
> zero page.
so I'm afraid the setting of cacheline in zero page as exclusive state by offline CPU, might impact the performance of other online CPUs who read zero page frequently, in order to get ride of this, just put another suspend-used-only zero page behind zero page.

3.If we want to monitor different address in current zero page, then the CPU number is limited to  PAGE_SIZE/L1_CACHE_LINE, which is 64 on a x86_64, which is not enough for servers, so maybe more zero pages are required.


BTW, since hlt force the CPU enter C1, if the power consumption is not much higher than current mwait solution when doing hibernation, thus hlt would be a simpler solution.
Comment 37 Varun Koyyalagunta 2016-06-17 17:40:29 UTC
(In reply to Chen Yu from comment #36)
> 2. As HPA mentioned in https://patchwork.kernel.org/patch/9158227/:
> > One implementation of MWAIT is for the waiting processor to claim the
> > line in an exclusive state, which would be highly undesirable for the
> > zero page.
> so I'm afraid the setting of cacheline in zero page as exclusive state by
> offline CPU, might impact the performance of other online CPUs who read zero
> page frequently

The monitor implementations I'm familiar with keep the line in a shared state. If an exclusive state was used, when multiple cpus are monitor-mwaiting the same address, only one could have it exclusive at at time. So they'd continuously wake up each other as each tries to keep its monitored line exclusive. I can't be sure what all real processors do, but I don't think any reasonable power-saving processor would implement mwait that way.
Comment 38 Varun Koyyalagunta 2016-06-17 17:53:13 UTC
(In reply to Chen Yu from comment #36)
> 3.If we want to monitor different address in current zero page, then the CPU
> number is limited to  PAGE_SIZE/L1_CACHE_LINE, which is 64 on a x86_64,
> which is not enough for servers, so maybe more zero pages are required.

Yeah, things will get hairy if we need a line per processor.

> BTW, since hlt force the CPU enter C1, if the power consumption is not much
> higher than current mwait solution when doing hibernation, thus hlt would be
> a simpler solution.

I'm still testing the hlt patch. I'll get back to you on that. As for power consumption, it all depends on how long these processors need to be dead for. Hlt is fine for short times. And hlt would even be more of a power savings over the deep sleep mwait if the cpu is woken up right away, as going down into a C5/C6 and waking up from it does take energy and time.
Comment 39 Varun Koyyalagunta 2016-06-20 16:42:54 UTC
The hlt patch is working for me.
Comment 40 Chen Yu 2016-06-22 10:10:16 UTC
Actually the hlt patch missed one scenario, that is, if some CPUs are offline before the hibernation, these CPUs will remain in mwait even with hlt patch applied,  thus cause the same issue when doing page restore.

So here's a series of patch to deal with situation, hope this one also work for you.
Comment 41 Chen Yu 2016-06-22 10:12:30 UTC
Created attachment 220951 [details]
0001-fix-frozen-cpus-overwrite

[PATCH][1/4]
Comment 42 Chen Yu 2016-06-22 10:13:06 UTC
Created attachment 220961 [details]
0002-add-arch-disable-nonboot

[PATCH][2/4]
Comment 43 Chen Yu 2016-06-22 10:13:33 UTC
Created attachment 220971 [details]
0003-resume-flag

[PATCH][3/4]
Comment 44 Chen Yu 2016-06-22 10:14:03 UTC
Created attachment 220981 [details]
0004-using-hlt-and-kick-offlinecpus

[PATCH][4/4]
Comment 45 Varun Koyyalagunta 2016-06-24 15:16:05 UTC
(In reply to Chen Yu from comment #40)
> So here's a series of patch to deal with situation, hope this one also work
> for you.
This works for me too.
FYI, I'm using kernel 4.3.0-rc3. I've tried some newer kernels, but I ran into strange hibernation problems, so for the purposes of this bug I've been testing on the older kernel. The signature of __cpu_up/down has changed since 4.3, so I modified your patch slightly to match the older function signature.
Comment 46 Chen Yu 2016-06-25 01:03:33 UTC
(In reply to Varun Koyyalagunta from comment #45)
> (In reply to Chen Yu from comment #40)
> > So here's a series of patch to deal with situation, hope this one also work
> > for you.
> This works for me too.
> FYI, I'm using kernel 4.3.0-rc3. I've tried some newer kernels, but I ran
> into strange hibernation problems, so for the purposes of this bug I've been
> testing on the older kernel. The signature of __cpu_up/down has changed
> since 4.3, so I modified your patch slightly to match the older function
> signature.

Nwer kernel has some problems since the commit of 
commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata)
because it split text mapping into dynamic allocated page tables, which might broken hibernation resume process.Currently we are seeking for solutions. Thanks for your feedback, I'll send the patchset out soon.
Comment 47 Chen Yu 2016-07-01 04:59:19 UTC
Latest patch at:
https://patchwork.kernel.org/patch/9208541/
https://patchwork.kernel.org/patch/9202321/
Would you please help check if it works?
Comment 48 Chen Yu 2016-07-01 04:59:58 UTC
(In reply to Chen Yu from comment #47)
> Latest patch at:
> https://patchwork.kernel.org/patch/9208541/
> https://patchwork.kernel.org/patch/9202321/
> Would you please help check if it works?

Sorry please ignore this one, I replied to the wrong thread.
Comment 50 Chen Yu 2016-08-01 03:21:15 UTC
Close because patch has been merged:

commit 406f992e4a372dafbe3c2cff7efbb2002a5c8ebd
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Thu Jul 14 03:55:23 2016 +0200

    x86 / hibernate: Use hlt_play_dead() when resuming from hibernation