Bug 10927
Summary: | Dell laptops: reboot instead of resume | ||
---|---|---|---|
Product: | Power Management | Reporter: | Kirill A. Shutemov (kirill) |
Component: | Hibernation/Suspend | Assignee: | Rafael J. Wysocki (rjw) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | acpi-bugzilla, bunk, hpa, jason, mingo, pavel, rjw, sam |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 7216, 10492 | ||
Attachments: |
acpidump
Debug patch Debug patch #2 Debug patch #3 Debug patch #4 Debug patch #5 Segment normalization patch Segment normalization patch (corrected) Normalization patch #3 Normalization patch #4 Normalization patch #5 Patch changing the meaning of acpi_wakeup_address Patch changing the meaning of acpi_wakeup_address #2 Modified normalization patch #5 Normalization patch #6 Normalization patch with the acpi_wakeup_address change, #7 |
Description
Kirill A. Shutemov
2008-06-16 07:31:22 UTC
Created attachment 16506 [details]
acpidump
So, commit e44b7b7525ad9d43163ab5e60c784325419e0ea6 Author: Pavel Machek <pavel@suse.cz> Date: Thu Apr 10 23:28:10 2008 +0200 x86: move suspend wakeup code to C Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Ingo Molnar <mingo@elte.hu> is the first bad commit? Yes. Is the system (kernel) 32-bit or 64-bit? 64-bit Can you boot the kernel with "acpi_sleep=s3_beep" in the command line and see if it beeps? It doesn't beep. Thanks. Now, there are two problems possible and I'll have to figure a reliable way of verifying which of them manifests itself. First, it's possible that we are called with a wrong values of CS:IP in the beginning, so the "movl $wakeup_stack_end, %esp" in arch/x86/kernel/acpi/realmode/wakeup.S places the stack at a wrong location. Second, the trampoline segment might be wrong. Would it be possible to try a 32-bit version of the failing kernel on this box? Hi!
> Now, there are two problems possible and I'll have to figure a reliable way
> of
> verifying which of them manifests itself.
>
> First, it's possible that we are called with a wrong values of CS:IP in the
> beginning, so the "movl $wakeup_stack_end, %esp" in
> arch/x86/kernel/acpi/realmode/wakeup.S places the stack at a wrong location.
>
> Second, the trampoline segment might be wrong.
One way to check that is place jmp $ at the begining -- if it hangs
instead of reboot, it reaches the beggining.
Then place jmp $ after %esp setup, and after first push to see how far
it goes...
Pavel
Created attachment 16528 [details]
Debug patch
So, if we comment out the "pushl $0 ; popfl" thing and the CS:IP are not as expected, we should detect a wrong signature and just halt?
Kirill, please apply the attached patch and see if anything changes, thanks.
> Kirill, please apply the attached patch and see if anything changes, thanks.
No. It still reboots.
Created attachment 16529 [details]
Debug patch #2
Thanks for testing.
Please apply this patch instead and see if it changes anything.
I have played with loop inserting. If we insert loop before the "pushl $0 ; popfl" it hangs, but it reboots if we insert after. If we comment out the "pushl $0 ; popfl" it reachs "calll main"(hangs if we insert loop before it), but it reboots then. It hangs with Debug patch #2 Okay, so the stack memory or stack pointer is buggered... (In reply to comment #14) > It hangs with Debug patch #2 This is consistent with your observations in Comment #13. What happens if you comment out the "pushl $0 ; popfl" along with the "calll main", without the debug patch #2? (In reply to comment #15) > Okay, so the stack memory or stack pointer is buggered... Hm, well, why the value of DS is correct in that case? What about SS? My point is, if the value of DS is correct, then the value of SS should be correct too, as we set them both in the same way. What am I missing? Looking at it, *how* do you know DS is actually correct? It seems to me that all we have is the *assumption* that the entrypoint is at CS:0, which isn't inherently a given. (In reply to comment #20) > It seems to me that all we have is the *assumption* that the entrypoint is at > CS:0, which isn't inherently a given. Indeed we assume so, but the signature test should fail if that's not the case and it doesn't. Well, I'm certainly missing something ... How do you know it's not failing? Something that does come to mind is if it at all possible we're not actually in real mode when this code is entered? (In reply to comment #22) > How do you know it's not failing? Because if it had failed, the kernel with the patch from Comment #10 would have hanged, but it hangs with the patch from Comment #12 instead. (In reply to comment #23) > Something that does come to mind is if it at all possible we're not actually > in > real mode when this code is entered? In that case we would get an exception on an attempt to load DS, ES and SS with a code segment selector, I think. No, that's not true at all. Most code segment selectors are also valid data segment selectors, but will trap on a write through that segment. Do we have any way to get data out of this box, e.g. a UART serial port? (In reply to comment #26) > No, that's not true at all. Most code segment selectors are also valid data > segment selectors, but will trap on a write through that segment. Well, that might be going on, then. Still, I wonder why the old code worked in that case. It wasn't so much different AFAICS ... *Grumble* My $LARGE_POLITICAL_ENTITY for an ICE for stuff like this. We need to get some info out. What do we have available?
> Something that does come to mind is if it at all possible we're not actually
> in
> real mode when this code is entered?
Hmm, that would be bad and against the spec. Not sure if that's
possible. Anything else than CS:0 would be against the spec, too,
IIRC.
Now... what is different from the "old" code... the old code did not
need a stack in real mode at all (if it did not need to call video
bios etc) -- so if you commented out pushl 0; popfl, it went back to
protected mode without using stack at all. (And yes, some strange
machines needed that).
I'm not sure what is going on here; one crazy explanation would be
"%rsp has high bits set"...
Hmm, can we get away with staying on whatever stack is provided to us
by BIOS?
Pavel
(In reply to comment #16) > What happens if you comment out the "pushl $0 ; popfl" along with the "calll > main", without the debug patch #2? It reboots on "pushw $0", line 89. (In reply to comment #29) > *Grumble* > > My $LARGE_POLITICAL_ENTITY for an ICE for stuff like this. > > We need to get some info out. What do we have available? 1. We know that the value of DS is correct, so the value of SS should be correct too, because it's the same value, and therefore our assumption that we've been called at CS:0 is satisfied. 2. We know that write accesses to the stack segment cause reboots. Created attachment 16535 [details]
Debug patch #3
Kirill, please apply this patch instead of the debug patch #2 and see if that reboots.
1. We know that the value of DS is correct, so the value of SS should be correct too, because it's the same value, and therefore our assumption that we've been called at CS:0 is satisfied. 2. We know that write accesses to the stack segment cause reboots. --- Assuming that is correct, this means we're either in Protected Mode or we're in real mode, but a transition from protected mode to real mode was botched and there are broken segment attributes bits still in the descriptor registers. The SMSW instruction can be used to tell if we're in protected mode or not. bugme-daemon@bugzilla.kernel.org wrote: > I'm not sure what is going on here; one crazy explanation would be > "%rsp has high bits set"... Impossible, unless we're in *long* mode... -hpa (In reply to comment #33) > Created an attachment (id=16535) [details] > Debug patch #3 > > Kirill, please apply this patch instead of the debug patch #2 and see if that > reboots. No. It hangs. bugme-daemon@bugzilla.kernel.org wrote: > (In reply to comment #33) >> Created an attachment (id=16535) > --> (http://bugzilla.kernel.org/attachment.cgi?id=16535&action=view) > [details] >> Debug patch #3 >> >> Kirill, please apply this patch instead of the debug patch #2 and see if >> that >> reboots. > > No. It hangs. Okay, I'm confused about this patch; what was the intent with it? That was to check if it would trap on an attempt to write to a know location in the data segment. It wouldn't, so it means DS is rather not loaded with a code segment selector and obviously the same applies to SS. Botched tansition from protected mode to real mode seems to be most likely to me at the moment. Is there any chance to test 32-bit kernel? With this applied to 32-bit kernel, we should be able to resume without touching stack at all. (Doing it in 64-bit kernel is not so easy, because we put address on the stack and ret, instead of jmp...) diff --git a/arch/x86/kernel/acpi/realmode/wakeup.S b/arch/x86/kernel/acpi/realmode/wakeup.S index f9b77fb..b91b2ea 100644 --- a/arch/x86/kernel/acpi/realmode/wakeup.S +++ b/arch/x86/kernel/acpi/realmode/wakeup.S @@ -42,22 +42,6 @@ _start: movl $wakeup_stack_end, %esp - /* Clear the EFLAGS */ - pushl $0 - popfl - - /* Check header signature... */ - movl signature, %eax - cmpl $0x51ee1111, %eax - jne bogus_real_magic - - /* Check we really have everything... */ - movl end_signature, %eax - cmpl $0x65a22c82, %eax - jne bogus_real_magic - - /* Call the C code */ - calll main /* Do any other stuff... */ Created attachment 16536 [details]
Debug patch #4
Now, let's try to write to the location pointed to by %sx.
Kirill, please test if it hangs with this patch instead of the previous one.
(In reply to comment #40) > Created an attachment (id=16536) [details] > Debug patch #4 > > Now, let's try to write to the location pointed to by %sx. > > Kirill, please test if it hangs with this patch instead of the previous one. > $ make CHK include/linux/version.h CHK include/linux/utsrelease.h CALL scripts/checksyscalls.sh CHK include/linux/compile.h AS arch/x86/kernel/acpi/realmode/wakeup.o arch/x86/kernel/acpi/realmode/wakeup.S: Assembler messages: arch/x86/kernel/acpi/realmode/wakeup.S:97: Error: `(%sp)' is not a valid base/index expression make[3]: *** [arch/x86/kernel/acpi/realmode/wakeup.o] Памылка 1 make[2]: *** [arch/x86/kernel/acpi/realmode/wakeup.bin] Памылка 2 make[1]: *** [arch/x86/kernel/acpi] Памылка 2 make: *** [arch/x86/kernel] Памылка 2 bugme-daemon@bugzilla.kernel.org wrote: > Created an attachment (id=16536) > --> (http://bugzilla.kernel.org/attachment.cgi?id=16536&action=view) > Debug patch #4 > > Now, let's try to write to the location pointed to by %sx. Uhm, (%sp) is not a valid addressing form. Only %bx, %bp, %si, and %di can be used to form 16-bit addresses; addresses which include %bp are implicitly %ss-based. -hoa Created attachment 16537 [details]
Debug patch #5
My bad, sorry. This one should compile, hopefully.
(In reply to comment #43) > Created an attachment (id=16537) [details] > Debug patch #5 It hangs too. bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=10927 > > > > > > ------- Comment #44 from kirill@shutemov.name 2008-06-18 13:59 ------- > (In reply to comment #43) >> Created an attachment (id=16537) > --> (http://bugzilla.kernel.org/attachment.cgi?id=16537&action=view) > [details] >> Debug patch #5 > > It hangs too. What happens if you change "(%bx)" to "%ss:(%bx)"? -hpa (In reply to comment #45) > What happens if you change "(%bx)" to "%ss:(%bx)"? Reboot. bugme-daemon@bugzilla.kernel.org wrote: > (In reply to comment #45) >> What happens if you change "(%bx)" to "%ss:(%bx)"? > > Reboot. Okay, it's definitely %ss being buggered, and the only thing that can bugger up %ss but not %ds downstream of that code is a botched transition out of protected mode. Rewriting all the real-mode code to not use the stack is obviously not realistic (can't make any BIOS calls that way), but it would be possible to dart into protected mode using only %cs: references to set up a GDT, then setting all the segments to sane, real-mode-like segments, and darting back out. -hpa I wonder, what happens if we load %ss with 0 initially and then with the %cs value? The null selector is invalid in protected mode so that might cause the CPU to wipe out the hidden part of the register ... bugme-daemon@bugzilla.kernel.org wrote: > ------- Comment #48 from rjw@sisk.pl 2008-06-18 14:53 ------- > I wonder, what happens if we load %ss with 0 initially and then with the %cs > value? The null selector is invalid in protected mode so that might cause > the > CPU to wipe out the hidden part of the register ... Absolutely nothing. If you were in protected mode, loading %ds and %ss with the same value would set the attribute bits to the same thing. Since the attribute bits didn't change, you're in real mode, and there are no null selectors in real mode. -hpa Hidden parts should really be refreshed as soon as anything is written there, but perhaps this helps? (Implementation of Rafael's idea) --- a/arch/x86/kernel/acpi/realmode/wakeup.S +++ b/arch/x86/kernel/acpi/realmode/wakeup.S @@ -34,6 +34,11 @@ _start: cli cld + xorw %ax, %ax + movw %ax, %ds + movw %ax, %es + movw %ax, %ss + /* Set up segments */ movw %cs, %ax movw %ax, %ds (In reply to comment #49) Fun, fun. :-( Okay, perhaps I'll be able to write this patch, but since I haven't done anything similar for years, I'll need some assistance. Created attachment 16538 [details]
Segment normalization patch
Segment normalization patch. Try this one on for size.
Created attachment 16539 [details]
Segment normalization patch (corrected)
Correction to segment normalization patch...
bugme-daemon@bugzilla.kernel.org wrote: > Created an attachment (id=16538) > --> (http://bugzilla.kernel.org/attachment.cgi?id=16538&action=view) > Segment normalization patch > > Segment normalization patch. Try this one on for size. Don't use this one... minor thinko. -hpa Note: if the "corrected" patch works, we should tidy it up to minimize the dependencies. In particular, if we let the kernel pre-patch the segmentation information, we can get rid of any dependency on writes. We can reduce it to requiring only that one segment is readable; it's not 100% clear which one to use -- DS,ES,FS,GS could be null, SS could have too short of a limit, and CS could be non-readable. CS is still probably the safest one, though. (In reply to comment #53) > Created an attachment (id=16539) [details] > Segment normalization patch (corrected) > > Correction to segment normalization patch... > $ make CHK include/linux/version.h CHK include/linux/utsrelease.h CALL scripts/checksyscalls.sh CHK include/linux/compile.h LD arch/x86/kernel/acpi/realmode/wakeup.elf arch/x86/kernel/acpi/realmode/wakeup.o: In function `wakeup_gdt': arch/x86/kernel/acpi/realmode/wakeup.S:(.data+0x0): undefined reference to `wakeup_gdt_len' make[3]: *** [arch/x86/kernel/acpi/realmode/wakeup.elf] Error 1 make[2]: *** [arch/x86/kernel/acpi/realmode/wakeup.bin] Error 2 make[1]: *** [arch/x86/kernel/acpi] Error 2 make: *** [arch/x86/kernel] Error 2 bugme-daemon@bugzilla.kernel.org wrote: > arch/x86/kernel/acpi/realmode/wakeup.o: In function `wakeup_gdt': > arch/x86/kernel/acpi/realmode/wakeup.S:(.data+0x0): undefined reference to > `wakeup_gdt_len' Just add the line: wakeup_gdt_len = 3*8 ... anywhere in the file... -hpa It works! I have problem with video restore so I have tried to boot with acpi_sleep=s3_bios and acpi_sleep=s3_mode. It reboots in both case. :( acpi_sleep=s3_beep works fine. Is it related to this bug? This is with the patch? Yes. It could be related - clearly the resume code is entered with state scrambled enough that the segment descriptors aren't even sanitized, there is other state that could be in a bizarre state. However, we should almost certainly explicitly set up a real-mode IDT; that might help and is something we can do easily enough. Created attachment 16549 [details]
Normalization patch #3
Please try this patch. It is the segment normalization patch #2 with the addition of explicitly setting the IDT.
(In reply to comment #63) > Created an attachment (id=16549) [details] > Normalization patch #3 > > Please try this patch. It is the segment normalization patch #2 with the > addition of explicitly setting the IDT. > It doesn't reboot, but it hangs with acpi_sleep=s3_bios. s3_mode doesn't hang or reboot system, but screen still black. Well, see suspend.sf.net for video debugging instructions. Unless video resume worked before, this is separate bug. Ok. Will this patch be in 2.6.26? (In reply to comment #63) > Created an attachment (id=16549) [details] > Normalization patch #3 > > Please try this patch. It is the segment normalization patch #2 with the > addition of explicitly setting the IDT. hpa, many thanks for the patch. If I had tried to prepare one, it would have taken much more time. It is a regression fix, so it would be nice to push it for 2.6.26. Will you take care of it or should I push it upstream?
> > Created an attachment (id=16549)
> --> (http://bugzilla.kernel.org/attachment.cgi?id=16549&action=view)
> [details]
> > Normalization patch #3
> >
> > Please try this patch. It is the segment normalization patch #2 with the
> > addition of explicitly setting the IDT.
>
> hpa, many thanks for the patch. If I had tried to prepare one, it would have
> taken much more time.
>
> It is a regression fix, so it would be nice to push it for 2.6.26. Will you
> take care of it or should I push it upstream?
yes, it is a regression fix, but it works around extremely broken
BIOS, and I'd call it slightly risky: we go (cripled?) real mode ->
protected mode -> real mode -> protected mode now.
(Note that even old code would die horribly as soon as you enabled
acpi_sleep=s3_bios/s3_mode...)
I'd prefer this to go into 2.6.27.
(And yes, hpa's assembly-coding skills are wonderful).
I'd like to clean it up first to not require a writable code segment. On Thursday, 19 of June 2008, bugme-daemon@bugzilla.kernel.org wrote: > ------- Comment #68 from pavel@suse.cz 2008-06-19 14:50 ------- > > > > Created an attachment (id=16549) > --> (http://bugzilla.kernel.org/attachment.cgi?id=16549&action=view) > > --> (http://bugzilla.kernel.org/attachment.cgi?id=16549&action=view) > [details] > > > Normalization patch #3 > > > > > > Please try this patch. It is the segment normalization patch #2 with the > > > addition of explicitly setting the IDT. > > > > hpa, many thanks for the patch. If I had tried to prepare one, it would > have > > taken much more time. > > > > It is a regression fix, so it would be nice to push it for 2.6.26. Will > you > > take care of it or should I push it upstream? > > yes, it is a regression fix, but it works around extremely broken > BIOS, and I'd call it slightly risky: we go (cripled?) real mode -> > protected mode -> real mode -> protected mode now. And what the risk is, actually? > (Note that even old code would die horribly as soon as you enabled > acpi_sleep=s3_bios/s3_mode...) > > I'd prefer this to go into 2.6.27. Shall we let the upstream decide, perhaps? (In reply to comment #69) > I'd like to clean it up first to not require a writable code segment. OK BTW, the patch fixes my Dell D620 as well. Handled-By : H. Peter Anvin <hpa@kernel.org> Tested-by: Kirill A. Shutemov <kirill@shutemov.name> Created attachment 16591 [details]
Normalization patch #4
Updated patch which doesn't require a writable code segment, and with proper metadata for submission.
Please test this out, and if it works, please add it to the ACPI tree.
Created attachment 16592 [details]
Normalization patch #5
Erk, fix trivial braindamage
(In reply to comment #77) > Created an attachment (id=16592) [details] > Normalization patch #5 > > Erk, fix trivial braindamage > It doesn't work. My laptop reboots. Okay... wtf... perhaps I'm confused as to the meaning of acpi_wakeup_address? In particular, what variable do I use to get the *physical* address of the wakeup page in low memory? acpi_wakeup_address is a virtual address (virt_to_phys() is applied to it before it's passed to the firmware). And we don't cache the physical address itself anywhere AFAICS. Created attachment 16596 [details]
Patch changing the meaning of acpi_wakeup_address
Please apply this patch additionally and see if that helps.
(In reply to comment #82) > Created an attachment (id=16596) [details] > Patch changing the meaning of acpi_wakeup_address > > Please apply this patch additionally and see if that helps. > No. Still reboots. Created attachment 16598 [details]
Patch changing the meaning of acpi_wakeup_address #2
Please boot with this patch applied instead of the previous one and see what is printed after "ACPI: Wake-up address set to ".
Created attachment 16599 [details] Modified normalization patch #5 This (slightly modified) normalization patch on top of the patch from Comment #84 works correctly on my test box. Kirill, can you test it, please? Yes, it works. Created attachment 16601 [details] Normalization patch #6 Strip trailing whitespace. Tested-by: Kirill A. Shutemov <kirill@shutemov.name> Okay, I'm confused... what changed, other than formatting? Nevermind, see the parenthization error. That'd do it. Rafael: will you push this upstream? (I'd rather it flows to you even though I'll come right back our way.) Created attachment 16604 [details] Normalization patch with the acpi_wakeup_address change, #7 OK Now, because it's necessary to change the meaning of acpi_wakeup_address for the normalization patch to work, I've merged the normalization patch #6 with the patch from Comment #82 and the result is attached. That's the one I'm going to push upstream. I tested it, but Kirill please verify that it works for you. Well, it's not technically necessary, we could to the virt_to_phys() in acpi_save_state_mem() just as well. Whatever. Still, it does make sense to cache the physical address, IMO. (In reply to comment #91) > Created an attachment (id=16604) [details] > Normalization patch with the acpi_wakeup_address change, #7 > > I tested it, but Kirill please verify that it works for you. The patch works properly. Thanks for testing, the patch is in -mm, so hopefully it'll be merged soon. Thanks a lot. Little late in responding, but the latest patch works for a D620. Thank you! marking RESOLVED, as the patch in comment #91 seems to work, (and is queued in the ACPI test tree) Sadly, the patch has been reported (by Ingo Molnar) to cause a regression on ASUS A8N-E. But Ingo said later that the problew was unrelated to this patch. References : http://marc.info/?l=linux-kernel&m=121446903218446&w=4 Okay, then. Then I consider this issue closed. Regressions list annotation: Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Patch : http://bugzilla.kernel.org/attachment.cgi?id=16604&action=view *** Bug 10260 has been marked as a duplicate of this bug. *** |