Bug 11963

Summary: S3: second resume fails unless BIOS "Intel TXT Feature" disabled - Thinkpad W500
Product: Virtualization Reporter: Nickolai Zeldovich (nickolai)
Component: kvmAssignee: virtualization_kvm
Status: CLOSED DOCUMENTED    
Severity: normal CC: alan, avi, chihchun, dwmw2, joe3572, johann-nikolaus.andreae, perher, vedran
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.26, 2.6.27-4, 2.6.28-rc3-git Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 56331    
Attachments: acpidump for Thinkpad W500 with VT-d disabled in BIOS
acpidump for Thinkpad W500 with TXT disabled in BIOS
acpidump for Thinkpad W500 with TXT enabled in BIOS
dmidecode for Thinkpad W500 with TXT disabled in BIOS
dmidecode for Thinkpad W500 with TXT enabled in BIOS
acpidump for Thinkpad W500 with TXT enabled in BIOS, take 2
use the RTC cmos area to track where the suspend/resume hangs
output from dmesg after rebooting after a second resume hang
Dmesg log from booting 2.6.31-rc2 with CONFIG_DMAR
dmesg.log for 2.6.31-rc with CONFIG_DMAR=y and intel_iommu=igfx_off
dmesg log after first resume from kernel image bzImage-x200s

Description Nickolai Zeldovich 2008-11-06 08:59:56 UTC
This laptop fails to resume from a second suspend-to-RAM, although resuming from the first suspend works just fine.

Adding acpi_sleep=s3_beep generates beeps during the first resume, but no beeps during the second resume attempt.  On second resume attempt, the disk light flickers briefly, but the suspend LED (crescent shape) stays solid-on, and the laptop exhibits no other signs of activity.

I tried to enable PM tracing (echo 1 > /sys/power/pm_trace), but since the kernel doesn't even beep the second time around, I don't get any "hash match" dmesg entries on reboot.

This happens in a kernel built from recent git (75fa67706cce5272bcfc51ed646f2da21f3bdb6e).  I disabled SMP, USB, and ATA/SATA in the kernel config (there appears to be an unrelated bug where AHCI cannot talk to the disk even after the first suspend).

What other information would be useful in this case?
Comment 1 Len Brown 2008-11-06 11:35:24 UTC
did this also fail the same way with 2.6.26.stable or 2.6.27.stable?
Comment 2 Nickolai Zeldovich 2008-11-06 11:42:20 UTC
Fails the same way using Ubuntu 8.10's 2.6.27-4 and 2.6.27-7 kernels; never tried 2.6.26 on this machine.  If there are specific prior versions that would be of particular interest in debugging this failure, I could try them out.
Comment 3 Nickolai Zeldovich 2008-11-06 13:16:51 UTC
While trying to figure out that other bug I briefly alluded to (that AHCI doesn't work after the first resume), I noticed that the system is already in a very bad state even after the first resume, so it might not be very surprising that the second suspend goes bad.

In particular, both ahci and e1000e are broken after the first resume.  After enabling AHCI and e1000e in the kernel build, and going into suspend the first time (echo mem > /sys/power/state), ahci prints errors like this on resume:

ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x100)
ata1.00: revalidation failed (errno=-5)
ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100)
ata2.00: revalidation failed (errno=-5)
[[ 5 seconds later.. ]]
Clocksource tsc unstable (delta = -499994004 ns)
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100)
ata2.00: revalidation failed (errno=-5)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x100)
ata1.00: revalidation failed (errno=-5)
[[ 5 seconds later.. ]]
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x100)
ata1.00: revalidation failed (errno=-5)
ata1.00: disabled
ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100)
ata2.00: revalidation failed (errno=-5)
ata2.00: disabled
ata2: exception Emask 0x60 SAct 0x0 SErr 0x800 action 0x6 frozen t4
ata2: irq_stat 0x20000000, host bus error
ata2: SError: { HostInt }
ata2: hard resetting link
ata1: exception Emask 0x60 SAct 0x0 SErr 0x800 action 0x6 frozen t4
ata1: irq_stat 0x20000000, host bus error
ata1: SError: { HostInt }
ata1: hard resetting link
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1: EH complete
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2: EH complete
sd 0:0:0:0: [sda] START_STOP FAILED
sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
PM: Device 0:0:0:0 failed to resume: error 262144

e1000e hardware doesn't seem to be transmitting for some reason either:

0000:00:19.0: eth0: Detected Tx Unit Hang:
  TDH                  <0>
  TDT                  <3>
  next_to_use          <3>
  next_to_clean        <0>
buffer_info[next_to_clean]:
  time_stamp           <fffc91ef>
  next_to_watch        <0>
  jiffies              <fffc9ba8>
  next_to_watch.status <0>

ifconfig eth0 reports no packets going in or out of the interface, although /proc/interrupts shows interrupts coming in for eth0 periodically.

The only other thing that looks out-of-ordinary in dmesg output around suspend/resume to me is the following, during resume:

ACPI: EC: non-query interrupt received, switching to interrupt mode

Unfortunately the fact that I have neither disk nor network makes it difficult to extract any significant amount of logs from the machine; I just typed up the above.
Comment 4 Nickolai Zeldovich 2008-11-09 06:40:31 UTC
After giving this some thought, I realized the AHCI/e1000e problem was a separate bug: it would appear that the VT-d IOMMU isn't being re-initialized on resume.  Disabling VT-d in the BIOS (and falling back to SWIOTLB) makes that problem go away, but unfortunately the second resume still fails.

In the meantime, I have tested v2.6.26 (also fails on second resume), and tried a few kernel options (acpi_osi=Linux acpi_serialize libata.noacpi=1) to no avail.

I'll attach the acpidump output for the machine in a moment; disassembling and trying to recompile it with iasl generates some warnings and errors, but I can't tell if they're significant.
Comment 5 Nickolai Zeldovich 2008-11-09 06:42:23 UTC
Created attachment 18747 [details]
acpidump for Thinkpad W500 with VT-d disabled in BIOS
Comment 6 Nickolai Zeldovich 2008-11-09 13:04:05 UTC
Turns out the second-resume-failure only happens when the "Intel TXT Feature" is enabled in the BIOS.  If I disable that setting, the kernel can suspend and resume many (more than 3) times.

For anyone else that might be having this problem, try going into the BIOS, into "Security", into "Security Chip", and set "Intel(TM) TXT Feature" to disabled.

Unfortunately, I don't understand what the BIOS is doing with Intel's TXT hardware to cause this problem.  Perhaps it expects kernel to properly enable the TPM chip?  Or the kernel is supposed to install some special trusted S3 resume handler?  I don't know why it only affects the second resume.

Moreover, I don't know whether this failure mode is specific to this BIOS or to TXT support in general.  However, I'll attach my dmidecode and acpidump output, both with TXT enabled and disabled, just in case.

(As for this bug report, it would be nice if the kernel at least logged a warning saying things might break..)
Comment 7 Nickolai Zeldovich 2008-11-09 13:04:59 UTC
Created attachment 18760 [details]
acpidump for Thinkpad W500 with TXT disabled in BIOS
Comment 8 Nickolai Zeldovich 2008-11-09 13:05:24 UTC
Created attachment 18761 [details]
acpidump for Thinkpad W500 with TXT enabled in BIOS
Comment 9 Nickolai Zeldovich 2008-11-09 13:07:17 UTC
Created attachment 18762 [details]
dmidecode for Thinkpad W500 with TXT disabled in BIOS
Comment 10 Nickolai Zeldovich 2008-11-09 13:07:35 UTC
Created attachment 18763 [details]
dmidecode for Thinkpad W500 with TXT enabled in BIOS
Comment 11 Len Brown 2008-11-11 21:30:00 UTC
If you enter BIOS setup and select the option to restore optimal defaults...

Does that result in "Intel TXT Feature" being enabled or disabled?
Comment 12 Nickolai Zeldovich 2008-11-12 07:28:00 UTC
Asking the BIOS to load default settings results in the "Intel TXT Feature" being disabled.
Comment 13 ykzhao 2008-11-13 01:31:10 UTC
Will you please confirm whether the problem still exists if the legacy(IDE) mode is selected for the sata device and the driver for e1000 is not loaded? (Of course the TXT feature is still enabled.)

    Will you please confirm whether the suspend/resume can work well on windows if the TXT feature is enabled?
    Thanks.
Comment 14 Nickolai Zeldovich 2008-11-13 06:38:46 UTC
Yes, the problem occurs if I enable legacy (PCI-IDE) mode for the disk controller instead of AHCI in the bios, and don't load e1000.  The problem also occurs if I have no disk or network drivers loaded at all.

As for whether Windows works with TXT enabled, it's a bit complicated, but the answer is "probably yes".  Here's the fine print:

On my W500, I never tried running Windows, and it would take me a while to install it for this test.  So I don't know for sure what the answer is.

However, I have another Thinkpad T500 running Windows XP, which mostly differs from the W500 in the graphics card and display panel.  On that machine, I enabled Intel TXT in the BIOS, and Windows is able to suspend-resume at least 3 times in a row.  So the answer there seems to be "yes".
Comment 15 ykzhao 2008-11-25 00:20:32 UTC
Hi, Nickolai
    thanks for the confirmation.
    It seems that the problem still exists no matter what IDE mode is used. If the TXT feature is disabled, the problem will disappear. But there is no problem on windows regardless of TXT feature. 
    As there is no beep voice on the second resume, maybe the BIOS doesn't transfer the control to the waking vector set in FACS table.
    Will you please attach the output of acpidump after first suspend/resume cycle? (Please enable the Intel TXT feature)
    Thanks.
   
Comment 16 Nickolai Zeldovich 2008-11-25 07:37:08 UTC
When TXT is enabled, the output of acpidump is exactly the same before and after the first resume (and it still hangs after the second resume).  I will attach the output shortly; it is virtually identical to the one in comment #8 except for some bits in the ASF! table (perhaps I changed some unrelated BIOS settings in the meantime?  I'm not sure why.)
Comment 17 Nickolai Zeldovich 2008-11-25 07:38:18 UTC
Created attachment 19014 [details]
acpidump for Thinkpad W500 with TXT enabled in BIOS, take 2
Comment 18 ykzhao 2008-12-04 00:37:41 UTC
Created attachment 19144 [details]
use the RTC cmos area to track where the suspend/resume hangs

Hi, Nickoali
    Sorry for the late response.
    Will you please try the debug patch on the latest kernel and do the following test? (TXT should be enabled in BIOS. And the boot option of "acpi_sleep=s3_beep" is also added)
    a. do one cycle of suspend/resume
    b. echo 25 > /proc/cmos
    c. echo mem > /sys/power/state (Second suspend)
    d. press the power button and see whether the system can be resumed. 

    If it can't be resumed, please reboot the system. After the system is rebooted, please cat /proc/cmos and attach the output of dmesg.
   thanks.
Comment 19 Zhang Rui 2008-12-09 21:30:57 UTC
well. we know nothing about VT and I don't have any idea on how to debug this issue.
re-assign to VT category to see if they know more about this.
Comment 20 Nickolai Zeldovich 2008-12-09 21:36:37 UTC
In response to comment 18, I will try to get to it in the near future.

In response to comment 19, I don't think this has anything to do with VT.  As I mentioned in comment 4, the IOMMU (VT-d) problem turned out to be an independent problem.
Comment 21 Zhang Rui 2008-12-09 21:59:45 UTC
(In reply to comment #20)
> In response to comment 18, I will try to get to it in the near future.
> 
> In response to comment 19, I don't think this has anything to do with VT.  As
> I
> mentioned in comment 4, the IOMMU (VT-d) problem turned out to be an
> independent problem.
> 
here, I said "VT" but I mean the "Intel(TM) TXT Feature". :)
maybe it's not accurate, but I'm sure the virtulization people know much more about this. :p
Comment 22 Avi Kivity 2008-12-10 01:05:11 UTC
No, they don't.  Who's the TXT maintainer?
Comment 23 Nickolai Zeldovich 2008-12-11 15:16:02 UTC
Created attachment 19256 [details]
output from dmesg after rebooting after a second resume hang

I tried the procedure suggested in comment 18.  The system hung after the second resume attempt, as before.  After rebooting, the BIOS gave an error about a CMOS checksum mismatch; this might have overwritten whatever value you were looking for in there.  After booting up, /proc/cmos contained:

    mcount=9, time=bbb004

The output of dmesg after rebooting is attached.
Comment 24 Rex Tsai 2009-03-19 10:48:00 UTC
I like to confirmed this issue happens on Thinkpad X200 (7458-RY9), BIOS version 6DET38WW (2.02) 12/19/2008 with 2.6.28 kernel.
Comment 25 Per Hermansson 2009-04-05 16:36:29 UTC
Also happens for me on Thinkpad X200s (74695KG) with BIOS 6DET33WW (1.10) (10/27/2008) and kernel 2.6.29
Comment 26 joe3572 2009-05-25 16:52:07 UTC
Using Acer Aspire One kernel 2.6.26 I have the same thing happening.  Apparently if I leave the laptop for 5 minutes (every time it takes exactly 5 minutes) it recovers and boots as normal.  Also found that after the first suspend I need to reinitialize ehci_hcd otherwise it comes out of suspend right away.

Any idea where the 5 minutes delay may come from?

Thanks
Comment 27 joe3572 2009-05-25 19:43:28 UTC
Using Acer Aspire One kernel 2.6.26 I have the same thing happening.  Apparently if I leave the laptop for 5 minutes (every time it takes exactly 5 minutes) it recovers.  Also found that after the first suspend I need to reinitialize ehci_hcd otherwise it comes out of suspend right away.

Any idea where the 5 minutes delay may come from?

Thanks
Comment 28 Rex Tsai 2009-06-29 11:41:28 UTC
There is a newer BIOS released for X200, please check here 

http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=MIGR-70347
http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=MIGR-70348

Anyone can confirm if this BIOS fix the issue ? :)
Comment 29 Per Hermansson 2009-06-29 13:14:54 UTC
I can confirm that the issue remains with the latest BIOS version for x200s (3.03 05/18/2009) using the 2.6.30 kernel.
Comment 30 David Woodhouse 2009-06-29 14:57:00 UTC
I'm using an x200s with BIOS version 6DET33WW (1.10) and cannot reproduce the problem with v2.6.31-rc1. (Actually slightly newer; commit a679128d)

I have TXT enabled in the BIOS, and the machine can suspend/resume repeatedly whether I enable the IOMMU in Linux or not.
Comment 31 David Woodhouse 2009-07-05 00:11:18 UTC
Please confirm whether this works for you in 2.6.31-rc2
Comment 32 Per Hermansson 2009-07-05 13:27:17 UTC
When testing with 2.6.31-rc2 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.31-rc2/ I'm still only able to resume the second time when Intel TXT is turned off in BIOS.
Comment 33 David Woodhouse 2009-07-05 19:14:30 UTC
Do you have CONFIG_DMAR enabled in your kernel and VT-d enabled in the BIOS?
Can you show dmesg output when it boots? The dmesg in comment #18 seems not to have a DMAR table in ACPI at all.
Comment 34 Per Hermansson 2009-07-06 18:46:42 UTC
Created attachment 22238 [details]
Dmesg log from booting 2.6.31-rc2 with CONFIG_DMAR

VT-d has always been enabled but CONFIG_DMAR wasn't enabled so I recompiled 2.6.31-rc2 with it. For some reason Intel graphics doesn't work with this so I could only test in rescue mode (it still fails to resume the second time).
I'm attaching the dmesg log.
Comment 35 David Woodhouse 2009-07-08 18:00:00 UTC
The graphics driver is known to be broken. You need to use intel_iommu=igfx_off (or update from Linus' git tree again and use the CONFIG_DMAR_BROKEN_GFX_WA option).
Comment 36 Per Hermansson 2009-07-09 07:22:15 UTC
Created attachment 22285 [details]
dmesg.log for 2.6.31-rc with CONFIG_DMAR=y and intel_iommu=igfx_off

Thanks, tested with intel_iommu=igfx_off graphics works now. I'm attaching dmesg log after first resume.
Comment 37 David Woodhouse 2009-07-09 16:24:56 UTC
I'm very confused that I can't reproduce this. Just to make sure, can you test the kernel at http://david.woodhou.se/bzImage-x200s ? 

You don't need any iommu-related arguments on its command line; it should work around the graphics bug automatically.

md5sum is 06b3ed3da19ab83f8d4371d91f4288a7

I can suspend and resume multiple times, with TXT and VT-d enabled in the BIOS.
BIOS version is as I said before (from dmidecode):
        Version: 6DET33WW (1.10 )
        Release Date: 10/27/2008

I don't know how to make sure all our BIOS settings are in sync. The 'reset to default options' doesn't actually seem to do anything. If we were using a sensible firmware like OpenFirmware, we'd be able to list the settings -- but these crappy BIOSes don't let us do that.

Can you try my kernel, and boot into runlevel 3 and run 'echo mem > /sys/power/state' from a VT?
Comment 38 Per Hermansson 2009-07-10 16:26:09 UTC
Created attachment 22298 [details]
dmesg log after first resume from kernel image bzImage-x200s

Tried with your image with same result. Attached dmesg log after resuming from the first echo mem > /sys/power/state.
Comment 39 David Woodhouse 2009-07-10 18:02:00 UTC
So, to summarise:

We have the same hardware.
We have the same BIOS.
We have, as far as we can tell, the same BIOS settings.
We have booted the same kernel.

Yours fails, mine works.

I don't know where to go from here, given that I'm not even sure we get back into the kernel -- it could be happening in the BIOS.

Do you want to try upgrading the BIOS to see if it persists?
Comment 40 Per Hermansson 2009-07-10 20:59:11 UTC
As far as I can tell I have a much newer version than you (3.03 05/18/2009) so it would be downgrading, but if that is possible I'll try if that works. 

For example in 2.04 one fix is
 - (Fix) Fixed unnecessary resource claiming for the TPM 1.2
And the 3.* series adds a lot of new security features. 

I'll try with an older BIOS and see if that works.
Comment 41 David Woodhouse 2009-07-11 10:25:12 UTC
Oh, sorry -- I was looking at comment #25 but you've upgraded since then.
Comment 42 Per Hermansson 2009-07-11 12:02:08 UTC
I've done some extensive testing on different setup.

First a little background: 
I dual boot Windows 7 with a partition encrypted with BitLocker, using the TPM claimed by Windows. Before downgrading the BIOS I decrypted the partition in case the computer would be bricked (TPM still claimed by Windows though).

Heres the result: 
After downgrading the BIOS to 1.10 multiple resumes WORKED on all kernels tested (from 2.6.28 to 2.6.31-rc1 (bzImage-x200s)).
I've then tested to turn on BitLocker with the same result. I've also tested to turn off BitLocker and clear the TPM and also the same result (everything works).

After this I've re-upgraded BIOS to 3.03 and tried with both TPM cleared and claimed by Windows and now multiple resumes even works here (with Intel TXT enabled) using the same kernels as above.

Conclusion:
Either I've broken the Intel TXT or the mysterious issue I had disappeared when I either turned off BitLocker or downgraded BIOS.
Comment 43 Per Hermansson 2009-07-17 17:48:11 UTC
I've traced the problem down to the Intel Anti-Theft module. When Intel TXT is enabled and Intel AT-p module is enabled resume fails the second time. This was probably the setup I had previously.

Without Intel AT-p everything seems to work independent of the Intel TXT setting with 2.6.31-rc3. Though I'm still not sure why it didn't work with the 1.10 BIOS which lacks this feature.