Bug 11963
Description
Nickolai Zeldovich
2008-11-06 08:59:56 UTC
did this also fail the same way with 2.6.26.stable or 2.6.27.stable? Fails the same way using Ubuntu 8.10's 2.6.27-4 and 2.6.27-7 kernels; never tried 2.6.26 on this machine. If there are specific prior versions that would be of particular interest in debugging this failure, I could try them out. While trying to figure out that other bug I briefly alluded to (that AHCI doesn't work after the first resume), I noticed that the system is already in a very bad state even after the first resume, so it might not be very surprising that the second suspend goes bad. In particular, both ahci and e1000e are broken after the first resume. After enabling AHCI and e1000e in the kernel build, and going into suspend the first time (echo mem > /sys/power/state), ahci prints errors like this on resume: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: failed to IDENTIFY (I/O error, err_mask=0x100) ata1.00: revalidation failed (errno=-5) ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100) ata2.00: revalidation failed (errno=-5) [[ 5 seconds later.. ]] Clocksource tsc unstable (delta = -499994004 ns) ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100) ata2.00: revalidation failed (errno=-5) ata1.00: failed to IDENTIFY (I/O error, err_mask=0x100) ata1.00: revalidation failed (errno=-5) [[ 5 seconds later.. ]] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: failed to IDENTIFY (I/O error, err_mask=0x100) ata1.00: revalidation failed (errno=-5) ata1.00: disabled ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100) ata2.00: revalidation failed (errno=-5) ata2.00: disabled ata2: exception Emask 0x60 SAct 0x0 SErr 0x800 action 0x6 frozen t4 ata2: irq_stat 0x20000000, host bus error ata2: SError: { HostInt } ata2: hard resetting link ata1: exception Emask 0x60 SAct 0x0 SErr 0x800 action 0x6 frozen t4 ata1: irq_stat 0x20000000, host bus error ata1: SError: { HostInt } ata1: hard resetting link ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1: EH complete ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2: EH complete sd 0:0:0:0: [sda] START_STOP FAILED sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK PM: Device 0:0:0:0 failed to resume: error 262144 e1000e hardware doesn't seem to be transmitting for some reason either: 0000:00:19.0: eth0: Detected Tx Unit Hang: TDH <0> TDT <3> next_to_use <3> next_to_clean <0> buffer_info[next_to_clean]: time_stamp <fffc91ef> next_to_watch <0> jiffies <fffc9ba8> next_to_watch.status <0> ifconfig eth0 reports no packets going in or out of the interface, although /proc/interrupts shows interrupts coming in for eth0 periodically. The only other thing that looks out-of-ordinary in dmesg output around suspend/resume to me is the following, during resume: ACPI: EC: non-query interrupt received, switching to interrupt mode Unfortunately the fact that I have neither disk nor network makes it difficult to extract any significant amount of logs from the machine; I just typed up the above. After giving this some thought, I realized the AHCI/e1000e problem was a separate bug: it would appear that the VT-d IOMMU isn't being re-initialized on resume. Disabling VT-d in the BIOS (and falling back to SWIOTLB) makes that problem go away, but unfortunately the second resume still fails. In the meantime, I have tested v2.6.26 (also fails on second resume), and tried a few kernel options (acpi_osi=Linux acpi_serialize libata.noacpi=1) to no avail. I'll attach the acpidump output for the machine in a moment; disassembling and trying to recompile it with iasl generates some warnings and errors, but I can't tell if they're significant. Created attachment 18747 [details]
acpidump for Thinkpad W500 with VT-d disabled in BIOS
Turns out the second-resume-failure only happens when the "Intel TXT Feature" is enabled in the BIOS. If I disable that setting, the kernel can suspend and resume many (more than 3) times. For anyone else that might be having this problem, try going into the BIOS, into "Security", into "Security Chip", and set "Intel(TM) TXT Feature" to disabled. Unfortunately, I don't understand what the BIOS is doing with Intel's TXT hardware to cause this problem. Perhaps it expects kernel to properly enable the TPM chip? Or the kernel is supposed to install some special trusted S3 resume handler? I don't know why it only affects the second resume. Moreover, I don't know whether this failure mode is specific to this BIOS or to TXT support in general. However, I'll attach my dmidecode and acpidump output, both with TXT enabled and disabled, just in case. (As for this bug report, it would be nice if the kernel at least logged a warning saying things might break..) Created attachment 18760 [details]
acpidump for Thinkpad W500 with TXT disabled in BIOS
Created attachment 18761 [details]
acpidump for Thinkpad W500 with TXT enabled in BIOS
Created attachment 18762 [details]
dmidecode for Thinkpad W500 with TXT disabled in BIOS
Created attachment 18763 [details]
dmidecode for Thinkpad W500 with TXT enabled in BIOS
If you enter BIOS setup and select the option to restore optimal defaults... Does that result in "Intel TXT Feature" being enabled or disabled? Asking the BIOS to load default settings results in the "Intel TXT Feature" being disabled. Will you please confirm whether the problem still exists if the legacy(IDE) mode is selected for the sata device and the driver for e1000 is not loaded? (Of course the TXT feature is still enabled.) Will you please confirm whether the suspend/resume can work well on windows if the TXT feature is enabled? Thanks. Yes, the problem occurs if I enable legacy (PCI-IDE) mode for the disk controller instead of AHCI in the bios, and don't load e1000. The problem also occurs if I have no disk or network drivers loaded at all. As for whether Windows works with TXT enabled, it's a bit complicated, but the answer is "probably yes". Here's the fine print: On my W500, I never tried running Windows, and it would take me a while to install it for this test. So I don't know for sure what the answer is. However, I have another Thinkpad T500 running Windows XP, which mostly differs from the W500 in the graphics card and display panel. On that machine, I enabled Intel TXT in the BIOS, and Windows is able to suspend-resume at least 3 times in a row. So the answer there seems to be "yes". Hi, Nickolai thanks for the confirmation. It seems that the problem still exists no matter what IDE mode is used. If the TXT feature is disabled, the problem will disappear. But there is no problem on windows regardless of TXT feature. As there is no beep voice on the second resume, maybe the BIOS doesn't transfer the control to the waking vector set in FACS table. Will you please attach the output of acpidump after first suspend/resume cycle? (Please enable the Intel TXT feature) Thanks. When TXT is enabled, the output of acpidump is exactly the same before and after the first resume (and it still hangs after the second resume). I will attach the output shortly; it is virtually identical to the one in comment #8 except for some bits in the ASF! table (perhaps I changed some unrelated BIOS settings in the meantime? I'm not sure why.) Created attachment 19014 [details]
acpidump for Thinkpad W500 with TXT enabled in BIOS, take 2
Created attachment 19144 [details]
use the RTC cmos area to track where the suspend/resume hangs
Hi, Nickoali
Sorry for the late response.
Will you please try the debug patch on the latest kernel and do the following test? (TXT should be enabled in BIOS. And the boot option of "acpi_sleep=s3_beep" is also added)
a. do one cycle of suspend/resume
b. echo 25 > /proc/cmos
c. echo mem > /sys/power/state (Second suspend)
d. press the power button and see whether the system can be resumed.
If it can't be resumed, please reboot the system. After the system is rebooted, please cat /proc/cmos and attach the output of dmesg.
thanks.
well. we know nothing about VT and I don't have any idea on how to debug this issue. re-assign to VT category to see if they know more about this. In response to comment 18, I will try to get to it in the near future. In response to comment 19, I don't think this has anything to do with VT. As I mentioned in comment 4, the IOMMU (VT-d) problem turned out to be an independent problem. (In reply to comment #20) > In response to comment 18, I will try to get to it in the near future. > > In response to comment 19, I don't think this has anything to do with VT. As > I > mentioned in comment 4, the IOMMU (VT-d) problem turned out to be an > independent problem. > here, I said "VT" but I mean the "Intel(TM) TXT Feature". :) maybe it's not accurate, but I'm sure the virtulization people know much more about this. :p No, they don't. Who's the TXT maintainer? Created attachment 19256 [details] output from dmesg after rebooting after a second resume hang I tried the procedure suggested in comment 18. The system hung after the second resume attempt, as before. After rebooting, the BIOS gave an error about a CMOS checksum mismatch; this might have overwritten whatever value you were looking for in there. After booting up, /proc/cmos contained: mcount=9, time=bbb004 The output of dmesg after rebooting is attached. I like to confirmed this issue happens on Thinkpad X200 (7458-RY9), BIOS version 6DET38WW (2.02) 12/19/2008 with 2.6.28 kernel. Also happens for me on Thinkpad X200s (74695KG) with BIOS 6DET33WW (1.10) (10/27/2008) and kernel 2.6.29 Using Acer Aspire One kernel 2.6.26 I have the same thing happening. Apparently if I leave the laptop for 5 minutes (every time it takes exactly 5 minutes) it recovers and boots as normal. Also found that after the first suspend I need to reinitialize ehci_hcd otherwise it comes out of suspend right away. Any idea where the 5 minutes delay may come from? Thanks Using Acer Aspire One kernel 2.6.26 I have the same thing happening. Apparently if I leave the laptop for 5 minutes (every time it takes exactly 5 minutes) it recovers. Also found that after the first suspend I need to reinitialize ehci_hcd otherwise it comes out of suspend right away. Any idea where the 5 minutes delay may come from? Thanks There is a newer BIOS released for X200, please check here http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=MIGR-70347 http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=MIGR-70348 Anyone can confirm if this BIOS fix the issue ? :) I can confirm that the issue remains with the latest BIOS version for x200s (3.03 05/18/2009) using the 2.6.30 kernel. I'm using an x200s with BIOS version 6DET33WW (1.10) and cannot reproduce the problem with v2.6.31-rc1. (Actually slightly newer; commit a679128d) I have TXT enabled in the BIOS, and the machine can suspend/resume repeatedly whether I enable the IOMMU in Linux or not. Please confirm whether this works for you in 2.6.31-rc2 When testing with 2.6.31-rc2 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.31-rc2/ I'm still only able to resume the second time when Intel TXT is turned off in BIOS. Do you have CONFIG_DMAR enabled in your kernel and VT-d enabled in the BIOS? Can you show dmesg output when it boots? The dmesg in comment #18 seems not to have a DMAR table in ACPI at all. Created attachment 22238 [details]
Dmesg log from booting 2.6.31-rc2 with CONFIG_DMAR
VT-d has always been enabled but CONFIG_DMAR wasn't enabled so I recompiled 2.6.31-rc2 with it. For some reason Intel graphics doesn't work with this so I could only test in rescue mode (it still fails to resume the second time).
I'm attaching the dmesg log.
The graphics driver is known to be broken. You need to use intel_iommu=igfx_off (or update from Linus' git tree again and use the CONFIG_DMAR_BROKEN_GFX_WA option). Created attachment 22285 [details]
dmesg.log for 2.6.31-rc with CONFIG_DMAR=y and intel_iommu=igfx_off
Thanks, tested with intel_iommu=igfx_off graphics works now. I'm attaching dmesg log after first resume.
I'm very confused that I can't reproduce this. Just to make sure, can you test the kernel at http://david.woodhou.se/bzImage-x200s ? You don't need any iommu-related arguments on its command line; it should work around the graphics bug automatically. md5sum is 06b3ed3da19ab83f8d4371d91f4288a7 I can suspend and resume multiple times, with TXT and VT-d enabled in the BIOS. BIOS version is as I said before (from dmidecode): Version: 6DET33WW (1.10 ) Release Date: 10/27/2008 I don't know how to make sure all our BIOS settings are in sync. The 'reset to default options' doesn't actually seem to do anything. If we were using a sensible firmware like OpenFirmware, we'd be able to list the settings -- but these crappy BIOSes don't let us do that. Can you try my kernel, and boot into runlevel 3 and run 'echo mem > /sys/power/state' from a VT? Created attachment 22298 [details]
dmesg log after first resume from kernel image bzImage-x200s
Tried with your image with same result. Attached dmesg log after resuming from the first echo mem > /sys/power/state.
So, to summarise: We have the same hardware. We have the same BIOS. We have, as far as we can tell, the same BIOS settings. We have booted the same kernel. Yours fails, mine works. I don't know where to go from here, given that I'm not even sure we get back into the kernel -- it could be happening in the BIOS. Do you want to try upgrading the BIOS to see if it persists? As far as I can tell I have a much newer version than you (3.03 05/18/2009) so it would be downgrading, but if that is possible I'll try if that works. For example in 2.04 one fix is - (Fix) Fixed unnecessary resource claiming for the TPM 1.2 And the 3.* series adds a lot of new security features. I'll try with an older BIOS and see if that works. Oh, sorry -- I was looking at comment #25 but you've upgraded since then. I've done some extensive testing on different setup. First a little background: I dual boot Windows 7 with a partition encrypted with BitLocker, using the TPM claimed by Windows. Before downgrading the BIOS I decrypted the partition in case the computer would be bricked (TPM still claimed by Windows though). Heres the result: After downgrading the BIOS to 1.10 multiple resumes WORKED on all kernels tested (from 2.6.28 to 2.6.31-rc1 (bzImage-x200s)). I've then tested to turn on BitLocker with the same result. I've also tested to turn off BitLocker and clear the TPM and also the same result (everything works). After this I've re-upgraded BIOS to 3.03 and tried with both TPM cleared and claimed by Windows and now multiple resumes even works here (with Intel TXT enabled) using the same kernels as above. Conclusion: Either I've broken the Intel TXT or the mysterious issue I had disappeared when I either turned off BitLocker or downgraded BIOS. I've traced the problem down to the Intel Anti-Theft module. When Intel TXT is enabled and Intel AT-p module is enabled resume fails the second time. This was probably the setup I had previously. Without Intel AT-p everything seems to work independent of the Intel TXT setting with 2.6.31-rc3. Though I'm still not sure why it didn't work with the 1.10 BIOS which lacks this feature. |