Created attachment 112391 [details] dmesg output for kernel 3.11.6, after two suspend to RAM The machine won't turn off completely when I try to power off my laptop. The kernel prints: Reboot: Power down and hangs there eternally. I tried 3.10.17, 3.11.6, 3.12-rc6, with and without kernel boot parameter acpi_osi="!Windows 2012". No luck at all. The suspend to RAM works correctly. Reboot works correctly too. See attachments for dumps of dmidecode, acpidump, and dmesg.
Created attachment 112401 [details] Output of acpidump
Created attachment 112411 [details] Output of dmidecode
Created attachment 112421 [details] dmesg output on kernel 3.12-rc6
Created attachment 112431 [details] Kernel config for 3.11.6
Does power off work on this machine before? Please try kernel param "reboot=kbd", or =pci, efi, triple, bios.
Power off works fine on stock Windows 8. With the upgrade to BIOS version 2.22 (which I am using now), it works fine on Windows 7 too. I'm not sure if this is related but users with earlier versions of BIOS reported a problem of Windows 7 unable to power off the machine (and shows BSOD on subsequent boot) if it has USB 3.0 driver installed. The problem has been fixed with newer BIOS version. None of the above kernel parameters seem to fix this problem. I've tested them on kernel 3.11.6 and 3.12-rc6.
I just got this machine a few days ago, so I haven't used earlier kernels on it. I'll try testing it on an earlier version of the kernel later today.
(In reply to Chang Liu from comment #6) > Power off works fine on stock Windows 8. With the upgrade to BIOS version > 2.22 (which I am using now), it works fine on Windows 7 too. I'm not sure if > this is related but users with earlier versions of BIOS reported a problem > of Windows 7 unable to power off the machine (and shows BSOD on subsequent > boot) if it has USB 3.0 driver installed. The problem has been fixed with > newer BIOS version. Ok. How about not loading xhci driver on the linux? > > None of the above kernel parameters seem to fix this problem. I've tested > them on kernel 3.11.6 and 3.12-rc6. Could you check the older kernel E.G, v3.3. Some other reporters find the power off can't work on the other machine between v3.3~v3.5. BTW, do you enable vt-d in the Bios? If yes, please disable it and try again.
(In reply to Lan Tianyu from comment #8) > Ok. How about not loading xhci driver on the linux? > Sure. I tried to rmmod the xhci kernel module. The problem persists. I also tried to compile the kernel (version 3.7) without xhci . No luck either. > BTW, do you enable vt-d in the Bios? If yes, please disable it and try again. I didn't find the vt-d option in BIOS. So I'm not sure how to disable it. > Could you check the older kernel E.G, v3.3. Some other reporters find the > power off can't work on the other machine between v3.3~v3.5. > Yes. I tested on kernel 3.3.8. The problem has gone! The machine powered off just fine. So I tested a bunch of other kernel releases. Here is the result: VERSION WORKS? 3.3.8 YES 3.4.44 YES 3.4.51 YES 3.5.0-rc1 NO 3.5.0 NO 3.5.1 NO 3.7.0 NO So obviously the regression was introduced in the 3.5 cycle. I looked at git log and see several ACPI related commits. I'll be running git-bisect to located the commit that introduced the bug. Any pointers on where should I be looking? If this is a valid question, in your opinion, what's the most suspicious commit(s) that will likely cause this regression?
OK. The git-bisect is done. Looks like b566a22c23327f18ce941ffad0ca907e50a53d41 is the first bad commit commit b566a22c23327f18ce941ffad0ca907e50a53d41 Author: Khalid Aziz <khalid.aziz@hp.com> Date: Fri Apr 27 13:00:33 2012 -0600 PCI: disable Bus Master on PCI device shutdown Disable Bus Master bit on the device in pci_device_shutdown() to ensure PCI devices do not continue to DMA data after shutdown. This can cause memory corruption in case of a kexec where the current kernel shuts down and transfers control to a new kernel while a PCI device continues to DMA to memory that does not belong to it any more in the new kernel. I have tested this code on two laptops, two workstations and a 16-socket server. kexec worked correctly on all of them. Signed-off-by: Khalid Aziz <khalid.aziz@hp.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> :040000 040000 599871e51d03f3859da648de41a0db25eb046c13 63b1f26a5c20523bb7fd56039e4f641ad240b925 M drivers
Great job!!! Check the code of v3.12-rc5. The issue seems caused by PCI master bit being cleared during shut down. Please try the patch on the v3.12-rc5. diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 98f7b9b..20758ef 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -392,8 +392,8 @@ static void pci_device_shutdown(struct device *dev) * Turn off Bus Master bit on the device to tell it to not * continue to do DMA. Don't touch devices in D3cold or unknown states. */ - if (pci_dev->current_state <= PCI_D3hot) - pci_clear_master(pci_dev); +// if (pci_dev->current_state <= PCI_D3hot) +// pci_clear_master(pci_dev); } #ifdef CONFIG_PM
Yes. I tried to comment out the offending lines on torvalds' latest kernel tree (as of c9ca72fc568403db192e199b752c9c253e5f5fd9) and recompiled the kernel. The problem is fixed! The machine powers off perfectly. If I don't comment these lines, the machine will hang at power off. The original patch that introduced the regression seems to be about fixing kexec. I have very limited knowledge of both kexec and PCI but I felt the commit cannot simply be reverted. Looks like we need to be more careful when clearing PCI master bit during shutdown. Perhaps I should inform the original authors of the related commits?
I've added Mr. Konstantin Khlebnikov and Mr. Bjorn Helgaas to the CC list as they authored/reviewed the relevant patches (commit 6e0eda3, 20f2420, 7897e60). I cannot find the author (Mr./Ms. Khalid Aziz) of the original commit (b566a22) on bugzilla though.
Yes, it's better to info PCI guys and origin commit author. BTW, it's possible to check which pci device cause this issue?
Clearing master-bit may break devices. Alan Cox warned us about this https://lkml.org/lkml/2012/6/6/545 But keeping it may break kexec-ing if driver doesn't shutdown device correctly. I think, xhci needs some quirk for this hardware...
(In reply to Lan Tianyu from comment #14) > Yes, it's better to info PCI guys and origin commit author. > > BTW, it's possible to check which pci device cause this issue? Yes! By brutal force enumeration I was able to find the very device that is causing the hang: Intel Corporation Lynx Point-LP SATA Controller, vendor:device == 8086:9c03 Took me hours of compiling, rebooting, compiling, rebooting...
Created attachment 112711 [details] This patch adds a blacklist for the offending device and fixes this problem. Tested on latest torvalds tree. Fixes the power off hang in my machine.
Patch sent to the linux-pci mailing list: http://marc.info/?l=linux-pci&m=138309785519515&w=2
Do you test the kexec? Will this trigger the previous issue commit b566a22c fix. BTW, your patch sent to PCI maillist loses signed-off-by.
(In reply to Lan Tianyu from comment #19) > Do you test the kexec? Will this trigger the previous issue commit b566a22c > fix. > > BTW, your patch sent to PCI maillist loses signed-off-by. Kexec doesn't work on this machine, even before the patch was applied (ie. in earlier kernels, I tested 3.11.6 and 3.12-rc7). Possibly because some devices won't reinitialize themselves properly. On 3.11.6, kexec -e triggers an immediate reboot. On 3.12-rc7, both with and without this patch applied, kexec -e can get it halfway into booting the new kernel but then a reboot is also triggered. I'm not sure if this patch has helped the kexec situation in this machine or not, but since it has no effects on machines without this defective SATA controller (and/or its firmware), I would say it is unlikely to negatively affect kexec on other machines. I'll resend with the singed-off-by and this kexec comment.
Hi, I am new to this and have been looking for a fix for the exact same problem, as I have the same machine (V5-573g). I am happy somebody has found the cause for this and others provided a patch. However, as I have never patched a kernel, I am not sure if I should try or just wait for an implementation in further releases. Do you think this fix will be implemented in RC8 of 3.12 or should I try to implement on my own. If so, could you give me any refs for how to properly do so. I am on Mint 15. Best, Felix
Hi Felix, I've sent this patch to the linux-pci mailing list, but so far nobody has made any comment :( I'm not familiar with Linux kernel development but I heard from other source that a patch needs to go through several reviews before being merged into mainline. Thus it looks very unlikely that this patch will make it into 3.12-rc8 or even 3.13, so you (and me since I'm also affected by this bug) are on our own. :< To compile a kernel you need, in general, to unpack the kernel tarball, apply necessary patches, configure the kernel, build the kernel, and install the kernel and its modules. See, for example: https://wiki.archlinux.org/index.php/Kernels/Compilation/Traditional and before the make mrproper line, make sure to patch your kernel with the patch provided in this bug report: patch -Np1 -i "${path}/0001-PCI-Blacklist-certain-hardware-from-clearing-Bus-Mas.patch" If you don't feel like going through the hassle, you can file a bug report to Mint or perhaps its upstream Debian. Say something like: this bug is fixed but patch doesn't make it to the upstream, please add this to the Debian kernel tree. Make sure you include the link to this bug report. Perhaps distros are willing to include the patch to their own kernel trees. (Big distros like Debian and Redhat apply many out-of-tree patches to their shipped kernels.) Chang
Created attachment 114351 [details] A better patch. Use this instead of the first one. As per Takao's suggestion in the mailing list, use a more generic mechanism instead of blacklisting.
Chang - I don't believe the SATA controller is 'defective'. As I said when it was originally proposed - the bus master disable is simply wrong. It's a miracle it doesn't break a lot more stuff.
(In reply to Alan from comment #24) > Chang - I don't believe the SATA controller is 'defective'. As I said when > it was originally proposed - the bus master disable is simply wrong. It's a > miracle it doesn't break a lot more stuff. Yeah I agree with it that disabling bus master is not a good idea and the proper solution would be to fix the driver(s) so that it stops doing DMA after being told so, but in the absence of such a fix (and since I cannot get in touch with the original author we don't even know which driver is wrong), perhaps a reasonable compromise is to let the drivers decide if he wants to have his Bus Master bit disabled or not? This is what I did in my second patch sent to the linux pci mailing list. Any comment is very welcoming!
The link is http://article.gmane.org/gmane.linux.kernel.pci/26537
Tested on my machine. Bug fixed by commit 4fc9bbf98, PCI: Disable Bus Master only on kexec reboot, merged to upstream during 3.13-rc4.