Bug 63861 - [FIXED]Unable to poweroff on Acer Aspire V5-573G
[FIXED]Unable to poweroff on Acer Aspire V5-573G
Status: VERIFIED CODE_FIX
Product: Drivers
Classification: Unclassified
Component: PCI
x86-64 Linux
: P1 high
Assigned To: Lan Tianyu
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-27 14:13 UTC by Chang Liu
Modified: 2014-01-01 07:15 UTC (History)
7 users (show)

See Also:
Kernel Version: 3.5.0 --- 3.12-rc7
Tree: Mainline
Regression: Yes


Attachments
dmesg output for kernel 3.11.6, after two suspend to RAM (66.65 KB, text/plain)
2013-10-27 14:13 UTC, Chang Liu
Details
Output of acpidump (401.82 KB, text/plain)
2013-10-27 14:14 UTC, Chang Liu
Details
Output of dmidecode (9.66 KB, text/plain)
2013-10-27 14:15 UTC, Chang Liu
Details
dmesg output on kernel 3.12-rc6 (54.06 KB, text/plain)
2013-10-27 14:21 UTC, Chang Liu
Details
Kernel config for 3.11.6 (138.86 KB, text/plain)
2013-10-27 14:23 UTC, Chang Liu
Details
This patch adds a blacklist for the offending device and fixes this problem. (3.22 KB, patch)
2013-10-30 00:57 UTC, Chang Liu
Details | Diff
A better patch. Use this instead of the first one. (3.21 KB, patch)
2013-11-12 11:52 UTC, Chang Liu
Details | Diff

Description Chang Liu 2013-10-27 14:13:19 UTC
Created attachment 112391 [details]
dmesg output for kernel 3.11.6, after two suspend to RAM

The machine won't turn off completely when I try to power off my laptop. The kernel prints:
Reboot: Power down
and hangs there eternally.

I tried 3.10.17, 3.11.6, 3.12-rc6, with and without kernel boot parameter acpi_osi="!Windows 2012". No luck at all.

The suspend to RAM works correctly. Reboot works correctly too.

See attachments for dumps of dmidecode, acpidump, and dmesg.
Comment 1 Chang Liu 2013-10-27 14:14:32 UTC
Created attachment 112401 [details]
Output of acpidump
Comment 2 Chang Liu 2013-10-27 14:15:18 UTC
Created attachment 112411 [details]
Output of dmidecode
Comment 3 Chang Liu 2013-10-27 14:21:55 UTC
Created attachment 112421 [details]
dmesg output on kernel 3.12-rc6
Comment 4 Chang Liu 2013-10-27 14:23:11 UTC
Created attachment 112431 [details]
Kernel config for 3.11.6
Comment 5 Lan Tianyu 2013-10-28 02:27:10 UTC
Does power off work on this machine before?

Please try kernel param "reboot=kbd", or =pci, efi, triple, bios.
Comment 6 Chang Liu 2013-10-28 05:25:45 UTC
Power off works fine on stock Windows 8. With the upgrade to BIOS version 2.22 (which I am using now), it works fine on Windows 7 too. I'm not sure if this is related but users with earlier versions of BIOS reported a problem of Windows 7 unable to power off the machine (and shows BSOD on subsequent boot) if it has USB 3.0 driver installed. The problem has been fixed with newer BIOS version.

None of the above kernel parameters seem to fix this problem. I've tested them on kernel 3.11.6 and 3.12-rc6.
Comment 7 Chang Liu 2013-10-28 05:37:40 UTC
I just got this machine a few days ago, so I haven't used earlier kernels on it. I'll try testing it on an earlier version of the kernel later today.
Comment 8 Lan Tianyu 2013-10-28 05:45:31 UTC
(In reply to Chang Liu from comment #6)
> Power off works fine on stock Windows 8. With the upgrade to BIOS version
> 2.22 (which I am using now), it works fine on Windows 7 too. I'm not sure if
> this is related but users with earlier versions of BIOS reported a problem
> of Windows 7 unable to power off the machine (and shows BSOD on subsequent
> boot) if it has USB 3.0 driver installed. The problem has been fixed with
> newer BIOS version.

Ok. How about not loading xhci driver on the linux?

> 
> None of the above kernel parameters seem to fix this problem. I've tested
> them on kernel 3.11.6 and 3.12-rc6.

Could you check the older kernel E.G, v3.3. Some other reporters find the power off can't work on the other machine between v3.3~v3.5.

BTW, do you enable vt-d in the Bios? If yes, please disable it and try again.
Comment 9 Chang Liu 2013-10-28 12:56:11 UTC
(In reply to Lan Tianyu from comment #8)
> Ok. How about not loading xhci driver on the linux?
> 

Sure. I tried to rmmod the xhci kernel module. The problem persists. I also tried to compile the kernel (version 3.7) without xhci
. No luck either.

> BTW, do you enable vt-d in the Bios? If yes, please disable it and try again.

I didn't find the vt-d option in BIOS. So I'm not sure how to disable it.

> Could you check the older kernel E.G, v3.3. Some other reporters find the
> power off can't work on the other machine between v3.3~v3.5.
> 

Yes. I tested on kernel 3.3.8. The problem has gone! The machine powered off just fine. So I tested a bunch of other kernel releases. Here is the result:

VERSION       WORKS?
3.3.8               YES
3.4.44             YES
3.4.51             YES
3.5.0-rc1         NO
3.5.0               NO
3.5.1               NO
3.7.0               NO

So obviously the regression was introduced in the 3.5 cycle. I looked at git log and see several ACPI related commits. I'll be running git-bisect to located the commit that introduced the bug. Any pointers on where should I be looking? If this is a valid question, in your opinion, what's the most suspicious commit(s) that will likely cause this regression?
Comment 10 Chang Liu 2013-10-29 01:55:10 UTC
OK. The git-bisect is done. Looks like b566a22c23327f18ce941ffad0ca907e50a53d41 is the first bad commit
commit b566a22c23327f18ce941ffad0ca907e50a53d41
Author: Khalid Aziz <khalid.aziz@hp.com>
Date:   Fri Apr 27 13:00:33 2012 -0600

    PCI: disable Bus Master on PCI device shutdown
    
    Disable Bus Master bit on the device in pci_device_shutdown() to ensure PCI
    devices do not continue to DMA data after shutdown.  This can cause memory
    corruption in case of a kexec where the current kernel shuts down and
    transfers control to a new kernel while a PCI device continues to DMA to
    memory that does not belong to it any more in the new kernel.
    
    I have tested this code on two laptops, two workstations and a 16-socket
    server.  kexec worked correctly on all of them.
    
    Signed-off-by: Khalid Aziz <khalid.aziz@hp.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

:040000 040000 599871e51d03f3859da648de41a0db25eb046c13 63b1f26a5c20523bb7fd56039e4f641ad240b925 M	drivers
Comment 11 Lan Tianyu 2013-10-29 07:40:03 UTC
Great job!!!
Check the code of v3.12-rc5. The issue seems caused by PCI master bit being cleared during shut down.

Please try the patch on the v3.12-rc5.

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 98f7b9b..20758ef 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -392,8 +392,8 @@ static void pci_device_shutdown(struct device *dev)
         * Turn off Bus Master bit on the device to tell it to not
         * continue to do DMA. Don't touch devices in D3cold or unknown states.
         */
-       if (pci_dev->current_state <= PCI_D3hot)
-               pci_clear_master(pci_dev);
+//     if (pci_dev->current_state <= PCI_D3hot)
+//             pci_clear_master(pci_dev);
 }
 
 #ifdef CONFIG_PM
Comment 12 Chang Liu 2013-10-29 08:17:15 UTC
Yes. I tried to comment out the offending lines on torvalds' latest kernel tree (as of c9ca72fc568403db192e199b752c9c253e5f5fd9) and recompiled the kernel. The problem is fixed! The machine powers off perfectly. If I don't comment these lines, the machine will hang at power off.

The original patch that introduced the regression seems to be about fixing kexec. I have very limited knowledge of both kexec and PCI but I felt the commit cannot simply be reverted. Looks like we need to be more careful when clearing PCI master bit during shutdown. Perhaps I should inform the original authors of the related commits?
Comment 13 Chang Liu 2013-10-29 08:34:11 UTC
I've added Mr. Konstantin Khlebnikov and Mr. Bjorn Helgaas to the CC list as they authored/reviewed the relevant patches (commit 6e0eda3, 20f2420, 7897e60). I cannot find the author (Mr./Ms. Khalid Aziz) of the original commit (b566a22) on bugzilla though.
Comment 14 Lan Tianyu 2013-10-29 09:07:27 UTC
Yes, it's better to info PCI guys and origin commit author.

BTW, it's possible to check which pci device cause this issue?
Comment 15 Konstantin Khlebnikov 2013-10-29 09:42:10 UTC
Clearing master-bit may break devices. Alan Cox warned us about this  https://lkml.org/lkml/2012/6/6/545
But keeping it may break kexec-ing if driver doesn't shutdown device correctly.

I think, xhci needs some quirk for this hardware...
Comment 16 Chang Liu 2013-10-30 00:56:14 UTC
(In reply to Lan Tianyu from comment #14)
> Yes, it's better to info PCI guys and origin commit author.
> 
> BTW, it's possible to check which pci device cause this issue?

Yes! By brutal force enumeration I was able to find the very device that is causing the hang:
Intel Corporation Lynx Point-LP SATA Controller, vendor:device == 8086:9c03
Took me hours of compiling, rebooting, compiling, rebooting...
Comment 17 Chang Liu 2013-10-30 00:57:56 UTC
Created attachment 112711 [details]
This patch adds a blacklist for the offending device and fixes this problem.

Tested on latest torvalds tree. Fixes the power off hang in my machine.
Comment 18 Chang Liu 2013-10-30 01:57:47 UTC
Patch sent to the linux-pci mailing list: http://marc.info/?l=linux-pci&m=138309785519515&w=2
Comment 19 Lan Tianyu 2013-10-30 02:19:13 UTC
Do you test the kexec? Will this trigger the previous issue commit b566a22c fix.

BTW, your patch sent to PCI maillist loses signed-off-by.
Comment 20 Chang Liu 2013-10-30 03:12:36 UTC
(In reply to Lan Tianyu from comment #19)
> Do you test the kexec? Will this trigger the previous issue commit b566a22c
> fix.
> 
> BTW, your patch sent to PCI maillist loses signed-off-by.

Kexec doesn't work on this machine, even before the patch was applied (ie. in earlier kernels, I tested 3.11.6 and 3.12-rc7).
Possibly because some devices won't reinitialize themselves properly. On 3.11.6, kexec -e triggers an immediate reboot. On 3.12-rc7, both with and without this patch applied, kexec -e can get it halfway into booting the new kernel but then a reboot is also triggered. I'm not sure if this patch has helped the kexec situation in this machine or not, but since it has no effects on machines without this defective SATA controller (and/or its firmware), I would say it is unlikely to negatively affect kexec on other machines.

I'll resend with the singed-off-by and this kexec comment.
Comment 21 Jility 2013-11-03 15:49:51 UTC
Hi, I am new to this and have been looking for a fix for the exact same problem, as I have the same machine (V5-573g). I am happy somebody has found the cause for this and others provided a patch. However, as I have never patched a kernel, I am not sure if I should try or just wait for an implementation in further releases. 
Do you think this fix will be implemented in RC8 of 3.12 or should I try to implement on my own. If so, could you give me any refs for how to properly do so. I am on Mint 15.  

Best, Felix
Comment 22 Chang Liu 2013-11-04 01:15:36 UTC
Hi Felix,

I've sent this patch to the linux-pci mailing list, but so far nobody has made any comment :( I'm not familiar with Linux kernel development but I heard from other source that a patch needs to go through several reviews before being merged into mainline. Thus it looks very unlikely that this patch will make it into 3.12-rc8 or even 3.13, so you (and me since I'm also affected by this bug) are on our own. :<

To compile a kernel you need, in general, to unpack the kernel tarball, apply necessary patches, configure the kernel, build the kernel, and install the kernel and its modules. See, for example: https://wiki.archlinux.org/index.php/Kernels/Compilation/Traditional and before the make mrproper line, make sure to patch  your kernel with the patch provided in this bug report:
patch -Np1 -i "${path}/0001-PCI-Blacklist-certain-hardware-from-clearing-Bus-Mas.patch"

If you don't feel like going through the hassle, you can file a bug report to Mint or perhaps its upstream Debian. Say something like: this bug is fixed but patch doesn't make it to the upstream, please add this to the Debian kernel tree. Make sure you include the link to this bug report. Perhaps distros are willing to include the patch to their own kernel trees. (Big distros like Debian and Redhat apply many out-of-tree patches to their shipped kernels.)

Chang
Comment 23 Chang Liu 2013-11-12 11:52:09 UTC
Created attachment 114351 [details]
A better patch. Use this instead of the first one.

As per Takao's suggestion in the mailing list, use a more generic mechanism instead of blacklisting.
Comment 24 Alan 2013-11-12 16:12:46 UTC
Chang - I don't believe the SATA controller is 'defective'. As I said when it was originally proposed - the bus master disable is simply wrong. It's a miracle it doesn't break a lot more stuff.
Comment 25 Chang Liu 2013-11-13 12:00:50 UTC
(In reply to Alan from comment #24)
> Chang - I don't believe the SATA controller is 'defective'. As I said when
> it was originally proposed - the bus master disable is simply wrong. It's a
> miracle it doesn't break a lot more stuff.

Yeah I agree with it that disabling bus master is not a good idea and the proper solution would be to fix the driver(s) so that it stops doing DMA after being told so, but in the absence of such a fix (and since I cannot get in touch with the original author we don't even know which driver is wrong), perhaps a reasonable compromise is to let the drivers decide if he wants to have his Bus Master bit disabled or not? This is what I did in my second patch sent to the linux pci mailing list. Any comment is very welcoming!
Comment 26 Chang Liu 2013-11-13 12:20:30 UTC
The link is http://article.gmane.org/gmane.linux.kernel.pci/26537
Comment 27 Chang Liu 2013-12-18 11:01:24 UTC
Tested on my machine. Bug fixed by commit 4fc9bbf98, PCI: Disable Bus Master only on kexec reboot, merged to upstream during 3.13-rc4.

Note You need to log in before you can comment on or make changes to this bug.