Bug 156341 - Nvidia fails to power on again, resulting in AML_INFINITE_LOOP/lockups (multiple laptops affected)
Summary: Nvidia fails to power on again, resulting in AML_INFINITE_LOOP/lockups (multi...
Status: NEEDINFO
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Rafael J. Wysocki
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-09-08 10:25 UTC by Peter Wu
Modified: 2019-11-18 23:55 UTC (History)
64 users (show)

See Also:
Kernel Version: 4.4.0; 4.6; 4.7 (with nouveau + PCI/PM patches)
Tree: Mainline
Regression: No


Attachments
dmesg for v4.7-rc5 (triggered runtime-resume via writing "on" to (nvidia device)/power/control) (129.04 KB, text/plain)
2016-09-08 10:25 UTC, Peter Wu
Details
acpidump for Clevo P651RA (BIOS 1.05.07) (672.72 KB, text/plain)
2016-09-12 12:07 UTC, Peter Wu
Details
attachment-10037-0.html (774 bytes, text/html)
2016-11-03 13:10 UTC, Sam McLeod
Details
acpidump for Dell XPS15 9560 KabyLake i7-7700HQ/GTX1050M BIOS 1.0.3 (998.13 KB, text/plain)
2017-02-15 07:53 UTC, Andrzej
Details
acpidump for ASUS N552VW-FI056T SkyLake i7-6700HQ/GTX 960M BIOS 3.0.0 (973.66 KB, text/plain)
2017-02-15 21:11 UTC, Giambattista Bloisi
Details
Patch for XPS 9560 (1.10 KB, patch)
2017-02-21 01:26 UTC, Tobias Schumacher
Details | Diff
acpidump for HP zBook Studio G3 (1009.15 KB, text/plain)
2017-03-01 22:17 UTC, Bruno Pagani
Details
acpidump (WS72 with MS-1776 motherboard & Quadro M2000M). (890.43 KB, text/plain)
2017-04-09 07:19 UTC, Px Laurent
Details
i7-6700HQ CPU. Nvidia GeForce GTX965m. Clevo model N150RF. (682.03 KB, text/plain)
2017-05-17 14:10 UTC, Alexander
Details
acpidump for Samsung Notebook Spin 7 (565.53 KB, text/plain)
2017-05-23 22:57 UTC, Yaohan Chen
Details
MSI GP62 7RD acpidump (1.03 MB, text/plain)
2017-05-26 10:44 UTC, Robert Brock
Details
Dmesg on MSI GP72 6QE after Juan mofication (61.38 KB, text/plain)
2017-06-17 11:12 UTC, Remy LABENE
Details
dmidecode Laptop MSI GP72 6QE (23.44 KB, text/plain)
2017-06-17 14:15 UTC, Remy LABENE
Details
acpidump Alienware 15R3 (1.11 MB, text/plain)
2017-07-19 23:48 UTC, taijian
Details
dmidecode Alienware 15R3 (26.20 KB, text/plain)
2017-07-19 23:48 UTC, taijian
Details
dmesg with irqbalance deamon (57.56 KB, text/plain)
2017-10-31 22:48 UTC, Remy LABENE
Details
dmesg with MS-16K2-based laptop (acpi_osi overrides in effect) (17.14 KB, application/octet-stream)
2017-11-01 21:17 UTC, Zack Weinberg
Details
dmidecode with MS-16K2-based laptop (acpi_osi overrides in effect) (4.78 KB, application/gzip)
2017-11-01 21:19 UTC, Zack Weinberg
Details
acpidump with MS-16K2-based laptop (acpi_osi overrides in effect) (231.03 KB, application/gzip)
2017-11-01 21:19 UTC, Zack Weinberg
Details
acpidump for MSI GE62 7RE-210FR (1.04 MB, text/plain)
2017-12-12 16:35 UTC, Etienne URBAH
Details
dmesg from thinkpad t440 laptop (61.19 KB, text/plain)
2018-07-04 15:54 UTC, colin.wu
Details
pcie link issue workaround (4.47 KB, patch)
2018-09-03 12:48 UTC, Karol Herbst
Details | Diff
acpidump from anomalous Alienware 13 R2 (949.91 KB, text/plain)
2018-09-26 08:54 UTC, Maik Freudenberg
Details
pci driver quirk to re-train link on resume (1.79 KB, text/plain)
2018-09-26 09:03 UTC, Maik Freudenberg
Details
acpidump HP Omen 15 dc0307ng (1.83 MB, text/plain)
2018-11-05 16:30 UTC, Matthias Fulz
Details
attachment-17158-0.html (2.04 KB, text/html)
2018-11-23 08:35 UTC, Alexander
Details
DMESG|grep ACPI kernel boot log (10.62 KB, text/plain)
2018-12-02 19:44 UTC, david.kremer.dk
Details

Description Peter Wu 2016-09-08 10:25:10 UTC
Created attachment 232611 [details]
dmesg for v4.7-rc5 (triggered runtime-resume via writing "on" to (nvidia device)/power/control)

See also https://www.spinics.net/lists/linux-pci/msg53694.html ("Kernel Freeze with American Megatrends BIOS") for more details (acpidump, lspci, some analysis, etc.).

Steps to reproduce on the affected machines:

 1. Load nouveau.
 2. Wait for it to runtime suspend.
 2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau. (alternatively: write "on" to /sys/bus/pci/devices/0000:01:00.0/power/control)
 3. lspci never returns, few moments later an AML_INFINITE_LOOP is reported.

Affected machines from
https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-234494238
- Clevo P651RA (and other Clevo P6xxRx models).
- MSI GE62 Apache Pro
- Gigabyte P35V5
- Razer Blade 14" (2016)
- Dell Inspiron 7559

These *new* laptops all have an Skylake CPU (i7-6500HQ) and a Nvidia GTX 9xxM GPU. Originally it was only observed for laptops with AMI BIOSes, but later we found a Dell laptop as well. The workaround acpi_osi="!Windows 2015" prevents Linux from reporting Windows 10 compatibility and helps *in some cases* because the ACPI code falls back to a different approach to power on the device (or PCIe link?).

Attached is one of the more interesting dmesg dumps which could be obtained that shows how the system breaks down over time. (This was v4.7-rc5 with PCI/PM D3cold + nouveau power resource/PM refcount leaks patches, but the problem was also visible on unpatches 4.4.0 for example.)
Comment 1 Zhang Rui 2016-09-12 10:51:42 UTC
let's focus on one platform first.
For people who encounters this problem and can give quick response, please attach the acpidump of the platform.
Comment 2 Zhang Rui 2016-09-12 10:54:29 UTC
Okay, let's focus on Clevo_P651RA first.
Comment 3 Zhang Rui 2016-09-12 10:55:33 UTC
I don't see how to download the acpidump file at https://github.com/Lekensteyn/acpi-stuff/blob/master/dsl/Clevo_P651RA/acpidump.txt
can you please attach it here?
Comment 4 Peter Wu 2016-09-12 12:07:08 UTC
Created attachment 233091 [details]
acpidump for Clevo P651RA (BIOS 1.05.07)

You can download the file via the "Raw" link on Github. I have attached a copy of the acpidump.

Of interest is the \_SB.PCI0.PGON method. See also this extract:
https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt#L94
Comment 5 Len Brown 2016-09-19 23:49:11 UTC
Does this still fail if you use the proprietary nvidia driver?
Comment 6 Lv Zheng 2016-09-20 02:36:24 UTC
Peter:
Should you first try this: attachment 239241 [details]

Rui:
Do you have PCI contact? Can we have them to look at the issue first?
From this link:
https://www.spinics.net/lists/linux-pci/msg53694.html 
Looks like a PCI power management gap if the attachment 239241 [details] doesn't help.

Thanks
Lv
Comment 7 Peter Wu 2016-09-20 08:31:09 UTC
(In reply to Len Brown from comment #5)
> Does this still fail if you use the proprietary nvidia driver?

I have not tried the proprietary driver, but AFAIK the blob does no attempts to put the device in D3 state.


(In reply to Lv Zheng from comment #6)
> Peter:
> Should you first try this: attachment 239241 [details]

I can try, but would it really help? Not all firmware have this loop and they will just assume that the link state is correct. This is the affected loop:

    While ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
        Local0 = 0x20
        While (Local0) {
            If ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
                Stall (0x64)
                Local0--
            } Else {
                Break
            }
        }

        If ((Local0 == Zero)) {
            \_SB.PCI0.PEG0.RTLK = One
            Stall (0x64)
        }
    }

In one trace I observed that the outer loop was executed 29 times which means that about 29 * (32 * 100us + 100us) = 95.7ms.
Comment 8 Lv Zheng 2016-09-21 05:53:35 UTC
Do you mean it's already long enough (95.7ms) for this case, and waiting longer won't solve the issue?
I don't know, I just want to get rid of the possible bug causes.

I'm not a PCI expert. So let me ask.
From the following AML, RTLK/LNKS belong to a PCI register space:
    OperationRegion (SANV, SystemMemory, 0x5FF9BD98, 0x0135)
    Field (SANV, AnyAcc, Lock, Preserve)
    {
        ASLB,   32, 
        IMON,   8, 
        IGDS,   8, 
        IBTT,   8, 
        IPAT,   8, 
        IPSC,   8, 
        IBIA,   8, 
        ISSC,   8, 
        IDMS,   8, 
        IF1E,   8, 
        HVCO,   8, 
        GSMI,   8, 
        PAVP,   8, 
        CADL,   8, 
        CSTE,   16, 
        NSTE,   16, 
        NDID,   8, 
        DID1,   32, 
        DID2,   32, 
        DID3,   32, 
        DID4,   32, 
        DID5,   32, 
        DID6,   32, 
        DID7,   32, 
        DID8,   32, 
        DID9,   32, 
        DIDA,   32, 
        DIDB,   32, 
        DIDC,   32, 
        DIDD,   32, 
        DIDE,   32, 
        DIDF,   32, 
        DIDX,   32, 
        NXD1,   32, 
        NXD2,   32, 
        NXD3,   32, 
        NXD4,   32, 
        NXD5,   32, 
        NXD6,   32, 
        NXD7,   32, 
        NXD8,   32, 
        NXDX,   32, 
        LIDS,   8, 
        KSV0,   32, 
        KSV1,   8, 
        BRTL,   8, 
        ALSE,   8, 
        ALAF,   8, 
        LLOW,   8, 
        LHIH,   8, 
        ALFP,   8, 
        IMTP,   8, 
        EDPV,   8, 
        SGMD,   8, 
        SGFL,   8, 
        SGGP,   8, 
        HRE0,   8, 
        HRG0,   32, 
        HRA0,   8, 
        PWE0,   8, 
        PWG0,   32, 
        PWA0,   8, 
        P1GP,   8, 
        HRE1,   8, 
        HRG1,   32, 
        HRA1,   8, 
        PWE1,   8, 
        PWG1,   32, 
        PWA1,   8, 
        P2GP,   8, 
        HRE2,   8, 
        HRG2,   32, 
        HRA2,   8, 
        PWE2,   8, 
        PWG2,   32, 
        PWA2,   8, 
        DLPW,   16, 
        DLHR,   16, 
        EECP,   8, 
        XBAS,   32, <- XBAS
        GBAS,   16, 
        NVGA,   32, 
        NVHA,   32, 
        AMDA,   32, 
        LTRX,   8, 
        OBFX,   8, 
        LTRY,   8, 
        OBFY,   8, 
        LTRZ,   8, 
        OBFZ,   8, 
        SMSL,   16, 
        SNSL,   16, 
        P0UB,   8, 
        P1UB,   8, 
        P2UB,   8, 
        PCSL,   8, 
        PBGE,   8, 
        M64B,   64, 
        M64L,   64, 
        CPEX,   32, 
        EEC1,   8, 
        EEC2,   8, 
        SBN0,   8, 
        SBN1,   8, 
        SBN2,   8, 
        M32B,   32, 
        M32L,   32, 
        P0WK,   32, 
        P1WK,   32, 
        P2WK,   32, 
        MXD1,   32, 
        MXD2,   32, 
        MXD3,   32, 
        MXD4,   32, 
        MXD5,   32, 
        MXD6,   32, 
        MXD7,   32, 
        MXD8,   32, 
        PXFD,   8, 
        EBAS,   32, 
        DGVS,   32, 
        DGVB,   32, 
        HYSS,   32
    }

        OperationRegion (RPCX, SystemMemory, Add (\XBAS, 0x8000), 0x1000)
        Field (RPCX, ByteAcc, NoLock, Preserve)
        {
            Offset (0x04), 
            CMDR,   8, 
            Offset (0x84), 
            D0ST,   2, 
            Offset (0xAA), 
            CEDR,   1, 
            Offset (0xB0), 
                ,   5, 
            RTLK,   1, <- RTLK
            Offset (0xC9), 
                ,   2, 
            LREN,   1, 
            Offset (0x216), 
            LNKS,   4, <- LNKS
        }
Can you infer what it is from the above AML?

Thanks
Comment 9 Lv Zheng 2016-09-21 05:56:59 UTC
It looks like AML code in PGON prior than this loop should always make the condition true. What the platform need to do is to wait.
So IMO, the code prior than this loop is more important for root causing this issue.
Comment 10 Peter Wu 2016-09-21 12:23:08 UTC
(In reply to Lv Zheng from comment #8)
> Do you mean it's already long enough (95.7ms) for this case, and waiting
> longer won't solve the issue?

That would be the theoretical delay. In practice, I have several seconds of processing due to ACPI debug logging (ACPI_NAMESPACE, ACPI_DB_NAMES). The logs stop after 46 seconds, maybe because I used SysRq+B for a forced reboot (reset).

> I'm not a PCI expert. So let me ask.
> From the following AML, RTLK/LNKS belong to a PCI register space:
>     OperationRegion (SANV, SystemMemory, 0x5FF9BD98, 0x0135)
>     Field (SANV, AnyAcc, Lock, Preserve)
>     {
[snip]
> Can you infer what it is from the above AML?

XBAS is the PCIe MMIO Base Address register. I guessed that "RTLK" means "Retrain Link" (see PCIe spec 7.8.7 Link Control Register) and that "LNKS" means PCIe Link speed. I posted these on:

https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt

(In reply to Lv Zheng from comment #9)
> It looks like AML code in PGON prior than this loop should always make the
> condition true. What the platform need to do is to wait.
> So IMO, the code prior than this loop is more important for root causing
> this issue.

The loop is indeed just a consequence, the root cause is due to the difference between invoking the "LKEN" code (problematic, see line 120 of notes.txt) and the fallback code (see line 141 of notes.txt).

However I am quite at loss on why it would be so significant. Note that I am no PCI expert either, the notes were based on the PCIe spec, ACPI tables and lots of guesswork.

Do you need more info?
Comment 11 Lv Zheng 2016-09-22 02:43:58 UTC
Let me re-assign it to Power-management category and reset the assignee to involve more developers.

Thanks
Lv
Comment 12 Rafael J. Wysocki 2016-09-27 00:09:41 UTC
Peter, one question: Why is this not regarded as a nouveau problem?
Comment 13 Peter Wu 2016-09-27 09:28:34 UTC
(In reply to Rafael J. Wysocki from comment #12)
> Peter, one question: Why is this not regarded as a nouveau problem?

Something changed in Windows 10 that made firmware authors write this specific DSDT workaround. If Linux advertises itself as Windows 7 for example, the problematic code is not triggered. (Some laptops also work when advertising "non-Windows 10", such as Windows 8).

It could be a missing piece in the nouveau driver, but exactly how to tackle that is not known. In a minimal module that uses the new PCI port runtime PM ("PR3 support") introduced with v4.8, I could also trigger the lockups.

Are you aware of changes to the policies in Windows 10 that could explain the different methods of putting a device into D3? Timing-wise or other APIs changes?
Comment 14 Rafael J. Wysocki 2016-09-28 00:31:04 UTC
On Tuesday, September 27, 2016 09:28:34 AM bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=156341
> 
> --- Comment #13 from Peter Wu <peter@lekensteyn.nl> ---
> (In reply to Rafael J. Wysocki from comment #12)
> > Peter, one question: Why is this not regarded as a nouveau problem?
> 
> Something changed in Windows 10 that made firmware authors write this
> specific
> DSDT workaround. If Linux advertises itself as Windows 7 for example, the
> problematic code is not triggered. (Some laptops also work when advertising
> "non-Windows 10", such as Windows 8).
> 
> It could be a missing piece in the nouveau driver, but exactly how to tackle
> that is not known. In a minimal module that uses the new PCI port runtime PM
> ("PR3 support") introduced with v4.8, I could also trigger the lockups.
> 
> Are you aware of changes to the policies in Windows 10 that could explain the
> different methods of putting a device into D3? Timing-wise or other APIs
> changes?

Not at the moment, but I'm going to ask around.
Comment 15 Rafael J. Wysocki 2016-09-29 21:20:00 UTC
One difference between Windows 10 and Windows 7 I know about is that Windows 10 supports power management of PCIe ports and I bet the ASL in comment #7 is needed to cope with that.

That PCIe ports PM appears to be different from what we're going to do in 4.8+, though, which may be the source of the problem.
Comment 16 Peter Wu 2016-09-29 21:59:04 UTC
(In reply to Rafael J. Wysocki from comment #15)
> One difference between Windows 10 and Windows 7 I know about is that Windows
> 10 supports power management of PCIe ports and I bet the ASL in comment #7
> is needed to cope with that.
> 
> That PCIe ports PM appears to be different from what we're going to do in
> 4.8+, though, which may be the source of the problem.

The invoked ACPI methods (_ON/_OFF on the power resource) are matching between Linux and Windows 10. From a packet capture with WinDbg kernel debugger:
https://lekensteyn.nl/files/p651ra-acpi-debug/acpi-evals.txt

Maybe some extra modifications are needed to the PCIe registers? (No idea, just guessing.)
Comment 17 Sam McLeod 2016-10-24 10:30:13 UTC
Tested against 4.9-RC2 on Fedora 25 and the problem still exists
Comment 18 Peter Wu 2016-11-03 12:55:50 UTC
(In reply to Rafael J. Wysocki from comment #15)
> That PCIe ports PM appears to be different from what we're going to do in
> 4.8+, though, which may be the source of the problem.

This is not the source of the problem, the issue exists before with older kernels.

The list of affected models keeps growing, there have been reports from additional HP, Dell and Asus laptops. All of these have in common a Skylake CPU (i7-6700HQ) and some NVIDIA GPU (Maxwell cards, GTX 950M/960M/965M/970M, Quadro M1000M). Some of the laptops are listed at the updated list in
https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-234494238

Any idea what to look into? Patches, documentation or other possible hints?

Failing a long-term solution, I am considering a temporary ACPI hack that patches the affected ACPI method to disable the conditional OSYS check:
https://github.com/Bumblebee-Project/bbswitch/issues/134#issuecomment-258117908
Comment 19 Sam McLeod 2016-11-03 13:10:19 UTC
Created attachment 243561 [details]
attachment-10037-0.html

Auto-reply: I'm out of the office at present and will be back in on the 7th, please contact systems@infoxchange.org if you require a response.
Comment 20 Rafael J. Wysocki 2016-11-16 21:49:18 UTC
(In reply to Peter Wu from comment #18)
> (In reply to Rafael J. Wysocki from comment #15)
> > That PCIe ports PM appears to be different from what we're going to do in
> > 4.8+, though, which may be the source of the problem.
> 
> This is not the source of the problem, the issue exists before with older
> kernels.
> 
> The list of affected models keeps growing, there have been reports from
> additional HP, Dell and Asus laptops. All of these have in common a Skylake
> CPU (i7-6700HQ) and some NVIDIA GPU (Maxwell cards, GTX 950M/960M/965M/970M,
> Quadro M1000M). Some of the laptops are listed at the updated list in
> https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-
> 234494238
> 
> Any idea what to look into? Patches, documentation or other possible hints?

You said that acpi_osi="!Windows 2015" helped in some cases.  I guess the other cases (where it doesn't help) are Windows 10 only systems?
Comment 21 Rafael J. Wysocki 2016-11-16 21:52:49 UTC
And what if we simply avoided using ACPI PM with the affected device on those systems?
Comment 22 Peter Wu 2016-11-16 23:42:41 UTC
> You said that acpi_osi="!Windows 2015" helped in some cases.  I guess the
> other cases (where it doesn't help) are Windows 10 only systems?

Not sure, I did not check if these systems have support for just w10 (and not 7, 8 or 8.1). Some others require acpi_osi=! acpi_osi="Windows 2009" to avoid the problematic code path in the ACPI table.

(In reply to Rafael J. Wysocki from comment #21)
> And what if we simply avoided using ACPI PM with the affected device on
> those systems?

You mean acpi=off? Avoiding runtime pm nouveau would be sufficient but kills battery life. One interesting observation is that turning off the ACPI power resource (via PCIe port PM) or system sleep seems not to trigger the issue. (Compared to using nouveau.) Maybe I'm dreaming, have to retest this just to be sure.

Do you have tips for tracing PCI register activities? (E.g. read/write pm regs)
Comment 23 Billy 2017-01-22 17:32:16 UTC
Hi everyone, I'm hoping to provide some helpful information here. I'm affected by this bug, in that I can't login to gnome unless I either blacklist the nouveau module or add "nouveau.runpm=0" to my kernel parameters. I've got some files here that I hope are of use to you:

Link to laptop: https://www.newegg.com/Product/Product.aspx?Item=N82E16834234412
Link to call trace where I can't login: https://paste.fedoraproject.org/533827/14851039/
Tar archive with system info: http://wbrawner.com/files/ASUSTeK_COMPUTER_INC.-X550VX.tar.gz

For what it's worth, I can't even get to the login screen with the proprietary nVidia drivers. Please let me know if I can otherwise be of assistance.
Comment 24 Kyle Agronick 2017-01-28 23:37:14 UTC
I just wanted to report that this issue is present on Lenovo W541 with 4.9.4-1. You can see my full bug report here for the symptoms: https://bugzilla.opensuse.org/show_bug.cgi?id=1022443

Putting acpi_osi=! acpi_osi="Windows 2009" in the Grub boot line fixed it.

Here are my GPUs:
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06)
01:00.0 VGA compatible controller: NVIDIA Corporation GK107GLM [Quadro K1100M] (rev a1)

cpuinfo prints:
Vendor ID: GenuineIntel
Hardware Raw: 
Brand: Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz
Hz Advertised: 2.8000 GHz
Hz Actual: 2.8000 GHz
Hz Advertised Raw: (2800000000, 0)
Hz Actual Raw: (2800000000, 0)
Arch: X86_64
Bits: 64
Count: 8
Raw Arch String: x86_64
L2 Cache Size: 6144 KB
L2 Cache Line Size: 0
L2 Cache Associativity: 0
Stepping: 3
Model: 60
Family: 6
Processor Type: 0
Extended Model: 0
Extended Family: 0
Flags: abm, acpi, aes, aperfmperf, apic, arat, arch_perfmon, avx, avx2, bmi1, bmi2, bts, clflush, cmov, constant_tsc, cx16, cx8, de, ds_cpl, dtes64, dtherm, dts, eagerfpu, epb, ept, erms, est, f16c, flexpriority, fma, fpu, fsgsbase, fxsr, ht, ida, invpcid, lahf_lm, lm, mca, mce, mmx, monitor, movbe, msr, mtrr, nonstop_tsc, nopl, nx, pae, pat, pbe, pcid, pclmulqdq, pdcm, pdpe1gb, pebs, pge, pln, pni, popcnt, pse, pse36, pts, rdrand, rdtscp, rep_good, sdbg, sep, smep, smx, ss, sse, sse2, sse4_1, sse4_2, ssse3, syscall, tm, tm2, tpr_shadow, tsc, tsc_adjust, tsc_deadline_timer, vme, vmx, vnmi, vpid, x2apic, xsave, xsaveopt, xtopology, xtpr
Comment 25 Giambattista Bloisi 2017-02-12 15:58:21 UTC
This issue is present on ASUS n552vw-fi056t (Core i7-6700HQ and NVIDIA GeForce GTX 960M) with kernel 4.9.9.

Putting acpi_osi=! acpi_osi="Windows 2009" in the Grub  worked around the problem, even though some functionality is lost (screen dimmering shortcuts).
Comment 26 Andrzej 2017-02-15 04:54:57 UTC
The issue is also present on KabyLake Dell XPS15 9560 with i7-7700HQ with NVidia GTX1050M. It manifests itself with complete freezes if the intel card is used for X and the NVidia card is disabled with bumblebee. Then, running nvidia-smi, lspci casuses freeze. The freezes do not happen if NVidia card is enabled using bbswitch.

Some info from lspci:

00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev 04)
01:00.0 3D controller: NVIDIA Corporation Device 1c8d (rev a1)
Comment 27 Andrzej 2017-02-15 04:56:21 UTC
I would like to add that acpi_osi="!Windows 2015" does not solve the problem, while  acpi_osi=! acpi_osi="Windows 2009" does (it does disable the touchpad though).
Comment 28 Zhang Rui 2017-02-15 06:30:53 UTC
@Andrzej, please attach the acpidump output of your laptop.
Comment 29 Andrzej 2017-02-15 07:53:25 UTC
Created attachment 254763 [details]
acpidump for Dell XPS15 9560 KabyLake i7-7700HQ/GTX1050M BIOS 1.0.3

Here comes the acpidump for my system: Dell XPS15 9560 KabyLake i7-7700HQ/GTX1050M BIOS 1.0.3
Comment 30 Giambattista Bloisi 2017-02-15 21:11:46 UTC
Created attachment 254777 [details]
acpidump for ASUS N552VW-FI056T SkyLake i7-6700HQ/GTX 960M BIOS 3.0.0
Comment 31 Andrzej 2017-02-19 07:29:19 UTC
Is there anything else I can do to help debug this issue?
Comment 32 Tobias Schumacher 2017-02-21 01:26:08 UTC
Created attachment 254843 [details]
Patch for XPS 9560

I am facing the same problem on an XPS 9560 and had a look at the acpidump, here the corresponding check is as follows:

If ((OSYS <= 0x07D9) || ((OSYS == 0x07DF) && (_REV == 0x05)))

So, telling the BIOS that we support ACPI Rev. 5 should be sufficient for this model to allow powering down the Nvidia without locking up. There is already some code which does this for other XPS and Latitude models in drivers/acpi/blacklist.c, I extended it for the XPS 9560. I also sent the patch to the LKML.
Comment 33 Andrzej 2017-02-21 04:48:48 UTC
Thanks Tobias! I tried your patch against 4.10 kernel and indeed it does solve the freeze on Dell XPS 9560. 

I did experience problems when disabling the card on battery with TLP on. What solved it was adding the NVidia card to TLP RUNTIME_PM_BLACKLIST.
Comment 34 Tobias Schumacher 2017-02-27 22:37:51 UTC
Update: the patch didn't get accepted, but I got the hint to try booting with acpi_rev_override. 

acpi_rev_override=5 instead of the acpi_osi stuff works for me (currently on kernel 4.8), no lockups and the touchpad issues also seem to be gone.
Comment 35 Bruno Pagani 2017-03-01 22:17:07 UTC
Created attachment 255035 [details]
acpidump for HP zBook Studio G3

I’m attaching acpidump for HP zBook Studio G3. Things are likely happening in the SSDT3 table, but can’t understand what is the problem. Maybe something around that PEGS function…
Comment 36 Peter Wu 2017-03-01 23:17:31 UTC
(In reply to Bruno Pagani from comment #35)
> Created attachment 255035 [details]
> acpidump for HP zBook Studio G3
> 
> I’m attaching acpidump for HP zBook Studio G3. Things are likely happening
> in the SSDT3 table, but can’t understand what is the problem. Maybe
> something around that PEGS function…

PEGS only reads from an address (should not have side-effects). The problem is in PGON where the "LKEN" function is somehow problematic and the fallback ACPI code (_OSI="Windows 2009" for your model) avoids the issue.
(BTW, I have another friend with same laptop, that workaround worked for him.)
Comment 37 Bruno Pagani 2017-03-01 23:21:22 UTC
(In reply to Peter Wu from comment #36)
> (In reply to Bruno Pagani from comment #35)
> > Created attachment 255035 [details]
> > acpidump for HP zBook Studio G3
> > 
> > I’m attaching acpidump for HP zBook Studio G3. Things are likely happening
> > in the SSDT3 table, but can’t understand what is the problem. Maybe
> > something around that PEGS function…
> 
> PEGS only reads from an address (should not have side-effects). The problem
> is in PGON where the "LKEN" function is somehow problematic and the fallback
> ACPI code (_OSI="Windows 2009" for your model) avoids the issue.
> (BTW, I have another friend with same laptop, that workaround worked for
> him.)

Thanks Peter!

However _OSI="Windows 2009" must have downsides, right? Would it be only Nvidia card PCIe port PM or something in the like.
Comment 38 Peter Wu 2017-03-02 16:39:59 UTC
(In reply to Bruno Pagani from comment #37)
> However _OSI="Windows 2009" must have downsides, right? Would it be only
> Nvidia card PCIe port PM or something in the like.

Maybe hotkeys work may behave differently, but I did not check this. (The effects are model-specific, effectively you are telling the firmware that the OS is Windows 7 which may not support firmware features for newer Windows versions.)
Comment 39 Giambattista Bloisi 2017-03-12 22:15:30 UTC
Just as an update on my ASUS N552VW-FI056T the "acpi_rev_override=5" works better than the acpi_osi="!Windows 2015" workaround: lockup disapperas and backlight function keys are working.
Comment 40 Pranav Sharma 2017-03-28 12:01:46 UTC
I tested out acpi_rev_override=5 on my dell inspiron 7559, and it got stuck while booting. acpi_osi="!Windows 2015" is still needed for this laptop. I am on ubuntu 16.04 and using nvidia drivers with prime in intel mode.
Comment 41 Bruno Pagani 2017-03-29 00:03:19 UTC
(In reply to Peter Wu from comment #22)
> (In reply to Rafael J. Wysocki from comment #21)
> > And what if we simply avoided using ACPI PM with the affected device on
> > those systems?
> 
> You mean acpi=off? Avoiding runtime pm nouveau would be sufficient but kills
> battery life.

That would indeed be a terrible fix.

> One interesting observation is that turning off the ACPI power
> resource (via PCIe port PM) or system sleep seems not to trigger the issue.
> (Compared to using nouveau.) Maybe I'm dreaming, have to retest this just to
> be sure.

So what I’m observing so far:

– nouveau works (doesn’t cause lockups), thanks to port PM I guess (it doesn’t work anymore with pcie_port_pm=off → provokes lockup at loading FWIR, can retest if needed). However, the powersavings seems to be less than before, and especially I’ve got my fan spinning up and down constantly on idle-like situations, which is really annoying.

– bbswitch with pcie_port_pm=off can turn off the card, but any attempt to turn it ON again cause this lockup. However, powersavinsg seem good (but no optimal, since not port_pm) and no fan spinning.

I still need to try using acpi_osi=! acpi_osi="Windows 2009", but this is not normal and probably has downsides (at least this workaround has downsides for other people).

Regarding fixing this: as there been any progress on whether the differences in PCIe Port PM implementation between Windows and Linux could be responsible here?
Comment 42 Remy LABENE 2017-03-29 09:52:24 UTC
Duplicate bug here https://bugzilla.kernel.org/show_bug.cgi?id=194431 with MSI GP72 6QE

grub options acpi_osi=! acpi_osi="Windows 2009" and nouveau driver is the better solution at this moment.
Comment 43 Lv Zheng 2017-03-30 02:57:01 UTC
(In reply to Peter Wu from comment #36)
> (In reply to Bruno Pagani from comment #35)
> > Created attachment 255035 [details]
> > acpidump for HP zBook Studio G3
> > 
> > I’m attaching acpidump for HP zBook Studio G3. Things are likely happening
> > in the SSDT3 table, but can’t understand what is the problem. Maybe
> > something around that PEGS function…
> 
> PEGS only reads from an address (should not have side-effects). The problem
> is in PGON where the "LKEN" function is somehow problematic and the fallback
> ACPI code (_OSI="Windows 2009" for your model) avoids the issue.
> (BTW, I have another friend with same laptop, that workaround worked for
> him.)

What's the problem in PGON?
Is it related to another issue you reported:
https://github.com/Bumblebee-Project/bbswitch/issues/142

Thanks
Lv
Comment 44 Peter Wu 2017-03-30 13:03:23 UTC
(In reply to Lv Zheng from comment #43)
> (In reply to Peter Wu from comment #36)
[..]
> > PEGS only reads from an address (should not have side-effects). The problem
> > is in PGON where the "LKEN" function is somehow problematic and the
> fallback
> > ACPI code (_OSI="Windows 2009" for your model) avoids the issue.
> > (BTW, I have another friend with same laptop, that workaround worked for
> > him.)
> 
> What's the problem in PGON?
> Is it related to another issue you reported:
> https://github.com/Bumblebee-Project/bbswitch/issues/142

It is unrelated to that issue, the namespace lookup seems to work well. Here are related reports:
https://github.com/Bumblebee-Project/Bumblebee/issues/764
https://github.com/Bumblebee-Project/bbswitch/issues/137
https://github.com/Bumblebee-Project/bbswitch/issues/148

Any ideas where to look? If it helps, here is a summary of the executed ASL:
https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt#L171
Comment 45 Px Laurent 2017-04-09 07:19:20 UTC
Created attachment 255791 [details]
acpidump (WS72 with MS-1776 motherboard & Quadro M2000M).
Comment 46 Maik Freudenberg 2017-04-20 15:58:53 UTC
Might be unrelated but on affected machines I more than once saw this in dmesg:
>ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
Coincicidence or hint?
Comment 47 Bruno Pagani 2017-04-29 14:01:58 UTC
(In reply to Maik Freudenberg from comment #46)
> Might be unrelated but on affected machines I more than once saw this in
> dmesg:
> >ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
> Coincicidence or hint?

Strange. Not sure if coincidence, but I would expect newer machine to support ASPM. I’m having that too, but currently booting with acpi_osi=! acpi_osi="Windows 2009", need to check without. If that changes, then this issue is really annoying, because it means we’re loosing on power savings to get some other ones working…
Comment 48 Bruno Pagani 2017-04-29 14:26:56 UTC
So just checked, and I was already having this before. Still not sure what the implications are (here or in general).
Comment 49 Golden G. Richard III 2017-05-03 14:44:39 UTC
In case it's helpful, on an Alienware 13 R3 running kernel 4.10.0-20, "acpi_rev_override=5" doesn't work, but the Windows acpi option fixes the hang.
Comment 50 Alexander 2017-05-17 14:10:45 UTC
Created attachment 256601 [details]
i7-6700HQ CPU. Nvidia GeForce GTX965m. Clevo model N150RF.

Linux Mint 18.1 Serena, 4.11.1-041101-generic (earlier 4.4 kernel also affected, update does nothing). Also tried latest Elementary OS and LXLE with same problem, I thought it was related to MDM, but now I suspect Xorg.

i7-6700HQ CPU. System hangs on lspci, lshw and inxi commands, so when probing hardware I guess. Has problems with freezing both with Nouveau and several versions of proprietary nvidia drivers. Laptop is a Clevo model N150RF with Nvidia GeForce GTX965m. 

dmesg also shows

INFO: task upowerd:1580 blocked for more than 120 seconds.
      Not tainted 4.11.1-041101-generic #201705140931
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
---
iwlwifi 0000:03:00.0: Getting the temperature timed out
---
INFO: task kworker/6:1:73 blocked for more than 120 seconds.
      Not tainted 4.11.1-041101-generic #201705140931
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Comment 51 Alexander 2017-05-17 14:54:31 UTC
using boot flags acpi_osi=! "acpi_osi=Windows 2009"

the following error messages show:

nouveau 0000:01:00.0: bus: MMIO read of 00000000 FAULT at 40912c [ IBUS ]

---

nouveau 0000:01:00.0: DRM: Pointer to flat panel table invalid

---

nouveau 0000:01:00.0: priv: GPC0: 419df4 00000000 (1940820e)
Comment 52 Peter Wu 2017-05-17 16:01:02 UTC
(In reply to Alexander from comment #51)
> nouveau 0000:01:00.0: DRM: Pointer to flat panel table invalid

Not a problem, it can safely be ignored. Xorg/lspci/lshw/etc are the victims, the cause of the freezes is due to bad interaction between PCI or ACPI with the platform firmware. A proper solution and the exact root cause is still unknown.
Comment 53 Yaohan Chen 2017-05-23 22:57:51 UTC
Created attachment 256687 [details]
acpidump for Samsung Notebook Spin 7
Comment 54 Yaohan Chen 2017-05-23 23:02:02 UTC
On Samsung Notebook Spin 7, I am getting a black screen when resuming from suspend, with Xorg using 100% CPU and keyboard/mouse non-responding. Setting the acpi_osi parameter did not help. I was told on the NVIDIA forum that it has to do with this bug, so I've attached an acpidump.

https://devtalk.nvidia.com/default/topic/1009973/linux/black-screen-and-keyboard-freeze-after-resume-from-suspend-geforce-940mx-nvidia-381-22-linux-4-11-2-/post/5152837/#5152837
Comment 55 Robert Brock 2017-05-26 10:44:05 UTC
Created attachment 256725 [details]
MSI GP62 7RD acpidump

I'm also seeing this on an MSI GP62 7RD. The parameters 'acpi_osi=! "acpi_osi=Windows 2009"' sort the issue for me, with no ill effect on hotkeys, trackpad, backlight, etc.
Comment 56 Juan Cuzmar 2017-06-16 15:55:35 UTC
I have an Asus GL553VD and to fix this issues I had to edit DSDT.
finally create a patch to an override the tables and put the file into the grub.
More information here: https://askubuntu.com/a/923216/680254

So it's the tippical DDSDT
Comment 57 Juan Cuzmar 2017-06-16 15:58:58 UTC
So it's the typical DDST malfunction to Linux
(I'm sorry, I hit tab+enter before finish my post)
Comment 58 Remy LABENE 2017-06-17 11:12:01 UTC
Created attachment 257047 [details]
Dmesg on MSI GP72 6QE after Juan mofication

I tried the modification proposed by Juan on my MSI GP72 6QE. The modified table is the SSDT6. I remove grub options acpi_osi=! acpi_osi="Windows 2009", the system is stable despite some boot error.
Comment 59 Peter Wu 2017-06-17 11:25:56 UTC
Note that DSDT/SSDT modifications and the acpi_osi options are only *workarounds* specific for a model. It does not solve the root cause.
Comment 60 Px Laurent 2017-06-17 13:08:58 UTC
I tried the modification proposed by Juan on a MSI WS72 laptop with MS1776 motherboard and Quadro. The modified table is the SSDT6 also, like Remy LABENE (MSI GP62 7RD) 2017-06-17 11:12:01 UTC,

but that does not solve the boot error.

What is your motherboard model Remy LABENE ? Can't find on google.
Comment 61 Remy LABENE 2017-06-17 14:15:44 UTC
Created attachment 257049 [details]
dmidecode Laptop MSI GP72 6QE

See the dmi decode file for model (MS-1795)
Comment 62 Remy LABENE 2017-06-17 15:53:02 UTC
I upgrade with the last bios (E1795IMS.119), and I boot on battery only, some errors are present(see below) but the system don't freeze with "nouveau" driver and 3D use.
....
[0.717662] ACPI Error: No handler for Region [EC__] (ffff8917360f0120) [EmbeddedControl] (20160930/evregion-166)
[    0.717729] ACPI Error: Region EmbeddedControl (ID=3) has no handler (20160930/exfldio-299)
[    0.717790] ACPI Error: Method parse/execution failed [\_SB.PCI0.LPCB.EC._REG] (Node ffff8917360f1370), AE_NOT_EXIST (20160930/psparse-543)
[    0.719570] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
....
nouveau 0000:01:00.0: DRM: VRAM: 2048 MiB
[    5.829676] nouveau 0000:01:00.0: DRM: GART: 1048576 MiB
[    5.829679] nouveau 0000:01:00.0: DRM: Pointer to TMDS table invalid
[    5.829679] nouveau 0000:01:00.0: DRM: DCB version 4.0
[    5.829680] nouveau 0000:01:00.0: DRM: Pointer to flat panel table invalid
Comment 63 Juan Cuzmar 2017-06-19 15:26:31 UTC
(In reply to Peter Wu from comment #59)
> Note that DSDT/SSDT modifications and the acpi_osi options are only
> *workarounds* specific for a model. It does not solve the root cause.

Yeah, the problem it's the latest motherboards has a new DSDT causes the linux firmware doesn't work correctly
Comment 64 Px Laurent 2017-06-21 05:33:21 UTC
Yes, also U have a sound problem that I suspect is also related to ACPI/DSDT (power management, cracking sound)
Comment 65 taijian 2017-07-19 23:44:58 UTC
I have another laptop that is affected by this same problem, even though I do not have an NVIDIA dGPU. My system is an Alienware 15R3 i7-7700HQ/RX470.

Observed problems from this thread:
+ ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
+ ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
+ trying to wake the dGPU after a 'hard' suspend will crash the graphical session

Also, amdgpu pm does not work and the dGPU does not auto-suspend. If I force-suspend it via acpi_call, then the graphical session will crash upon trying to re-wake it, as described in this thread.
Comment 66 taijian 2017-07-19 23:48:07 UTC
Created attachment 257617 [details]
acpidump Alienware 15R3
Comment 67 taijian 2017-07-19 23:48:44 UTC
Created attachment 257619 [details]
dmidecode Alienware 15R3
Comment 68 Remy LABENE 2017-10-31 21:16:02 UTC
The new debian 9.1 kernel adds the "irqbalance" package. The system seems more stable.
Comment 69 Juan Cuzmar 2017-10-31 21:21:01 UTC
(In reply to Remy LABENE from comment #68)
> The new debian 9.1 kernel adds the "irqbalance" package. The system seems
> more stable.

Can you please run uname -arm?
Comment 70 Remy LABENE 2017-10-31 22:42:41 UTC
Linux PRT-MSIGP72 4.9.0-4-amd64 #1 SMP Debian 4.9.51-1 (2017-09-28) x86_64 GNU/Linux
Comment 71 Remy LABENE 2017-10-31 22:48:22 UTC
Created attachment 260453 [details]
dmesg with irqbalance deamon
Comment 72 Bruno Pagani 2017-10-31 23:06:43 UTC
(In reply to Remy LABENE from comment #71)
> Created attachment 260453 [details]
> dmesg with irqbalance deamon

This is a boot with `acpi_osi=! "acpi_osi=Windows 2009" acpi_rev_override=5`, so not sure what you are trying to say.
Comment 73 Remy LABENE 2017-10-31 23:22:58 UTC
Yes, the system is unstable without `acpi_osi=! "acpi_osi=Windows 2009" acpi_rev_override=5`, but now bbswitch and nvidia driver works perfectly, the webcam also.
Comment 74 Zack Weinberg 2017-11-01 21:17:51 UTC
Created attachment 260457 [details]
dmesg with MS-16K2-based laptop (acpi_osi overrides in effect)

I'm also experiencing this problem with kernel 4.13 (Debian unstable) on a shiny new laptop made by ZaReason, based on the MSI MS-16K2 motherboard (if the BIOS is to be believed, anyway, which I'm not 100% sure of -- you'll see why when you look at the dmidecode output).  `acpi_osi=! acpi_osi="Windows 2009"` successfully works around the problem, `acpi_rev_override=5` doesn't.

Without the workaround, logging in graphically and `lspci` after the Nvidia chip is suspended will trigger the syndrome.  The NMI watchdog reports a hang reading I/O registers, from inside the Nouveau power management code.

[   26.252604] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[   26.312496] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[   26.312499] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[   26.312501] nouveau 0000:01:00.0: DRM: resuming object tree...
[   47.206122] INFO: rcu_sched self-detected stall on CPU
[   47.206127]   7-...: (5249 ticks this GP) idle=9f6/140000000000001/0 softirq=3582/3582 fqs=2624 
[   47.206127]    (t=5250 jiffies g=497 c=496 q=1129)
[   47.206129] NMI backtrace for cpu 7
[   47.206130] CPU: 7 PID: 611 Comm: systemd-logind Not tainted 4.13.0-1-amd64 #1 Debian 4.13.10-1
[   47.206154] Hardware name: White Brand Company White Brand Product/MS-16K2, BIOS E16K2ID6.311 05/12/2017
[   47.206155] Call Trace:
[   47.206156]  <IRQ>
[   47.206159]  ? dump_stack+0x5c/0x85
[   47.206160]  ? nmi_cpu_backtrace+0xbf/0xd0
[   47.206161]  ? irq_force_complete_move+0x140/0x140
[   47.206162]  ? nmi_trigger_cpumask_backtrace+0xf4/0x120
[   47.206164]  ? rcu_dump_cpu_stacks+0x9c/0xd5
[   47.206165]  ? rcu_check_callbacks+0x7a9/0x8f0
[   47.206166]  ? update_wall_time+0x45d/0x720
[   47.206168]  ? tick_sched_do_timer+0x40/0x40
[   47.206169]  ? update_process_times+0x28/0x50
[   47.206169]  ? tick_sched_handle+0x23/0x60
[   47.206170]  ? tick_sched_timer+0x34/0x70
[   47.206171]  ? __hrtimer_run_queues+0xdc/0x220
[   47.206173]  ? hrtimer_interrupt+0xa6/0x1f0
[   47.206174]  ? smp_apic_timer_interrupt+0x34/0x50
[   47.206175]  ? apic_timer_interrupt+0x82/0x90
[   47.206175]  </IRQ>
[   47.206177]  ? ioread32+0x2b/0x30
[   47.206224]  ? nv04_timer_read+0x42/0x60 [nouveau]
[   47.206241]  ? nvkm_pmu_reset+0x67/0x160 [nouveau]
[   47.206250]  ? nvkm_subdev_preinit+0x2f/0x100 [nouveau]
[   47.206267]  ? nvkm_device_init+0x5d/0x260 [nouveau]
[   47.206282]  ? nvkm_udevice_init+0x41/0x60 [nouveau]
[   47.206291]  ? nvkm_object_init+0x3b/0x180 [nouveau]
[   47.206300]  ? nvkm_object_init+0xab/0x180 [nouveau]
[   47.206309]  ? nvkm_object_init+0xab/0x180 [nouveau]
[   47.206325]  ? nouveau_do_resume+0x3b/0xe0 [nouveau]
[   47.206341]  ? nouveau_pmops_runtime_resume+0x89/0x170 [nouveau]
[   47.206342]  ? pci_restore_standard_config+0x40/0x40
[   47.206343]  ? pci_pm_runtime_resume+0x73/0xa0
[   47.206345]  ? __rpm_callback+0xc1/0x1f0
[   47.206346]  ? pci_restore_standard_config+0x40/0x40
[   47.206347]  ? rpm_callback+0x1f/0x70
[   47.206348]  ? pci_restore_standard_config+0x40/0x40
[   47.206349]  ? rpm_resume+0x4af/0x6c0
[   47.206350]  ? evdev_ioctl_handler+0x72/0xb60 [evdev]
[   47.206352]  ? __pm_runtime_resume+0x47/0x70
[   47.206366]  ? nouveau_drm_ioctl+0x35/0xc0 [nouveau]
[   47.206368]  ? do_vfs_ioctl+0x9f/0x600
[   47.206369]  ? syscall_trace_enter+0x11a/0x2c0
[   47.206370]  ? SyS_ioctl+0x74/0x80
[   47.206371]  ? do_syscall_64+0x7c/0xf0
[   47.206372]  ? entry_SYSCALL64_slow_path+0x25/0x25
[   72.348354] watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [systemd-logind:611]
[   72.348356] Modules linked in: ctr ccm rfcomm bnep binfmt_misc nls_ascii nls_cp437 vfat fat snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek snd_hda_codec_generic arc4 coretemp kvm_intel msi_wmi sparse_keymap iwlmvm kvm irqbypass rtsx_usb_ms intel_cstate memstick mac80211 intel_uncore intel_rapl_perf joydev evdev nouveau i915 iwlwifi snd_hda_intel efi_pstore serio_raw snd_hda_codec pcspkr snd_hda_core mxm_wmi snd_hwdep uvcvideo efivars ttm videobuf2_vmalloc videobuf2_memops cfg80211 btusb videobuf2_v4l2 snd_pcm hci_uart drm_kms_helper btrtl videobuf2_core btbcm btqca btintel snd_timer drm videodev mei_me snd iTCO_wdt media sg iTCO_vendor_support i2c_algo_bit soundcore shpchp mei intel_pch_thermal bluetooth ac battery drbg ansi_cprng wmi video ecdh_generic
[   72.348404]  rfkill intel_lpss_acpi intel_lpss tpm_crb acpi_pad acpi_als kfifo_buf button industrialio parport_pc ppdev lp parport efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb algif_skcipher af_alg dm_crypt dm_mod hid_generic usbhid sd_mod rtsx_usb_sdmmc mmc_core rtsx_usb mfd_core crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd psmouse ahci libahci xhci_pci i2c_i801 libata xhci_hcd nvme alx mdio nvme_core scsi_mod usbcore usb_common fan thermal i2c_hid hid
[   72.348435] CPU: 7 PID: 611 Comm: systemd-logind Not tainted 4.13.0-1-amd64 #1 Debian 4.13.10-1
[   72.348436] Hardware name: White Brand Company White Brand Product/MS-16K2, BIOS E16K2ID6.311 05/12/2017
[   72.348436] task: ffff8ccbfb333040 task.stack: ffff9ab2021b8000
[   72.348439] RIP: 0010:ioread32+0x2b/0x30
[   72.348439] RSP: 0018:ffff9ab2021bbb80 EFLAGS: 00000296 ORIG_RAX: ffffffffffffff10
[   72.348440] RAX: 00000000ffffffff RBX: ffff8ccbfa67f400 RCX: 0000000000000018
[   72.348441] RDX: ffff9ab20510a014 RSI: ffff9ab20510a014 RDI: ffff9ab205009410
[   72.348441] RBP: 00000000ffffffff R08: 0000000000000002 R09: ffff9ab2021bbb84
[   72.348441] R10: 0000000000000000 R11: 00000000000003d1 R12: 00000000ffffffff
[   72.348442] R13: ffffffffffffffff R14: ffff8ccbfa35bf00 R15: ffff9ab2021bbde0
[   72.348442] FS:  00007fe0536b2a00(0000) GS:ffff8ccc0edc0000(0000) knlGS:0000000000000000
[   72.348443] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   72.348443] CR2: 00007fb9eb2af1a0 CR3: 0000000475345000 CR4: 00000000003406e0
[   72.348444] Call Trace:
[   72.348465]  ? nv04_timer_read+0x42/0x60 [nouveau]
[   72.348481]  ? nvkm_pmu_reset+0x67/0x160 [nouveau]
[   72.348491]  ? nvkm_subdev_preinit+0x2f/0x100 [nouveau]
[   72.348506]  ? nvkm_device_init+0x5d/0x260 [nouveau]
[   72.348521]  ? nvkm_udevice_init+0x41/0x60 [nouveau]
[   72.348530]  ? nvkm_object_init+0x3b/0x180 [nouveau]
[   72.348538]  ? nvkm_object_init+0xab/0x180 [nouveau]
[   72.348547]  ? nvkm_object_init+0xab/0x180 [nouveau]
[   72.348562]  ? nouveau_do_resume+0x3b/0xe0 [nouveau]
[   72.348577]  ? nouveau_pmops_runtime_resume+0x89/0x170 [nouveau]
[   72.348578]  ? pci_restore_standard_config+0x40/0x40
[   72.348579]  ? pci_pm_runtime_resume+0x73/0xa0
[   72.348581]  ? __rpm_callback+0xc1/0x1f0
[   72.348581]  ? pci_restore_standard_config+0x40/0x40
[   72.348582]  ? rpm_callback+0x1f/0x70
[   72.348583]  ? pci_restore_standard_config+0x40/0x40
[   72.348584]  ? rpm_resume+0x4af/0x6c0
[   72.348586]  ? evdev_ioctl_handler+0x72/0xb60 [evdev]
[   72.348587]  ? __pm_runtime_resume+0x47/0x70
[   72.348600]  ? nouveau_drm_ioctl+0x35/0xc0 [nouveau]
[   72.348602]  ? do_vfs_ioctl+0x9f/0x600
[   72.348603]  ? syscall_trace_enter+0x11a/0x2c0
[   72.348604]  ? SyS_ioctl+0x74/0x80
[   72.348605]  ? do_syscall_64+0x7c/0xf0
[   72.348606]  ? entry_SYSCALL64_slow_path+0x25/0x25
[   72.348607] Code: 48 81 ff ff ff 03 00 77 20 48 81 ff 00 00 01 00 76 05 0f b7 d7 ed c3 48 c7 c6 b0 b9 63 83 e8 2d ff ff ff b8 ff ff ff ff c3 8b 07 <c3> 0f 1f 40 00 48 81 fe ff ff 03 00 48 89 f2 77 1f 48 81 fe 00 


Attached: dmesg (with acpi_osi overrides in effect).  Will shortly attach dmidecode and acpidump output.
Comment 75 Zack Weinberg 2017-11-01 21:19:20 UTC
Created attachment 260459 [details]
dmidecode with MS-16K2-based laptop (acpi_osi overrides in effect)

dmidecode output for the same laptop described in previous comment.  Somebody forgot to fill in most of the OEM fields.
Comment 76 Zack Weinberg 2017-11-01 21:19:50 UTC
Created attachment 260461 [details]
acpidump with MS-16K2-based laptop (acpi_osi overrides in effect)
Comment 77 Etienne URBAH 2017-12-12 16:35:26 UTC
Created attachment 261127 [details]
acpidump for MSI GE62 7RE-210FR

Kernel option 'acpi_osi=! acpi_osi="Windows 2009"' permits 'lspci' to succeed, and 'nouveau' to successfully manage an external display with resolution 3840 x 2160 at 60Hz through DisplayPort.
Comment 78 Rafael Delboni 2017-12-24 14:04:18 UTC
Same issue is present on Gigabyte Aero 15X (Core i7-7700HQ and NVIDIA GeForce GTX 1070 MaxQ) with kernel 4.10.0-42-generic.

Putting acpi_osi=! acpi_osi="Windows 2009" in the Grub worked around the problem, no major functionality is lost, but screen dimming shortcuts.
Comment 79 Bruno Pagani 2018-01-29 12:50:05 UTC
Current status is NEEDINFO, question is “from who”? If there is anything we could provide to help debugging, please tell us.
Comment 80 Isaac A. 2018-02-07 03:09:28 UTC
I can confirm that I'm having this issue with my MSI GL72 6QD (i5 6300HQ, GTX 950m), and can confirm that putting acpi_osi=! acpi_osi="Windows 2009" in grub did fix the issue.

I'd be happy to provide debug logs if needed.
Comment 81 Etienne URBAH 2018-03-17 02:54:55 UTC
With Linux kernel 4.16.0-041600rc5 from http://kernel.ubuntu.com/~kernel-ppa/mainline :
-  'lspci' still fails to answer, and makes the whole machine freeze after some time.
-  Kernel option 'acpi_osi=! acpi_osi="Windows 2009"' is still a good workaround.
Comment 82 Maik Freudenberg 2018-03-17 03:45:45 UTC
(In reply to Bruno Pagani from comment #79)
> Current status is NEEDINFO, question is “from who”? If there is anything we
> could provide to help debugging, please tell us.
Maybe time to sum up the info we have
- nvidia gpu fails to power on
- on runtime resume
? also on system suspend/resume?
? due to a loop in acpi method, does overriding that alone help?
- waiting for someting to come alive
little info, some questions.
"from whom" - anyone that can provide.
Comment 83 Maik Freudenberg 2018-03-17 03:52:18 UTC
...and maybe try this first
https://bugs.acpica.org/show_bug.cgi?id=1333#c86
first to rule out any side effects.
Comment 84 Ivan Shapovalov 2018-03-27 16:37:44 UTC
(In reply to Maik Freudenberg from comment #83)
> ...and maybe try this first
> https://bugs.acpica.org/show_bug.cgi?id=1333#c86
> first to rule out any side effects.

As expected, that did not help.

BTW, T540p (GT730M) here, similar lockups happen in 50% of suspend/resume cycles and sometimes just before power-down/reboot. Surprisingly though, never during actual work.
Comment 85 Etienne URBAH 2018-04-18 18:43:44 UTC
With Linux kernel 4.17.0-041700rc1 from http://kernel.ubuntu.com/~kernel-ppa/mainline, systematically :

-  Inside a Linux console, 'lspci' fails to answer, and makes the whole machine immediately freeze.

-  Graphical login fails, and makes the whole machine immediately freeze.

-  Kernel option 'acpi_osi=! acpi_osi="Windows 2009"' is still a good workaround.
Comment 86 Iseulde 2018-04-29 00:55:19 UTC
I hope this bug will get fixed soon enough. Tried everything on my Alienware 15 R3 (NVidia 1060), nothing helps. Tried also the workaround suggested above (Kernel option 'acpi_osi=! acpi_osi="Windows 2009"'), but it won't fix the black screen after suspend.
Comment 87 colin.wu 2018-07-04 15:41:08 UTC
got the same issue. ThinkPad T440 NVIDIA GF117
Comment 88 colin.wu 2018-07-04 15:54:06 UTC
Created attachment 277165 [details]
dmesg from thinkpad t440 laptop
Comment 89 Robert Bowman 2018-07-24 14:33:32 UTC
I get the following error trying to install Arch linux on Alienware 15

MMIO read of 00000000 FAULT at 022554 [ IBUS ]

this halts the install process and leaves me in command shell, but I have experienced no issues running ubuntu 18.04 with its GUI.
Comment 90 Bruno Pagani 2018-07-25 10:04:24 UTC
@Robert Bowman: Well, a command shell is the standard Arch Linux installation method, so not sure whether you actually have an issue.

Regarding the error message you are reporting, there is already a bug report for it: https://bugs.freedesktop.org/show_bug.cgi?id=100423
Comment 91 Victor 2018-07-26 11:15:10 UTC
Got the same issue on my HP ZBook Studio G5 x360 with a Nvidia Quadro P1000 (Mobile) GPU using the proprietary nvidia drivers on kernel 4.17.9-1-ARCH, the workaround of setting kernel options to 'acpi_osi=! acpi_osi="Windows 2009"' works.
Comment 92 Maik Freudenberg 2018-09-03 12:05:22 UTC
Since this issue seems to get bigger, more recent notebooks by ASUS, Dell/Alienware, Toshiba seem affected, I compared those acpi tables with the unaffected one from my notebook and noticed this in the _ON method of the PEGP scope:
            TREN = One
            LNKD = Zero
            While (LNKS < 0x07)
            {

If I counted the bits correctly, this is related to the root port to which the nvidia gpu is connected with
TREN=link retrain bit
LNKD=link disable bit
LNKS=link width
So it's setting the link to retrain, enables it and then waits for it to reach x8 width. On affected machines, I couldn't find it to set the retrain bit, is that what changed in Windows 10, always retraining so the acpi depends on that?
Maybe someone could check by patching the kernel to set the bit in the resume iter of the pci driver?
Comment 93 Karol Herbst 2018-09-03 12:43:57 UTC
(In reply to Maik Freudenberg from comment #92)
> Since this issue seems to get bigger, more recent notebooks by ASUS,
> Dell/Alienware, Toshiba seem affected, I compared those acpi tables with the
> unaffected one from my notebook and noticed this in the _ON method of the
> PEGP scope:
>             TREN = One
>             LNKD = Zero
>             While (LNKS < 0x07)
>             {
> 
> If I counted the bits correctly, this is related to the root port to which
> the nvidia gpu is connected with
> TREN=link retrain bit
> LNKD=link disable bit
> LNKS=link width
> So it's setting the link to retrain, enables it and then waits for it to
> reach x8 width. On affected machines, I couldn't find it to set the retrain
> bit, is that what changed in Windows 10, always retraining so the acpi
> depends on that?
> Maybe someone could check by patching the kernel to set the bit in the
> resume iter of the pci driver?

okay, interesting, but I don't think this is actually causing issues. While I was digging around some nouveau code, I was pinpointing the devinit scripts we run to initialize those GPUs to be the first action that runtime suspend+resume issues occur.

On my laptop the GPU simply doesn't respond to any PCIe request made. Anyway, what those scripts do is to lower the PCIe link speed from 8.0 (what the GPU boots into) to 2.5 as one of the first actions.

Not doing that or setting the link to 8.0 "fixes" the issues for me:
https://github.com/karolherbst/linux/commit/08936d832bb3505d9431912d8be03796d71f55b1.patch

I would be very interested to know for how many other laptops this patch helps as well. I noticed though that if I run the secboot bits with this patch, resuming can still fail afterwards, but I didn't look deeper on why that happens, as this is quite problematic to pinpoint.

But in the end, I think what is the root cause of those issues is that the Host and the GPU simply disagree about the state of the PCIe link and then things just randomly fail.
Comment 94 Karol Herbst 2018-09-03 12:48:07 UTC
Created attachment 278269 [details]
pcie link issue workaround

better to attach the patch
Comment 95 Karol Herbst 2018-09-03 12:50:54 UTC
another note: in case the resume fails, the ACPI code wasn't able to read out the GPUs state as well, so most likely all values are garbage and simple contain -1, which would mean that LNKS will never be below 0x07 anyway.

So everything I was reading out via ACPI returned 0xffff, but maybe that's just the case for me (Dell XPS 9560).
Comment 96 Maik Freudenberg 2018-09-03 13:27:47 UTC
If I understand correctly, that patch only applies to Pascal, any reason to leave Maxwell out?
In my case LNKS/TREN/LNKD worked on the config space of the upstream port, namely
00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x8 Controller (rev 06)
and not the GPU.
Comment 97 Karol Herbst 2018-09-03 13:39:10 UTC
(In reply to Maik Freudenberg from comment #96)
> If I understand correctly, that patch only applies to Pascal, any reason to
> leave Maxwell out?

it also affects maxwell, but we had the PCIe stuff already wired up there, just not for Pascal.
The more important part of that patch is the nvkm_pcie_fini function which gets called before runtime suspending the GPU.

> In my case LNKS/TREN/LNKD worked on the config space of the upstream port,
> namely
> 00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor
> PCI Express x8 Controller (rev 06)
> and not the GPU.

oh, I see. But in any case, if the upstream port <-> GPU communication is broken, we can't rely on anything. And this was the situation I noticed on my machine.

I am mainly interested if setting the PCIe link to the "default" state from the GPU perspective is a working workaround and my hope is, that this will also give us better insights on what the real cause is.
Comment 98 Maik Freudenberg 2018-09-03 13:54:11 UTC
Which kernel versions can this patch be applied on?
Comment 99 Karol Herbst 2018-09-03 14:03:08 UTC
(In reply to Maik Freudenberg from comment #98)
> Which kernel versions can this patch be applied on?

I think all of the most recent ones should just work? The branch the patch is from was 4.17 based, but it may also apply on older/newer kernels as well.
Comment 100 Josh Farwell 2018-09-20 06:00:48 UTC
Karol, I tried your kernel patch and the results were very promising. No more kernel lockups!!

My computer is a Gigabyte Aero 15X v8 with an i7-8750H and a GTX 1070. I am running Fedora 28 with kernel version 4.18.

I have been experiencing the hard kernel lockups when nouveau is loaded and the GPU has entered D3 power state. Running `lspci` or trying to suspend the machine locks it up, as do other programs such as the Power Manager in GNOME or the Steam client. Trying to unload the nouveau driver once it has been loaded also results in a kernel lockup. I can use the acpi_osi="Windows 2009" workaround but then the nouveau driver seems to never put the card into the low power state.

My use case is infrequent CUDA and gaming, so my desire is to use the proprietary drivers when I need them and turn the card off when I don't. I am trying to use nouveau as a workaround to power off the card, as the older methods (bbswitch) also give me kernel lockups. I am using the current draw from tlp-stat to figure out when the card is on or off. Luckily, it draws almost an amp(!) so it's easy to tell.

With the PCIe link speed patch applied to nouveau, the kernel lockup issues disappear under certain conditions. If I load nouveau during boot and run X on the Intel card, the card never turns off when it isn't in use, and xorg-x11-drv-nouveau reports a crash after a while. However, if I load nouveau *after* X has started up, it does power down the card and seems to be stable.

Suspend and resume works. Unloading the nouveau module works. Running lspci works.

I am getting some interesting results when I run lspci. Instead of a hard kernel lockup, the lspci stops and "thinks" for a moment corresponding with an increase in current draw. This is indicating to me that the card is turning back on when something tries to get a response from it. After some seconds, nouveau will turn the card off again.

I can dynamically load and unload both the nvidia module and nouveau, which makes this suitable workaround for me. I am curious if setting the link speed to 8.0 would make bbswitch work, and may try it as an experiment.
Comment 101 Mark Morris 2018-09-20 21:44:38 UTC
Josh I have also applied the "PCIE link speed patch" on the exact same Gigabyte hardware.  I patched Kernel version 4.19 RC4.  I allow the Nouveau module to load normally on boot and it effectively blocks the NVIDIA driver load.  The journalctl dump seems to confirm all of this is prior to X starting, but I do not have the subsequent crash that you noted.  I have been able to remove the acpi_osi=! and acpi_osi="Windows 2009" options and do not have any repeatable lock-ups associated with lspci.  I have observed the same behavior where it does pause briefly when lspci or inxi -G is executed as if waiting on response from the NVIDIA device.

As for power usage, I find it difficult to confirm how much power the NVIDIA hardware is drawing and whether it is ever fully off, but I can confirm that I see 30-40 W total draw with the NVIDIA driver in control and only 13-20 W with nouveau.  In both cases I am just running powertop and have allowed the post boot activities to quiet down so there is no significant load on the system.  I don't do anything special with when I load nouveau. I was also unable to get bbswitch to work and have removed it and moved on.

Karol you seemed to imply this patch was more a tool to investigate as opposed to a fix, so if there is anything I can do to help with further diagnostics I will be happy to do so. I don't really understand this code so I will need specifics.  I also am not seeing any adverse behavior during suspend/resume or normal shutdown.
Comment 102 Maik Freudenberg 2018-09-26 08:38:40 UTC
While investigating, I found an anomaly in the form of an Alienware 13 R2. This device also fails to properly initialize the nvidia device after resume. The gpu is connected to a pci bridge
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #1 (rev f1) (prog-if 00 [Normal decode])

This was limited by the manufacturer to gen2 (5GT/s) speed:
LnkCap: Port #1, Speed 5GT/s, Width x4, ASPM not supported

Thus, this device can never reach 8GT/s so the patch doesn't have any effect.
The only noticeable difference on the gpu is:
01:00.0 3D controller: NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2)
after boot:
LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
after resume:
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
The bridge stays the same, though, at LinkCap 5GT/s

Yet, the acpi still contains
While (LNKS < 0x07)
    {
Comment 103 Maik Freudenberg 2018-09-26 08:53:34 UTC
(continued)
Yet, the acpi still contains
While (LNKS < 0x07)
    {

So the question is, what LNKS really is that needs to be 0111. On that device, it can be neither speed nor width?
There are also other notebooks that fail to work after resume where the gpu is connected to the same x4 bridge
Intel Corporation Sunrise Point-LP PCI Express Root Port
like
Asus R558U
ASUS zenbook ux310uq
Asus X705UQ
Asus x556
Alpha Centurion Ultra
MSI cx62
HP Spectre
Toshiba Satellite Pro a50
...
but on those, the bridge is not limited to gen2.
So is really something wrong with that bridge?
Comment 104 Maik Freudenberg 2018-09-26 08:54:44 UTC
Created attachment 278771 [details]
acpidump from anomalous Alienware 13 R2
Comment 105 Maik Freudenberg 2018-09-26 09:03:39 UTC
Created attachment 278773 [details]
pci driver quirk to re-train link on resume

Just for completeness, attaching a hacky driver quirk patch to re-train the link on resume. On my otherwise unaffected notebook, this raises the link speed of the gpu from 0001 (2.5 GT/s) to 0003 (8GT/s) on resume.
Does not have any effect on the aforementioned Alienware, but at least it told that this device was already at the maximum reachable speed of 5GT/s.
Comment 106 Maik Freudenberg 2018-09-26 13:44:32 UTC
> Asus R558U
> ASUS zenbook ux310uq
> Asus X705UQ
> Asus x556
> Alpha Centurion Ultra
> MSI cx62
> HP Spectre
> Toshiba Satellite Pro a50

Forgot, all those machines don't have any acpi switches, plain Windows 10-only machines without an acpi_osi workaround which doesn't really improve the situation.
Comment 107 Karol Herbst 2018-09-28 23:36:34 UTC
(In reply to Maik Freudenberg from comment #105)
> Created attachment 278773 [details]
> pci driver quirk to re-train link on resume
> 
> Just for completeness, attaching a hacky driver quirk patch to re-train the
> link on resume. On my otherwise unaffected notebook, this raises the link
> speed of the gpu from 0001 (2.5 GT/s) to 0003 (8GT/s) on resume.
> Does not have any effect on the aforementioned Alienware, but at least it
> told that this device was already at the maximum reachable speed of 5GT/s.

yeah, this sounds correct. The thing seems to be that some GPUs actually come up with 8GT/s, but some state saving or something makes it that the bus actually thinks the device is at 2.5GT/s maybe even the PCI config space of the device thinks so? Don't for sure, but it would explain quite a lot, because if devices disagree on the link settings, nothing is expected to work anyway.

Will test your quirk when I find some time to test it.
Comment 108 Maik Freudenberg 2018-09-29 16:48:43 UTC
There are currently two paths I'm trying to follow:
1) unhandled chipset bug
From my observations, the first batch of machines with this issue from back when this bug report was created always had their gpu connected to a pci bridge
00:1c.0 PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #3 [8086:a112] (rev f1)
connected gpus were diferent models from Maxwell to Pascal..
The second batch of machines now all have their gpu connected to a pci bridge
00:1c.0 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root Port #1 [8086:9d10] (rev f1)
connected gpus are mainly Maxwell 940MX but also 930M, 960M and even a Kepler mobile Quadro.
So my question to those people in this thread with @intel.com addresses, Lv Zheng,  Rafael J. Wysocki, are the errata sheets of those chipsets/controllers easily accessible to rule out a chipset bug?

2) State of config space and link on suspend
Like Karol mentioned state saving and config space, my next guess is when the kernel restores the config space on resume on bridge and gpu, maybe that is flawed already so nothing can really been done to get the communication right. Which would point to having to somehow sanitize the pci config on suspend, e.g. tuning the speed down to 2.5GT/s if that's not already been done. Takes some information gathering on the state of the pci registers on suspend beforehand.

Karol, only test the patch if you're bored, though the printed out values of the current speed settings might be of interest, I don't expect it to do anything.
If the devices disagree about the link state/setting, this should be fixable by triggering a link equalization though I didn't really observe that so far.

The insteresting point of the mentioned Alienware is that it's nailed down to gen2, so at least any fancy gen3 capabilities (if there are any, didn't look into it, yet) are out of the game.
Comment 109 Karol Herbst 2018-09-30 20:22:57 UTC
well, you can try out my patch on gen2 when you replace the 8_0 with 5_0 in the patch. Calls look like that: "nvkm_pcie_set_link(pci, NVKM_PCIE_SPEED_8_0, 16);
". 16 is for the amount of lanes, but we completely ignore this value anyway (there are some macbooks where the binary driver actually changes the width, but usually it seems to be super unstable on other chips).

My GPU is connected with a "00:01.0 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 05)"

I seriously don't think it is only an issue on a few controllers, and might even happen with any one.
Comment 110 Karol Herbst 2018-10-04 08:44:13 UTC
(In reply to Maik Freudenberg from comment #105)
> Created attachment 278773 [details]
> pci driver quirk to re-train link on resume
> 
> Just for completeness, attaching a hacky driver quirk patch to re-train the
> link on resume. On my otherwise unaffected notebook, this raises the link
> speed of the gpu from 0001 (2.5 GT/s) to 0003 (8GT/s) on resume.
> Does not have any effect on the aforementioned Alienware, but at least it
> told that this device was already at the maximum reachable speed of 5GT/s.

with that quirk it still fails and I get this inside dmesg:

[  280.463651] pci_raw_set_power_state: 66 callbacks suppressed
[  280.463657] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[  280.524318] nouveau 0000:01:00.0: nvquirk: max speed: 16
[  280.524319] nouveau 0000:01:00.0: nvquirk: current speed: 16
[  280.524320] nouveau 0000:01:00.0: nvquirk: gpu current speed: 000f
[  280.656526] nouveau 0000:01:00.0: nvquirk: 2. max speed: 16
[  280.656530] nouveau 0000:01:00.0: nvquirk: 2. current speed: 16
[  280.656536] nouveau 0000:01:00.0: nvquirk: 2. gpu current speed: 000f
[  280.656547] nouveau 0000:01:00.0: quirk_nvidia_resume+0x0/0x150 took 129123 usecs
[  280.656590] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[  280.656594] nouveau 0000:01:00.0: DRM: couldn't wake up GPU!
Comment 111 Karol Herbst 2018-10-04 08:45:37 UTC
allthough you can ignore the two top lines, that is nouveau being stupid... maybe I should remove the silly runpm code inside nouveau and test again (it still has the pre _PR3 code in it and runs it on all cards)
Comment 112 Karol Herbst 2018-10-04 10:52:53 UTC
okay, tested with most of the runpm code removed inside nouveau:

first cycle:
Oct 04 10:52:14 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: max speed: 16
Oct 04 10:52:14 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: current speed: 16
Oct 04 10:52:14 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: gpu current speed: 0001
Oct 04 10:52:15 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: 2. max speed: 16
Oct 04 10:52:15 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: 2. current speed: 16
Oct 04 10:52:15 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: 2. gpu current speed: 0001
Oct 04 10:52:15 kherbst.pingu kernel: nouveau 0000:01:00.0: quirk_nvidia_resume+0x0/0x150 took 128846 usecs

second cycle:
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: Refused to change power state, currently in D3
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: max speed: 16
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: current speed: 16
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: gpu current speed: 000f <== different
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: 2. max speed: 16
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: 2. current speed: 16
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: 2. gpu current speed: 000f
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: quirk_nvidia_resume+0x0/0x150 took 128599 usecs

no idea how random it is, that I get one working cycle with the quirk.
Comment 113 Maik Freudenberg 2018-10-04 12:42:55 UTC
The data you get is even more confusing,
>gpu current speed: 000f
simply means that the gpu is turned off, the config space contains all 0xff which is consistent with
>Refused to change power state, currently in D3
Which raises the question, how does your patch get around that by setting the speed on an otherwise dead device?
>nvquirk: current speed: 16
means the bus is reporting a speed of 8GT/s, though I found this is a somehow cached value, I don't really know when the kernel updates that, it doesn't change when changing the speed of the device.
>gpu current speed: 0001
>2. gpu current speed: 0001
means that before and after training, the gpu is at 2.5GT/s so training does not have any effect on your device.

Like with the cached bus speed, I noticed the kernel at some point introduced a kind of config space caching, which leads to a 'zombie mode'. In an unrelated bug where my gpu failed to power on, with a 3.x kernel, the config space was always 0xff which is consistent, with 4.x, it would sometimes be 0xff and sometimes present a cached config space while the device always being off.
Comment 114 Karol Herbst 2018-10-05 14:49:52 UTC
(In reply to Maik Freudenberg from comment #113)
> The data you get is even more confusing,
> >gpu current speed: 000f
> simply means that the gpu is turned off, the config space contains all 0xff
> which is consistent with
> >Refused to change power state, currently in D3
> Which raises the question, how does your patch get around that by setting
> the speed on an otherwise dead device?

I set it before suspending the GPU.

> >nvquirk: current speed: 16
> means the bus is reporting a speed of 8GT/s, though I found this is a
> somehow cached value, I don't really know when the kernel updates that, it
> doesn't change when changing the speed of the device.
> >gpu current speed: 0001
> >2. gpu current speed: 0001
> means that before and after training, the gpu is at 2.5GT/s so training does
> not have any effect on your device.
> 
> Like with the cached bus speed, I noticed the kernel at some point
> introduced a kind of config space caching, which leads to a 'zombie mode'.
> In an unrelated bug where my gpu failed to power on, with a 3.x kernel, the
> config space was always 0xff which is consistent, with 4.x, it would
> sometimes be 0xff and sometimes present a cached config space while the
> device always being off.

yeah... might be related. I think we kind of have to do something before touching the GPU after invoking the ACPI methods to power it back on. The might be in some random hw state and we shouldn't touch the GPU before we are sure that the PCIe link is somewhat in a sane state.

I know this all works more or less reliable without Nouveau loaded (in fact, we have to run some script from the vbios with the help of some firmware stored in the vbios as well, both signed and verified before execution) and if I skip that, runtime suspend/resume works as well, but the GPU isn't in a useable state for nouveau then.

There is some memory stuff going on, but also the PCIe configuration is touched. It might be that we need some information from Nvidia for that (already working on that), but maybe there is a nice solution without their help?

Anyway, we touch the PCIe configuration when loading nouveau and we might have to change something before suspending or do something special on resuming.
Comment 115 Maik Freudenberg 2018-10-07 03:44:04 UTC
Dear Raphael.
Due to the known problems with the bridge
00:1c.0 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root Port #1 [8086:9d10] (rev f1)
being
https://bugzilla.kernel.org/show_bug.cgi?id=201069
https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23
I renew my question for errata sheets.
Or is the answer to whether or not that information is confidential confidential?
Comment 116 Matthias Fulz 2018-11-05 16:30:07 UTC
Created attachment 279327 [details]
acpidump HP Omen 15 dc0307ng

I've checked the tables for PGON and found following in ssdt7:

        Method (PGON, 1, Serialized)
        {
            PION = Arg0
            If ((PION == Zero))
            {
                If ((SGGP == Zero))
                {
                    Return (Zero)
                }
            }
            ElseIf ((PION == One))
            {
                If ((P1GP == Zero))
                {
                    Return (Zero)
                }
            }
            ElseIf ((PION == 0x02))
            {
                If ((P2GP == Zero))
                {
                    Return (Zero)
                }
            }

            PEBA = \XBAS /* External reference */
            PDEV = GDEV (PION)
            PFUN = GFUN (PION)
            Name (SCLK, Package (0x03)
            {
                One,
                0x0100,
                Zero
            })
            If ((DerefOf (SCLK [Zero]) != Zero))
            {
                PCRA (0xDC, 0x100C, ~DerefOf (SCLK [One]))
                Sleep (0x10)
            }

            If ((CCHK (PION, One) == Zero))
            {
                Return (Zero)
            }

            GPPR (PION, One)
            RTEN (PION)
            If ((PBGE != Zero))
            {
                If (SBDL (PION))
                {
                    PUAB (PION)
                    CBDL = GUBC (PION)
                    MBDL = GMXB (PION)
                    If ((CBDL > MBDL))
                    {
                        CBDL = MBDL /* \_SB_.PCI0.MBDL */
                    }

                    PDUB (PION, CBDL)
                }
            }

            \_SB.PCI0.PEG0.LREN = \_SB.PCI0.PEG0.PEGP.LTRE
            \_SB.PCI0.PEG0.CEDR = One
            While ((\_SB.PCI0.PEG0.LNKS < 0x03))
            {
                Sleep (One)
            }

            If ((PION == Zero))
            {
                S0VI = H0VI /* \_SB_.PCI0.H0VI */
                S0DI = H0DI /* \_SB_.PCI0.H0DI */
                LCT0 = ((ELC0 & 0x43) | (LCT0 & 0xFFBC))
            }
 
The interesting part here: \_SB.PCI0.PEG0.LNKS < 0x03
As far as I understand it should be always 0x04 as it is set only once inside the ssdt12:

    Scope (\_SB.PCI0.PEG0)
    {
        OperationRegion (MSID, SystemMemory, EBAS, 0x0500)
        Field (MSID, DWordAcc, Lock, Preserve)
        {
            VEID,   16,
            Offset (0x40),
            NVID,   32,
            Offset (0x4C),
            ATID,   32,
            Offset (0x88),
            PASM,   2,
            Offset (0x48B),
                ,   1,
            NHDA,   1
        }

        OperationRegion (RPCX, SystemMemory, ((\XBAS + 0x8000) + Zero), 0x1000)
        Field (RPCX, ByteAcc, NoLock, Preserve)
        {
            Offset (0x04),
            CMDR,   8,
            Offset (0x19),
            PRBN,   8,
            Offset (0x84),
            D0ST,   2,
            Offset (0xAA),
            CEDR,   1,
            Offset (0xB0),
            ASPM,   2,
                ,   2,
            LNKD,   1,
            Offset (0xC9),
                ,   2,
            LREN,   1,
            Offset (0x216),
            LNKS,   4

Here is the only place I can find LNKS initialized to 4.

I'll try to remove the while loop for a test and compile a custom dsdt for a test.

Question: Can some acpi expert tell me what this LNKS is for?
Comment 117 Matthias Fulz 2018-11-05 22:15:29 UTC
Seems that overloading ssdt7 is not working. Or I don't know how to do it correct.

Does anybody has some hints, what I could try as workaround for this issue?

Setting acpi_osi or rev didn't help
Comment 118 Matthias Fulz 2018-11-07 00:22:11 UTC
Some more hints:

After reboot NVIDIA card off:
cat /proc/acpi/bbswitch                                                                                                                                                                                          0000:01:00.0 OFF

lspci -> working

Second call to lspci -> complete freeze


Reboot:
Card off:
cat /proc/acpi/bbswitch                                                                                                                                                                                          0000:01:00.0 OFF

optirun lspci -> working
optirun lspci -> working
optirun lspci -> working

Reboot:
Card off:
cat /proc/acpi/bbswitch                                                                                                                                                                                          0000:01:00.0 OFF

lspci -> working

Manual cycle card power (inside acpidbg):
execute \_SB.PCI0.PGON
Evaluating \_SB.PCI0.PGON
4ACPI Warning: \_SB.PCI0.PGON: Insufficient arguments - Caller passed 0, method requires 1 (20180531/nsarguments-235)                                                                                                                
Evaluation of \_SB.PCI0.PGON returned object 000000003cba7e41, external buffer length 18
 [Integer] = 0000000000000000

execute \_SB.PCI0.PGOF
Evaluating \_SB.PCI0.PGOF
4ACPI Warning: \_SB.PCI0.PGOF: Insufficient arguments - Caller passed 0, method requires 1 (20180531/nsarguments-235)                                                                                                                
Evaluation of \_SB.PCI0.PGOF returned object 000000003cba7e41, external buffer length 18
 [Integer] = 0000000000000000

Card off:
cat /proc/acpi/bbswitch                                                                                                                                                                                          0000:01:00.0 OFF

lspci -> working


When I'll cycle the card via ACPI calls before calling lspci everything works fine.

For me it looks somehow that lscpi, lshw, etc. are doing something or better missing something which is done when calling the acpi methods directly.

Does anybody know what exactly these tools are doing (or the kernel, etc.) which  is different from the acpi calls?

Sorry if this all isn't very clear, I'm totally new to all this acpi stuff and just trying to get around the freezes.
Comment 119 Matthias Fulz 2018-11-07 00:24:33 UTC
Sorry copy & paste the wrong terminal.
Here are the correct acpi calls:

- execute \_SB.PCI0.PGON Zero
Evaluating \_SB.PCI0.PGON
Evaluation of \_SB.PCI0.PGON returned object 000000003cba7e41, external buffer length 18
 [Integer] = 0000000000000000

- execute \_SB.PCI0.PGOF Zero
Evaluating \_SB.PCI0.PGOF
Evaluation of \_SB.PCI0.PGOF returned object 000000003cba7e41, external buffer length 18
 [Integer] = 0000000000000000
Comment 120 Karol Herbst 2018-11-08 00:59:19 UTC
(In reply to Matthias Fulz from comment #118)
> 
> Does anybody know what exactly these tools are doing (or the kernel, etc.)
> which  is different from the acpi calls?
> 


yes, they do a more fine grained suspending process:
1. put GPU into S3 state via PCI config space
2. ACPI call to suspend the GPU
3. ACPI call to suspend the bus

but that's not actually what is triggering the issue. I was digging a bit into what Nouveau is doing before suspending and it turns out that when you invoke some parts of the vbios you have to run in order to fully use the GPU, some scripts you push through signed firmware embedded inside the vbios on the GPU, it touches the PCI link settings which causes the resume process to fail. More or less.

The bigger problem here is, we don't have anything to "revert" what this embedded script touches and those scripts are quite huge, so we need to come up with something working for all GPUs after that was executed.

I had a small hack to change the PCIe link speed back to the boot default which fixed the issue a little bit, but caused issues later on.

I am currently in touch with Nvidia about that issue and we might get some more information on that in the future, which would help us to fix this issue completly.
Comment 121 Matthias Fulz 2018-11-08 08:33:56 UTC
Thanks for the Info.

If this is any help: My setup is running nvidia binary drivers / bumblebee.
I saw your patch but because I blacklisted nouveau, I'm unable to try it out.

Yesterday I ran into the issue and got messages on these two ACPI methods, if this could be any help:

\_SB.PCI0.PEG0.PEGP._ON, AE_AML_LOOP_TIMEOUT
\_SB.PCI0.PEG0.PEGP._PS0, AE_AML_LOOP_TIMEOUT

Will try to debug them later at home a bit.
Comment 122 houkime 2018-11-12 06:28:10 UTC
+1 affected machine to the piggybank of knowledge

Asus Vivobook Pro N580GD-FI110
Intel Core i5,
Nvidia GTX1050M

Simptoms:
* Fails to reboot, shutdown or suspend-resume.
* lspci sometimes completes after some time, but makes touchpad and/or keyboard non-functional
* systemd output on reboot shows a bunch of processes being endlessly caught in soft and sometimes hard lockups

Information collection is limited because it is my friend's laptop which he currently uses and not mine.

For now the problem was solved by installing nvidia proprietary, then deleting xf86-video-nouveau (Arch-based system) and blacklisting nouveau kernel module.
(Not sure it is ethical to persuade to go back to nouveau to do more tests)
Comment 123 Matthias Fulz 2018-11-23 00:05:22 UTC
Another information:
I'm trying nouveau with nvidia PRIME and there are no lockups with this setup.

The issue I've here is now that the powerconsumption is with dicrete card render offload very high around 15W compared to switch off the card via acpi 7W.

cat /sys/kernel/debug/vgaswitcheroo/switch                                                                                                                                                                        0:DIS: :DynOff:0000:01:00.0
1:IGD:+:Pwr:0000:00:02.0

 xrandr --listproviders                                                                                                                                                                                            
Providers: number : 2
Provider 0: id: 0x8d cap: 0xb, Source Output, Sink Output, Sink Offload crtcs: 4 outputs: 4 associated providers: 1 name:Intel
Provider 1: id: 0x65 cap: 0x7, Source Output, Sink Output, Source Offload crtcs: 4 outputs: 2 associated providers: 1 name:nouveau

I can switch off the card completely via acpi_call but then the lockups are popping up again...
Comment 124 Alexander 2018-11-23 08:34:58 UTC
Created attachment 279627 [details]
attachment-17158-0.html

Has anyone else noticed that whilst GRUB boot parameters seem to resolve
the problem (perhaps with some decrease in performance) the system will
once again freeze upon resuming from a suspend (closing laptop lid)?

On Nov 23, 2018 01:05, <bugzilla-daemon@bugzilla.kernel.org> wrote:

https://bugzilla.kernel.org/show_bug.cgi?id=156341

--- Comment #123 from Matthias Fulz (mfulz@olznet.de) ---
Another information:
I'm trying nouveau with nvidia PRIME and there are no lockups with this
setup.

The issue I've here is now that the powerconsumption is with dicrete card
render offload very high around 15W compared to switch off the card via acpi
7W.

cat /sys/kernel/debug/vgaswitcheroo/switch


                                                  0:DIS:
:DynOff:0000:01:00.0
1:IGD:+:Pwr:0000:00:02.0

 xrandr --listproviders

Providers: number : 2
Provider 0: id: 0x8d cap: 0xb, Source Output, Sink Output, Sink Offload
crtcs:
4 outputs: 4 associated providers: 1 name:Intel
Provider 1: id: 0x65 cap: 0x7, Source Output, Sink Output, Source Offload
crtcs: 4 outputs: 2 associated providers: 1 name:nouveau

I can switch off the card completely via acpi_call but then the lockups are
popping up again...
Comment 125 Matthias Fulz 2018-11-23 21:07:43 UTC
(In reply to Alexander from comment #124)
> Created attachment 279627 [details]
> attachment-17158-0.html
> 
> Has anyone else noticed that whilst GRUB boot parameters seem to resolve
> the problem (perhaps with some decrease in performance) the system will
> once again freeze upon resuming from a suspend (closing laptop lid)?
> 

I can confirm this issue.
Supending even with working osi settings is going to cause the freeze on resume.
Comment 126 Ramazan 2018-11-29 13:00:54 UTC
Same issue. I had to add
acpi_osi=! acpi_osi="Windows 2009"

as a workaround to boot option. But that caused another issue. My touchpad doesn't work anymore.

Model: ASUS FX553VD 
GTX 1050 + Intel HD

Dmesg output have these:

[    0.000000] ACPI: Core revision 20170831
[    0.000000] ACPI Error: [PRT0] Namespace lookup failure, AE_ALREADY_EXISTS (20170831/dswload-378)
[    0.000000] ACPI Exception: AE_ALREADY_EXISTS, During name lookup/catalog (20170831/psobject-252)
[    0.000000] ACPI Exception: AE_ALREADY_EXISTS, (SSDT:SataTabl) while loading table (20170831/tbxfload-228)
[    0.000000] ACPI Error: 1 table load failures, 10 successful (20170831/tbxfload-246)

[    9.722102] tpm_crb MSFT0101:00: [Firmware Bug]: ACPI region does not cover the entire command/response buffer. [mem 0xfed40000-0xfed4087f flags 0x200] vs fed40080 f80
[    9.722107] tpm_crb MSFT0101:00: [Firmware Bug]: ACPI region does not cover the entire command/response buffer. [mem 0xfed40000-0xfed4087f flags 0x200] vs fed40080 f80
Comment 127 david.kremer.dk 2018-12-01 11:22:38 UTC
Just wanted to add that I have the same problem with a CLEVO 955ER model. Laptop freezes using lspci/lshw/... with bumblebee switch=OFF on the discrete GPU.

Specs:
CLEVO 955ER laptop 
NVIDIA Corporation GP104M [GeForce GTX 1070 Mobile]
Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz CPU
Distro Archlinux / Kernel 4.14.84-1-lts #1 SMP

More infos here: https://github.com/Bumblebee-Project/Bumblebee/issues/1007

Crossing fingers for a fix to be backported to the LTS branch for christmas :]
Comment 128 Kristóf Marussy 2018-12-01 19:43:29 UTC
I am having similar issues on an ASUS GL504GM (Intel i7 8570H, Nvidia GTX1060) on Linux 4.19.5. My system is stable with the nouveau driver if I use acpi_osi=! acpi_osi="Windows 2009", or if I patch SSDT9 (it appears to handle the dGPU power management) by removing any code conditional on OSI > 2009. However, in that case, the power draw of the laptop is around 28W when the Nvidia GPU is powered off (state DynPwr according to vga_switcheroo), which is lower than the 43W drawn when the Nvidia GPU is powered on, but higher than the 16W drawn with an unpatched SSDT9 and no acpi_osi setting. The latter configuration locks up on lspci or when initiating suspend or poweroff.

So the GPU seems to have 3 distinct power states (of course, the proprietary nvidia driver can set lots of different clock speeds that nouveau cannot, but they only come into play when the dGPU is turned on):
1. Fully powered off via ACPI.
2. Partially powered off by vga_switcheroo.
3. Fully powered on.

In states 1 and 2, vga_switcheroo reports DynOff, while in state 3 it reports DynPwr. In contrast, bbswitch reports OFF only in state 1, and reports ON in states 2 and 3. It is possible to reach state 1 using bbswitch, but it leads to an unstable system even with the patched SSDT9.

After applying patches https://bugzilla.kernel.org/attachment.cgi?id=278269 and https://bugzilla.kernel.org/attachment.cgi?id=278773, my system is stable after booting up and the card is actually in state 1 (I can check the state by looking at the power draw). After a suspend-resume cycle, the card is in state 2. It is possible to return to state 1 by "power cycling" the card: I run DRI_PRIME=1 glxgears to power the card on for PRIME offload and then exit glxgears to power it down. The suspend-resume-offload-exit cycle can be repeated without locking up the system, and lspci also works.

The situation is less rosy if I try to connect an external monitor. If I plug in  my monitor to the HDMI port, which is wired to the Nvidia GPU, when the card is in state 1, the system locks up. I can avoid the lockup and use the external monitor if I run DRI_PRIME=1 glxgears to put the card into state 3 while plugging in the HDMI cable. After that, gxgears may be exited and the HDMI output continues to function.

However, after removing the external monitor this dance cannot be repeated: my monitor just displays "Unsupported output format", as if the card's internal state were corrupted and it sent garbage to the HDMI port. Although some parts of the card must be working correctly because even in this state I can run DRI_PRIME=1 glxgears without problems and see the gears spinning on the internal display of the laptop.
Comment 129 Kristóf Marussy 2018-12-01 20:16:59 UTC
Played around a little bit more with suspend-resume cycles and HDMI connection. It seems that after a suspend-resume-glxgears cycle, I can connect and disconnect by external monitor reliably (no "Unsupported output format" message), so that behavior seems to trigger either only after boot, or randomly (and I didn't manage to trigger it again while playing around).

Moreover, after powering on my card by plugging in the monitor, power usage settles around 37W as opposed to 43W, so I might have been too hasty when comparing the power draw of my card in various states. As the discrepancy is quite a bit smaller than I have thought (37W when powered up vs 28W when DynOff with a patched SSDT9), it might be explained by the fact that in "state 2", the card is indeed fully powered, but nouveau thinks it is off, so utilization is 0%. Whereas the card in "state 1" is indeed off, hence the issue when trying to power it up.
Comment 130 Bruno Pagani 2018-12-02 10:51:17 UTC
The difference between state 1 and 2 is likely the PCIe port (to which the card is attached) being turned off or not from what I understood of all this.

And yes, the issue is about powering on the card again, nouveau has issues setting the card configuration correctly at this stage. Going off is working, but on not, resulting in lockup (at worse) or (partially) unusable card.
Comment 131 Kristóf Marussy 2018-12-02 12:48:08 UTC
Played around even more. If I set "su -c 'env DISPLAY=:0 DRI_PRIME=1 glxinfo' my_username >/dev/null" as a post-suspend hook in systemd, my GPU returns to state 1 after resume. It would be better to set this hook in KDE, so I don't have to jump through hooks to run in on the X server / with the correct authentication cookie, but I couldn't find a way to do that. I guess this could much rather be done from the kernel, but I am not familiar with nouveau source code at all to write a patch.

Resuming from hibernation, on the other hand, is completely broken and leads to a lockup.
Comment 132 david.kremer.dk 2018-12-02 19:44:44 UTC
Created attachment 279803 [details]
DMESG|grep ACPI kernel boot log

My ACPI kernel log
Comment 133 Kristóf Marussy 2018-12-03 20:23:59 UTC
Back with some more info -- although I start to feel that this is getting a bit futile. Hibernation appears to be working again if GuC loading is disabled in i915, so that was probably a separate issue for me. Currently, it seems the machine is fairly stable when running applications with DRI_PRIME=1 or plugging in an external monitor, and power cycling the card with `DRI_PRIME=1 glxinfo` can put it back to state 1 after sleep or hibernation.

However, resuming after sleep with an external display connected (i.e., the card is powered on before and after suspend) leads to a non-responsive X server and display manager. I can still switch to another terminal and observe DRI notifier timeout errors from nouveau in dmesg if that rings a bell for someone. Restarting X fixes the problem somewhat, but connecting a display after that freezes the computer for good, during which the monitor complains about unsupported input. At any rate, plugging out the external display to let the card power down before sleep looks like a workaround (this is, of course, after applying the kernel patches in this thread).
Comment 134 Antoine Audras 2019-01-28 21:14:40 UTC
Hi,

I am running Arch on an Aero 15x v8 with proprietary nvidia drivers and bumblebee and I have noted something when using X that I don't see in the thread (well it echoes the 2 consecutive lspci launch crashing the system). I dont know if it'll be of any use to you though.

When starting X using just X command without root privilege, X complains because he wasn't launched with root privilege (can't open /dev/tty0)

And when doing that a second time it just freezes the system right after printing the initialization messages out ("Using system config [...]")

Once again I launched X unprivileged both times

But launching any application with optirun between the 2 X calls prevents the second launch from crashing the system.

So I guess optirun must "reset" dôme value somewhere that has been set by an unprivileged operation. 

It can be reproduced lauching lspci or lspci, lshw or X (ex : lspci, X -> crash, lspci, sudo optirun cat /etc/fstab, X -> no crash)

So what do lspci, lshw and X do that can be "reset" by launching optirun?
Comment 135 Bruno Pagani 2019-01-29 10:00:20 UTC
(In reply to Antoine Audras from comment #134)
> So what do lspci, lshw and X do that can be "reset" by launching optirun?

They wake up the card, but currently nouveau/bbswitch don’t do that properly. I guess that loading the proprietary driver then restores a correct card state maybe?
Comment 136 Karol Herbst 2019-05-19 18:09:36 UTC
I was posting some patches on the Nouveau mailing list in order to fix this issue. I also wanted some PCI folks to look at it, but nobody replied yet...

I am sure that this is in no way a driver issue, but an issue within the PCI subsystem or the PCIe controller, so we are kind of reluctant to push such a "workaround" into a driver as other devices might run into such issues as well? Or maybe it's some Nvidia hardware screwup? Anyways, I have not enough knowledge about how all that PCI stuff works, so I can't come to a final conclusion here.

mailing list thread: https://lists.freedesktop.org/archives/nouveau/2019-May/032353.html
Comment 138 Victor 2019-09-21 11:36:24 UTC
What's the current status on this bug, any reply/update on those patches?
Comment 139 Matthias Fulz 2019-09-28 21:53:48 UTC
If someone is interested in a workaround for using either nvidia or the intel card, I've worked out the following setup:

One grub entry for intel only:

modprobe.blacklist=nvidia_drm modprobe.blacklist=nvidia_modeset modprobe.blacklist=nvidia snd_hda_intel.enable=1,0,0

Adding the above parameters to grub and using just the intel GPU I'm able to run my omen laptop with 8-9W (25% Display brightness) which gives me around 7-8h :)

Every other setup just goes with around 15W ....


Second grub for using the nvidia together with the binary blob:

I'm using the following grub parameter to tell that I want to use the discrete card:

discretevga=y

Therefore I've a shell script that checks for this cmdline param and activate the needed xorg conf:

#!/bin/bash

if [ -f /etc/X11/xorg.conf.d/20-nvidia.conf ]
then
    rm /etc/X11/xorg.conf.d/20-nvidia.conf
fi

cat /proc/cmdline | grep 'discretevga' > /dev/null 2>&1
if [ $? -eq 0 ]
then
    ln -s /usr/local/etc/x11-nvidia.conf /etc/X11/xorg.conf.d/20-nvidia.conf
else
    rmmod nvidia_drm
    rmmod nvidia_modeset
    rmmod nvidia
fi 

exit 0


here is the x11-nvidia.conf that I'm using:

Section "Module"
        Load "modesetting"
EndSection

Section "Device"
        Identifier "Nvidia Card"
        Driver "nvidia"
        BusID "1:0:0"
        Option "AllowEmptyInitialConfiguration"
EndSection


Drawback is obviously the needed reboot, that is needed to switch between the cards, but at least I'm able to use really both on their own without any issue.

And for me 90% of the time I'm just using the intel card, so most important is low power consumption.

Hopefully this helps some of you guys ;)
Comment 140 arpie 2019-10-23 22:09:55 UTC
After hours of experimenting on this laptop :

Computer : PC Specialist OptimusIX 15 (aka Clevo N8xxEP6)
BIOS : American Megatrends 1.07.13
OS : Arch Linux
GPU : NVIDIA GTX 1060 Mobile 

Until recently, any attempt to use bumblebee or acpi commands to power down the GPU have resulted in a system freeze with lspci, suspend, power cable plug in, etc.  No kernel line parameters seem to have any effect.

I have discovered that the system freeze is closely linked to the interaction between the nvidia graphics card on pci address 0000:01:00.0 and its associated sound card at pci address 0000:01:00.1  (I don't actually know what that sound card is doing - I presume it's for the HDMI port?)

If I completely disable the audio card using :
echo 1 | sudo tee /sys/bus/pci/devices/0000:01:00.1/remove

Then the system hangs are completely cured - I can acpi _OFF or _ON or _PS3 or _PS0 to my hearts content and the gfx card will power up and down perfectly, lspci behaves perfectly normally (without any lag), and suspend/resume and power cable plug/unplug all works.  Even better, kernel power management on the PCI bus seems to work perfectly too, but only kicks in when I rmmod nvidia.  So far, bumblebee and bbswitch also seem to be totally happy.

Can anybody else confirm similar findings?

Bear in mind that the audio card needs to be removed BEFORE the kernel loads any audio modules.  I do it like this :

[Unit]
Description=Nvidia Audio Card OnBoot Disabler
Before=bumblebeed.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/sh -c "echo 1 > /sys/bus/pci/devices/0000:01:00.1/remove"
ExecStop=/usr/bin/sh -c "echo 1 > /sys/bus/pci/rescan"

[Install]
WantedBy=sysinit.target
Comment 141 Matthias Fulz 2019-10-24 22:09:04 UTC
(In reply to arpie from comment #140)
> After hours of experimenting on this laptop :
> 
> Computer : PC Specialist OptimusIX 15 (aka Clevo N8xxEP6)
> BIOS : American Megatrends 1.07.13
> OS : Arch Linux
> GPU : NVIDIA GTX 1060 Mobile 
> 
> Until recently, any attempt to use bumblebee or acpi commands to power down
> the GPU have resulted in a system freeze with lspci, suspend, power cable
> plug in, etc.  No kernel line parameters seem to have any effect.
> 
> I have discovered that the system freeze is closely linked to the
> interaction between the nvidia graphics card on pci address 0000:01:00.0 and
> its associated sound card at pci address 0000:01:00.1  (I don't actually
> know what that sound card is doing - I presume it's for the HDMI port?)
> 
> If I completely disable the audio card using :
> echo 1 | sudo tee /sys/bus/pci/devices/0000:01:00.1/remove
> 
> Then the system hangs are completely cured - I can acpi _OFF or _ON or _PS3
> or _PS0 to my hearts content and the gfx card will power up and down
> perfectly, lspci behaves perfectly normally (without any lag), and
> suspend/resume and power cable plug/unplug all works.  Even better, kernel
> power management on the PCI bus seems to work perfectly too, but only kicks
> in when I rmmod nvidia.  So far, bumblebee and bbswitch also seem to be
> totally happy.
> 
> Can anybody else confirm similar findings?
> 
> Bear in mind that the audio card needs to be removed BEFORE the kernel loads
> any audio modules.  I do it like this :
> 
> [Unit]
> Description=Nvidia Audio Card OnBoot Disabler
> Before=bumblebeed.service
> 
> [Service]
> Type=oneshot
> RemainAfterExit=yes
> ExecStart=/usr/bin/sh -c "echo 1 > /sys/bus/pci/devices/0000:01:00.1/remove"
> ExecStop=/usr/bin/sh -c "echo 1 > /sys/bus/pci/rescan"
> 
> [Install]
> WantedBy=sysinit.target

Not working for me. Still freezing with this.
Comment 142 arpie 2019-10-25 13:38:19 UTC
(In reply to Matthias Fulz from comment #141)
> (In reply to arpie from comment #140)
[snip]
> > If I completely disable the audio card using :
> > echo 1 | sudo tee /sys/bus/pci/devices/0000:01:00.1/remove
> > 
> > Then the system hangs are completely cured 
> 
> Not working for me. Still freezing with this.

Any chance of more details?  When and how is it freezing?  Is it any different from before?  What are your machine/card details (looks like you haven't posted these anywhere above)?

Also, are you absolutely sure you've disabled the audio card during boot *before the kernel notices it is there*?  The only reliable way I've found to check if this is the case, is to run powertop, and look in the 'Device Status' tab for listings of 'Audio codec hwXXXXX: nvidia'.  If that is showing up, then the nvidia sound card is still active and will cause hangs.  My solution only works if the audio card is removed/disabled before the audio system initialises during boot (hence the WantedBy=sysinit.target in my service file).

I think I should have also mentioned that in order for the kernel to do the PM, you need to do something like :

echo auto | sudo tee /sys/bus/pci/devices/0000:01:00.0/power/control

I have TLP installed, which does this for me.

Now a few days have passed, I admit I have had a few freezes when using bbswitch.  But if I disable bbswitch and just use bumblebee with no power management, all is well (so far).  If I want to power down the nvidia GFX card I just manually modprobe -r nvidia and the kernel does the rest.   Using this solution, I see a drop from about 20W to 10W when the card powers off, with no ACPI calls at all (or, rather, none that I am aware of - I have no idea what the kernel is actually doing behind the scenes).

I am sure that there must be a 'proper' solution where the correct ACPI commands are used to power off/on both the nvidia video and audio at the same time but finding such a solution is far beyond me...
Comment 143 Matthias Fulz 2019-10-25 16:17:25 UTC
(In reply to arpie from comment #142)
> (In reply to Matthias Fulz from comment #141)
> > (In reply to arpie from comment #140)
> [snip]
> > > If I completely disable the audio card using :
> > > echo 1 | sudo tee /sys/bus/pci/devices/0000:01:00.1/remove
> > > 
> > > Then the system hangs are completely cured 
> > 
> > Not working for me. Still freezing with this.
> 
> Any chance of more details?  When and how is it freezing?  Is it any
> different from before?  What are your machine/card details (looks like you
> haven't posted these anywhere above)?
> 

I've got a HP OMEN 15 with a nvidia GTX 1050 running archlinux

> Also, are you absolutely sure you've disabled the audio card during boot
> *before the kernel notices it is there*?  The only reliable way I've found
> to check if this is the case, is to run powertop, and look in the 'Device
> Status' tab for listings of 'Audio codec hwXXXXX: nvidia'.  If that is
> showing up, then the nvidia sound card is still active and will cause hangs.
> My solution only works if the audio card is removed/disabled before the
> audio system initialises during boot (hence the WantedBy=sysinit.target in
> my service file).
> 

I've used your service file together with bumblebee and bbswitch.

> I think I should have also mentioned that in order for the kernel to do the
> PM, you need to do something like :
> 
> echo auto | sudo tee /sys/bus/pci/devices/0000:01:00.0/power/control
> 
> I have TLP installed, which does this for me.
> 

Ok this step was missing.

> Now a few days have passed, I admit I have had a few freezes when using
> bbswitch.  But if I disable bbswitch and just use bumblebee with no power
> management, all is well (so far).  If I want to power down the nvidia GFX
> card I just manually modprobe -r nvidia and the kernel does the rest.  
> Using this solution, I see a drop from about 20W to 10W when the card powers
> off, with no ACPI calls at all (or, rather, none that I am aware of - I have
> no idea what the kernel is actually doing behind the scenes).
> 

Ah I see.
Then I think this is basically somehow similar to my workaround using the snd_hda_intel modul parameter.
The nvidia card will just be completely "powered off" by not using it in any way (no module loaded)

> I am sure that there must be a 'proper' solution where the correct ACPI
> commands are used to power off/on both the nvidia video and audio at the
> same time but finding such a solution is far beyond me...

I think some ACPI / PM guys should definitly check the audio part of the GPU as there could be some issues related to this bug.


I will try it perhaps once again and give feedback here.
But honestly these tests are really harmful for me because it happens very often that some files are truncated to zero during this crash randomly and I've to restore backups then...
Comment 144 Matthias Fulz 2019-10-25 16:42:56 UTC
Ok here are more tests:

Disabling the audio part with your suggestion ist working.
No NVIDIA audio in powertop nor in lspci.

Your solution is basically loading / unloading the nvidia modul now, which indeed is working, but not the optimus part I think?

As soon as I try acpi the freeze is happening again after one or two times running lspci.

And here the real problem starts:

Just loading and afterwards unloading the nvidia module wakes up the card from the real disabled state. Even powertop telling 0% usage of the nvidia card my power consumption is not going below 11W again.

So the issue here is: You're way is not triggering the real shutdown for the nvidia card as you're just unloading the module.

The difference in using bbswitch (which leads to freezes for you as well) is that this will do the acpi calls and really powering down the card, which leads to the freezes...

For some users it might be fully ok to just use load / unload nvidia as it make a difference for the power consumption.

But for me it's around 1/3 missing runtime, which relly hurts me :)


But perhaps you could try my workaround with two boot entries and check the power consumption on your side, when running intel only?

for me it's around 7-8W intel only and around 11W when using your workaround.
But again your solution is just not really disabling the nvidia card, instead it's more like just not using it and let it stay in idle mode with limited PM.
Comment 145 Kai-Heng Feng 2019-10-25 18:00:33 UTC
Laptops with Skylake SoC and later shouldn't need bbswitch. PCIe port PM will disable the power of the card.
After nvidia.ko gets unloaded, make sure "power/control" is "auto" for its video (e.g. 01:00.0) and audio (e.g. 01:00.1) functions and its upstream bridge (use lspci -t to check).
 
In addition to that, these two commits are also required for mainline kernel users:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=52525b7a3cf82adec5c6cf0ecbd23ff228badc94
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=bacd861452d2be86a4df341b12e32db7dac8021e
Comment 146 arpie 2019-10-25 18:36:49 UTC
(In reply to Matthias Fulz from comment #143)
> (In reply to arpie from comment #142)
> > (In reply to Matthias Fulz from comment #141)
> > > (In reply to arpie from comment #140)
> > [snip]
> 
> Ah I see.
> Then I think this is basically somehow similar to my workaround using the
> snd_hda_intel modul parameter.
> The nvidia card will just be completely "powered off" by not using it in any
> way (no module loaded)

Yes, now I've read your workaround more closely, I think you're right it is basically achieving the same thing.

> > I am sure that there must be a 'proper' solution where the correct ACPI
> > commands are used to power off/on both the nvidia video and audio at the
> > same time but finding such a solution is far beyond me...
> 
> I think some ACPI / PM guys should definitly check the audio part of the GPU
> as there could be some issues related to this bug.

Judging by comment 145, they are already way ahead of us!
Comment 147 arpie 2019-10-25 18:45:17 UTC
(In reply to Matthias Fulz from comment #144)
> Ok here are more tests:
> 
> Disabling the audio part with your suggestion ist working.
> No NVIDIA audio in powertop nor in lspci.
> 
> Your solution is basically loading / unloading the nvidia modul now, which
> indeed is working, but not the optimus part I think?

Optimus IS working here, no problem at all.
 
[snip]
> For some users it might be fully ok to just use load / unload nvidia as it
> make a difference for the power consumption.
> 
> But for me it's around 1/3 missing runtime, which relly hurts me :)
> 
> But perhaps you could try my workaround with two boot entries and check the
> power consumption on your side, when running intel only?

I will try this maybe tonight when I will have more time to spare.  I too would be interested in gaining 20-30% battery life!  But, then again, I wouldn't want to have to reboot to be able to use the dGPU (I use it for blender3d).
 
> for me it's around 7-8W intel only and around 11W when using your workaround.
> But again your solution is just not really disabling the nvidia card,
> instead it's more like just not using it and let it stay in idle mode with
> limited PM.

Yes, and no... as far as I can see from my tests, it is not staying in idle mode, it is being fully powered down by the kernel when not in use, and fully powered up again when I use optirun.
Comment 148 arpie 2019-10-25 18:56:42 UTC
(In reply to Kai-Heng Feng from comment #145)

Thank you very much for this optimistic-sounding info, Kai-Heng Feng.

> Laptops with Skylake SoC and later shouldn't need bbswitch. PCIe port PM
> will disable the power of the card.

How do I check if I have Skylake SoC?

I am actually currently using a modified bbswitch where I have disabled the acpi calls.  The point of this is to force bumblebee to automatically load and unload the nvidia modules before and after using optirun.  I suspect there is an easier way but for now this works for me.

> After nvidia.ko gets unloaded, make sure "power/control" is "auto" for its
> video (e.g. 01:00.0) and audio (e.g. 01:00.1) functions and its upstream
> bridge (use lspci -t to check).

I initially tried what you describe here but the audio part was preventing power management from happening because it was permanently flagged in use by the snd_hda_audio module.  Hence my work-around.
  
> In addition to that, these two commits are also required for mainline kernel
> users:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> ?id=52525b7a3cf82adec5c6cf0ecbd23ff228badc94
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> ?id=bacd861452d2be86a4df341b12e32db7dac8021e

I am not interested in attempting to compile the kernel so will wait for these two commits to make it into the stable release.  Reading their descriptions, especially the second one, sounds like it is the perfect fix for what I am experiencing.
Comment 149 Matthias Fulz 2019-10-25 19:43:10 UTC
(In reply to Kai-Heng Feng from comment #145)
> Laptops with Skylake SoC and later shouldn't need bbswitch. PCIe port PM
> will disable the power of the card.
> After nvidia.ko gets unloaded, make sure "power/control" is "auto" for its
> video (e.g. 01:00.0) and audio (e.g. 01:00.1) functions and its upstream
> bridge (use lspci -t to check).
>  
> In addition to that, these two commits are also required for mainline kernel
> users:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> ?id=52525b7a3cf82adec5c6cf0ecbd23ff228badc94
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> ?id=bacd861452d2be86a4df341b12e32db7dac8021e

Ok I've a coffee lake and will go to try these two commits and report back.
Comment 150 adikurthy 2019-11-10 11:32:00 UTC
(In reply to Kai-Heng Feng from comment #145)
> Laptops with Skylake SoC and later shouldn't need bbswitch. PCIe port PM
> will disable the power of the card.
> After nvidia.ko gets unloaded, make sure "power/control" is "auto" for its
> video (e.g. 01:00.0) and audio (e.g. 01:00.1) functions and its upstream
> bridge (use lspci -t to check).
>  
> In addition to that, these two commits are also required for mainline kernel
> users:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> ?id=52525b7a3cf82adec5c6cf0ecbd23ff228badc94
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> ?id=bacd861452d2be86a4df341b12e32db7dac8021e


I have an i7 8750H with a GTX 1050 Mobile. I applied these two patches on top of Linus' tree. After I switched all "power/control" to "auto", everything works now. Card powers down, suspend/resume works. 

Thank you for figuring this out. Before this I was getting lockups with bbswitch/acpi_call during boot. I had to do crazy workarounds to get away with this during early boot and suspend/resume. Those days are gone now!
Comment 151 Matthias Fulz 2019-11-12 23:22:47 UTC
I've tryed the patches and can confirm that the issues with lockups are gone with just using bumblebee (unloading nvidia module).

But still the problems are the same:
1.) Just unloading the nvidia modules keeps the power consumption up to 13/14W which is 5-6W more (almoest double) in compare to intel only ~8W

2.) Using acpi call to poweroff the nvidia card completely which drops the power consumption to 8W the lockups are back.

So for me the patches are not working :(
Comment 152 Matthias Fulz 2019-11-12 23:26:45 UTC
Of course I've set the power to auto for everything
Comment 153 Matthias Fulz 2019-11-12 23:31:06 UTC
More infos:

powertop shows this when using intel gpu only (my workarounds from post https://bugzilla.kernel.org/show_bug.cgi?id=156341#c139)

              0.0%        PCI Device: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16)
              0.0%        PCI Device: Intel Corporation Cannon Lake PCH HECI Controller
              0.0%        PCI Device: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile]
              0.0%        PCI Device: Intel Corporation Cannon Lake PCH SPI Controller
              0.0%        PCI Device: Intel Corporation Cannon Lake PCH cAVS
              0.0%        PCI Device: NVIDIA Corporation GP107GL High Definition Audio Controller
              0.0%        PCI Device: Intel Corporation Cannon Lake PCH Shared SRAM
              0.0%        PCI Device: Intel Corporation Cannon Lake PCH SMBus Controller
              0.0%        PCI Device: Intel Corporation Cannon Lake PCH PCI Express Root Port #14
 
when using the patches with nvidia modules unloaded and power set to auto its still saying 100%
Comment 154 Matthias Fulz 2019-11-18 23:55:08 UTC
Ok after I removed the patches and the normal Kernel-Update to 5.3.11 happend, I'm experiencing the same higher power consumption that happend during the test before.

It could be related to something else not the patches.

But I'm unable to find out atm. where it comes from :(
The pc is not going below 10W with 5.3.11
on 5.3.8 it drops to 7-8W.

Note You need to log in before you can comment on or make changes to this bug.