Bug 212039

Summary: PCI_COMMAND randomly flips to 0 after system resume, breaks the xHCI, HMB NVMe, and all other PCI devices
Product: Power Management Reporter: Kai-Heng Feng (kai.heng.feng)
Component: Hibernation/SuspendAssignee: Rafael J. Wysocki (rjw)
Status: RESOLVED WILL_NOT_FIX    
Severity: normal CC: alexander.usyskin, avi.shalev, mathias.nyman, rjw, rui.zhang, sasha.neftin, tomas.winkler, tomasw, vitaly.lifshits, yu.c.chen
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: mainline, linux-next Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
lspci -vvnnxxx
dmesg with mei_me disabled
dmidecode
attachment-20658-0.html

Description Kai-Heng Feng 2021-03-03 17:08:37 UTC
After system resume, the read on PCI_COMMAND may randomly be 0 in pci_read_config_word(), which is called by __pci_set_master().

So it only enables Bus Master in next pci_read_config_word(), disabling "Memory Space", "I/O Space" at the same time.

This makes xHCI driver fail to access MMIO register, render USB unusable.
Comment 1 Kai-Heng Feng 2021-03-03 17:13:01 UTC
Created attachment 295605 [details]
dmesg

Most relevant part:
[13425.286713] xhci_hcd 0000:00:14.0: __pci_set_master 0 0x402
[13425.286829] xhci_hcd 0000:00:14.0: __pci_set_master 1 0x402
[13425.287198] xhci_hcd 0000:00:14.0: __pci_set_master 2 0x0
[13425.287413] xhci_hcd 0000:00:14.0: Controller not ready at resume -19
[13425.287415] xhci_hcd 0000:00:14.0: PCI post-resume error -19!
[13425.287417] xhci_hcd 0000:00:14.0: HC died; cleaning up
[13425.287422] PM: dpm_run_callback(): pci_pm_resume+0x0/0xf0 returns -19
[13425.287431] xhci_hcd 0000:00:14.0: PM: failed to resume async: error -19

Here's the patch to illustrate the bug:
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 16a17215f633..a53b8a965d87 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4249,6 +4249,11 @@ static void __pci_set_master(struct pci_dev *dev, bool enable)
        u16 old_cmd, cmd;
 
        pci_read_config_word(dev, PCI_COMMAND, &old_cmd);
+       pci_info(dev, "%s 0 0x%0x", __func__, old_cmd);
+       pci_read_config_word(dev, PCI_COMMAND, &old_cmd);
+       pci_info(dev, "%s 1 0x%0x", __func__, old_cmd);
+       pci_read_config_word(dev, PCI_COMMAND, &old_cmd);
+       pci_info(dev, "%s 2 0x%0x", __func__, old_cmd);
        if (enable)
                cmd = old_cmd | PCI_COMMAND_MASTER;
        else
Comment 2 Kai-Heng Feng 2021-03-03 17:13:50 UTC
Created attachment 295607 [details]
lspci -vvnnxxx
Comment 3 Kai-Heng Feng 2021-03-03 17:14:15 UTC
Also, the issue is not reproducible on Intel RVP.
Comment 4 Mathias Nyman 2021-03-04 10:54:45 UTC
Is the sudden 0 return for PCI_COMMAND word read related to a change in the
PCI_STATUS word?

As these two are both part of the same 32bit register could it be possible
that somthing setting the status accidentally overwrites the command?

PCI is not really my area, wanted to give my two cents anyway
Comment 5 Kai-Heng Feng 2021-03-04 12:27:49 UTC
I don't think so, because there are cases that the first read is 0, but it flips back to its original value on next read.

Also, there doesn't seem to be anything touches PCI_STATUS in PCI resume path.

As of now it seems like disabling Intel ME completely can make the issue go away.
Comment 6 Rafael J. Wysocki 2021-03-17 16:08:55 UTC
(In reply to Kai-Heng Feng from comment #0)
> After system resume, the read on PCI_COMMAND may randomly be 0 in
> pci_read_config_word(), which is called by __pci_set_master().

In the attached dmesg this is suspend-to-idle, so I wonder if you have any insights in what in particular can cause this failure to happen.

There's a little difference between the handling of suspend-to-idle and PM-runtime, so I'm wondering why this is not visible in the working-state PM but just on system resume.

> So it only enables Bus Master in next pci_read_config_word(), disabling
> "Memory Space", "I/O Space" at the same time.
> 
> This makes xHCI driver fail to access MMIO register, render USB unusable.
Comment 7 Kai-Heng Feng 2021-03-17 16:17:41 UTC
It's caused by ME. But I have zero clue on what really happened.

This is the workaround [1] I found that can magically make the issue go away. It makes ME the last to suspend and first to resume, also no async suspend/resume by imposing device links to other PCH devices.

[1] https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/unstable/commit/?id=6518ab6907040003c96107ebd247c58eab579f9e
Comment 8 Kai-Heng Feng 2021-03-17 16:19:58 UTC
Sasha, do you know if the workaround is something should be done? If this is a proper quirk we should do, I hope we can put it to e1000e driver...
Comment 9 Rafael J. Wysocki 2021-03-17 16:23:28 UTC
(In reply to Kai-Heng Feng from comment #7)
> It's caused by ME. But I have zero clue on what really happened.

What exactly does ME stand for in this context?

> This is the workaround [1] I found that can magically make the issue go
> away. It makes ME the last to suspend and first to resume, also no async
> suspend/resume by imposing device links to other PCH devices.

So this would be a NIC.

Does reverting commit 6cecf02e77ab make the issue go away or is the workaround still needed after reverting it?

> [1]
> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/unstable/
> commit/?id=6518ab6907040003c96107ebd247c58eab579f9e
Comment 10 Kai-Heng Feng 2021-03-17 16:26:25 UTC
(In reply to Rafael J. Wysocki from comment #9)
> (In reply to Kai-Heng Feng from comment #7)
> > It's caused by ME. But I have zero clue on what really happened.
> 
> What exactly does ME stand for in this context?

Intel Management Engine.

> 
> > This is the workaround [1] I found that can magically make the issue go
> > away. It makes ME the last to suspend and first to resume, also no async
> > suspend/resume by imposing device links to other PCH devices.
> 
> So this would be a NIC.
> 
> Does reverting commit 6cecf02e77ab make the issue go away or is the
> workaround still needed after reverting it?

The workaround is still needed. The commit just makes the fail rate increased from 1% to 10%.

> 
> > [1]
> >
> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/unstable/
> > commit/?id=6518ab6907040003c96107ebd247c58eab579f9e
Comment 11 Rafael J. Wysocki 2021-03-17 17:04:20 UTC
(In reply to Kai-Heng Feng from comment #10)
> (In reply to Rafael J. Wysocki from comment #9)
> > (In reply to Kai-Heng Feng from comment #7)
> > > It's caused by ME. But I have zero clue on what really happened.
> > 
> > What exactly does ME stand for in this context?
> 
> Intel Management Engine.

OK, thanks! [But it would help to resolve the TLA at least once in the report - or git commit for that matter.]

If it is caused by the ME, I wonder why the issue is reported against xHCI.  Surely other PCH devices are affected too?

> > > This is the workaround [1] I found that can magically make the issue go
> > > away. It makes ME the last to suspend and first to resume, also no async
> > > suspend/resume by imposing device links to other PCH devices.
> > 
> > So this would be a NIC.
> > 
> > Does reverting commit 6cecf02e77ab make the issue go away or is the
> > workaround still needed after reverting it?
> 
> The workaround is still needed. The commit just makes the fail rate
> increased from 1% to 10%.

I wonder why PM-runtime isn't affected then.  Or maybe it is, but failures are less likely to occur in that context.

Anyway, it looks like the suspend and resume of ME cannot be carried out in parallel with the suspend and resume of any other PCH devices, but does it have to be the last one to suspend and the first one to resume for the issue to go away?
Comment 12 Kai-Heng Feng 2021-03-18 04:16:45 UTC
(In reply to Rafael J. Wysocki from comment #11)
> (In reply to Kai-Heng Feng from comment #10)
> > (In reply to Rafael J. Wysocki from comment #9)
> > > (In reply to Kai-Heng Feng from comment #7)
> > > > It's caused by ME. But I have zero clue on what really happened.
> > > 
> > > What exactly does ME stand for in this context?
> > 
> > Intel Management Engine.
> 
> OK, thanks! [But it would help to resolve the TLA at least once in the
> report - or git commit for that matter.]
> 
> If it is caused by the ME, I wonder why the issue is reported against xHCI. 
> Surely other PCH devices are affected too?

Yes other PCI devices are also affected, xHCI is the most affected one.

> 
> > > > This is the workaround [1] I found that can magically make the issue go
> > > > away. It makes ME the last to suspend and first to resume, also no
> async
> > > > suspend/resume by imposing device links to other PCH devices.
> > > 
> > > So this would be a NIC.
> > > 
> > > Does reverting commit 6cecf02e77ab make the issue go away or is the
> > > workaround still needed after reverting it?
> > 
> > The workaround is still needed. The commit just makes the fail rate
> > increased from 1% to 10%.
> 
> I wonder why PM-runtime isn't affected then.  Or maybe it is, but failures
> are less likely to occur in that context.
> 
> Anyway, it looks like the suspend and resume of ME cannot be carried out in
> parallel with the suspend and resume of any other PCH devices, but does it
> have to be the last one to suspend and the first one to resume for the issue
> to go away?

Logically I think it's the case, because ME is at the center of the PCH and has even higher privilege than OS, so it should be the last one to suspend and the first one to resume.

In reality, I didn't test that because disabling async suspend for just one device is hard, as device_suspend() on non-async suspend doesn't wait for already dispatched asysc suspends to complete, so there are still parallel suspend operations happened at the same time.
Comment 13 Sasha Neftin 2021-03-18 06:42:48 UTC
(In reply to Kai-Heng Feng from comment #8)
> Sasha, do you know if the workaround is something should be done? If this is
> a proper quirk we should do, I hope we can put it to e1000e driver...

Hello,
How it related to e1000e? Do you have an option to disable ME via BIOS?
In case xHCI device affected - why do you suggest add w/a to e1000e. We agreed to increase polling time for ULP (you know we much duscussed) - I would suggest going by another way.
Comment 14 Kai-Heng Feng 2021-03-18 06:52:28 UTC
(In reply to Sasha Neftin from comment #13)
> (In reply to Kai-Heng Feng from comment #8)
> > Sasha, do you know if the workaround is something should be done? If this
> is
> > a proper quirk we should do, I hope we can put it to e1000e driver...
> 
> Hello,
> How it related to e1000e?

Because ME is more or less controlled by e1000e?

> Do you have an option to disable ME via BIOS?

According to ODM, ME can only be disabled at hardware level.
Disabling AMT/vPRO in BIOS doesn't disable ME.

> In case xHCI device affected - why do you suggest add w/a to e1000e. We
> agreed to increase polling time for ULP (you know we much duscussed) - I
> would suggest going by another way.

It's to ensure e1000e suspend/resume routine (and also ME related routines) doesn't overlap with other device driver's suspend/resume routine.

Of course we can add a giant lock on PCI core to wait until ME is fully functional on resume.
Comment 15 Sasha Neftin 2021-03-18 09:00:06 UTC
(In reply to Sasha Neftin from comment #13)
> (In reply to Kai-Heng Feng from comment #8)
> > Sasha, do you know if the workaround is something that should be done? If
> this is
> > a proper quirk we should do, I hope we can put it to e1000e driver...
> 
> Hello,
> How it related to e1000e? Do you have an option to disable ME via BIOS?
> In case xHCI device affected - why do you suggest add w/a to e1000e. We
> agreed to increase polling time for ULP (you know we much discussed) - I
> would suggest going by another way.

I would suggest to try to mask MEIvia  .config:
Device Drivers -> Misc devices -> Intel Managment Engine Interface remove <m>
Comment 16 Kai-Heng Feng 2021-03-18 12:18:10 UTC
(In reply to Sasha Neftin from comment #15)
> (In reply to Sasha Neftin from comment #13)
> > (In reply to Kai-Heng Feng from comment #8)
> > > Sasha, do you know if the workaround is something that should be done? If
> > this is
> > > a proper quirk we should do, I hope we can put it to e1000e driver...
> > 
> > Hello,
> > How it related to e1000e? Do you have an option to disable ME via BIOS?
> > In case xHCI device affected - why do you suggest add w/a to e1000e. We
> > agreed to increase polling time for ULP (you know we much discussed) - I
> > would suggest going by another way.
> 
> I would suggest to try to mask MEIvia  .config:
> Device Drivers -> Misc devices -> Intel Managment Engine Interface remove <m>

It doesn't help. Probably because mei_me is just an "Interface"?
Comment 17 Kai-Heng Feng 2021-03-18 12:21:05 UTC
Created attachment 295909 [details]
dmesg with mei_me disabled
Comment 18 Sasha Neftin 2021-03-20 12:49:56 UTC
(In reply to Kai-Heng Feng from comment #14)
> (In reply to Sasha Neftin from comment #13)
> > (In reply to Kai-Heng Feng from comment #8)
> > > Sasha, do you know if the workaround is something should be done? If this
> > is
> > > a proper quirk we should do, I hope we can put it to e1000e driver...
> > 
> > Hello,
> > How it related to e1000e?
> 
> Because ME is more or less controlled by e1000e?
> 
e1000e do not control ME. We need to find ME/AMT contact and work with them to disable the ME. I do not see another way.

> > Do you have an option to disable ME via BIOS?
> 
> According to ODM, ME can only be disabled at hardware level.
> Disabling AMT/vPRO in BIOS doesn't disable ME.
> 
> > In case xHCI device affected - why do you suggest add w/a to e1000e. We
> > agreed to increase polling time for ULP (you know we much duscussed) - I
> > would suggest going by another way.
> 
> It's to ensure e1000e suspend/resume routine (and also ME related routines)
> doesn't overlap with other device driver's suspend/resume routine.
> 
> Of course we can add a giant lock on PCI core to wait until ME is fully
> functional on resume.
Comment 19 Kai-Heng Feng 2021-03-20 14:11:04 UTC
> e1000e do not control ME. We need to find ME/AMT contact and work with them
> to disable the ME. I do not see another way.

Ok, at least ME ULP is more or less related e1000e...
Comment 20 Zhang Rui 2021-03-24 14:53:41 UTC
CC Tomas.
Comment 21 Tomas Winkler 2021-03-24 15:55:18 UTC
Can you please collect additional information 
1. dmidcode
2. MEI firmware version (cat /sys/class/mei/mei0/fw_ver)
Comment 22 Kai-Heng Feng 2021-03-24 17:27:11 UTC
Created attachment 296033 [details]
dmidecode
Comment 23 Kai-Heng Feng 2021-03-24 17:28:03 UTC
$ cat /sys/class/mei/mei0/fw_ver
0:15.0.22.1622
0:15.0.22.1622
0:15.0.22.1622
Comment 24 Kai-Heng Feng 2021-03-25 08:39:10 UTC
I was told that currently [1] is best approach to tackle the issue. Can I get some ACKs from Intel if I send the patch to mailing list?

[1] https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/unstable/commit/?id=6518ab6907040003c96107ebd247c58eab579f9e
Comment 25 Tomas Winkler 2021-03-28 09:49:46 UTC
(In reply to Kai-Heng Feng from comment #24)
> I was told that currently [1] is best approach to tackle the issue. Can I
> get some ACKs from Intel if I send the patch to mailing list?
> 
> [1]
> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/unstable/
> commit/?id=6518ab6907040003c96107ebd247c58eab579f9e

We will need to investigate it. This doesn't look like a device driver issue, I will need to propagate the findings internally.
Comment 26 Sasha Neftin 2021-03-28 09:49:55 UTC
Created attachment 296101 [details]
attachment-15001-0.html

Out of office. Expected delayed response
Comment 27 Zhang Rui 2021-05-28 07:46:59 UTC
Hi, Tomas,

Do you happen to get any update on this?
Comment 28 Zhang Rui 2021-06-16 07:25:42 UTC
ping...
Comment 29 Alexander Usyskin 2021-07-13 09:22:00 UTC
We are investigating it internally together with team now
Comment 30 Alexander Usyskin 2021-07-13 09:24:26 UTC
(In reply to Kai-Heng Feng from comment #3)
> Also, the issue is not reproducible on Intel RVP.

What is the system that you reproduce it on?
Comment 31 Kai-Heng Feng 2021-07-13 10:56:55 UTC
(In reply to Alexander Usyskin from comment #30)
> (In reply to Kai-Heng Feng from comment #3)
> > Also, the issue is not reproducible on Intel RVP.
> 
> What is the system that you reproduce it on?

It's OptiPlex 7490 AIO.
Comment 32 Zhang Rui 2022-06-21 07:18:46 UTC
May I know the latest status of this issue?
Comment 33 Sasha Neftin 2022-06-21 07:19:28 UTC
Created attachment 301238 [details]
attachment-20658-0.html

Out of office. Expected delayed response
Comment 34 Kai-Heng Feng 2022-07-28 14:36:44 UTC
(In reply to Zhang Rui from comment #32)
> May I know the latest status of this issue?
Looks like it's a hardware defect in B0 stepping.

B1 stepping doesn't seem to have this issue, so I'll drop the workaround we applied in Ubuntu kernel.