Bug 55211

Summary: ASPM enabled on pre-1.1 PCIe devices; causes ath5k and other device malfunctions
Product: Drivers Reporter: Roman Yepishev (roman.yepishev)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: RESOLVED CODE_FIX    
Severity: normal CC: bjorn, roman.yepishev, yinghai
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.8.0 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: fix aspm sequence
remove not needed checking
Original 3.8 kernel dmesg
3.8 kernel dmesg with patch applied
remove not needed checking
First patch fix aspm sequence
Second patch remove not needed checking

Description Roman Yepishev 2013-03-14 10:04:48 UTC
This bug report is based on investigation for https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1137513

I have an old Acer Aspire One AOA150 netbook which has a BIOS that does not disable ASPM for ath5k wireless device, 168c:001c, Atheros Communications Inc. AR242x / AR542x PCI-Express adapter.

2 years ago in 6ccf15a1a76d2ff915cdef6ae4d12d0170087118 ath5k driver got this:

   pci_disable_link_state(pdev, PCIE_LINK_STATE_L0S);

The card was working fine until Ubuntu raring 13.04 switched to 3.8 and the same symptoms that were fixed by the commit above started to appear - random md5 generator instead of file downloads, SSH disconnects, etc. The card had LnkCtl: ASPM L0s L1 Enabled; according to lspci in 3.8+ while it had ASPM disabled in kernels <= 3.7.

I have tested all the available builds of 3.8 mainline kernel and found that it was introduced early in 3.8 series. Then I did a bisect between 3.7 and 3.8 and found that the first bad commit is:

commit 8c33f51df406e1a1f7fa4e9b244845b7ebd61fa6
Author: Taku Izumi <izumi.taku@jp.fujitsu.com>
Date:   Tue Oct 30 15:27:13 2012 +0900

    PCI/ACPI: Request _OSC control before scanning PCI root bus
    
    This patch moves up the code block to request _OSC control in order to
    separate ACPI work and PCI work in acpi_pci_root_add().
    
    Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

I see the following in dmesg:
[    0.145833] pci_root PNP0A08:00: ACPI _OSC support notification failed, disabling PCIe ASPM
[    0.145840] pci_root PNP0A08:00: Unable to request _OSC control (_OSC support mask: 0x08)
...
[    0.181256] ACPI _OSC control for PCIe not granted, disabling ASPM

So it looks like it is actually doing something, yet ASPM remains enabled causing the card to malfunction during data transfers.
Comment 1 Yinghai Lu 2013-03-14 23:19:47 UTC
Created attachment 95431 [details]
fix aspm sequence

Please check if attached patch fixes the problem.
Comment 2 Yinghai Lu 2013-03-15 01:07:00 UTC
Created attachment 95441 [details]
remove not needed checking

Looks like we can go further to make hotplug working with disabling
aspm in quirks and drivers.

Could just remove aspm_disabled checking.
Comment 3 Roman Yepishev 2013-03-15 08:00:13 UTC
Thank you!

The "fix aspm sequence" patch fixes the issue.

03:00.0 Class [0200]: Device [168c:001c] (rev 01)
	Subsystem: Device [105b:e008]
        ...
        LnkCtl:	ASPM Disabled; RCB 128 bytes Disabled- Retrain- CommClk+
Comment 4 Yinghai Lu 2013-03-16 20:53:00 UTC
Can you post boog log with and without patch?
Comment 5 Roman Yepishev 2013-03-16 22:34:36 UTC
Created attachment 95581 [details]
Original 3.8 kernel dmesg
Comment 6 Roman Yepishev 2013-03-16 22:35:09 UTC
Created attachment 95591 [details]
3.8 kernel dmesg with patch applied
Comment 7 Roman Yepishev 2013-03-16 22:39:00 UTC
Done.

As I see, the difference between the output is:
-pci_root PNP0A08:00: ACPI _OSC support notification failed, disabling PCIe ASPM
+pci_root PNP0A08:00: ACPI _OSC support notification failed, PCIe ASPM will be disabled
...
+pci 0000:03:00.0: disabling ASPM on pre-1.1 PCIe device.  You can enable it with 'pcie_aspm=force'
...
-ACPI _OSC control for PCIe not granted, disabling ASPM
+pci_root PNP0A08:00: ACPI _OSC support notification failed, PCIe ASPM disabled

03:00.0 is Atheros Communications Inc. AR242x / AR542x Wireless Network Adapter (PCI-Express)
Comment 8 Yinghai Lu 2013-03-17 03:21:43 UTC
Created attachment 95601 [details]
remove not needed checking

ok, that is from sanity_checking, that is why old sequence is happened to
work.

Anyway please test new patch that should be more complete and handle
hotplug case.

Thanks
Comment 9 Roman Yepishev 2013-03-17 15:50:23 UTC
Re-tested on two laptops - AOA150, ath5k device got ASPM disabled and on a Lenovo E420 I got ASPM disabled for iwlwifi-driven Intel Corporation Centrino Wireless-N 1000 [8086:0084].

Also I found http://article.gmane.org/gmane.linux.kernel.pci/20640 where a question was raised why pci_disable_link_state stopped doing anything for iwlwifi devices too.
Comment 10 Yinghai Lu 2013-03-17 17:19:41 UTC
On Sun, Mar 17, 2013 at 8:50 AM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=55211

> --- Comment #9 from Roman Yepishev <roman.yepishev@gmail.com>  2013-03-17
> 15:50:23 ---
> Re-tested on two laptops - AOA150, ath5k device got ASPM disabled and on a
> Lenovo E420 I got ASPM disabled for iwlwifi-driven Intel Corporation Centrino
> Wireless-N 1000 [8086:0084].
>
> Also I found http://article.gmane.org/gmane.linux.kernel.pci/20640 where a
> question was raised why pci_disable_link_state stopped doing anything for
> iwlwifi devices too.
>

good.

Did you test first patch or second patch?

Please test second patch only on affected platforms.

old logic is quite strange, on boot path and hotplug path it have
different aspm_disabled setting.
as it could set aspm_disabled after pci root bus scanning.

Second patch will not restore to old logic, and just remove not needed
aspm_disabled checking
for disabling path.

So second patch is right fix, but it need more test.

Thanks

Yinghai
Comment 11 Yinghai Lu 2013-03-17 17:28:17 UTC
Created attachment 95651 [details]
First patch fix aspm sequence

restore old logic
Comment 12 Yinghai Lu 2013-03-17 17:30:14 UTC
Created attachment 95661 [details]
Second patch remove not needed checking

this one should be complete fix that still keep hotplug and boot path have
same aspm_disabled set.
Comment 13 Roman Yepishev 2013-03-17 18:22:12 UTC
I got a bit confused regarding the fixes, so tested everything with the netbook/ath5k:

Booting without patches:
LnkCtl: ASPM L0s L1 Enabled;

Booting with first patch only,  "fix aspm sequence", id 95651
LnkCtl: ASPM Disabled;

Booting with second patch only, "remove not needed checking", id 95661
LnkCtl: ASPM Disabled;

Booting with first+second patch applied:
LnkCtl: ASPM Disabled;

So, in any case the issue appeared fixed. As I don't have any hotpluggable PCI-X device I am not sure how to test the hotplug interaction.
Comment 14 Bjorn Helgaas 2013-04-02 17:40:21 UTC
I think this regression has nothing to do with pci_disable_link_state().

When aspm_disabled is set, pci_disable_link_state() doesn't do anything.

In both 3.7 and 3.8, aspm_disabled is set in acpi_pci_root_add() before
any driver probe routines are run, so it looks like calling
pci_disable_link_state() from a driver had no effect even in 3.7.  This
is a problem, of course, but not the one Roman is seeing, because ath5k
calls pci_disable_link_state() from the driver probe routine.

There are also PCI_FIXUP_FINAL quirks for 82575 and 82598 NICs that call
pci_disable_link_state().  In 3.7, these quirks are run before
aspm_disabled is set, but 8c33f51d moved the pcie_no_aspm() call up
before we start scanning the bus, so in 3.8, aspm_disabled is set
*before* we run them.  I think that means 8c33f51d broke all these
quirks.  That's also a problem, of course, but this isn't the one Roman
is seeing either.

I think the problem Roman is seeing happens when
pcie_aspm_init_link_state() calls pcie_aspm_sanity_check() during device
enumeration.  In 3.8, the fact that aspm_disabled is already set by the
time we get here means we skip the check for pre-1.1 PCIe devices, and
I think *this* is what Roman is seeing.  The dmesg log attached at
comment #6 (https://bugzilla.kernel.org/attachment.cgi?id=95591) confirms
that Roman's ath5k is indeed a pre-1.1 device:

  pci 0000:03:00.0: disabling ASPM on pre-1.1 PCIe device.  ...
  ath5k 0000:03:00.0: registered as 'phy0'

I'm not comfortable with any of Yinghai's patches yet.  I know they fix
Roman's issue, but I'm worried they may introduce other regressions, so
I plan to simply revert 8c33f51d for now.
Comment 15 Bjorn Helgaas 2013-04-08 16:23:50 UTC
This should be fixed by this commit, which appeared in v3.9-rc6:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/acpi/pci_root.c?id=b8178f130e25c1bdac1c33e0996f1ff6e20ec08e
Comment 16 Bjorn Helgaas 2013-04-30 19:38:33 UTC
This should be resolved in v3.9.  Please reopen if you disagree.

Note that bug #57331 is a similar but different ASPM problem where a driver uses pci_disable_link_state(..., L0S | L1 | CKLPM), but the device still enters L1.