Bug 53811

Summary:	Lenovo Thinkpad X220: acpiphp broken from 3.6 onwards
Product:	ACPI	Reporter:	ALLurGroceries (allurgroceries)
Component:	Config-Hotplug	Assignee:	Huang Ying (ying.huang)
Status:	CLOSED CODE_FIX
Severity:	normal	CC:	allurgroceries, bjorn, rjw, ying.huang
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	3.6	Subsystem:
Regression:	Yes	Bisected commit-id:
Attachments:	lspci reverse the blacklist test dmesg_good.txt dmesg_fail.txt Debug patch for acpihp hotplug pci acpidump Disable runtime power management for hotplugable root port Debug patch 3 for acpihp hotplug Debug patch 4 for acpihp hotplug Debug patch 5 for acpiphp hotplug

Description ALLurGroceries 2013-02-14 06:55:28 UTC

Created attachment 93291 [details]
lspci

Distribution: Debian Sid AMD64
Hardware: Lenovo X220, i5-2520M, 8GB

Hotplugging the expresscard no longer works after 3.5.

Bisected to commit 71a83bd727cc31c5fe960c3758cb396267ff710e

"PCI/PM: add runtime PM support to PCIe port"

"Because runtime suspend is broken for some chipsets, a black list is used to disable runtime PM support for these chipsets."

I'm not sure how the blacklist is supposed to look, so I reversed the test and the problem is gone. Here's a diff against 3.7.7:

diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index ebb3e5f..b427aad 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -225,7 +225,7 @@ static int __devinit pcie_portdrv_probe(struct pci_dev *dev,
 	 * it by default.
 	 */
 	dev->d3cold_allowed = false;
-	if (!pci_match_id(port_runtime_pm_black_list, dev))
+	if (pci_match_id(port_runtime_pm_black_list, dev))
 		pm_runtime_put_noidle(&dev->dev);
 
 	return 0;

Comment 1 ALLurGroceries 2013-02-14 06:57:30 UTC

Created attachment 93301 [details]
reverse the blacklist test

Comment 2 Bjorn Helgaas 2013-02-15 17:44:38 UTC

Created attachment 93401 [details]
dmesg_good.txt

Comment 3 Bjorn Helgaas 2013-02-15 17:45:32 UTC

Created attachment 93411 [details]
dmesg_fail.txt

Comment 4 Huang Ying 2013-02-26 07:57:31 UTC

Created attachment 94101 [details]
Debug patch for acpihp hotplug

Comment 5 Huang Ying 2013-02-26 08:01:12 UTC

(In reply to comment #4)
> Created an attachment (id=94101) [details]
> Debug patch for acpihp hotplug

Hi, ALLurGroceries,

Can you try the above patch?

Comment 6 ALLurGroceries 2013-02-26 17:53:48 UTC

(In reply to comment #5)
> (In reply to comment #4)
> > Created an attachment (id=94101) [details] [details]
> > Debug patch for acpihp hotplug
> 
> Hi, ALLurGroceries,
> 
> Can you try the above patch?

Hi Huang,

I tried your patch but I am not sure what it did. I don't get more debug output from acpiphp or pci_hotplug on a hotplug event:

[   70.038722] acpiphp_glue: _handle_hotplug_event_bridge: Bus check notify on \_SB_.PCI0.EXP4
[   70.038736] acpiphp_glue: _handle_hotplug_event_bridge: re-enumerating slots under \_SB_.PCI0.EXP4
[   70.038750] acpiphp_glue: acpiphp_check_bridge: 0 enabled, 0 disabled

Comment 7 ALLurGroceries 2013-02-26 18:04:37 UTC

(In reply to comment #6)
> Hi Huang,
> 
> I tried your patch but I am not sure what it did. I don't get more debug
> output
> from acpiphp or pci_hotplug on a hotplug event:
> 
> [   70.038722] acpiphp_glue: _handle_hotplug_event_bridge: Bus check notify
> on
> \_SB_.PCI0.EXP4
> [   70.038736] acpiphp_glue: _handle_hotplug_event_bridge: re-enumerating
> slots
> under \_SB_.PCI0.EXP4
> [   70.038750] acpiphp_glue: acpiphp_check_bridge: 0 enabled, 0 disabled

I tried hotplugging again and this time it worked! I'm not sure what happened.

Plug:

[  281.211439] acpiphp_glue: _handle_hotplug_event_bridge: Bus check notify on \_SB_.PCI0.EXP4
[  281.211449] acpiphp_glue: _handle_hotplug_event_bridge: re-enumerating slots under \_SB_.PCI0.EXP4
[  281.211509] pci 0000:05:00.0: [197b:2360] type 00 class 0x010601
[  281.211578] pci 0000:05:00.0: reg 10: [io  0x0000-0x0007]
[  281.211604] pci 0000:05:00.0: reg 14: [io  0x0000-0x0003]
[  281.211630] pci 0000:05:00.0: reg 18: [io  0x0000-0x0007]
[  281.211656] pci 0000:05:00.0: reg 1c: [io  0x0000-0x0003]
[  281.211681] pci 0000:05:00.0: reg 20: [io  0x0000-0x000f]
[  281.211708] pci 0000:05:00.0: reg 24: [mem 0x00000000-0x00001fff]
[  281.211828] pci 0000:05:00.0: PME# supported from D3hot
[  281.219542] pci 0000:05:00.0: BAR 5: assigned [mem 0xf1c00000-0xf1c01fff]
[  281.219563] pci 0000:05:00.0: BAR 4: assigned [io  0x4000-0x400f]
[  281.219576] pci 0000:05:00.0: BAR 0: assigned [io  0x4010-0x4017]
[  281.219588] pci 0000:05:00.0: BAR 2: assigned [io  0x4018-0x401f]
[  281.219601] pci 0000:05:00.0: BAR 1: assigned [io  0x4020-0x4023]
[  281.219613] pci 0000:05:00.0: BAR 3: assigned [io  0x4024-0x4027]
[  281.219641] pci 0000:05:00.0: no hotplug settings from platform
[  281.219860] ahci 0000:05:00.0: enabling device (0000 -> 0003)
[  281.235240] ahci 0000:05:00.0: AHCI 0001.0000 32 slots 1 ports 3 Gbps 0x1 impl SATA mode
[  281.235260] ahci 0000:05:00.0: flags: 64bit ncq pm led clo pmp pio slum part 
[  281.235924] scsi9 : ahci
[  281.236061] ata10: SATA max UDMA/133 abar m8192@0xf1c00000 port 0xf1c00100 irq 19
[  281.236130] acpiphp_glue: acpiphp_check_bridge: 1 enabled, 0 disabled
[  281.554635] ata10: SATA link down (SStatus 0 SControl 300)


Unplug:

[  358.759492] acpiphp_glue: _handle_hotplug_event_bridge: Bus check notify on \_SB_.PCI0.EXP4
[  358.759505] acpiphp_glue: _handle_hotplug_event_bridge: re-enumerating slots under \_SB_.PCI0.EXP4
[  359.262150] ata10: failed to stop engine (-5)
[  359.262602] acpiphp_glue: acpiphp_check_bridge: 0 enabled, 1 disabled

Comment 8 Rafael J. Wysocki 2013-02-26 23:07:57 UTC

I need to see the output of lspci -vvv from the affected machine and the output of

$ find /sys/devices/pci0000\:00/ -name control -print -exec cat {} \;

Comment 9 Rafael J. Wysocki 2013-02-26 23:12:25 UTC

And the output of acpidump from it pretty please.

Do subsequent attempts to hotplug/unplug work with the patch applied?

Comment 10 ALLurGroceries 2013-02-26 23:24:24 UTC

Created attachment 94141 [details]
pci

Comment 11 ALLurGroceries 2013-02-26 23:26:15 UTC

Created attachment 94151 [details]
acpidump

Comment 12 ALLurGroceries 2013-02-26 23:29:20 UTC

(In reply to comment #8)
> I need to see the output of lspci -vvv from the affected machine and the
> output
> of

I attached lspci -vvv in the initial report:
https://bugzilla.kernel.org/attachment.cgi?id=93291

> $ find /sys/devices/pci0000\:00/ -name control -print -exec cat {} \;

https://bugzilla.kernel.org/attachment.cgi?id=94141

(In reply to comment #9)
> And the output of acpidump from it pretty please.

https://bugzilla.kernel.org/attachment.cgi?id=94151

> Do subsequent attempts to hotplug/unplug work with the patch applied?

Yes, with the one card I have on hand now:
05:00.0 SATA controller: JMicron Technology Corp. JMB360 AHCI Controller (rev
02)

Should I try more cards? I have one or two more that are not USB bridged (those
worked fine without any patches, if I remember correctly).

Comment 13 Rafael J. Wysocki 2013-02-26 23:58:12 UTC

Thanks!

One card should be enough.

Does the first hotplug attempt always reliably fail (with the patch applied)?

Comment 14 ALLurGroceries 2013-02-27 01:02:53 UTC

(In reply to comment #13)
> Does the first hotplug attempt always reliably fail (with the patch applied)?

Hm, no. I tried around 20+ insert/eject cycles and had no problems.

I rebooted a few times and the first hotplug didn't fail.

I was ready to chalk it up to user error when just now I hotplugged again and it failed. Let me know what else I can do to attack this.

Here's my dmesg output (failed hotplug at 55.822641):

[   52.024976] ata7: failed to stop engine (-5)
[   52.025393] acpiphp_glue: acpiphp_check_bridge: 0 enabled, 1 disabled
[   55.822621] acpiphp_glue: _handle_hotplug_event_bridge: Bus check notify on \_SB_.PCI0.EXP4
[   55.822632] acpiphp_glue: _handle_hotplug_event_bridge: re-enumerating slots under \_SB_.PCI0.EXP4
[   55.822641] acpiphp_glue: acpiphp_check_bridge: 0 enabled, 0 disabled
[   82.389043] acpiphp_glue: _handle_hotplug_event_bridge: Bus check notify on \_SB_.PCI0.EXP4
[   82.389054] acpiphp_glue: _handle_hotplug_event_bridge: re-enumerating slots under \_SB_.PCI0.EXP4
[   82.389063] acpiphp_glue: acpiphp_check_bridge: 0 enabled, 0 disabled
[   86.733002] acpiphp_glue: _handle_hotplug_event_bridge: Bus check notify on \_SB_.PCI0.EXP4
[   86.733013] acpiphp_glue: _handle_hotplug_event_bridge: re-enumerating slots under \_SB_.PCI0.EXP4
[   86.733073] pci 0000:05:00.0: [197b:2360] type 00 class 0x010601
[   86.733143] pci 0000:05:00.0: reg 10: [io  0x0000-0x0007]
[   86.733169] pci 0000:05:00.0: reg 14: [io  0x0000-0x0003]
[   86.733195] pci 0000:05:00.0: reg 18: [io  0x0000-0x0007]
[   86.733221] pci 0000:05:00.0: reg 1c: [io  0x0000-0x0003]
[   86.733248] pci 0000:05:00.0: reg 20: [io  0x0000-0x000f]
[   86.733275] pci 0000:05:00.0: reg 24: [mem 0x00000000-0x00001fff]
[   86.733395] pci 0000:05:00.0: PME# supported from D3hot
[   86.741022] pci 0000:05:00.0: BAR 5: assigned [mem 0xf1c00000-0xf1c01fff]
[   86.741044] pci 0000:05:00.0: BAR 4: assigned [io  0x4000-0x400f]
[   86.741056] pci 0000:05:00.0: BAR 0: assigned [io  0x4010-0x4017]
[   86.741069] pci 0000:05:00.0: BAR 2: assigned [io  0x4018-0x401f]
[   86.741081] pci 0000:05:00.0: BAR 1: assigned [io  0x4020-0x4023]
[   86.741093] pci 0000:05:00.0: BAR 3: assigned [io  0x4024-0x4027]
[   86.741121] pci 0000:05:00.0: no hotplug settings from platform
[   86.741342] ahci 0000:05:00.0: enabling device (0000 -> 0003)
[   86.756768] ahci 0000:05:00.0: AHCI 0001.0000 32 slots 1 ports 3 Gbps 0x1 impl SATA mode
[   86.756773] ahci 0000:05:00.0: flags: 64bit ncq pm led clo pmp pio slum part 
[   86.757333] scsi7 : ahci
[   86.757461] ata8: SATA max UDMA/133 abar m8192@0xf1c00000 port 0xf1c00100 irq 19
[   86.757530] acpiphp_glue: acpiphp_check_bridge: 1 enabled, 0 disabled
[   87.079951] ata8: SATA link down (SStatus 0 SControl 300)

Comment 15 Rafael J. Wysocki 2013-02-27 14:08:02 UTC

Please remove the debug patch, do

# echo on > /sys/devices/pci0000:00/0000:00:1c.3/power/control

and see if hotplug works after that.

Comment 16 ALLurGroceries 2013-02-28 00:29:59 UTC

(In reply to comment #15)
> Please remove the debug patch, do
> 
> # echo on > /sys/devices/pci0000:00/0000:00:1c.3/power/control
> 
> and see if hotplug works after that.

It works.

Comment 17 Rafael J. Wysocki 2013-02-28 01:03:01 UTC

OK, thanks.

Why exactly do you have 'auto' in that file?

The kernel doesn't put it in there by default, so either you or your distro did that.

Comment 18 ALLurGroceries 2013-02-28 01:14:17 UTC

(In reply to comment #17)
> Why exactly do you have 'auto' in that file?
> 
> The kernel doesn't put it in there by default, so either you or your distro
> did
> that.

Where would it be configured?

Comment 19 Rafael J. Wysocki 2013-02-28 01:21:23 UTC

Somewhere in the init scripts (or systemd configuration nowadays).

The distro I use doesn't do that, so I don't know really. :-)

Do you use powertop?

Anyway, you can make your init scripts write "on" to /sys/devices/pci0000:00/0000:00:1c.3/power/control during boot as a workaround for the time being.

Comment 20 ALLurGroceries 2013-02-28 01:26:28 UTC

(In reply to comment #18)
> Where would it be configured?

I found it. It's configured by laptop mode tools.

I changed its config file and now it is not managing runtime PM.

But power consumption is higher.

Comment 21 Rafael J. Wysocki 2013-02-28 21:54:59 UTC

Yes, it is.

So here's a tradeoff: either you can enable the laptop mode's power saving, but then you need to wake up the port manually (like in comment #15) before inserting the hot-plug device, or you can disable it, but power consumption will be higher.

Can you please check if the number of ACPI interrupts (in /proc/interrupts) increases when you try to insert the removable device when there's 'auto' in the 'power/control' file of the port?

Just do "cat /proc/interrupts | grep acpi" before and after inserting the card and see if the number in the second column changes.

Comment 22 ALLurGroceries 2013-03-01 20:08:56 UTC

(In reply to comment #21)
> Can you please check if the number of ACPI interrupts (in /proc/interrupts)
> increases when you try to insert the removable device when there's 'auto' in
> the 'power/control' file of the port?
> 
> Just do "cat /proc/interrupts | grep acpi" before and after inserting the
> card
> and see if the number in the second column changes.

On 3.7.9, unpatched from a fresh boot:

cat /sys/devices/pci0000:00/0000:00:1c.3/power/control
auto

Before hotplug:

cat /proc/interrupts | grep acpi
  9:        663          0          0          0   IO-APIC-fasteoi   acpi

After hotplug:

cat /proc/interrupts | grep acpi
  9:        712          0          0          0   IO-APIC-fasteoi   acpi

Comment 23 Rafael J. Wysocki 2013-03-01 22:13:23 UTC

Can you also look into the gpe* files in /sys/firmware/acpi/interrupts/ and see which of the numbers increase when you insert the card (in the configuration like in comment #22)?

Comment 24 Huang Ying 2013-03-03 13:08:33 UTC

(In reply to comment #14)
> (In reply to comment #13)
> > Does the first hotplug attempt always reliably fail (with the patch
> applied)?
> 
> Hm, no. I tried around 20+ insert/eject cycles and had no problems.
> 
> I rebooted a few times and the first hotplug didn't fail.
> 
> I was ready to chalk it up to user error when just now I hotplugged again and
> it failed. Let me know what else I can do to attack this.
> 
> Here's my dmesg output (failed hotplug at 55.822641):

[snip]

With the debug patch, could you find any pattern which will trigger the failing?  Or just randomly?  Is it possible for you to try more times?  And can you provide the model number of your express card?

Comment 25 ALLurGroceries 2013-03-03 18:37:27 UTC

(In reply to comment #23)
> Can you also look into the gpe* files in /sys/firmware/acpi/interrupts/ and
> see
> which of the numbers increase when you insert the card (in the configuration
> like in comment #22)?

-/sys/firmware/acpi/interrupts/gpe01:       2   enabled
+/sys/firmware/acpi/interrupts/gpe01:       3   enabled

-/sys/firmware/acpi/interrupts/gpe11:     777   enabled
+/sys/firmware/acpi/interrupts/gpe11:     816   enabled

-/sys/firmware/acpi/interrupts/gpe_all:     780
+/sys/firmware/acpi/interrupts/gpe_all:     820


(In reply to comment #24)
> [snip]
> 
> With the debug patch, could you find any pattern which will trigger the
> failing?  Or just randomly?  Is it possible for you to try more times?  And
> can
> you provide the model number of your express card?

As far as I can tell it fails at random. I cannot reliably trigger it.

The expresscard I've been using is labeled "I/Ocrest e-Sata / USB2.0", the chipset is JMicron Technology Corp. JMB360

Comment 26 Huang Ying 2013-03-04 08:28:46 UTC

Created attachment 94341 [details]
Disable runtime power management for hotplugable root port

Comment 27 Huang Ying 2013-03-04 08:29:26 UTC

(In reply to comment #26)
> Created an attachment (id=94341) [details]
> Disable runtime power management for hotplugable root port

Hi, ALLurGroceries,

Could you try the above patch?

Comment 28 ALLurGroceries 2013-03-09 20:33:22 UTC

(In reply to comment #27)
> (In reply to comment #26)
> > Created an attachment (id=94341) [details] [details]
> > Disable runtime power management for hotplugable root port
> 
> Hi, ALLurGroceries,
> 
> Could you try the above patch?

Very sorry for the long delay, I hadn't had an opportunity where I could reboot until now.

Hotplugging works with your patch.

[   60.766010] acpiphp_glue: _handle_hotplug_event_bridge: Bus check notify on \_SB_.PCI0.EXP4
[   60.766021] acpiphp_glue: _handle_hotplug_event_bridge: re-enumerating slots under \_SB_.PCI0.EXP4
[   60.766082] pci 0000:05:00.0: [197b:2360] type 00 class 0x010601
[   60.766153] pci 0000:05:00.0: reg 10: [io  0x0000-0x0007]
[   60.766179] pci 0000:05:00.0: reg 14: [io  0x0000-0x0003]
[   60.766205] pci 0000:05:00.0: reg 18: [io  0x0000-0x0007]
[   60.766231] pci 0000:05:00.0: reg 1c: [io  0x0000-0x0003]
[   60.766258] pci 0000:05:00.0: reg 20: [io  0x0000-0x000f]
[   60.766285] pci 0000:05:00.0: reg 24: [mem 0x00000000-0x00001fff]
[   60.766406] pci 0000:05:00.0: PME# supported from D3hot
[   60.773741] acpiphp_glue: bus exists... trim
[   60.773900] acpiphp_glue: acpi_bus_trim return 0
[   60.774111] pci 0000:05:00.0: BAR 5: assigned [mem 0xf1c00000-0xf1c01fff]
[   60.774129] pci 0000:05:00.0: BAR 4: assigned [io  0x4000-0x400f]
[   60.774143] pci 0000:05:00.0: BAR 0: assigned [io  0x4010-0x4017]
[   60.774157] pci 0000:05:00.0: BAR 2: assigned [io  0x4018-0x401f]
[   60.774170] pci 0000:05:00.0: BAR 1: assigned [io  0x4020-0x4023]
[   60.774183] pci 0000:05:00.0: BAR 3: assigned [io  0x4024-0x4027]
[   60.774212] pci 0000:05:00.0: no hotplug settings from platform
[   60.774515] ahci 0000:05:00.0: enabling device (0000 -> 0003)
[   60.789727] ahci 0000:05:00.0: AHCI 0001.0000 32 slots 1 ports 3 Gbps 0x1 impl SATA mode
[   60.789746] ahci 0000:05:00.0: flags: 64bit ncq pm led clo pmp pio slum part 
[   60.790629] scsi6 : ahci
[   60.790864] ata7: SATA max UDMA/133 abar m8192@0xf1c00000 port 0xf1c00100 irq 19
[   60.790921] acpiphp_glue: acpiphp_check_bridge: 1 enabled, 0 disabled
[   61.108940] ata7: SATA link down (SStatus 0 SControl 300)

Comment 29 Bjorn Helgaas 2013-04-03 21:26:50 UTC

To summarize, we use acpiphp on this machine because the BIOS did not grant us permission to use PCIe native hotplug.  acpiphp relies on an SCI generated by the root port (00:1c.3 in this case) when a hotplug event occurs.  If runtime power management turns off the root port, the SCI doesn't happen and we don't notice the hotplug event.

The comment #26 patch avoids this problem by noticing that there is a hotplug slot registered on bus 05 (the secondary bus of 00:1c.3), and keeping us from turning off 00:1c.3.  acpiphp registered this slot because the ACPI EXP4 device for 00:1c.3 contains an _ADR method and a SLOT device with an _RMV method.

However, I don't think this is a complete solution because acpiphp does not claim all root ports that can generate hotplug SCI events.  Bug #54981 is one example.  Root port 00:1c.4 on this machine is another.  The BIOS clearly expects SCI interrupts from EXP5 (the 00:1c.4 root port, which leads to the SDXC/MMC Host Controller at 0d:00.0), but acpiphp doesn't claim it because it has no _RMV method, so we may still turn it off.

Comment 30 Huang Ying 2013-04-04 12:21:30 UTC

(In reply to comment #29)
> To summarize, we use acpiphp on this machine because the BIOS did not grant
> us
> permission to use PCIe native hotplug.  acpiphp relies on an SCI generated by
> the root port (00:1c.3 in this case) when a hotplug event occurs.  If runtime
> power management turns off the root port, the SCI doesn't happen and we don't
> notice the hotplug event.

In comment #3 dmesg_fail.txt, when card is hot-added:

 [   38.893118] acpiphp_glue: _handle_hotplug_event_bridge: Bus check notify on \_SB_.PCI0.EXP4
 [   38.893121] acpiphp_glue: _handle_hotplug_event_bridge: re-enumerating slots under \_SB_.PCI0.EXP4
 [   38.893125] acpiphp_glue: acpiphp_check_bridge: 0 enabled, 0 disabled

It appears that the SCI is generated when hotplug event occurs even if the root port is in low power state.  But because the configuration space of subordinate devices is not accessible when the root port is in low power state, the hotplugged device is not detected.

> The comment #26 patch avoids this problem by noticing that there is a hotplug
> slot registered on bus 05 (the secondary bus of 00:1c.3), and keeping us from
> turning off 00:1c.3.  acpiphp registered this slot because the ACPI EXP4
> device
> for 00:1c.3 contains an _ADR method and a SLOT device with an _RMV method.
> 
> However, I don't think this is a complete solution because acpiphp does not
> claim all root ports that can generate hotplug SCI events.  Bug #54981 is one
> example.

In #54981, if acpiphp is used, PCIe hotplug does not work regardless we put the root port into low power state or not.  If pciehp is used, the pciehp will register the hotplug slots, so that we will prevent putting the root port into low power state.  So I think the comment #26 patch is a practical solution.

> Root port 00:1c.4 on this machine is another.  The BIOS clearly
> expects SCI interrupts from EXP5 (the 00:1c.4 root port, which leads to the
> SDXC/MMC Host Controller at 0d:00.0), but acpiphp doesn't claim it because it
> has no _RMV method, so we may still turn it off.

Although I can not verify it on every machine, I think in general the SCI interrupt can be generated for the device under a root port even if we put the root port into low power state.  In my x220, a Renesas PCIe express card (cold plugged) can generate SCI when new USB device is plugged even if the root port 00:1c.3 is in low power state.  I can not test 00:1c.4, because it appears that runtime PM for SDXC/MMC HC at 0d:00.0 does not work for me.  But I think it should work.

Comment 31 Bjorn Helgaas 2013-04-04 15:22:55 UTC

> https://bugzilla.kernel.org/show_bug.cgi?id=53811

> It appears that the SCI is generated when hotplug event occurs even if the
> root
> port is in low power state.  But because the configuration space of
> subordinate
> devices is not accessible when the root port is in low power state, the
> hotplugged device is not detected.

If the problem is with config space access, not with the SCI, why
can't you just put the root port back in D0 before scanning its
secondary bus?  That seems like a more direct fix.

>> However, I don't think this is a complete solution because acpiphp does not
>> claim all root ports that can generate hotplug SCI events.  Bug #54981 is
>> one
>> example.
>
> In #54981, if acpiphp is used, PCIe hotplug does not work regardless we put
> the
> root port into low power state or not.  If pciehp is used, the pciehp will
> register the hotplug slots, so that we will prevent putting the root port
> into
> low power state.  So I think the comment #26 patch is a practical solution.

My point here is not about power; only that acpiphp cannot reliably
identify all slots, so we can't rely on slot presence as a way to
decide what to do about power.  On the machine in bug #54981, we must
use acpiphp because the BIOS doesn't let us use pciehp.

Comment 32 Huang Ying 2013-04-05 10:12:26 UTC

(In reply to comment #31)
> > https://bugzilla.kernel.org/show_bug.cgi?id=53811
> 
> > It appears that the SCI is generated when hotplug event occurs even if the
> root
> > port is in low power state.  But because the configuration space of
> subordinate
> > devices is not accessible when the root port is in low power state, the
> > hotplugged device is not detected.
> 
> If the problem is with config space access, not with the SCI, why
> can't you just put the root port back in D0 before scanning its
> secondary bus?  That seems like a more direct fix.

comment #4 patch does exactly like that.  ALLurGroceries tried that patch and it works for him in comment #7.  But in comment #14, ALLurGroceries reported that it does not work reliably enough.  So I write comment #26 patch to fix the issue in a more conservative way, and it works reliable for ALLurGroceries.
 
> >> However, I don't think this is a complete solution because acpiphp does
> not
> >> claim all root ports that can generate hotplug SCI events.  Bug #54981 is
> one
> >> example.
> >
> > In #54981, if acpiphp is used, PCIe hotplug does not work regardless we put
> the
> > root port into low power state or not.  If pciehp is used, the pciehp will
> > register the hotplug slots, so that we will prevent putting the root port
> into
> > low power state.  So I think the comment #26 patch is a practical solution.
> 
> My point here is not about power; only that acpiphp cannot reliably
> identify all slots, so we can't rely on slot presence as a way to
> decide what to do about power.  On the machine in bug #54981, we must
> use acpiphp because the BIOS doesn't let us use pciehp.

If acpiphp cannot identify the slots, acpiphp cannot work regardless whether the bridge is put into low power state.  If acpiphp can identify the slots, with the help of comment #26 patch, acpiphp can work properly and the root port will not be put into low power state.  So I think the point is to make acpiphp identify all slots.  And comment #26 patch is a valid fix.

Comment 33 Bjorn Helgaas 2013-04-05 16:56:54 UTC

>> If the problem is with config space access, not with the SCI, why
>> can't you just put the root port back in D0 before scanning its
>> secondary bus?  That seems like a more direct fix.
>
> comment #4 patch does exactly like that.  ALLurGroceries tried that patch and
> it works for him in comment #7.  But in comment #14, ALLurGroceries reported
> that it does not work reliably enough.  So I write comment #26 patch to fix
> the
> issue in a more conservative way, and it works reliable for ALLurGroceries.

The comment #4 patch adds "pm_runtime_get_sync(&bridge->pci_dev->dev)"
before re-enumerating.  If that is supposed to keep the bridge in D0
but doesn't, we need to figure out why.  I don't know how PM works,
but maybe Rafael will chime in.

> If acpiphp cannot identify the slots, acpiphp cannot work regardless whether
> the bridge is put into low power state.  If acpiphp can identify the slots,
> with the help of comment #26 patch, acpiphp can work properly and the root
> port
> will not be put into low power state.  So I think the point is to make
> acpiphp
> identify all slots.  And comment #26 patch is a valid fix.

Please look at the bug #54981 DSDT and explain either (1) why we
should use pciehp rather than acpiphp on that machine, or (2) how we
can make acpiphp identify the ExpressCard slot.

My contention is that the "slot identification" model currently used
by acpiphp is fundamentally broken and that we should change acpiphp
so it doesn't depend on identifying slots.  I don't believe it is
possible to identify all possible slots by looking at the DSDT, and
therefore, I don't want to make changes like the comment #26 patch
that depend on identifying slots.

Comment 34 Rafael J. Wysocki 2013-04-05 19:50:29 UTC

On Friday, April 05, 2013 10:56:30 AM Bjorn Helgaas wrote:
> >> If the problem is with config space access, not with the SCI, why
> >> can't you just put the root port back in D0 before scanning its
> >> secondary bus?  That seems like a more direct fix.
> >
> > comment #4 patch does exactly like that.  ALLurGroceries tried that patch
> and
> > it works for him in comment #7.  But in comment #14, ALLurGroceries
> reported
> > that it does not work reliably enough.  So I write comment #26 patch to fix
> the
> > issue in a more conservative way, and it works reliable for ALLurGroceries.
> 
> The comment #4 patch adds "pm_runtime_get_sync(&bridge->pci_dev->dev)"
> before re-enumerating.  If that is supposed to keep the bridge in D0
> but doesn't, we need to figure out why.  I don't know how PM works,
> but maybe Rafael will chime in.

Quite frankly, this looks like unreliable hardware to me in the face of
comment #14.  It looks like with that patch applied the problem is randomly
reproducible, which suggests that we probably don't wait for a sufficiently
long time during the PCIe port resume.  Or there's a race I'm not seeing at
the moment.

So it might be a good idea to repeat the experiment with a similar patch,
but with increased timeouts (and the "put" part need not be "sync").

Comment 35 Huang Ying 2013-04-06 06:00:21 UTC

(In reply to comment #34)
> On Friday, April 05, 2013 10:56:30 AM Bjorn Helgaas wrote:
> > >> If the problem is with config space access, not with the SCI, why
> > >> can't you just put the root port back in D0 before scanning its
> > >> secondary bus?  That seems like a more direct fix.
> > >
> > > comment #4 patch does exactly like that.  ALLurGroceries tried that patch
> and
> > > it works for him in comment #7.  But in comment #14, ALLurGroceries
> reported
> > > that it does not work reliably enough.  So I write comment #26 patch to
> fix the
> > > issue in a more conservative way, and it works reliable for
> ALLurGroceries.
> > 
> > The comment #4 patch adds "pm_runtime_get_sync(&bridge->pci_dev->dev)"
> > before re-enumerating.  If that is supposed to keep the bridge in D0
> > but doesn't, we need to figure out why.  I don't know how PM works,
> > but maybe Rafael will chime in.
> 
> Quite frankly, this looks like unreliable hardware to me in the face of
> comment #14.  It looks like with that patch applied the problem is randomly
> reproducible, which suggests that we probably don't wait for a sufficiently
> long time during the PCIe port resume.  Or there's a race I'm not seeing at
> the moment.
> 
> So it might be a good idea to repeat the experiment with a similar patch,
> but with increased timeouts (and the "put" part need not be "sync").

OK.  I will try this.

Comment 36 Huang Ying 2013-04-08 01:35:46 UTC

Created attachment 97651 [details]
Debug patch 3 for acpihp hotplug

Hi, ALLurGroceries,

Could you try this new patch?  The comment #26 patch is considered not good enough.

Comment 37 ALLurGroceries 2013-04-09 18:21:00 UTC

(In reply to comment #36)
> Created an attachment (id=97651) [details]
> Debug patch 3 for acpihp hotplug
> 
> Hi, ALLurGroceries,
> 
> Could you try this new patch?  The comment #26 patch is considered not good
> enough.

I have tested it for a day and I haven't gotten a hotplug failure yet.

However, I am getting random hard lockups with 3.9-rc6. I haven't tested rc6 without your patch, so I will do that now and report back.

Comment 38 ALLurGroceries 2013-04-13 21:30:03 UTC

(In reply to comment #37)
> I am getting random hard lockups with 3.9-rc6. I haven't tested rc6
> without your patch, so I will do that now and report back.

I am still getting lockups even without the patch. But hotplugging works just fine without the patch applied on 3.9-rc6.

Comment 39 Huang Ying 2013-04-14 07:51:31 UTC

Created attachment 98541 [details]
 Debug patch 4 for acpihp hotplug

Comment 40 Huang Ying 2013-04-14 07:59:09 UTC

(In reply to comment #38)
> (In reply to comment #37)
> > I am getting random hard lockups with 3.9-rc6. I haven't tested rc6
> > without your patch, so I will do that now and report back.
> 
> I am still getting lockups even without the patch. But hotplugging works just
> fine without the patch applied on 3.9-rc6.

Thanks a lot!  Oh, I checked 3.9-rc6, and found it fixed the issue in another way, disable runtime PM for PCIe port totally.  But I still hope I can fix the bug without disabling runtime PM for PCIe port totally.  So some people can benefit from the power saving from runtime PM support of PCIe port.

Finally, I got a eSATA expresscard as follow:

05:00.0 SATA controller: JMicron Technology Corp. JMB362 SATA Controller (rev 10)

The label on card is:

"Orico PNE2-EC54 EXPRESS TO eSATA 2 PORT"

It appears the brand is different from yours, but the chipset is similar.

I tested it on my x220 with

a) 3.9-rc5
b) 3.9-rc5 + comment #4 patch
c) 3.9-rc5 + comment #39 patch

a) has same bug as you.  But b) and c) works good for me for tens hotadd/hotremove.

So is it possible for you to test b) another time and c)?

Comment 41 ALLurGroceries 2013-04-15 01:17:27 UTC

(In reply to comment #40)
> a) 3.9-rc5
> b) 3.9-rc5 + comment #4 patch
> c) 3.9-rc5 + comment #39 patch
> 
> a) has same bug as you.  But b) and c) works good for me for tens
> hotadd/hotremove.
> 
> So is it possible for you to test b) another time and c)?

Sure, I will try rc5 with comment #39 patch. I get this error compiling rc5 with comment #4 patch (it says it cleanly applies with fuzz):
  CC [M]  drivers/pci/hotplug/acpiphp_glue.o
drivers/pci/hotplug/acpiphp_glue.c: In function ‘_handle_hotplug_event_func’:
drivers/pci/hotplug/acpiphp_glue.c:1242:23: error: ‘bridge’ undeclared (first use in this function)
drivers/pci/hotplug/acpiphp_glue.c:1242:23: note: each undeclared identifier is reported only once for each function it appears in
make[4]: *** [drivers/pci/hotplug/acpiphp_glue.o] Error 1

Comment 42 Huang Ying 2013-04-15 04:11:11 UTC

Created attachment 98681 [details]
Debug patch 5 for acpiphp hotplug

Comment 43 Huang Ying 2013-04-15 04:18:28 UTC

(In reply to comment #41)
> (In reply to comment #40)
> > a) 3.9-rc5
> > b) 3.9-rc5 + comment #4 patch
> > c) 3.9-rc5 + comment #39 patch
> > 
> > a) has same bug as you.  But b) and c) works good for me for tens
> > hotadd/hotremove.
> > 
> > So is it possible for you to test b) another time and c)?
> 
> Sure, I will try rc5 with comment #39 patch. I get this error compiling rc5
> with comment #4 patch (it says it cleanly applies with fuzz):
>   CC [M]  drivers/pci/hotplug/acpiphp_glue.o
> drivers/pci/hotplug/acpiphp_glue.c: In function ‘_handle_hotplug_event_func’:
> drivers/pci/hotplug/acpiphp_glue.c:1242:23: error: ‘bridge’ undeclared (first
> use in this function)
> drivers/pci/hotplug/acpiphp_glue.c:1242:23: note: each undeclared identifier
> is
> reported only once for each function it appears in
> make[4]: *** [drivers/pci/hotplug/acpiphp_glue.o] Error 1

Sorry, I have updated comment #4 patch with comment #42 patch, which can be applied to 3.9-rc5 cleanly.  Could you use comment #42 patch in test b) above?

Comment 44 ALLurGroceries 2013-04-15 06:18:04 UTC

(In reply to comment #43)
> Sorry, I have updated comment #4 patch with comment #42 patch, which can be
> applied to 3.9-rc5 cleanly.  Could you use comment #42 patch in test b)
> above?

Both of the patches you asked me to test work fine on rc5. I tested 50 consecutive insert/ejects each in addition to normal use. I hope my expresscard spring isn't getting tired... :P

Comment 45 Huang Ying 2013-04-15 06:32:58 UTC

(In reply to comment #44)
> (In reply to comment #43)
> > Sorry, I have updated comment #4 patch with comment #42 patch, which can be
> > applied to 3.9-rc5 cleanly.  Could you use comment #42 patch in test b)
> above?
> 
> Both of the patches you asked me to test work fine on rc5. I tested 50
> consecutive insert/ejects each in addition to normal use. I hope my
> expresscard
> spring isn't getting tired... :P

Thanks a lot!

Comment 46 Huang Ying 2013-05-02 00:38:39 UTC

Hi,  ALLurGroceries, as you have tested, kernel v3.9-rc6 has fixed this bug in another way, that should be: commit de7d5f7 "PCI/PM: Disable runtime PM of PCIe ports".  The commit has been merged by kernel v3.9.  So, I close the bug report, please reopen if not fixed.