Bug 215143

Summary: PCI: pciehp: clear cmd_busy bit when Command Completed in polling mode
Product: Drivers Reporter: Liguang (zhangliguang)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: lukas
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.10.23 Subsystem:
Regression: No Bisected commit-id:
Attachments: hotplug timeout log

Description Liguang 2021-11-26 07:42:44 UTC
Created attachment 299725 [details]
hotplug timeout log

This patch fixes this problem that on driver probe from system startup,
pciehp checks the Presence Detect State bit in the Slot Status register
to bring up an occupied slot or bring down an unoccupied slot. If empty
slot's power status is on, turn power off. The Hot-Plug interrupt isn't
requested yet, so avoid triggering a notification by calling
pcie_disable_notification().

Both the CCIE and HPIE bits are masked in pcie_disable_notification(),
when we issue a hotplug command, pcie_wait_cmd() will polling the
Command Completed bit instead of waiting for an interrupt. But cmd_busy
bit was not cleared when Command Completed which results in timeouts
like this in pciehp_power_off_slot() and pcie_init_notification():

  pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x01c0
(issued 2264 msec ago)
  pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x05c0
(issued 2288 msec ago)

After some debug, the first timeout occurs:

    pciehp_power_off_slot

        pcie_write_cmd(ctrl, PCI_EXP_SLTCTL_PWR_OFF, PCI_EXP_SLTCTL_PCC)

            pcie_do_write_cmd

                    /*
                     * Always wait for any previous command that might still be in progress
                     */
                    pcie_wait_cmd(ctrl);  // at the beginning

                        if (!ctrl->cmd_busy)  // cmd_busy was not zero
                            return;


Why cmd_busy was not flase, because in function pcie_disable_notification cmd_busy was not setting correctly:

    pcie_disable_notification  //  Both the CCIE and HPIE bits are masked

        pcie_write_cmd

            pcie_do_write_cmd

                ctrl->cmd_busy = 1  // cmd_busy was setting to 1

                pcie_wait_cmd

                    pcie_poll_cmd // pcie_wait_cmd() will polling the Command Completed bit instead of waiting for an interrupt

                         After debug we found Command Completed can be signaled without cmd_busy was setting to 0.


If we use interrupt mode pciehp_isr:

        pciehp_isr

            if (events & PCI_EXP_SLTSTA_CC) {
                ctrl->cmd_busy = 0;  //  cmd_busy was setting to zero, this was left in pcie_poll_cmd.
Comment 1 Liguang 2021-11-26 07:48:11 UTC
#lspci -tv
-+-[0000:fc]-+-00.0  Device 1ded:8001
 |           \-0a.0  Device 1ded:8003
 +-[0000:b0]---00.0-[b1-b2]--+-00.0  Mellanox Technologies MT27800 Family [ConnectX-5]
 |                           \-00.1  Mellanox Technologies MT27800 Family [ConnectX-5]
 +-[0000:98]-+-00.0-[99]--
 |           +-01.0-[9a]--
 |           +-02.0-[9b]--
 |           \-03.0-[9c]--
 +-[0000:80]---00.0-[81]--
 +-[0000:7c]-+-00.0  Device 1ded:8001
 |           \-0a.0  Device 1ded:8003
 +-[0000:30]-+-00.0-[31]--
 |           +-01.0-[32]--
 |           +-02.0-[33]--
 |           \-03.0-[34]--
 +-[0000:10]---00.0-[11]--
 \-[0000:00]-+-00.0-[01-02]----00.0-[02]----00.0  ASPEED Technology, Inc. ASPEED Graphics Family
             +-01.0-[03]----00.0  Renesas Technology Corp. uPD720201 USB 3.0 Host Controller
             +-02.0-[04]----00.0  Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller
             \-03.0-[05]--
Comment 2 Liguang 2021-11-26 07:50:22 UTC
[   36.384101] pcieport 0000:00:03.0: Adding to iommu group 4
[   36.392032] pcieport 0000:00:03.0: PME: Signaling with IRQ 87
[   36.399782] pcieport 0000:00:03.0: pciehp: Slot Capabilities      : 0x0022005a
[   36.409054] pcieport 0000:00:03.0: pciehp: Slot Status            : 0x0000
[   36.417986] pcieport 0000:00:03.0: pciehp: Slot Control           : 0x01ca
[   36.761412] pcieport 0000:00:03.0: pciehp: Slot #4 AttnBtn- PwrCtrl+ MRL- AttnInd+ PwrInd+ HotPlug+ Surprise- Interlock+ NoCompl- IbPresDis- LLActRep+
[   36.779026] pcieport 0000:00:03.0: pciehp: pciehp_get_power_status: SLOTCTRL 88 value read 1ca
[   36.789833] pcieport 0000:00:03.0: pciehp: pciehp_check_link_active: lnk_status = 1011
[   36.857413] pcieport 0000:00:03.0: pciehp: pcie_disable_notification: SLOTCTRL 88 write cmd 0
[   38.745412] pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x01c0 (issued 1948 msec ago)
[   38.817411] pcieport 0000:00:03.0: pciehp: pciehp_power_off_slot: SLOTCTRL 88 write cmd 400
[   40.705412] pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x05c0 (issued 1948 msec ago)
[   40.718993] pcieport 0000:00:03.0: pciehp: pcie_enable_notification: SLOTCTRL 88 write cmd 1038
[   40.730110] pcieport 0000:00:03.0: pciehp: pciehp_check_link_active: lnk_status = 1011
Comment 3 Lukas Wunner 2021-11-26 17:41:27 UTC
The vendor ID 1ded (Alibaba) and the device IDs 8000 (Root Port), 8001 and 8003 (Co-processor) are currently not registered in the PCI IDs database. You may want to add them there:
https://pci-ids.ucw.cz/
Comment 4 Liguang 2021-11-28 07:07:14 UTC
thanks.