Bug 215590
Description
Rahul Kumar
2022-02-11 14:29:33 UTC
Created attachment 300465 [details]
kern log with all 3 patches
Created attachment 300466 [details]
lspci_before_suspend
Created attachment 300467 [details]
lspci_after_suspend
We have a workaround for this bu this is obviosly not a generic solution and it would be helpful if we could get an adivse what is the generic solution here. Created attachment 301174 [details]
Force reset bus
Here is the workaround
Created attachment 301175 [details]
force SBR
The actual patch - previos was wrong.
Would it be possible for you to provide dmesg for a contemporary kernel such as v5.18? A lot has changed since v5.4 and it's difficult to determine what the behavior might be today compared to the v5.4.2 kernel you've used. UNfort(In reply to Lukas Wunner from comment #7) > Would it be possible for you to provide dmesg for a contemporary kernel such > as v5.18? A lot has changed since v5.4 and it's difficult to determine what > the behavior might be today compared to the v5.4.2 kernel you've used. UNfortunetaly the system can only run on 5.4 and 5.11 kernels to support hotplug. WOuld a log from 5.11 help ? Hm, why is that? A v5.11 kernel is certainly much closer to what we have today, so yes, that would help. Please add the following to the kernel command line: log_buf_len=10M pciehp.pciehp_debug=1 dyndbg="file pciehp* +p" ignore_loglevel (In reply to Lukas Wunner from comment #9) > Hm, why is that? A v5.11 kernel is certainly much closer to what we have > today, so yes, that would help. Please add the following to the kernel > command line: > > log_buf_len=10M pciehp.pciehp_debug=1 dyndbg="file pciehp* +p" > ignore_loglevel It's just some internal constraint by the costumer who 'likes' this kernel for it's stability i guess ? Anyway - I will provide the logs with required info from 5.11 kernel soon. This kernel also includes your patch 'PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock' If you could provide output of "lspci -vvv" (a) before suspend, (b) after suspend (with my patch) and (c) after suspend (with my patch + Anatoli's workaround), that would be ideal. Thanks! Created attachment 301183 [details]
lspci before S3
Created attachment 301184 [details]
lspci after S3
Created attachment 301185 [details]
lspci after S3 with Anatoli's work around
(In reply to Lukas Wunner from comment #11) > If you could provide output of "lspci -vvv" (a) before suspend, (b) after > suspend (with my patch) and (c) after suspend (with my patch + Anatoli's > workaround), that would be ideal. Thanks! Done in 3 files above - see their descriptions at the top. Thanks The dmesg output (https://bugzilla.kernel.org/attachment.cgi?id=300465) shows tons of AER messages right after the GPU driver starts probing. It looks like the electrical connection to the expansion card is flakey. On resume, there's a fatal uncorrectable error at 67.797528. That's obviously the reason why the link fails to come up. AER then tries and fails to reset the port: [ 69.684921] pcieport 0000:00:01.1: AER: Root Port link has been reset [ 69.691438] pcieport 0000:00:01.1: AER: Device recovery failed The first messages comes from aer_root_reset() and the second message from pcie_do_recovery(). You should try to find out why recovery failed. That's ultimately the reason why you need to explicitly perform a Secondary Bus Reset in the workaround patch. The code in pcie_do_recovery() has changed significantly since v5.4. What we had in v5.11 is much closer to what we have today, so dmesg output for v5.11 would definitely help. When I submitted the comment yesterday, I got a "mid-air collision" error from bugzilla. I'm not sure if I've accidentally deleted a comment from you when submitting mine, please check. Sorry for the inconvenience. Created attachment 301186 [details]
dmesg 5.11 with work around
Thanks for the analysis Lukas, I am attaching a dmesg from 5.11 with thee work around for you to look. Let me know if you also need dmesg without work around ?
Regarding AER - our SBIOS is a costum test SBIOS which generates AER on purpuse to test the system behaviour under this condition. SO it's not a physical flake connectivity but an intended hack.
(In reply to Andrey Grodzovsky from comment #17) > Created attachment 301186 [details] > dmesg 5.11 with work around > > Thanks for the analysis Lukas, I am attaching a dmesg from 5.11 with thee > work around for you to look. Let me know if you also need dmesg without work > around ? > > Regarding AER - our SBIOS is a costum test SBIOS which generates AER on > purpuse to test the system behaviour under this condition. SO it's not a > physical flake connectivity but an intended hack. Just a correction here, SBIOS seems enable AER only, not purposely generated, those error we are seeing is real one's. AER treats that as transaction error we are adding card on the go on slot. Yes dmesg output without the workaround would be useful. With usec timestamps please. (The previous log only contains timestamps with 1 second resolution.) Created attachment 301188 [details]
Here is the dmesg 5.11 without work around
(In reply to Lukas Wunner from comment #19) > Yes dmesg output without the workaround would be useful. With usec > timestamps please. (The previous log only contains timestamps with 1 second > resolution.) Hey. Log attached above. @Andrey Grodzovsky: You wrote that with the workaround patch applied, you're able to re-plug the extension card after resume and the hotplug port is properly brought up. I'm not seeing that in dmesg_wa.txt. I can only see there that pciehp brings down the slot on resume because the extension card is gone, but there are no log entries showing that the slot is brought up again. Upon resume, a race is visible between pciehp and AER: In dmesg_no_wa.txt, the race is won by pciehp. It detects that the slot is empty and kicks the IRQ thread pciehp_ist() to bring the slot down: [ 159.973097] pcieport 0000:00:01.1: pciehp: pciehp_check_link_active: lnk_status = 1101 [ 159.973267] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present [ 159.973288] pcieport 0000:00:01.1: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:01:00 These log messages correspond to the following call stack: pciehp_resume() pciehp_check_presence() pciehp_card_present_or_link_active() pciehp_check_link_active() pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC); ... pciehp_ist() pciehp_handle_presence_or_link_change() And pciehp_ist() is holding ctrl->reset_lock, which prevents the concurrent AER thread from performing a reset while the slot is brought down: [ 159.973365] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) error received: 0000:00:01.0 [ 159.973478] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID) [ 159.973481] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00004000/04400000 [ 159.973484] pcieport 0000:00:01.1: [14] CmpltTO (First) [ 170.387351] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) error received: 0000:00:01.0 [ 170.387367] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID) [ 170.387370] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00004000/04400000 [ 170.387372] pcieport 0000:00:01.1: [14] CmpltTO (First) [ 171.115213] pci 0000:03:00.0: AER: can't recover (no error_detected callback) The last message above is a little concerning: At this point, the GPU driver has already been unbound. In report_error_detected(), we return PCI_ERS_RESULT_NO_AER_DRIVER if the device is not bound to a driver or if its io_state is perm_failure. Both conditions are true here. I think we need to change the logic in report_error_detected() to return PCI_ERS_RESULT_NONE in those cases and thus allow forward progress. For unbound devices, it may be necessary to call pci_restore_state() after the Secondary Bus Reset. My expectation would be that recovery fails due to the GPU device being unbound and permanently failed. Curiously, the log messages claim success: [ 171.115953] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0 [ 172.136373] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008 [ 172.136387] pcieport 0000:00:01.1: AER: Root Port link has been reset (0) [ 172.136407] pcieport 0000:00:01.1: AER: device recovery successful That's very strange. Have you modified the logic in report_error_detected()? Any custom patches in your v5.11 tree? What's going on here? Comparing all of the above to dmesg_wa.txt, it's apparent that the race is won by AER there instead of pciehp. AER resets the slot right after coming out of suspend: [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) error received: 0000:00:01.0 [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID) [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00004000/04400000 [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: [14] CmpltTO (First) [Wed Jun 15 15:29:39 2022] [drm] PCI error: detected callback, state(2)!! [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0 Only then does pciehp get a chance to check slot status and bring it down: [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp: pciehp_check_link_active: lnk_status = 1101 [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:01:00 However, there's something strange going on: While pciehp is still busy bringing down the slot, a slot reset is performed concurrently: [Wed Jun 15 15:29:40 2022] amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1 [Wed Jun 15 15:29:40 2022] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008 [Wed Jun 15 15:29:40 2022] pcieport 0000:00:01.1: AER: Root Port link has been reset (0) [Wed Jun 15 15:29:40 2022] amdgpu 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible) ... [Wed Jun 15 15:29:51 2022] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (1332715520). [Wed Jun 15 15:29:51 2022] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device. That is supposed to be prevented by the reset_lock. I don't see how this can happen. So the bottom line is: (a) We may need to change the handling of unbound and permanently failed devices in report_error_detected(). (b) I can't really make sense of the logs because the messages do not correlate with the code. Also, neither dmesg_wa.txt nor dmesg_no_wa.txt show that the extension card is re-plugged and the slot is brought up again. Both logs show a bunch of pciehp_check_link_active messages at the bottom of which I'm unsure where they're originating from, but the DLLLA bit is not set in any of those messages, so the link never comes up. Also, both lspci_after_s3_wa.txt and lspci_after_s3.txt show DLActive- for the hotplug port 00:01.1. (In reply to Lukas Wunner from comment #22) > @Andrey Grodzovsky: > > You wrote that with the workaround patch applied, you're able to re-plug the > extension card after resume and the hotplug port is properly brought up. I'm > not seeing that in dmesg_wa.txt. I can only see there that pciehp brings > down the slot on resume because the extension card is gone, but there are no > log entries showing that the slot is brought up again. > > Upon resume, a race is visible between pciehp and AER: > > In dmesg_no_wa.txt, the race is won by pciehp. It detects that the slot is > empty and kicks the IRQ thread pciehp_ist() to bring the slot down: > > [ 159.973097] pcieport 0000:00:01.1: pciehp: pciehp_check_link_active: > lnk_status = 1101 > [ 159.973267] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present > [ 159.973288] pcieport 0000:00:01.1: pciehp: pciehp_unconfigure_device: > domain:bus:dev = 0000:01:00 > > These log messages correspond to the following call stack: > > pciehp_resume() > pciehp_check_presence() > pciehp_card_present_or_link_active() > pciehp_check_link_active() > pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC); > ... > pciehp_ist() > pciehp_handle_presence_or_link_change() > > And pciehp_ist() is holding ctrl->reset_lock, which prevents the concurrent > AER thread from performing a reset while the slot is brought down: > > [ 159.973365] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) > error received: 0000:00:01.0 > [ 159.973478] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected > (Fatal), type=Transaction Layer, (Requester ID) > [ 159.973481] pcieport 0000:00:01.1: device [1022:15d3] error > status/mask=00004000/04400000 > [ 159.973484] pcieport 0000:00:01.1: [14] CmpltTO (First) > [ 170.387351] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) > error received: 0000:00:01.0 > [ 170.387367] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected > (Fatal), type=Transaction Layer, (Requester ID) > [ 170.387370] pcieport 0000:00:01.1: device [1022:15d3] error > status/mask=00004000/04400000 > [ 170.387372] pcieport 0000:00:01.1: [14] CmpltTO (First) > [ 171.115213] pci 0000:03:00.0: AER: can't recover (no error_detected > callback) > > The last message above is a little concerning: At this point, the GPU driver > has already been unbound. In report_error_detected(), we return > PCI_ERS_RESULT_NO_AER_DRIVER if the device is not bound to a driver or if > its io_state is perm_failure. Both conditions are true here. I think we need > to change the logic in report_error_detected() to return PCI_ERS_RESULT_NONE > in those cases and thus allow forward progress. For unbound devices, it may > be necessary to call pci_restore_state() after the Secondary Bus Reset. > > My expectation would be that recovery fails due to the GPU device being > unbound and permanently failed. Curiously, the log messages claim success: > > [ 171.115953] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot: SLOTCTRL 70 > write cmd 0 > [ 172.136373] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot: SLOTCTRL 70 > write cmd 1008 > [ 172.136387] pcieport 0000:00:01.1: AER: Root Port link has been reset (0) > [ 172.136407] pcieport 0000:00:01.1: AER: device recovery successful > > That's very strange. Have you modified the logic in report_error_detected()? > Any custom patches in your v5.11 tree? What's going on here? We haven't touched the logic of report_error_detected and it's exact same as of 5.11 kernel https://elixir.bootlin.com/linux/v5.11/source/drivers/pci/hotplug/pciehp_hpc.c#L837 Yes - there are a few but they are not related to pciehp code, i will attach them here just to be sure so u can look. > > Comparing all of the above to dmesg_wa.txt, it's apparent that the race is > won by AER there instead of pciehp. AER resets the slot right after coming > out of suspend: > > [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: AER: Multiple Uncorrected > (Fatal) error received: 0000:00:01.0 > [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: PCIe Bus Error: > severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID) > [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: device [1022:15d3] error > status/mask=00004000/04400000 > [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: [14] CmpltTO > (First) > [Wed Jun 15 15:29:39 2022] [drm] PCI error: detected callback, state(2)!! > [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot: > SLOTCTRL 70 write cmd 0 > > Only then does pciehp get a chance to check slot status and bring it down: > > [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp: > pciehp_check_link_active: lnk_status = 1101 > [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp: Slot(0): Card not > present > [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp: > pciehp_unconfigure_device: domain:bus:dev = 0000:01:00 > > However, there's something strange going on: While pciehp is still busy > bringing down the slot, a slot reset is performed concurrently: > > [Wed Jun 15 15:29:40 2022] amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses > VM inv eng 6 on hub 1 > [Wed Jun 15 15:29:40 2022] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot: > SLOTCTRL 70 write cmd 1008 > [Wed Jun 15 15:29:40 2022] pcieport 0000:00:01.1: AER: Root Port link has > been reset (0) > [Wed Jun 15 15:29:40 2022] amdgpu 0000:03:00.0: can't change power state > from D3cold to D0 (config space inaccessible) > ... > [Wed Jun 15 15:29:51 2022] amdgpu 0000:03:00.0: amdgpu: > amdgpu_device_ip_resume failed (1332715520). > [Wed Jun 15 15:29:51 2022] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing > device. > > That is supposed to be prevented by the reset_lock. I don't see how this can > happen. I will try to see more into this. > > So the bottom line is: > (a) We may need to change the handling of unbound and permanently failed > devices in report_error_detected(). > (b) I can't really make sense of the logs because the messages do not > correlate with the code. Also, neither dmesg_wa.txt nor dmesg_no_wa.txt show > that the extension card is re-plugged and the slot is brought up again. > > Both logs show a bunch of pciehp_check_link_active messages at the bottom of > which I'm unsure where they're originating from, but the DLLLA bit is not > set in any of those messages, so the link never comes up. Also, both > lspci_after_s3_wa.txt and lspci_after_s3.txt show DLActive- for the hotplug > port 00:01.1. I am reattchaing log from 5.11 with work around and subsequent succefull hot plug - please take a look. Created attachment 301190 [details]
dmesg 5.11 with work around amd subsequent hot plug of same card
Attaching the log with hotplug after resume as requested.
Created attachment 301191 [details]
PCI/PME: Fix race for NV23 on PME polling
Created attachment 301192 [details]
PCI: Add support for PME state on suspended hot-add devices
Created attachment 301193 [details]
PCI: Fixes state transitions of NV23 dGPU from D3cold into D0
Attached above the 3 other work arounds in PCI we still have in our brnach - i think they are not realted to our issue ? Besides this only your deadlock fix patch 'PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock' is the only other PCI costum patch we have in that branch. Lukas - just a gentle ping to recheck the logs we attached. Created attachment 302942 [details]
The dmesg for S3 resume time with AER.
Here is the dmesg for 5.18 kernel.
Created attachment 302943 [details]
The dmesg for S3 resume time with AER.
Here is the dmesg for 5.18 kernel with blacklisted amdgpu.
Created attachment 302945 [details]
The dmesg for S3 resume time with AER.
The issue has been observed in the latest kernel 6.0.
Created attachment 303186 [details]
The dmesg 6.1.0-rc5 for S3 resume time with AER
The issue has been observed in the latest kernel with amdgpu blacklisted.
Created attachment 303187 [details]
The dmesg 6.1.0-rc5 with rework down_write_nested the ctrl->reset_lock, and AMD subsequent hot plug of same card.
Attaching the log with hotplug after resume.
Created attachment 303188 [details]
The patch to resolve the issue for secondary bus reset.
@Lukas Wunner:
Is this patch make sense for this case at all?
Created attachment 303524 [details]
The dmesg 6.2.0-rc2 for S3 resume time with AER
Created attachment 303525 [details]
The dmesg 6.2.0-rc2 with rework down_write_nested the ctrl->reset_lock, and AMD subsequent hot plug of same card.
Created attachment 303634 [details] The dmesg 6.2.0-rc4 for S3 resume with AER The dmesg 6.2.0-rc4 for the patch: https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e37d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/ Created attachment 303639 [details]
The dmesg 6.2.0-rc5 for S3 resume with AER
Created attachment 303640 [details]
The patch for testing
The merged patch "PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock"
for the kernel 6.2.0-rc5
Created attachment 304099 [details]
[PATCH v2] PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock
My apologies for the delay. Could you test if this patch fixes the issue for you?
It should apply cleanly on top of any recent kernel. If you test with anything older than v6.3-rc1, be sure to also cherry-pick commit 74ff8864cc84 ("PCI: hotplug: Allow marking devices as disconnected during bind/unbind").
Thanks a lot.
Created attachment 304111 [details]
The dmesg for S3 resume time with AER
The dmesg for current kernel 6.3.0-rc5 on the same setup.
Thanks Lukas. The patch has been tested with current kernel 6.3.0-rc5 on the same setup. The deadlock between reset_lock and device_lock has been fixed. See details in the dmesg log: dmesg_6.3.0-rc5_fix.txt |