Bug 201469
Description
RussianNeuroMancer
2018-10-18 12:19:56 UTC
Unfortunately, this is the famous PCI bus problem... We can't talk to our NIC anymore. We have tried many times to see what can be done from the driver side, and our HW guys told us that this is not driver related. There is noise on the PCI lines and because of that, we can't access our device. We have heard reports that claim that it started from a specific version and worked fine in an earlier version of the software, but every time we had such reports, the user could reproduce on the older software as well. Just less reliably. What you can do here is to try to enable PCI AER to see if the bus driver spews a few things. Another thing that can help is to try to see if there is a patch in the PCI tree that can be causing this. From the iwlwifi side, we can't do much. > We have heard reports that claim that it started from a specific version and > worked fine in an earlier version of the software, but every time we had such > reports, the user could reproduce on the older software as well. Just less > reliably. I absolutely sure I never have this issue for at least two years. Since 4.18rc1 it's reproducible on every attempt. Also, please check https://bugzilla.kernel.org/show_bug.cgi?id=102281 As you can see a lot of people use suspend on this device since year 2015. If this issue happen at least sometimes (even rarely) for a three years, someone else would report it before I did. > try to enable PCI AER to see if the bus driver spews a few things How to do so? > Another thing that can help is to try to see if there is a patch in the PCI > tree that can be causing this. What commit you would recommend to revert? I'm not developer, so I honestly don't know where to look. (In reply to RussianNeuroMancer from comment #2) > > We have heard reports that claim that it started from a specific version > and > > worked fine in an earlier version of the software, but every time we had > such > > reports, the user could reproduce on the older software as well. Just less > > reliably. > > I absolutely sure I never have this issue for at least two years. Since > 4.18rc1 it's reproducible on every attempt. > > Also, please check https://bugzilla.kernel.org/show_bug.cgi?id=102281 > As you can see a lot of people use suspend on this device since year 2015. > If this issue happen at least sometimes (even rarely) for a three years, > someone else would report it before I did. All this may be true, but what I said still holds. The only thing I can suggest is that you take an old kernel (which you claim to be working), and install our latest driver using the backport tree. See https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/core_release#how_to_install_the_driver for instruction on how to do that. Then, if you'll reproduce the problem, then maybe we will have some indication. But we won't be able to do much from iwlwifi. > > > try to enable PCI AER to see if the bus driver spews a few things > > How to do so? Kernel configuration. > > > Another thing that can help is to try to see if there is a patch in the PCI > > tree that can be causing this. > > What commit you would recommend to revert? I'm not developer, so I honestly > don't know where to look. You didn't get the point. I don't know either, I am not a PCI developer. You can try to bisect the whole kernel if you like. Or maybe focus git bisect to PCI and drivers/wireless/intel/iwlwifi In the meantime I'll close this since we really can't do much. We will still be notified if you add something. > You can try to bisect the whole kernel if you like. Bisect point to commit https://github.com/torvalds/linux/commit/9ab105deb60fa76d66cae5548819b4e8703d2056 Verification: Upstream 4.18.0 - issue is reproducible Upstream 4.20rc7 - issue is reproducible 4.18.0 without mentioned commit - issue is not reproducible 4.20rc7 without mentioned commit - issue is not reproducible What is next steps? Please post the output of lspci and sudo lspci -xxxvvv with and without this commit. Thanks Created attachment 280119 [details]
lspci output from upstream Linux 4.20rc7
Created attachment 280121 [details]
lspci -xxxvvv output from upstream Linux 4.20rc7
Created attachment 280123 [details]
lspci output from patched Linux 4.20rc7
Created attachment 280125 [details]
lspci -xxxvvv output from patched Linux 4.20rc7
Hi Bjorn, the submitter here bisected a problem in which the Intel wireless device can't access its config space. He came to the conclusion that https://github.com/torvalds/linux/commit/9ab105deb60fa76d66cae5548819b4e8703d2056 caused the problem. I asked him to attach the config space before and after he reverts your patch. I can't see any difference the L1 PM substates. Can you shed more light? I can indeed see that in the patch 4.20-rc7 kernel: L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ T_CommonMode=0us LTR1.2_Threshold=163840ns And in the upstream kernel: L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ T_CommonMode=0us LTR1.2_Threshold=163840ns Note the difference in the L1.2 value. Is this related to suspend / resume? Or you see the bug even when you do a fresh boot without suspend / resume? thanks. > Is this related to suspend / resume? Yes, as per original bug description "stop working after wakeup from suspend". > Or you see the bug even when you do a fresh boot without suspend / resume? No, bug never occur on fresh boot. Created attachment 280175 [details]
test patch
First of all, I'm very sorry about the problem, and thank you very much for having done the bisection. That is a tremendous help in debugging.
I assume the attached dmesg from v4.19-rc8 has the problem. Can you attach the dmesg from a kernel that does not have the problem? Also, please attach an acpidump.
I suspect what's happening is that this:
acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
acpi PNP0A08:00: _OSC failed (AE_ERROR); disabling ASPM
means host->native_ltr == 0, i.e., the OS doesn't have control over the LTR capability, which means pdev->ltr_path == 0 for every device, which means that after 9ab105deb60f, we clear out PCI_L1SS_CAP_ASPM_L1_2 so we won't use L1.2.
I don't understand yet why leaving L1.2 disabled should cause a functional problem. I would expect that with L1.2 disabled, everything would still work but use more power.
Oh, and I forgot to ask if you could try v4.19-rc8 with the debug patch attached to comment #14, and attach the resulting dmesg log. I asked internally at Intel and of course, if LTR is not configured, we don't use L1.2 for *ASPM* because we can't use that without LTR values. But ASPM doesn't seem really relevant here since the problem happens only after suspend resume, so we are probably debugging a PCI-PM problem? Created attachment 280253 [details] dmesg with Linux 4.17.19 > Can you attach the dmesg from a kernel that does not have the problem? dmesg of a few suspend/resume cycles with Linux 4.17.19 is attached. Created attachment 280255 [details] acpidump > Also, please attach an acpidump. Sure, acpidump is attached. (In reply to Bjorn Helgaas from comment #15) > Oh, and I forgot to ask if you could try v4.19-rc8 with the debug patch > attached to comment #14, and attach the resulting dmesg log. RussianNeuroMancer, are you able to provide this additional log from the patched kernel? (If not, I'm affected by the same issue on my Latitude 7350 and could also produce this; let me know.) > RussianNeuroMancer, are you able to provide this additional log from the
> patched kernel?
Yes, tomorrow. Kernel is still compiling.
Created attachment 280263 [details] dmesg with test patch 1 > Oh, and I forgot to ask if you could try v4.19-rc8 with the debug patch > attached to comment #14, and attach the resulting dmesg log. Resulting dmesg is attached. Created attachment 280297 [details]
test patch (use LTR if already enabled by platform)
Can you please try this patch? If it helps, please attach the output of "lspci -vvs 00:1c.0; lspci -vvs 01:00.0" from before and after the suspend/resume.
I still can't explain the problem, but if this patch helps, I'll pore over the spec again and see if I can figure something out.
(In reply to Bjorn Helgaas from comment #22) > Created attachment 280297 [details] > test patch (use LTR if already enabled by platform) > > Can you please try this patch? I believe this is a typo: + pcie_capability_read_dword(dev, PCI_EXP_DEVCTL2, &ctl); + if (cap & PCI_EXP_DEVCTL2_LTR_EN) { "cap" here should be "ctl", right? Created attachment 280339 [details] lspci -vvs before suspend (Linux 4.20 with comment #22 patch, on Dell Latitude 7350) (In reply to Bjorn Helgaas from comment #22) > Created attachment 280297 [details] > test patch (use LTR if already enabled by platform) > > Can you please try this patch? If it helps, please attach the output of > "lspci -vvs 00:1c.0; lspci -vvs 01:00.0" from before and after the > suspend/resume. This fixes the issue on my Dell Latitude 7350 (with the change in comment #23). It has identical hardware as the Dell Venue Pro 7140 at these two PCI bus addresses. I'm attaching the requested lspci -vvs output. The only difference before and after suspend is the LTR max snoop/no snoop latencies for the Wi-Fi adapter. Created attachment 280341 [details] lspci -vvs after suspend (Linux 4.20 with comment #22 patch, on Dell Latitude 7350) Thanks a lot for testing this! You're right about the typo in the patch. I think we have at least two issues here. 1) Linux has no support for saving/restoring the Max Latency values in the LTR Capability. This results in the latencies being zero after you resume, as you see in the lspci output. The device still *works* after resume, but power consumption should increase because the device is effectively requesting the best possible service, so we probably don't use the L1.2 state at all. 2) Linux has no support for programming the Max Latency values for hot-added devices. When using ACPI hotplug, firmware may do this, but for native PCIe hotplug (pciehp), the new device should again be requesting the best possible service, resulting in more power consumption than necessary. The platform is supposed to supply a _DSM method with information required to program these values. (In reply to RussianNeuroMancer from comment #7) > Created attachment 280121 [details] > lspci -xxxvvv output from upstream Linux 4.20rc7 RussianNeuroMancer, I'm pretty sure this lspci output was captured before the suspend because the LTR max latencies are non-zero. Could I trouble you to collect similar output after the resume of v4.20-rc7, when iwlwifi isn't working? David, if you're able to capture this info, I'd like to see your "lspci -xxxvvv" output from both before and after the suspend/resume (on an upstream kernel without the test patch). I think the comment 22 patch is probably something we need to do, but I still can't connect the fact that ASPM L1.2 is disabled with iwlwifi being completely non-functional. I'm hoping we can find something else that explains that. Created attachment 280377 [details]
lspci -xxxvvv before suspend (Linux 4.20, on Dell Latitude 7350)
Created attachment 280379 [details] lspci -xxxvvv after suspend (Linux 4.20, on Dell Latitude 7350) (In reply to Bjorn Helgaas from comment #27) > (In reply to RussianNeuroMancer from comment #7) > > Created attachment 280121 [details] > > lspci -xxxvvv output from upstream Linux 4.20rc7 > > RussianNeuroMancer, I'm pretty sure this lspci output was captured before > the suspend because the LTR max latencies are non-zero. Could I trouble you > to collect similar output after the resume of v4.20-rc7, when iwlwifi isn't > working? > > David, if you're able to capture this info, I'd like to see your "lspci > -xxxvvv" output from both before and after the suspend/resume (on an > upstream kernel without the test patch). The output from my system is attached, with upstream kernel version 4.20. Unfortunately, the only PCI device on this system with LTR capability seems to be the Wi-Fi adapter (the integrated SD card reader uses USB 3.0 instead). Because of this bug, the Wi-Fi adapter does not provide valid data to lspci after resume from suspend. Created attachment 280381 [details]
dmesg (Linux 4.20, on Dell Latitude 7350)
Created attachment 280383 [details]
acpidump (on Dell Latitude 7350)
Created attachment 280409 [details] lspci -xxxvvv before suspend (Linux 4.18, on Dell Latitude 7350) (In reply to Bjorn Helgaas from comment #27) > I think the comment 22 patch is probably something we need to do, but I > still can't connect the fact that ASPM L1.2 is disabled with iwlwifi being > completely non-functional. I'm hoping we can find something else that > explains that. The bisection point that causes the iwlwifi issue on the Dell Venue Pro 7140 is not the same one that causes it on the Dell Latitude 7350. It works for me with Linux 4.18rc1, Linux 4.18, and Linux 4.18.16. It's not until Linux 4.19 that the card can't access its config space and Wi-Fi stops working. In Linux 4.18, what I am observing though is that the system *boots* with ASPM L1.2 disabled on the Wi-Fi adapter and PCIe bridge. But if I suspend the system, then after it resumes I actually see that ASPM L1.2 has become enabled on both. In case this is relevant, lspci -xxxvvv is attached for Linux 4.18 before/after suspend. Created attachment 280411 [details]
lspci -xxxvvv after suspend (Linux 4.18, on Dell Latitude 7350)
Created attachment 280419 [details] lspci -vvs before suspend (Linux 4.18.1 with comment #22 patch, on Dell Venue 11 Pro 7140) > Can you please try this patch? If it helps, please attach the output of > "lspci -vvs 00:1c.0; lspci -vvs 01:00.0" from before and after the > suspend/resume. I tested this patch on top of 4.18.1, output before suspend and after resume is attached. Created attachment 280421 [details] lspci -vvs before resume (Linux 4.18.1 with comment #22 patch, on Dell Venue 11 Pro 7140) Created attachment 280423 [details]
lspci -xxxvvv before suspend (Linux 4.20 upstream, on Dell Venue 11 Pro 7140)
Created attachment 280425 [details] lspci -vvs after resume (Linux 4.18.1 with comment #22 patch, on Dell Venue 11 Pro 7140) Created attachment 280427 [details] lspci -xxxvvv after resume (Linux 4.20 upstream, on Dell Venue 11 Pro 7140) > RussianNeuroMancer, I'm pretty sure this lspci output was captured before the > suspend because the LTR max latencies are non-zero. Could I trouble you to > collect similar output after the resume of v4.20-rc7, when iwlwifi isn't > working? Sure, output is attached. Using git bisect, I found the power management change in 4.19rc1 which causes Wi-Fi to actually stop working after suspend/resume on the Dell Latitude 7350. Prior to the commit below, when the Dell Latitude 7350 is suspended/resumed, ASPM L1.2 becomes disabled on the Wi-Fi adapter (due to commit 9ab105d) but Wi-Fi still works. With the commit below, the Wi-Fi adapter's config space can no longer be accessed after suspend/resume. The difference with the Dell Venue Pro 7140 (which is specifically mentioned in the commit message below) is that it only takes commit 9ab105d to prevent it from accessing the Wi-Fi adapter's config space after suspend/resume. commit 6f9db69ad93cd6ab77d5571cf748ff7cdcfb0285 Author: Tristian Celestin <tristiancelestin@fastmail.com> Date: Fri Jun 15 04:50:18 2018 -0400 ACPI / PM: Default to s2idle in all machines supporting LP S0 The Dell Venue Pro 7140 supports the Low Power S0 Idle state, but does not support any of the _DSM functions that the current heuristic checks for. Since suspend-to-mem can not be safely performed on this machine, and since the bitfield check can't cover this case, it is safer to enable s2idle by default by checking for the presence of the _DSM alone and removing the bitfield check. Signed-off-by: Tristian Celestin <tristiancelestin@fastmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> So before this commit Dell 7350 perform suspend-to-mem by default? (In reply to David Ward from comment #39) > Prior to the commit below, when the Dell Latitude 7350 is suspended/resumed, > ASPM L1.2 becomes disabled on the Wi-Fi adapter (due to commit 9ab105d) [...] meant to say "ASPM L1.2 becomes enabled" (In reply to RussianNeuroMancer from comment #40) > So before this commit Dell 7350 perform suspend-to-mem by default? Yes; with this commit removed on the Dell Latitude 7350: $ cat /sys/power/mem_sleep s2idle [deep] $ (sudo lspci -vvs 00:1c.0; sudo lspci -vvs 01:00.0) | grep -e rev -e L1SubCtl1 00:1c.0 PCI bridge: Intel Corporation Wildcat Point-LP PCI Express Root Port #3 (rev e3) (prog-if 00 [Normal decode]) L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ 01:00.0 Network controller: Intel Corporation Wireless 7265 (rev 59) L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ $ systemctl suspend $ (sudo lspci -vvs 00:1c.0; sudo lspci -vvs 01:00.0) | grep -e rev -e L1SubCtl1 00:1c.0 PCI bridge: Intel Corporation Wildcat Point-LP PCI Express Root Port #3 (rev e3) (prog-if 00 [Normal decode]) L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ 01:00.0 Network controller: Intel Corporation Wireless 7265 (rev 59) L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ But with this commit applied: $ cat /sys/power/mem_sleep [s2idle] deep $ (sudo lspci -vvs 00:1c.0; sudo lspci -vvs 01:00.0) | grep -e rev -e L1SubCtl1 00:1c.0 PCI bridge: Intel Corporation Wildcat Point-LP PCI Express Root Port #3 (rev e3) (prog-if 00 [Normal decode]) L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ 01:00.0 Network controller: Intel Corporation Wireless 7265 (rev 59) L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ $ systemctl suspend [ 107.490445] iwlwifi 0000:01:00.0: iwlwifi transaction failed, dumping registers <...> $ (sudo lspci -vvs 00:1c.0; sudo lspci -vvs 01:00.0) | grep -e rev -e L1SubCtl1 00:1c.0 PCI bridge: Intel Corporation Wildcat Point-LP PCI Express Root Port #3 (rev e3) (prog-if 00 [Normal decode]) L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ 01:00.0 Network controller: Intel Corporation Wireless 7265 (rev ff) (prog-if ff) So it's still the case that whenever iwlwifi breaks after resume, ASPM L1.2 has been disabled; but the reason(s) ASPM L1.2 became disabled may be a bit different. (In reply to David Ward from comment #39) > Using git bisect, I found the power management change in 4.19rc1 which > causes Wi-Fi to actually stop working after suspend/resume on the Dell > Latitude 7350. > > Prior to the commit below, when the Dell Latitude 7350 is suspended/resumed, > ASPM L1.2 becomes disabled on the Wi-Fi adapter (due to commit 9ab105d) but > Wi-Fi still works. With the commit below, the Wi-Fi adapter's config space > can no longer be accessed after suspend/resume. > > The difference with the Dell Venue Pro 7140 (which is specifically mentioned > in the commit message below) is that it only takes commit 9ab105d to prevent > it from accessing the Wi-Fi adapter's config space after suspend/resume. > > > > commit 6f9db69ad93cd6ab77d5571cf748ff7cdcfb0285 > Author: Tristian Celestin <tristiancelestin@fastmail.com> > Date: Fri Jun 15 04:50:18 2018 -0400 > > ACPI / PM: Default to s2idle in all machines supporting LP S0 Without reverting this, can you echo "deep" to /sys/power/mem_sleep and retest? It should do suspend-to-RAM instead of suspend-to-idle then. (In reply to Rafael J. Wysocki from comment #42) > > commit 6f9db69ad93cd6ab77d5571cf748ff7cdcfb0285 > > Author: Tristian Celestin <tristiancelestin@fastmail.com> > > Date: Fri Jun 15 04:50:18 2018 -0400 > > > > ACPI / PM: Default to s2idle in all machines supporting LP S0 > > Without reverting this, can you echo "deep" to /sys/power/mem_sleep and > retest? > > It should do suspend-to-RAM instead of suspend-to-idle then. Yes, this does what you'd expect: manually changing from suspend-to-idle to suspend-to-RAM this way has the same effect as reverting the commit. When the system resumes from suspend, the Wi-Fi adapter works and it now has ASPM L1.2 enabled. It remains this way even after suspending/resuming the system several times. (ASPM L1.2 is disabled from when the system boots until it is first suspended.) RussianNeuroMancer, I noticed in the dmesg output you originally posted in October that your system was running BIOS version A16. Just to be sure, do you see the same behavior with the current BIOS version A17? > Yes; with this commit removed on the Dell Latitude 7350: > $ cat /sys/power/mem_sleep > s2idle [deep] Very interesting. Resume from deep sleep doesn't work for me at all on Dell 7140. Can't wakeup by power button nor by LID open event. > I noticed in the dmesg output you originally posted in October that your > system was running BIOS version A16. Just to be sure, do you see the same > behavior with the current BIOS version A17? I flashed BIOS A17 and I still see same behaviour. (In reply to Emmanuel Grumbach from comment #1) > We can't talk to our NIC anymore. We have tried many times to see what can > be done from the driver side, and our HW guys told us that this is not > driver related. There is noise on the PCI lines and because of that, we > can't access our device. As a test, I booted into the upstream Linux 4.20 kernel, but blacklisted the iwlmvm and iwlwifi modules from the kernel command line. At the shell, I confirmed that these modules were not loaded, and that no network interface was created for the adapter. The output of "lspci -xxxvvv" showed that the adapter was present, but no "Kernel driver in use" line appeared. It also showed that ASPM L1.2 was disabled, and the configuration space was printed successfully. I suspended and resumed the system; then I looked at the output of "lspci -xxxvvv" again. It was still able to print the configuration space of the adapter (versus attachment 280379 [details] where it could not); and it showed that ASPM L1.2 remained disabled. I'm very underinformed about the interplay between the Wi-Fi driver and the PCIe subsystem in the kernel, but does this test isolate the root issue? Or would a similar one? The wifi driver won't be messing with the config space. Only with its own registers that are memory mapped. Meaning that any change to the config space will be done solely by the PCI bus driver. This is why this bug is currently assigned to the maintainer of the PCI bus driver. Created attachment 281117 [details] ASPM patches Finally coming back to this... David, based on comment #24 and comment #25, I think the problem on the Dell Latitude 7350 should be fixed by this patch, which is on my pci/aspm branch and headed for v5.1. If you could verify, that would be great. This is intended to fix the latency differences you observed via lspci. RussianNeuroMancer, you have a Dell Venue 11 Pro 7140 where iwlwifi stopped working after suspend/resume in v4.18-rc1; specifically, it was broken by 9ab105deb60f ("PCI/ASPM: Disable ASPM L1.2 Substate if we don't have LTR"). The patch attached to comment #48 *should* fix that problem. Can you confirm/deny? Apparently there's also some problem related to deep sleep (comment #45). That sounds like something separate from this ASPM/LTR issue because the resume doesn't work at all. If that's still a problem, can you open a separate bugzilla for it? > The patch attached to comment #48 *should* fix that problem. Can you > confirm/deny? I build 5.0rc6 with patches from comment #48 and I can confirm that WiFi adapter still works after wakeup from suspend freeze. > Apparently there's also some problem related to deep sleep (comment #45). > That sounds like something separate from this ASPM/LTR issue because the > resume doesn't work at all. If that's still a problem, can you open a > separate bugzilla for it? Of course I can open separate bugreport about this, but I not sure if there is point on doing so as S3 is never supposed to work on this device - it doesn't work even with preinstalled Windows 8, and also doesn't work with Windows 10? Also 6f9db69ad93cd6ab77d5571cf748ff7cdcfb0285 happened exactly because S3 doesn't work on this tablet. RussianNeuroMancer, thanks very much for testing the patch. If David confirms that it also fixes the problem on his machine, I *think* we can consider this issue resolved? I don't really understand the S3 issue, so if it's resolved to your satisfaction, e.g., by 6f9db69ad93c ("ACPI / PM: Default to s2idle in all machines supporting LP S0"), I guess nothing else needs to be done there. > I *think* we can consider this issue resolved? I think so. Thank you for fixing this :) > I don't really understand the S3 issue I re-tested S3 - Dell Venue 11 Pro 7140 power off on S3 attempt (at least with Linux 5.0rc6). Maybe siruation with Dell Latitude 7350 is better. and maybe 6f9db69ad93c should cover only 7140 and not change 7350 behaviour, but let's see what David can say about S3 stability on his tablet. David, is S3 reliable for you on 7350? Check if /sys/power/mem_sleep is deep while testing this. (In reply to Bjorn Helgaas from comment #48) > David, based on comment #24 and comment #25, I think the problem on the Dell > Latitude 7350 should be fixed by this patch, which is on my pci/aspm branch > and headed for v5.1. If you could verify, that would be great. This is > intended to fix the latency differences you observed via lspci. Bjorn, yes with these patches Wi-Fi continues to work after suspend, even in s2idle mode. Thanks! A couple of other things: (In reply to Bjorn Helgaas from comment #51) > RussianNeuroMancer, thanks very much for testing the patch. If David > confirms that it also fixes the problem on his machine, I *think* we can > consider this issue resolved? I believe these two patches were necessary to make ASPM / LTR work correctly after suspend. However, I'm a bit concerned that this is simply going to mask a still unresolved problem lurking in either the firmware or driver for this Wi-Fi adapter. There's still no explanation as to why this affected Wi-Fi at all. I realize that issues having similar symptoms in the past did not turn up any issues in the iwlwifi driver, but I feel like this possibility was completely dismissed here without even being considered. Bjorn, I'm interested in your thoughts on comment #45 -- does that test seem to indicate anything? Or does this not give meaningful information if there's no driver loaded for the PCI device? > I don't really understand the S3 issue, so if it's resolved to your > satisfaction, e.g., by 6f9db69ad93c ("ACPI / PM: Default to s2idle in all > machines supporting LP S0"), I guess nothing else needs to be done there. My issue with this patch is that it may have had a broader impact than intended. According to the commit message, it was targeted at the Dell Venue 11 Pro 7140; but it also changes the default sleep mode on the Dell Latitude 7350 too, which doesn't have the same issue and seems completely functional after suspend-to-RAM. Do you think this is enough to file an issue about it? (In reply to David Ward from comment #53) > Bjorn, I'm interested in your thoughts on comment #45 -- Whoops, I meant comment #46. I have similar hardware, and attempted to verify this fix was working on a Dell Inspiron 15-3573 obtained in January 2019, with kernel 4.20.13. Unfortunately, after suspending, the iwlwifi is not available, and the Network Manager shows 'Device not ready' until the device is rebooted. I understand that's a symptom, not the actual problem, but the error: Timeout waiting for hardware access (CSR_GP_CNTRL 0xffffffff) is still present. I can run whatever commands are required to verify, before or after, as necessary, just let me know. Below is the dmesg output that appears relevant.. -- user1@endor:~$ uname -a Linux endor 4.20.13-042013-generic #201902270533 SMP Wed Feb 27 10:35:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux user1@endor:~$ --- [ 51.987725] wlp1s0: deauthenticating from 0c:80:63:f6:b9:87 by local choice (Reason: 3=DEAUTH_LEAVING) [ 52.138708] PM: suspend entry (deep) [ 52.138711] PM: Syncing filesystems ... done. [ 52.161906] Freezing user space processes ... (elapsed 0.014 seconds) done. [ 52.176795] OOM killer disabled. [ 52.176796] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [ 52.178428] printk: Suspending console(s) (use no_console_suspend to debug) [ 52.452781] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 52.452901] sd 0:0:0:0: [sda] Stopping disk [ 52.789937] ACPI: Preparing to enter system sleep state S3 [ 52.810669] PM: Saving platform NVS memory [ 52.810721] Disabling non-boot CPUs ... [ 52.824725] IRQ 125: no longer affine to CPU1 [ 52.824733] IRQ 128: no longer affine to CPU1 [ 52.825759] smpboot: CPU 1 is now offline [ 52.831050] ACPI: Low-level resume complete [ 52.831181] PM: Restoring platform NVS memory [ 52.835372] Enabling non-boot CPUs ... [ 52.835487] x86: Booting SMP configuration: [ 52.835489] smpboot: Booting Node 0 Processor 1 APIC 0x2 [ 52.836489] x86/cpu: Activated the Intel User Mode Instruction Prevention (UMIP) CPU feature [ 52.836960] cache: parent cpu1 should not be sleeping [ 52.837259] CPU1 is up [ 52.841173] ACPI: Waking up from system sleep state S3 [ 52.945957] pci_raw_set_power_state: 11 callbacks suppressed [ 52.945964] iwlwifi 0000:01:00.0: Refused to change power state, currently in D3 [ 53.035367] sd 0:0:0:0: [sda] Starting disk [ 53.038632] ACPI: button: The lid device is not compliant to SW_LID. [ 53.270465] usb 1-5: reset high-speed USB device number 2 using xhci_hcd [ 53.350379] ata2: SATA link down (SStatus 4 SControl 300) [ 53.510224] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 53.511922] ata1.00: configured for UDMA/133 [ 53.512117] ata1.00: Enabling discard_zeroes_data [ 53.546562] usb 1-6: reset high-speed USB device number 3 using xhci_hcd [ 53.822717] usb 1-7: reset full-speed USB device number 4 using xhci_hcd [ 53.977731] acpi LNXPOWER:06: Turning OFF [ 53.979085] acpi LNXPOWER:05: Turning OFF [ 53.980348] acpi LNXPOWER:04: Turning OFF [ 53.981505] acpi LNXPOWER:03: Turning OFF [ 53.982383] acpi LNXPOWER:02: Turning OFF [ 53.983232] acpi LNXPOWER:01: Turning OFF [ 53.984122] acpi LNXPOWER:00: Turning OFF [ 53.984498] OOM killer enabled. [ 53.984500] Restarting tasks ... [ 53.993058] Bluetooth: hci0: read Intel version: 370810011003110e00 [ 53.994296] Bluetooth: hci0: Intel Bluetooth firmware file: intel/ibt-hw-37.8.10-fw-1.10.3.11.e.bseq [ 54.013081] done. [ 54.013965] thermal thermal_zone6: failed to read out thermal zone (-61) [ 54.052848] PM: suspend exit [ 54.142200] IPv6: ADDRCONF(NETDEV_UP): wlp1s0: link is not ready [ 54.166525] ------------[ cut here ]------------ [ 54.166530] Timeout waiting for hardware access (CSR_GP_CNTRL 0xffffffff) [ 54.166598] WARNING: CPU: 1 PID: 690 at drivers/net/wireless/intel/iwlwifi/pcie/trans.c:2003 iwl_trans_pcie_grab_nic_access+0x1ee/0x220 [iwlwifi] [ 54.166599] Modules linked in: ccm hid_multitouch hid_generic spi_pxa2xx_platform snd_soc_skl 8250_dw snd_soc_hdac_hda snd_hda_ext_core snd_soc_skl_ipc intel_rapl i2c_designware_platform i2c_designware_core snd_soc_sst_ipc snd_soc_sst_dsp intel_telemetry_pltdrv snd_soc_acpi_intel_match snd_soc_acpi intel_punit_ipc intel_telemetry_core intel_pmc_ipc snd_soc_core x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi coretemp snd_compress kvm_intel ac97_bus snd_pcm_dmaengine snd_hda_codec_realtek snd_hda_codec_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel dell_laptop snd_hda_codec dell_smm_hwmon snd_hda_core snd_hwdep nls_iso8859_1 aesni_intel snd_pcm aes_x86_64 crypto_simd cryptd glue_helper intel_cstate intel_rapl_perf snd_seq_dummy snd_seq_oss uvcvideo arc4 videobuf2_vmalloc videobuf2_memops snd_seq_midi videobuf2_v4l2 videobuf2_common snd_seq_midi_event btusb snd_rawmidi joydev btrtl input_leds videodev iwlmvm rtsx_usb_ms btbcm media memstick btintel serio_raw [ 54.166626] mac80211 bluetooth dell_wmi dell_smbios dcdbas iwlwifi ecdh_generic wmi_bmof dell_wmi_descriptor snd_seq snd_seq_device snd_timer idma64 virt_dma intel_lpss_pci intel_lpss snd cfg80211 mei_me mei soundcore processor_thermal_device intel_soc_dts_iosf mac_hid intel_hid int3400_thermal sparse_keymap int3403_thermal acpi_thermal_rel int3406_thermal int340x_thermal_zone sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq libcrc32c dm_mirror dm_region_hash dm_log rtsx_usb_sdmmc rtsx_usb i915 kvmgt vfio_mdev mdev vfio_iommu_type1 vfio kvm irqbypass i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm drm_panel_orientation_quirks cfbfillrect cfbimgblt cfbcopyarea i2c_i801 sdhci_pci psmouse cqhci sdhci i2c_hid wmi fb ahci libahci fbdev hid i2c_core video pinctrl_geminilake pinctrl_intel [ 54.166664] CPU: 1 PID: 690 Comm: NetworkManager Not tainted 4.20.13-042013-generic #201902270533 [ 54.166665] Hardware name: Dell Inc. Inspiron 15-3573/0XT9M4, BIOS 1.5.0 10/03/2018 [ 54.166674] RIP: 0010:iwl_trans_pcie_grab_nic_access+0x1ee/0x220 [iwlwifi] [ 54.166676] Code: 42 e6 49 8d 57 08 bf 00 20 00 00 e8 9c 02 eb e4 e9 36 ff ff ff 89 c6 48 c7 c7 a8 4f a2 c0 c6 05 f2 88 02 00 01 e8 14 27 e9 e4 <0f> 0b e9 f1 fe ff ff 48 8b 7b 30 48 c7 c1 10 50 a2 c0 31 d2 31 f6 [ 54.166676] RSP: 0018:ffffa97900de7170 EFLAGS: 00010082 [ 54.166678] RAX: 0000000000000000 RBX: ffff94c5b9530018 RCX: 0000000000000006 [ 54.166678] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff94c5bba96440 [ 54.166679] RBP: ffffa97900de7198 R08: 0000000000000001 R09: 000000000000039b [ 54.166680] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000000 [ 54.166681] R13: ffff94c5b953a2bc R14: ffffa97900de71a8 R15: 00000000ffffffff [ 54.166682] FS: 00007f2887b8ffc0(0000) GS:ffff94c5bba80000(0000) knlGS:0000000000000000 [ 54.166683] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 54.166684] CR2: 00007fbcd56f8be8 CR3: 0000000176048000 CR4: 0000000000340ee0 [ 54.166685] Call Trace: [ 54.166697] iwl_write_prph+0x3d/0x90 [iwlwifi] [ 54.166705] iwl_pcie_apm_init+0x1db/0x240 [iwlwifi] [ 54.166713] iwl_trans_pcie_start_hw+0x52/0x1d0 [iwlwifi] [ 54.166724] iwl_mvm_up+0x3c/0xa80 [iwlmvm] [ 54.166730] ? skb_dequeue+0x59/0x70 [ 54.166733] ? wireless_nlevent_flush+0x78/0x80 [ 54.166741] __iwl_mvm_mac_start+0x29b/0x300 [iwlmvm] [ 54.166749] iwl_mvm_mac_start+0x4c/0x130 [iwlmvm] [ 54.166752] ? inetdev_event+0x47/0x500 [ 54.166754] ? __fib6_clean_all+0x75/0xa0 [ 54.166782] drv_start+0x48/0x100 [mac80211] [ 54.166801] ieee80211_do_open+0x434/0x840 [mac80211] [ 54.166819] ieee80211_open+0x52/0x60 [mac80211] [ 54.166822] __dev_open+0xd5/0x170 [ 54.166824] __dev_change_flags+0x184/0x1f0 [ 54.166826] dev_change_flags+0x27/0x60 [ 54.166828] do_setlink+0x30e/0xe10 [ 54.166831] ? __nla_parse+0xf1/0x120 [ 54.166833] ? nla_parse+0x11/0x20 [ 54.166835] ? inet6_validate_link_af+0x4f/0x70 [ 54.166836] ? __nla_parse+0x38/0x120 [ 54.166839] rtnl_newlink+0x7a8/0x900 [ 54.166842] ? cpumask_next_and+0x1e/0x20 [ 54.166844] ? cpumask_next+0x1b/0x20 [ 54.166845] ? __snmp6_fill_stats64.isra.57+0xf6/0x120 [ 54.166848] ? pskb_expand_head+0x73/0x2f0 [ 54.166852] ? __kmalloc_node_track_caller+0xcb/0x2b0 [ 54.166853] ? pskb_expand_head+0x73/0x2f0 [ 54.166855] ? __kmalloc_reserve.isra.51+0x31/0x90 [ 54.166857] ? security_sock_rcv_skb+0x2f/0x50 [ 54.166859] ? skb_queue_tail+0x43/0x50 [ 54.166862] ? __netlink_sendskb+0x56/0x70 [ 54.166863] ? netlink_unicast+0x212/0x260 [ 54.166867] ? security_capset+0x30/0x70 [ 54.166871] ? ns_capable_common+0x6c/0x70 [ 54.166872] ? ns_capable+0x13/0x20 [ 54.166874] rtnetlink_rcv_msg+0x213/0x300 [ 54.166876] ? rtnl_calcit.isra.31+0x100/0x100 [ 54.166878] netlink_rcv_skb+0x52/0x130 [ 54.166880] rtnetlink_rcv+0x15/0x20 [ 54.166881] netlink_unicast+0x1a4/0x260 [ 54.166882] netlink_sendmsg+0x20d/0x3c0 [ 54.166885] sock_sendmsg+0x3e/0x50 [ 54.166886] ___sys_sendmsg+0x295/0x2f0 [ 54.166888] ? rtnl_unlock+0xe/0x10 [ 54.166890] ? dev_forward_change+0x140/0x140 [ 54.166893] ? sysctl_head_finish.part.27+0x28/0x40 [ 54.166895] ? proc_sys_call_handler+0xc9/0x100 [ 54.166898] ? __fget_light+0x54/0x60 [ 54.166899] __sys_sendmsg+0x5c/0xa0 [ 54.166901] __x64_sys_sendmsg+0x1f/0x30 [ 54.166904] do_syscall_64+0x5a/0x110 [ 54.166907] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 54.166909] RIP: 0033:0x7f288536f607 [ 54.166911] Code: 44 00 00 41 54 55 41 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 0b ea ff ff 44 89 e2 41 89 c0 48 89 ee 89 df b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 44 ea ff ff 48 [ 54.166912] RSP: 002b:00007fffd3e2b7c0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e [ 54.166913] RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f288536f607 [ 54.166913] RDX: 0000000000000000 RSI: 00007fffd3e2b820 RDI: 0000000000000007 [ 54.166914] RBP: 00007fffd3e2b820 R08: 0000000000000000 R09: 00007f28850da1b0 [ 54.166915] R10: 0000562ed3393010 R11: 0000000000000293 R12: 0000000000000000 [ 54.166916] R13: 00007fffd3e2b820 R14: 00007fffd3e2b9a4 R15: 0000000000000000 [ 54.166917] ---[ end trace ccf34690ffa429e2 ]--- Created attachment 281439 [details]
lspci -xxxvvv before suspend (Linux 4.20.13, on Dell Inspiron 15-3573)
Created attachment 281441 [details]
lspci -xxxvvv after suspend (Linux 4.20.13, on Dell Inspiron 15-3573)
vjek, the patch from comment #48 is not upstream yet and is not in v4.20.13 either (AFAIK). So I think it is expected that you will still see this problem. If you have the time and interest, you could test the next-20190301 kernel, which does include those patches, or you could apply them to v4.20.13 and test that (the patches might not apply completely cleanly to v4.20.13, but it should be pretty close). Understood, I'll try the next-20190301 kernel and advise on the results. Unfortunately, it's not good news. dmesg output and two lspci attachments, but the card is still not available after a suspend, with 5.0.0-rc8-next-20190301. No crash, though, so I guess that's positive. -- [20190301 15:11:56] user1@endor ~ :uname -a Linux endor 5.0.0-rc8-next-20190301 #2 SMP Fri Mar 1 13:46:40 MST 2019 x86_64 x86_64 x86_64 GNU/Linux -- [ 628.304819] wlp1s0: deauthenticating from 0c:80:63:f6:b9:87 by local choice (Reason: 3=DEAUTH_LEAVING) [ 628.355922] PM: suspend entry (deep) [ 628.355925] PM: Syncing filesystems ... done. [ 628.368196] Freezing user space processes ... (elapsed 0.008 seconds) done. [ 628.376601] OOM killer disabled. [ 628.376619] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [ 628.378247] printk: Suspending console(s) (use no_console_suspend to debug) [ 628.757967] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 628.758152] sd 0:0:0:0: [sda] Stopping disk [ 629.094746] ACPI: Preparing to enter system sleep state S3 [ 629.116032] PM: Saving platform NVS memory [ 629.116089] Disabling non-boot CPUs ... [ 629.116472] IRQ 125: no longer affine to CPU1 [ 629.117484] smpboot: CPU 1 is now offline [ 629.122342] ACPI: Low-level resume complete [ 629.122457] PM: Restoring platform NVS memory [ 629.127359] Enabling non-boot CPUs ... [ 629.127463] x86: Booting SMP configuration: [ 629.127466] smpboot: Booting Node 0 Processor 1 APIC 0x2 [ 629.129552] CPU1 is up [ 629.134906] ACPI: Waking up from system sleep state S3 [ 629.246441] iwlwifi 0000:01:00.0: Refused to change power state, currently in D3 [ 629.333452] sd 0:0:0:0: [sda] Starting disk [ 629.335151] ACPI: button: The lid device is not compliant to SW_LID. [ 629.570692] usb 1-5: reset high-speed USB device number 2 using xhci_hcd [ 629.649798] ata2: SATA link down (SStatus 4 SControl 300) [ 629.810705] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 629.812612] ata1.00: configured for UDMA/133 [ 629.812789] ata1.00: Enabling discard_zeroes_data [ 629.847120] usb 1-7: reset full-speed USB device number 4 using xhci_hcd [ 630.123080] usb 1-6: reset high-speed USB device number 3 using xhci_hcd [ 630.276570] acpi LNXPOWER:06: Turning OFF [ 630.277921] acpi LNXPOWER:05: Turning OFF [ 630.279241] acpi LNXPOWER:04: Turning OFF [ 630.280418] acpi LNXPOWER:03: Turning OFF [ 630.281537] acpi LNXPOWER:02: Turning OFF [ 630.282671] acpi LNXPOWER:01: Turning OFF [ 630.283789] acpi LNXPOWER:00: Turning OFF [ 630.284094] OOM killer enabled. [ 630.284095] Restarting tasks ... done. [ 630.306666] Bluetooth: hci0: read Intel version: 370810011003110e00 [ 630.307093] thermal thermal_zone6: failed to read out thermal zone (-61) [ 630.310187] Bluetooth: hci0: Intel Bluetooth firmware file: intel/ibt-hw-37.8.10-fw-1.10.3.11.e.bseq [ 630.343231] PM: suspend exit [ 630.447435] iwlwifi 0000:01:00.0: Error, can not clear persistence bit [ 630.461934] iwlwifi 0000:01:00.0: Error, can not clear persistence bit [ 630.462309] iwlwifi 0000:01:00.0: Error, can not clear persistence bit [ 630.635507] Bluetooth: hci0: Intel firmware patch completed and activated [ 634.074609] dell_wmi: Unknown WMI event type 0x12 [ 634.577466] dell_wmi: Unknown WMI event type 0x12 [ 641.025959] iwlwifi 0000:01:00.0: Error, can not clear persistence bit [ 641.026425] iwlwifi 0000:01:00.0: Error, can not clear persistence bit ... (these last two Error messages repeat every 10 seconds thereafter) Created attachment 281447 [details]
lspci -xxxvvv before suspend (Linux 5.0.0-rc8, on Dell Inspiron 15-3573)
Created attachment 281449 [details]
lspci -xxxvvv after suspend (Linux 5.0.0-rc8, on Dell Inspiron 15-3573)
@vjek This is a different issue. Please open a new bug and add linuxwifi@intel.com to the bug. Sorry, I started composing this long ago but got distracted til now. (In reply to David Ward from comment #54) > (In reply to David Ward from comment #53) > > Bjorn, I'm interested in your thoughts on comment #45 -- > > Whoops, I meant comment #46. If I understand correctly, you demonstrated that with the identical kernel: - if iwlwifi/iwlmvm are not loaded the device remains accessible after suspend/resume - if they are loaded, the device stops responding after suspend/resume I agree, that is a very interesting test. What does /sys/module/pcie_aspm/parameters/policy contain? In general the PCI core does ASPM configuration regardless of whether a driver is bound to the device, but it only calls pcie_aspm_powersave_config_link() when a driver claims a device. So it's possible there's a problem there, or it's possible there's something in iwlwifi, e.g., in iwl_pcie_apm_init() iwl_pcie_apm_config() iwl_mvm_config_ltr() The only thing that iwlwifi does with the config space is to read the ASPM configuration to know if L1 is enabled. If it is, it informs the firmware that it can use L1. pcie_capability_read_word(trans_pcie->pci_dev, PCI_EXP_LNKCTL, &lctl); if (lctl & PCI_EXP_LNKCTL_ASPM_L1) iwl_set_bit(trans, CSR_GIO_REG, CSR_GIO_REG_VAL_L0S_ENABLED); else iwl_clear_bit(trans, CSR_GIO_REG, CSR_GIO_REG_VAL_L0S_ENABLED); trans->pm_support = !(lctl & PCI_EXP_LNKCTL_ASPM_L0S); pcie_capability_read_word(trans_pcie->pci_dev, PCI_EXP_DEVCTL2, &cap); trans->ltr_enabled = cap & PCI_EXP_DEVCTL2_LTR_EN; IWL_DEBUG_POWER(trans, "L1 %sabled - LTR %sabled\n", (lctl & PCI_EXP_LNKCTL_ASPM_L1) ? "En" : "Dis", trans->ltr_enabled ? "En" : "Dis"); We don't write anything to the config space. (In reply to Emmanuel Grumbach from comment #65) > The only thing that iwlwifi does [skip] > We don't write anything to the config space. Seems we can't find volunteer able to fix the bug for 8 months. And since it's reproducible in 5.1.7-1-default: > Jun 25 22:34:15 my.local kernel: iwlwifi 0000:3b:00.0: Error, can not clear > persistence bit it didn't get fixed magically by random refactoring or improvement either. still the same in 2021 with 5.11.7-200.fc33.x86_64 @gustavo, As described in comment 63, what you are experiencing is different from this bug. I have opened bug 212457 for your issue. Can you please visit it and upload the output of 'dmesg' (after resuming from suspend)? Then please follow up with other requests for details/testing in that bug. Thank you. |