Bug 215129
Summary: | Linux kernel hangs during power down | ||
---|---|---|---|
Product: | Networking | Reporter: | Martin Stolpe (martin.stolpe) |
Component: | Other | Assignee: | Stephen Hemminger (stephen) |
Status: | NEW --- | ||
Severity: | normal | CC: | aperotti, benjamin, enricod, hkallweit1, martin.stolpe, northernfreevatar, regressions |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.15 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: | Kernel log after timeout occured |
Description
Martin Stolpe
2021-11-24 21:14:53 UTC
Wireless card is a QCA6174 based card with ath10k_pci driver. CC'ing the auther of the patch. The hint to ath10k_pci is misleading here, the actual issue is in the interaction between net core and Intel network driver (igb). In a nutshell the issue is: - The core changes result in network driver's runtime_resume() being called from a context where RTNL is held. - This conflicts with few Intel drivers taking RTNL in their resume path. This has been initially discussed e.g. here, but there's no tangible result yet. https://lore.kernel.org/lkml/20210809032809.1224002-1-acelan.kao@canonical.com/#t What you can do as workaround for the time being: Disable Runtime Power Management for the network adapter: echo on > /sys/class/net/<interface>/device/power/control I've blacklisted the igb driver and the problem doesn't occur. Thanks! Right, blacklisting the driver also works, however just for people who don't need the wired network. Following patch should solve the issue, however I have no test hw. Could you please test it (after removing igb blacklisting)? diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index dd208930f..8073cce73 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -9254,7 +9254,7 @@ static int __maybe_unused igb_suspend(struct device *dev) return __igb_shutdown(to_pci_dev(dev), NULL, 0); } -static int __maybe_unused igb_resume(struct device *dev) +static int __maybe_unused __igb_resume(struct device *dev, bool rpm) { struct pci_dev *pdev = to_pci_dev(dev); struct net_device *netdev = pci_get_drvdata(pdev); @@ -9297,17 +9297,24 @@ static int __maybe_unused igb_resume(struct device *dev) wr32(E1000_WUS, ~0); - rtnl_lock(); + if (!rpm) + rtnl_lock(); if (!err && netif_running(netdev)) err = __igb_open(netdev, true); if (!err) netif_device_attach(netdev); - rtnl_unlock(); + if (!rpm) + rtnl_unlock(); return err; } +static int __maybe_unused igb_resume(struct device *dev) +{ + return __igb_resume(dev, false); +} + static int __maybe_unused igb_runtime_idle(struct device *dev) { struct net_device *netdev = dev_get_drvdata(dev); @@ -9326,7 +9333,7 @@ static int __maybe_unused igb_runtime_suspend(struct device *dev) static int __maybe_unused igb_runtime_resume(struct device *dev) { - return igb_resume(dev); + return __igb_resume(dev, true); } static void igb_shutdown(struct pci_dev *pdev) @@ -9442,7 +9449,7 @@ static pci_ers_result_t igb_io_error_detected(struct pci_dev *pdev, * @pdev: Pointer to PCI device * * Restart the card from scratch, as if from a cold-boot. Implementation - * resembles the first-half of the igb_resume routine. + * resembles the first-half of the __igb_resume routine. **/ static pci_ers_result_t igb_io_slot_reset(struct pci_dev *pdev) { @@ -9482,7 +9489,7 @@ static pci_ers_result_t igb_io_slot_reset(struct pci_dev *pdev) * * This callback is called when the error recovery driver tells us that * its OK to resume normal operation. Implementation resembles the - * second-half of the igb_resume routine. + * second-half of the __igb_resume routine. */ static void igb_io_resume(struct pci_dev *pdev) { -- 2.34.1 I've tried the patch with 5.15.5 and the problem does not occur. Thank you! Just a quick follow-up: I can confirm that this patch fixes the issue on my hardware as well, thank you! Is there any chance that this gets merged into the main kernel soon? So far I haven't seen this patch in linux or linux-next and the issue is really rather annoying :). Cheers, Benjamin The patch is on its way via the Intel network driver tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/tnguy/net-queue/+/refs/heads/dev-queue (In reply to Heiner Kallweit from comment #5) > Following patch should solve the issue, however I have no test hw. Could you > please test it (after removing igb blacklisting)? Tested on 5.15.7 with igb nics: worked like a charm, thanks! Hi, this is your Linux kernel regression tracker speaking. (In reply to Heiner Kallweit from comment #8) > The patch is on its way via the Intel network driver tree: > https://kernel.googlesource.com/pub/scm/linux/kernel/git/tnguy/net-queue/+/ > refs/heads/dev-queue thx for the patch, but what is taking this patch so long to get upstreamed (which is a requirement to get it backported to stable)? Or was it merged and I just missed it? Or were problems found? Reminder, it's a regression in 5.15.y we are talking about. This is the only Linux version currently distributed by kernel.org that available for users that need something from 5.11 or later and want a stable and secure kernel at the same time. *** Bug 215359 has been marked as a duplicate of this bug. *** Bug is still present in 5.11, some igb fixes made it through but not this one I meant 5.15.11 just released :) I've same issue but with "igc". (In reply to Enrico Demarin from comment #12) > Bug is still present in 5.15.11, some igb fixes made it through but not this > one I recently poked the developers and the fix is no on its way: https://lore.kernel.org/netdev/b4be04bbd6a20855526b961ef80669bd2647564c.camel@intel.com/ (In reply to Mikhail Kondrashov from comment #14) > I've same issue but with "igc". Related, but different and fixed by: https://lore.kernel.org/netdev/20211214003949.666642-1-vinicius.gomes@intel.com/ Sadly this patch is not on the way yet it seems :-/ /me grumbles I confirm this is fixed in 5.15.12 |