Bug 215129

Summary: Linux kernel hangs during power down
Product: Networking Reporter: Martin Stolpe (martin.stolpe)
Component: OtherAssignee: Stephen Hemminger (stephen)
Status: NEW ---    
Severity: normal CC: aperotti, benjamin, enricod, hkallweit1, martin.stolpe, northernfreevatar, regressions
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.15 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Kernel log after timeout occured

Description Martin Stolpe 2021-11-24 21:14:53 UTC
Created attachment 299703 [details]
Kernel log after timeout occured

On my system the kernel is waiting for a task during shutdown which doesn't complete.

The commit which causes this behavior is: [f32a213765739f2a1db319346799f130a3d08820] ethtool: runtime-resume netdev parent before ethtool ioctl ops

This bug causes also that the system gets unresponsive after starting Steam: https://steamcommunity.com/app/221410/discussions/2/3194736442566303600/
Comment 1 Martin Stolpe 2021-11-24 22:05:10 UTC
Wireless card is a QCA6174 based card with ath10k_pci driver.
Comment 2 Artem S. Tashkinov 2021-11-25 09:15:57 UTC
CC'ing the auther of the patch.
Comment 3 Heiner Kallweit 2021-11-25 10:06:05 UTC
The hint to ath10k_pci is misleading here, the actual issue is in the interaction between net core and Intel network driver (igb).

In a nutshell the issue is:
- The core changes result in network driver's runtime_resume() being called from a context where RTNL is held.
- This conflicts with few Intel drivers taking RTNL in their resume path.

This has been initially discussed e.g. here, but there's no tangible result yet.
https://lore.kernel.org/lkml/20210809032809.1224002-1-acelan.kao@canonical.com/#t

What you can do as workaround for the time being:
Disable Runtime Power Management for the network adapter:
echo on > /sys/class/net/<interface>/device/power/control
Comment 4 Martin Stolpe 2021-11-25 21:55:22 UTC
I've blacklisted the igb driver and the problem doesn't occur. Thanks!
Comment 5 Heiner Kallweit 2021-11-27 10:19:33 UTC
Right, blacklisting the driver also works, however just for people who don't need the wired network.
Following patch should solve the issue, however I have no test hw. Could you please test it (after removing igb blacklisting)?


diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index dd208930f..8073cce73 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -9254,7 +9254,7 @@ static int __maybe_unused igb_suspend(struct device *dev)
 	return __igb_shutdown(to_pci_dev(dev), NULL, 0);
 }
 
-static int __maybe_unused igb_resume(struct device *dev)
+static int __maybe_unused __igb_resume(struct device *dev, bool rpm)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct net_device *netdev = pci_get_drvdata(pdev);
@@ -9297,17 +9297,24 @@ static int __maybe_unused igb_resume(struct device *dev)
 
 	wr32(E1000_WUS, ~0);
 
-	rtnl_lock();
+	if (!rpm)
+		rtnl_lock();
 	if (!err && netif_running(netdev))
 		err = __igb_open(netdev, true);
 
 	if (!err)
 		netif_device_attach(netdev);
-	rtnl_unlock();
+	if (!rpm)
+		rtnl_unlock();
 
 	return err;
 }
 
+static int __maybe_unused igb_resume(struct device *dev)
+{
+	return __igb_resume(dev, false);
+}
+
 static int __maybe_unused igb_runtime_idle(struct device *dev)
 {
 	struct net_device *netdev = dev_get_drvdata(dev);
@@ -9326,7 +9333,7 @@ static int __maybe_unused igb_runtime_suspend(struct device *dev)
 
 static int __maybe_unused igb_runtime_resume(struct device *dev)
 {
-	return igb_resume(dev);
+	return __igb_resume(dev, true);
 }
 
 static void igb_shutdown(struct pci_dev *pdev)
@@ -9442,7 +9449,7 @@ static pci_ers_result_t igb_io_error_detected(struct pci_dev *pdev,
  *  @pdev: Pointer to PCI device
  *
  *  Restart the card from scratch, as if from a cold-boot. Implementation
- *  resembles the first-half of the igb_resume routine.
+ *  resembles the first-half of the __igb_resume routine.
  **/
 static pci_ers_result_t igb_io_slot_reset(struct pci_dev *pdev)
 {
@@ -9482,7 +9489,7 @@ static pci_ers_result_t igb_io_slot_reset(struct pci_dev *pdev)
  *
  *  This callback is called when the error recovery driver tells us that
  *  its OK to resume normal operation. Implementation resembles the
- *  second-half of the igb_resume routine.
+ *  second-half of the __igb_resume routine.
  */
 static void igb_io_resume(struct pci_dev *pdev)
 {
-- 
2.34.1
Comment 6 Martin Stolpe 2021-11-29 20:53:59 UTC
I've tried the patch with 5.15.5 and the problem does not occur. Thank you!
Comment 7 Benjamin Radel 2021-12-07 10:38:24 UTC
Just a quick follow-up: I can confirm that this patch fixes the issue on my hardware as well, thank you! Is there any chance that this gets merged into the main kernel soon? So far I haven't seen this patch in linux or linux-next and the issue is really rather annoying :).

Cheers, Benjamin
Comment 8 Heiner Kallweit 2021-12-07 11:38:00 UTC
The patch is on its way via the Intel network driver tree:
https://kernel.googlesource.com/pub/scm/linux/kernel/git/tnguy/net-queue/+/refs/heads/dev-queue
Comment 9 aperotti 2021-12-10 15:01:50 UTC
(In reply to Heiner Kallweit from comment #5)
> Following patch should solve the issue, however I have no test hw. Could you
> please test it (after removing igb blacklisting)?

Tested on 5.15.7 with igb nics: worked like a charm, thanks!
Comment 10 The Linux kernel's regression tracker (Thorsten Leemhuis) 2021-12-17 09:55:39 UTC
Hi, this is your Linux kernel regression tracker speaking.

(In reply to Heiner Kallweit from comment #8)
> The patch is on its way via the Intel network driver tree:
> https://kernel.googlesource.com/pub/scm/linux/kernel/git/tnguy/net-queue/+/
> refs/heads/dev-queue

thx for the patch, but what is taking this patch so long to get upstreamed (which is a requirement to get it backported to stable)? Or was it merged and I just missed it? Or were problems found?

Reminder, it's a regression in 5.15.y we are talking about. This is the only Linux version currently distributed by kernel.org that available for users that need something from 5.11 or later and want a stable and secure kernel at the same time.
Comment 11 Enrico Demarin 2021-12-19 20:59:11 UTC
*** Bug 215359 has been marked as a duplicate of this bug. ***
Comment 12 Enrico Demarin 2021-12-22 16:29:42 UTC
Bug is still present in 5.11, some igb fixes made it through but not this one
Comment 13 Enrico Demarin 2021-12-22 16:30:16 UTC
I meant 5.15.11 just released :)
Comment 14 Mikhail Kondrashov 2021-12-23 03:06:23 UTC
I've same issue but with "igc".
Comment 15 The Linux kernel's regression tracker (Thorsten Leemhuis) 2021-12-23 06:51:15 UTC
(In reply to Enrico Demarin from comment #12)
> Bug is still present in 5.15.11, some igb fixes made it through but not this
> one

I recently poked the developers and the fix is no on its way:
https://lore.kernel.org/netdev/b4be04bbd6a20855526b961ef80669bd2647564c.camel@intel.com/

(In reply to Mikhail Kondrashov from comment #14)
> I've same issue but with "igc".

Related, but different and fixed by:
https://lore.kernel.org/netdev/20211214003949.666642-1-vinicius.gomes@intel.com/

Sadly this patch is not on the way yet it seems :-/
/me grumbles
Comment 16 Enrico Demarin 2022-01-13 21:31:21 UTC
I confirm this is fixed in 5.15.12