Bug 215359

Summary: Linux 5.15.10 hung task when stopping NetworkManager
Product: Networking Reporter: Enrico Demarin (enricod)
Component: OtherAssignee: Stephen Hemminger (stephen)
Status: RESOLVED DUPLICATE    
Severity: high CC: enricod, regressions
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.15.10 Subsystem:
Regression: No Bisected commit-id:
Attachments: Output from ver_linux
cpuinfo.txt
lspci

Description Enrico Demarin 2021-12-18 15:36:21 UTC
Created attachment 300069 [details]
Output from ver_linux

Moving from kernel 5.14.21 to 5.15.x , I noticed sporadic hangs mostly related to NetworkManager. Rebooting the box triggers the behaviour in all cases, NetworkManager hangs when receiving a stop job, and eventually these messages are logged:

Dec 18 10:04:07 pyjamarama kernel: INFO: task kworker/15:2:251 blocked for more than 245 seconds.
Dec 18 10:04:07 pyjamarama kernel:       Not tainted 5.15.10-MANJARO #1
Dec 18 10:04:07 pyjamarama kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 18 10:04:07 pyjamarama kernel: task:kworker/15:2    state:D stack:    0 pid:  251 ppid:     2 flags:0x00004000
Dec 18 10:04:07 pyjamarama kernel: Workqueue: ipv6_addrconf addrconf_verify_work
Dec 18 10:04:07 pyjamarama kernel: Call Trace:
Dec 18 10:04:07 pyjamarama kernel:  <TASK>
Dec 18 10:04:07 pyjamarama kernel:  __schedule+0x311/0x940
Dec 18 10:04:07 pyjamarama kernel:  schedule+0x59/0xc0
Dec 18 10:04:07 pyjamarama kernel:  schedule_preempt_disabled+0x11/0x20
Dec 18 10:04:07 pyjamarama kernel:  __mutex_lock.constprop.0+0x258/0x460
Dec 18 10:04:07 pyjamarama kernel:  addrconf_verify_work+0xa/0x20
Dec 18 10:04:07 pyjamarama kernel:  process_one_work+0x1e8/0x3c0
Dec 18 10:04:07 pyjamarama kernel:  worker_thread+0x50/0x3b0
Dec 18 10:04:07 pyjamarama kernel:  ? process_one_work+0x3c0/0x3c0
Dec 18 10:04:07 pyjamarama kernel:  kthread+0x132/0x160
Dec 18 10:04:07 pyjamarama kernel:  ? set_kthread_struct+0x40/0x40
Dec 18 10:04:07 pyjamarama kernel:  ret_from_fork+0x22/0x30
Dec 18 10:04:07 pyjamarama kernel:  </TASK>
Dec 18 10:04:07 pyjamarama kernel: INFO: task NetworkManager:681 blocked for more than 245 seconds.
Dec 18 10:04:07 pyjamarama kernel:       Not tainted 5.15.10-MANJARO #1
Dec 18 10:04:07 pyjamarama kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 18 10:04:07 pyjamarama kernel: task:NetworkManager  state:D stack:    0 pid:  681 ppid:     1 flags:0x00004004
Dec 18 10:04:07 pyjamarama kernel: Call Trace:
Dec 18 10:04:07 pyjamarama kernel:  <TASK>
Dec 18 10:04:07 pyjamarama kernel:  __schedule+0x311/0x940
Dec 18 10:04:07 pyjamarama kernel:  schedule+0x59/0xc0
Dec 18 10:04:07 pyjamarama kernel:  schedule_preempt_disabled+0x11/0x20
Dec 18 10:04:07 pyjamarama kernel:  __mutex_lock.constprop.0+0x258/0x460
Dec 18 10:04:07 pyjamarama kernel:  ? delay_halt+0x3b/0x60
Dec 18 10:04:07 pyjamarama kernel:  igb_resume+0xff/0x1d0 [igb]
Dec 18 10:04:07 pyjamarama kernel:  pci_pm_runtime_resume+0xaa/0xc0
Dec 18 10:04:07 pyjamarama kernel:  ? pci_pm_freeze_noirq+0x100/0x100
Dec 18 10:04:07 pyjamarama kernel:  __rpm_callback+0x44/0x110
Dec 18 10:04:07 pyjamarama kernel:  ? pci_pm_freeze_noirq+0x100/0x100
Dec 18 10:04:07 pyjamarama kernel:  rpm_callback+0x35/0x70
Dec 18 10:04:07 pyjamarama kernel:  rpm_resume+0x55b/0x800
Dec 18 10:04:07 pyjamarama kernel:  __pm_runtime_resume+0x4a/0x80
Dec 18 10:04:07 pyjamarama kernel:  dev_ethtool+0xd4/0x2d60
Dec 18 10:04:07 pyjamarama kernel:  ? enqueue_task_fair+0x8e/0x690
Dec 18 10:04:07 pyjamarama kernel:  ? inet_ioctl+0xde/0x1e0
Dec 18 10:04:07 pyjamarama kernel:  dev_ioctl+0x1ae/0x560
Dec 18 10:04:07 pyjamarama kernel:  ? _copy_from_user+0x28/0x60
Dec 18 10:04:07 pyjamarama kernel:  sock_do_ioctl+0xe4/0x130
Dec 18 10:04:07 pyjamarama kernel:  sock_ioctl+0x204/0x310
Dec 18 10:04:07 pyjamarama kernel:  __x64_sys_ioctl+0x82/0xb0
Dec 18 10:04:07 pyjamarama kernel:  do_syscall_64+0x3b/0x90
Dec 18 10:04:07 pyjamarama kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Dec 18 10:04:07 pyjamarama kernel: RIP: 0033:0x7f8828d4c59b
Dec 18 10:04:07 pyjamarama kernel: RSP: 002b:00007ffd61b19498 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Dec 18 10:04:07 pyjamarama kernel: RAX: ffffffffffffffda RBX: 00007ffd61b19520 RCX: 00007f8828d4c59b
Dec 18 10:04:07 pyjamarama kernel: RDX: 00007ffd61b19540 RSI: 0000000000008946 RDI: 0000000000000013
Dec 18 10:04:07 pyjamarama kernel: RBP: 00007ffd61b19690 R08: 0000000000000007 R09: 003066307334706e
Dec 18 10:04:07 pyjamarama kernel: R10: 0000000000000021 R11: 0000000000000246 R12: 0000000000000002
Dec 18 10:04:07 pyjamarama kernel: R13: 0000000000000000 R14: 00007ffd61b19540 R15: 00007ffd61b19520
Dec 18 10:04:07 pyjamarama kernel:  </TASK>


I bisected the kernel from 5.14.21 to 5.15.0 ( when the issue started ) and it points to this commit:

commit f32a213765739f2a1db319346799f130a3d08820
Author: Heiner Kallweit <hkallweit1@gmail.com>
Date:   Sun Aug 1 12:36:48 2021 +0200

    ethtool: runtime-resume netdev parent before ethtool ioctl ops
    
    If a network device is runtime-suspended then:
    - network device may be flagged as detached and all ethtool ops (even if not
      accessing the device) will fail because netif_device_present() returns
      false
    - ethtool ops may fail because device is not accessible (e.g. because being
      in D3 in case of a PCI device)
    
    It may not be desirable that userspace can't use even simple ethtool ops
    that not access the device if interface or link is down. To be more friendly
    to userspace let's ensure that device is runtime-resumed when executing the
    respective ethtool op in kernel.
    
    Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

I reverted that patch in 5.15.10 and it seems to have cured the hang on reboot behaviour, however i'll be testing more to see if the other random hangs disappear too.

thanks for your attention
Comment 1 Enrico Demarin 2021-12-18 16:11:08 UTC
Created attachment 300071 [details]
cpuinfo.txt
Comment 2 Enrico Demarin 2021-12-18 16:11:46 UTC
Created attachment 300073 [details]
lspci
Comment 3 The Linux kernel's regression tracker (Thorsten Leemhuis) 2021-12-19 04:48:50 UTC
Hi, this is your Linux kernel regression tracker speaking. 

Not totally sure, but I guess this will be fixed by this patch:
https://lore.kernel.org/netdev/6bb28d2f-4884-7696-0582-c26c35534bae@gmail.com/

Will write a mail later why I takes to long to get this mainlined and backported.

This bug thus is like a duplicate of Bug 215129
Comment 4 Enrico Demarin 2021-12-19 14:53:27 UTC
Thanks Thorsten - it seems indeed a duplicate - ( the discussion seemed to hint that more work was needed - but will try the patch )
Comment 5 Enrico Demarin 2021-12-19 20:58:38 UTC
Hello, I re-applied the commit in question and applied the patch detailed in Bug 215129, which resolves the issue as well.
Comment 6 Enrico Demarin 2021-12-19 20:59:11 UTC

*** This bug has been marked as a duplicate of bug 215129 ***