Bug 205557 - parts of kernel hangs completely (mutex-lock)
Summary: parts of kernel hangs completely (mutex-lock)
Status: RESOLVED DUPLICATE of bug 205155
Alias: None
Product: Power Management
Classification: Unclassified
Component: Run-Time-PM (show other bugs)
Hardware: All Linux
: P1 high
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-11-17 19:04 UTC by Tormen
Modified: 2019-12-05 10:18 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.3.7
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (kernel target of journalctl) from last kernel network layer mutex lockup (105.77 KB, text/plain)
2019-11-17 19:05 UTC, Tormen
Details

Description Tormen 2019-11-17 19:04:01 UTC
Hi,

Please help: This really annoying bug seems to always get the complete network part of the kernel into a MUTEX-LOCK :((

As most processes these days do something with network... this basically renders the machine useless. E.g. a "ping 1.1.1.1" will be stuck and become an unkillable process!
Same for all programs with network access.

Only way to get out of this mess: SYSRQ + R-E-I-S-U-B :((

I am not sure which driver to blame here. As it seems network related and the only network drivers active during the problem happened are INTEL-WIRELESS, I classified it as such.

I'll attach the dmesg from the last occurance.
I have also the complete systemd journal of this and of many other occurances.

I don't know yet a way to reproduce it, but it happens 1 every couple of days, sometimes even several times a day.

Please advise if you need any other info.

Thanks a lot for any hint or help,

Tormen
Comment 1 Tormen 2019-11-17 19:05:20 UTC
Created attachment 285959 [details]
dmesg (kernel target of journalctl) from last kernel network layer mutex lockup
Comment 2 Tormen 2019-11-17 19:14:19 UTC
As this bug forces to SYSRQ reboot the machine everytime it occurs I find it pretty bad / important.

I had another look at the log and these were all the reported hung_task_timeouts:

Nov 17 16:26:44 huit kernel: INFO: task kworker/10:3:834 blocked for more than 122 seconds.
Nov 17 16:26:44 huit kernel: INFO: task sway_statusline:1138 blocked for more than 122 seconds.
Nov 17 16:26:44 huit kernel: INFO: task ThreadPoolForeg:3973 blocked for more than 122 seconds.
Nov 17 16:26:44 huit kernel: INFO: task ip:3979 blocked for more than 122 seconds.
Nov 17 16:28:47 huit kernel: INFO: task kworker/10:3:834 blocked for more than 245 seconds.
Nov 17 16:28:47 huit kernel: INFO: task wpa_supplicant:887 blocked for more than 122 seconds.
Nov 17 16:28:47 huit kernel: INFO: task sway_statusline:1138 blocked for more than 245 seconds.
Nov 17 16:28:47 huit kernel: INFO: task ThreadPoolForeg:3973 blocked for more than 245 seconds.
Nov 17 16:28:47 huit kernel: INFO: task ThreadPoolForeg:3977 blocked for more than 122 seconds.
Nov 17 16:28:47 huit kernel: INFO: task ip:3979 blocked for more than 245 seconds.

The worst and the one that should interest the kernel community I assume should be `kworker`.

And this is the first kworker hung_task stack trace:

Nov 17 16:26:44 huit kernel: INFO: task kworker/10:3:834 blocked for more than 122 seconds.
Nov 17 16:26:44 huit kernel:       Tainted: G           OE     5.3.7-arch1-1-ARCH #1
Nov 17 16:26:44 huit kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 17 16:26:44 huit kernel: kworker/10:3    D    0   834      2 0x80004080
Nov 17 16:26:44 huit kernel: Workqueue: events rtl_work_func_t [r8152]
Nov 17 16:26:44 huit kernel: Call Trace:
Nov 17 16:26:44 huit kernel:  ? __schedule+0x27f/0x6d0
Nov 17 16:26:44 huit kernel:  schedule+0x43/0xd0
Nov 17 16:26:44 huit kernel:  rpm_resume+0x18b/0x790
Nov 17 16:26:44 huit kernel:  ? wait_woken+0x70/0x70
Nov 17 16:26:44 huit kernel:  rpm_resume+0x302/0x790
Nov 17 16:26:44 huit kernel:  __pm_runtime_resume+0x3b/0x60
Nov 17 16:26:44 huit kernel:  usb_autopm_get_interface+0x18/0x50
Nov 17 16:26:44 huit kernel:  rtl_work_func_t+0x6b/0x290 [r8152]
Nov 17 16:26:44 huit kernel:  ? __schedule+0x287/0x6d0
Nov 17 16:26:44 huit kernel:  process_one_work+0x1d1/0x3a0
Nov 17 16:26:44 huit kernel:  worker_thread+0x4a/0x3d0
Nov 17 16:26:44 huit kernel:  kthread+0xfb/0x130
Nov 17 16:26:44 huit kernel:  ? process_one_work+0x3a0/0x3a0
Nov 17 16:26:44 huit kernel:  ? kthread_park+0x80/0x80
Nov 17 16:26:44 huit kernel:  ret_from_fork+0x35/0x40

... and this points to rpm_resume and then __pm_runtime_resume.

Still not 100% sure were to best address this bug though in the kernel bugzilla.
Please advise.
Comment 3 Tormen 2019-11-17 19:17:30 UTC
I find this bug pretty SEVERE (as it can leave you data-loss as you cannot save anymore (e.g. when writing on a website)), hence the Importance P1 high.
Comment 4 Ronan Pigott 2019-11-30 02:15:41 UTC
I think I may be experiencing the same issue.

The repeated part in my journal is very similar:

Nov 29 17:53:31 rxps kernel: INFO: task kworker/7:1:172 blocked for more than 122 seconds.
Nov 29 17:53:31 rxps kernel:       Tainted: G           OE     5.3.13-arch1-1 #1
Nov 29 17:53:31 rxps kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 29 17:53:31 rxps kernel: kworker/7:1     D    0   172      2 0x80004000
Nov 29 17:53:31 rxps kernel: Workqueue: events rtl_work_func_t [r8152]
Nov 29 17:53:31 rxps kernel: Call Trace:
Nov 29 17:53:31 rxps kernel:  ? __schedule+0x27f/0x6d0
Nov 29 17:53:31 rxps kernel:  schedule+0x43/0xd0
Nov 29 17:53:31 rxps kernel:  rpm_resume+0x18b/0x790
Nov 29 17:53:31 rxps kernel:  ? wait_woken+0x70/0x70
Nov 29 17:53:31 rxps kernel:  rpm_resume+0x302/0x790
Nov 29 17:53:31 rxps kernel:  __pm_runtime_resume+0x3b/0x60
Nov 29 17:53:31 rxps kernel:  usb_autopm_get_interface+0x18/0x50
Nov 29 17:53:31 rxps kernel:  rtl8152_set_mac_address+0x61/0x1c0 [r8152]
Nov 29 17:53:31 rxps kernel:  set_ethernet_addr+0x83/0x90 [r8152]
Nov 29 17:53:31 rxps kernel:  rtl8152_reset_resume+0x48/0x60 [r8152]
Nov 29 17:53:31 rxps kernel:  usb_resume_interface.isra.0+0x46/0xd0
Nov 29 17:53:31 rxps kernel:  usb_resume_both+0xe8/0x140
Nov 29 17:53:31 rxps kernel:  ? usb_runtime_suspend+0x70/0x70
Nov 29 17:53:31 rxps kernel:  __rpm_callback+0x7b/0x130
Nov 29 17:53:31 rxps kernel:  rpm_callback+0x4f/0x70
Nov 29 17:53:31 rxps kernel:  ? usb_runtime_suspend+0x70/0x70
Nov 29 17:53:31 rxps kernel:  rpm_resume+0x591/0x790
Nov 29 17:53:31 rxps kernel:  rpm_resume+0x302/0x790
Nov 29 17:53:31 rxps kernel:  __pm_runtime_resume+0x3b/0x60
Nov 29 17:53:31 rxps kernel:  usb_autopm_get_interface+0x18/0x50
Nov 29 17:53:31 rxps kernel:  rtl_work_func_t+0x6b/0x290 [r8152]
Nov 29 17:53:31 rxps kernel:  ? __schedule+0x287/0x6d0
Nov 29 17:53:31 rxps kernel:  process_one_work+0x1d1/0x3a0
Nov 29 17:53:31 rxps kernel:  worker_thread+0x4a/0x3d0
Nov 29 17:53:31 rxps kernel:  kthread+0xfb/0x130
Nov 29 17:53:31 rxps kernel:  ? process_one_work+0x3a0/0x3a0
Nov 29 17:53:31 rxps kernel:  ? kthread_park+0x80/0x80
Nov 29 17:53:31 rxps kernel:  ret_from_fork+0x1f/0x40

I can probably reproduce with an untainted kernel, if you'd prefer. It's tainted by bbswitch atm.

Some basic info:
$ sudo dmidecode -s system-product-name
XPS 15 9560
$ uname -r
5.3.13-arch1-1
$ lspci | grep Network
02:00.0 Network controller: Intel Corporation Wireless 8265 / 8275 (rev 78)

I have a Dell DA200 USB-C to Ethernet adapter I use with this laptop, and I've only experienced the lockup using that adapter. When I unplug the adapter, ksoftirqd will fill 100% cpu, my laptop becomes quickly unusable, and can only be shut down by a hard power-off.
Comment 5 Ronan Pigott 2019-12-02 22:49:40 UTC
I think this issue is actually a duplicate of #205155.
Comment 6 Tormen 2019-12-05 10:18:16 UTC
@ronan, that is actually true.

And I am also kind of sorry for this "cross-post"... but 

 (a) I thought my initial bug report was not as concise as it could have been.

 (b) I got no reaction whatsoever in weeks on a clearly broken driver so I was kind of desperate and honestly did not know what to do other than to try it again.

But I made peace with this ever since. So I will close this bug report.

Tormen

*** This bug has been marked as a duplicate of bug 205155 ***

Note You need to log in before you can comment on or make changes to this bug.