Bug 201379 - watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [gnome-shell:1112]
Summary: watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [gnome-shell:1112]
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Watchdog (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_watchdog@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-11 18:32 UTC by Cristian Aravena Romero
Modified: 2020-02-24 17:39 UTC (History)
2 users (show)

See Also:
Kernel Version: Ubuntu 4.18.0-8.9-generic 4.18.7
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Cristian Aravena Romero 2018-10-11 18:32:59 UTC
Hello,

Open bug in launchpad.net
https://bugs.launchpad.net/bugs/1796385

"I can use the system for a while, then at random, the screen blinks and freezes. Must reboot.
Seems to happen both with Wayland and Xorg."

Best regards,
--
Cristian Aravena Romero (caravena)
Comment 1 Guenter Roeck 2018-10-12 11:41:09 UTC
Talk about shooting the messenger. The kernel's watchdog (kernel/watchdog.c) reports a stalled CPU. This is not a problem with the kernel's watchdog. The kernel watchdog just reports the problem. It is also not a problem with the watchdog subsystem, which is not even involved.
Comment 2 Cristian Aravena Romero 2018-10-12 11:49:55 UTC
@Guenter,

Could you change to where it corresponds in bugzilla if it is not in 'watchdog' this report?

Best regards,
--
Cristian Aravena Romero (caravena)
Comment 3 Guenter Roeck 2018-10-12 11:57:40 UTC
The bug in lauchpad, unless I am missing something, provides not a single actionable traceback. I don't think it is even possible to identify where exactly the CPU hangs unless additional information is provided. There is no traceback in dmesg, and OopsText doesn't include it either.

Given that, it is not possible to identify the responsible subsystem, much less to fix the underlying problem. The only thing we can say for sure is that it is _not_ a watchdog driver problem.
Comment 4 Guenter Roeck 2018-10-12 11:59:53 UTC
Also, I don't think I have permission to change any of the bug status fields.
Comment 5 Cristian Aravena Romero 2018-10-12 12:15:48 UTC
@Guenter,

I change it, but I do not know what 'Product' and 'Component'.

Best regards,
--
Cristian Aravena Romero (caravena)
Comment 6 Guenter Roeck 2018-10-12 17:12:33 UTC
Unfortunately we do not have information to determine 'Product' and 'Component'. 

The only information we have is that the hanging process is gnome-shell (or at least that this was the case in at least one instance), that the screen blinks and freezes when the problem is observed, and that the hanging CPU served most of the graphics card interrupts. If it is persistent, it _might_ suggest that graphics (presumably the Radeon graphics driver and/or the graphics hardware) is involved. This would be even more likely if the observed PCIe errors point to the graphics card (not sure if the information provided shows the PCIe bus tree; if so I have not found it).
Comment 7 amer.hwaitat@gmail.com 2019-01-28 11:47:42 UTC
Message from syslogd@amer at Jan 27 19:26:19 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [swapper/5:0]

Message from syslogd@amer at Jan 27 19:26:19 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#1 stuck for 27s! [dmeventd:71548]

Message from syslogd@amer at Jan 27 19:27:30 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [6_scheduler:64928]

Message from syslogd@amer at Jan 27 19:31:25 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:34]

Message from syslogd@amer at Jan 27 19:32:42 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 33s! [swift-object-up:11358]

Message from syslogd@amer at Jan 27 19:33:55 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 24s! [dmeventd:71548]

Message from syslogd@amer at Jan 27 19:34:25 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 65s! [kworker/2:0:59993]

Message from syslogd@amer at Jan 27 19:37:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 24s! [kworker/u256:3:8447]

Message from syslogd@amer at Jan 27 19:37:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:34]

Message from syslogd@amer at Jan 27 19:37:51 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [systemd:11968]

The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

upgrade firmware from my side to VMWare 14 instead of using VMware 12 on RHEL 7.6 Maipo
Comment 8 Tony Freeman 2020-02-18 17:19:25 UTC
I'm seeing the same here.  

Using RedHat 7.7

kernel version: 3.10.0-1062.8.1.el7.x86_64


All 10 of my machines are on the same hardware and rhel7 release.  Just the one machine is reporting this problem.  Eventually the machine becomes unusable and hard reboot is needed.
Comment 9 amer.hwaitat@gmail.com 2020-02-18 19:39:36 UTC
I had the same problem with OSP on RHEL 7.6

it seems that there was a network problem with my case, if it's relevant please check network connectivity, and / or IO latency on your VM, check the logs for that.

good luck
Comment 10 amer.hwaitat@gmail.com 2020-02-18 20:34:02 UTC
you can also check RAID card as it's common to be defect and causes problems on servers, and disk I/O latency if there is a disk problem it will affect the performance, I had this on HDD also for some services that needed an SSD to be installed and recommended by some Vendor to me.

I checked messages for errors and audit.log, if you are using OSP on your machines check the related Nova.log ... etc

BR
Comment 11 Tony Freeman 2020-02-21 23:42:28 UTC
I opened up the server and reset cards and vacuumed it out a couple of days ago.  I'll know if things are okay next week.
Comment 12 amer.hwaitat@gmail.com 2020-02-22 04:09:31 UTC
Hi,

cat /var/log/messages.log | grep -i error >> messages-errors.txt
cat /var/log/nova/nova.log | grep -i error >> nova-errors.txt
cat /var/log/audit/audit.log | grep -i error >> audit-errors.txt

in your home directory you will find the .txt files

take time to trace errors, maybe you will find answers.

if you have a Power-edge servers (DELL), the RC (RAID - Controller) problem is common.

Otherwise you may have to check Network connections between server and switch, replace the UTP cables, and ping the other servers, rabbitMQ in OSP is depending on heart-beat to sync between servers, if the Heartbeat fails it causes this.
 
Best regards
Amer Hwitat
Comment 13 amer.hwaitat@gmail.com 2020-02-22 05:49:34 UTC
Hi,

also you can check the journalctl

journalctl -l|grep -i error > journal-errors.txt


cheers
Comment 14 Tony Freeman 2020-02-24 17:39:17 UTC
Thanks ... I reviewed log files this morning and had a look at output from journalclt.  Appears the system is good to go.  I guess going through and blowing out the machine and re-seating everything into their slots helped.  Thanks you for your assistance!

Note You need to log in before you can comment on or make changes to this bug.