Hello, Open bug in launchpad.net https://bugs.launchpad.net/bugs/1796385 "I can use the system for a while, then at random, the screen blinks and freezes. Must reboot. Seems to happen both with Wayland and Xorg." Best regards, -- Cristian Aravena Romero (caravena)
Talk about shooting the messenger. The kernel's watchdog (kernel/watchdog.c) reports a stalled CPU. This is not a problem with the kernel's watchdog. The kernel watchdog just reports the problem. It is also not a problem with the watchdog subsystem, which is not even involved.
@Guenter, Could you change to where it corresponds in bugzilla if it is not in 'watchdog' this report? Best regards, -- Cristian Aravena Romero (caravena)
The bug in lauchpad, unless I am missing something, provides not a single actionable traceback. I don't think it is even possible to identify where exactly the CPU hangs unless additional information is provided. There is no traceback in dmesg, and OopsText doesn't include it either. Given that, it is not possible to identify the responsible subsystem, much less to fix the underlying problem. The only thing we can say for sure is that it is _not_ a watchdog driver problem.
Also, I don't think I have permission to change any of the bug status fields.
@Guenter, I change it, but I do not know what 'Product' and 'Component'. Best regards, -- Cristian Aravena Romero (caravena)
Unfortunately we do not have information to determine 'Product' and 'Component'. The only information we have is that the hanging process is gnome-shell (or at least that this was the case in at least one instance), that the screen blinks and freezes when the problem is observed, and that the hanging CPU served most of the graphics card interrupts. If it is persistent, it _might_ suggest that graphics (presumably the Radeon graphics driver and/or the graphics hardware) is involved. This would be even more likely if the observed PCIe errors point to the graphics card (not sure if the information provided shows the PCIe bus tree; if so I have not found it).
Message from syslogd@amer at Jan 27 19:26:19 ... kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [swapper/5:0] Message from syslogd@amer at Jan 27 19:26:19 ... kernel:NMI watchdog: BUG: soft lockup - CPU#1 stuck for 27s! [dmeventd:71548] Message from syslogd@amer at Jan 27 19:27:30 ... kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [6_scheduler:64928] Message from syslogd@amer at Jan 27 19:31:25 ... kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:34] Message from syslogd@amer at Jan 27 19:32:42 ... kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 33s! [swift-object-up:11358] Message from syslogd@amer at Jan 27 19:33:55 ... kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 24s! [dmeventd:71548] Message from syslogd@amer at Jan 27 19:34:25 ... kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 65s! [kworker/2:0:59993] Message from syslogd@amer at Jan 27 19:37:50 ... kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 24s! [kworker/u256:3:8447] Message from syslogd@amer at Jan 27 19:37:50 ... kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:34] Message from syslogd@amer at Jan 27 19:37:51 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [systemd:11968] The CPU has been disabled by the guest operating system. Power off or reset the virtual machine. upgrade firmware from my side to VMWare 14 instead of using VMware 12 on RHEL 7.6 Maipo
I'm seeing the same here. Using RedHat 7.7 kernel version: 3.10.0-1062.8.1.el7.x86_64 All 10 of my machines are on the same hardware and rhel7 release. Just the one machine is reporting this problem. Eventually the machine becomes unusable and hard reboot is needed.
I had the same problem with OSP on RHEL 7.6 it seems that there was a network problem with my case, if it's relevant please check network connectivity, and / or IO latency on your VM, check the logs for that. good luck
you can also check RAID card as it's common to be defect and causes problems on servers, and disk I/O latency if there is a disk problem it will affect the performance, I had this on HDD also for some services that needed an SSD to be installed and recommended by some Vendor to me. I checked messages for errors and audit.log, if you are using OSP on your machines check the related Nova.log ... etc BR
I opened up the server and reset cards and vacuumed it out a couple of days ago. I'll know if things are okay next week.
Hi, cat /var/log/messages.log | grep -i error >> messages-errors.txt cat /var/log/nova/nova.log | grep -i error >> nova-errors.txt cat /var/log/audit/audit.log | grep -i error >> audit-errors.txt in your home directory you will find the .txt files take time to trace errors, maybe you will find answers. if you have a Power-edge servers (DELL), the RC (RAID - Controller) problem is common. Otherwise you may have to check Network connections between server and switch, replace the UTP cables, and ping the other servers, rabbitMQ in OSP is depending on heart-beat to sync between servers, if the Heartbeat fails it causes this. Best regards Amer Hwitat
Hi, also you can check the journalctl journalctl -l|grep -i error > journal-errors.txt cheers
Thanks ... I reviewed log files this morning and had a look at output from journalclt. Appears the system is good to go. I guess going through and blowing out the machine and re-seating everything into their slots helped. Thanks you for your assistance!