Bug 213809

Summary: iTCO_wdt always reboots system after 60 seconds of uptime when running Linux 5.14-rc or 5.13.4
Product: Drivers Reporter: Michael Marley (michael)
Component: WatchdogAssignee: drivers_watchdog (drivers_watchdog)
Status: RESOLVED CODE_FIX    
Severity: high CC: debbugs, frederic, jdelvare, linux, michael, seblu
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.14-rc2, 5.13.4 Subsystem:
Regression: Yes Bisected commit-id:

Description Michael Marley 2021-07-21 01:44:49 UTC
After installing Linux kernel 5.13.4 on my system, I found that it would always reboot after exactly 60 seconds of uptime (twice the configured watchdog interval).  I am using the built-in watchdog support of systemd, if that matters.  Previous versions of the kernel did not have a problem.

I was able to reproduce this on a Haswell system (iTCO_wdt iTCO_wdt.1.auto: Found a Lynx Point_LP TCO device (Version=2, TCOBASE=0x1860)) and a Broadwell system (iTCO_wdt iTCO_wdt.1.auto: Found a Wildcat Point_LP TCO device (Version=2, TCOBASE=0x1860)).  I didn't try it on any other systems.

After discovering that the watchdog was causing the reboots, I scanned the commit log for 5.13.4 and found "watchdog: iTCO_wdt: Account for rebooting on second timeout" (5e65819a006ec8a8df2f8639dc26ef0cfaa95ae7), which looked rather suspicious.  I then found that reverting that commit restores the correct watchdog behavior.  That commit is also present as cb011044e34c293e139570ce5c01aed66a34345c in Linus's tree.
Comment 1 Jean Delvare 2021-08-03 14:22:14 UTC
Same problem here on a Lynx Point TCO device. Reverting the faulty commit on top of kernel 5.13.7 solves the problem for me.
Comment 2 Jean Delvare 2021-08-03 15:34:22 UTC
Turns out a fix is already available:

https://lkml.org/lkml/2021/7/26/349
Comment 3 Michael Marley 2021-08-04 00:04:35 UTC
Thanks, that patch solves the issue on my systems.
Comment 4 Seb Lu 2021-09-13 00:08:46 UTC
I'm affected by a similar issue on Atom C2350 CPUs with 5.14 mainline and stables 5.14.x kernels. 

The kernel reboots after few time. Disabling userland using the watchdog is a workaround which fix the reboot loop.
There is no such reboot with 5.13 kernel branch up to 5.13.14.

I built a 5.14.3 with the patch and it doesn't fix the issue.
Do you think this is related or I need to open a new BR?

Regards,
Comment 5 Guenter Roeck 2021-09-13 04:25:53 UTC
You could try to revert commit cb011044e34c. If it helps, it is likely the same problem.
Comment 6 Javier S. Pedro 2021-10-03 17:35:47 UTC
Something is still broken even after the fix aec42642d91 . I have a Broadwell system ( " 9 Series TCO device (Version=2, TCOBASE=0x1860) "), and before commit cb011044e34c ( say 5.12.x ) I would see the following behavior:

I set the timeout to 30 (wdctl -s 30)
wdctl prints:
Timeout:       30 seconds
Timeleft:      30 seconds

I enable the watchdog
After 30 seconds, system reboots.
--------

After commit cb011044e34c, plus the fix aec42642d91 , I see:

I set the timeout to 30 (wdctl -s 30)
wdctl prints:
Timeout:       30 seconds
Timeleft:      15 seconds

After only 15 seconds, system reboots. 
-------

In addition to the confusing UI (showing a different value for timeout/timeleft right after setting the watchdog), it is rebooting _early_ (at only half the timeout!). That is much more problematic than rebooting late...

On my system during probe() I can see that 
SMI_EN is 0x80020023 (i.e. TCO_EN is off but GBL_SMI_EN is on),
so we do the tmrval /= 2. 
However my system definitely reboots on the first timeout somehow, so it doesn't need that. 

Any ideas? Is there any other interesting register that could help understand the difference? I don't have any system where the divide by two is actually needed, and Intel manuals are not much help.
Comment 7 Frederic MASSOT 2021-10-05 16:08:21 UTC
Hi,

I am using a Debian kernel, I allow myself to supplement this bug, if that helps.

I tested the kernel 5.14.6 on a PC and I have the same bug, the PC restarts after one minute. There is no problem with the kernel 5.10.46.

The watchdog is managed by systemd.

Motherboard: Asus P7P55D-E
CPU: Intel Core i7 CPU 870 @ 2.93GHz

Extract from the logs:
iTCO_vendor_support: vendor-support=0
iTCO_wdt iTCO_wdt.1.auto: Found a P55 TCO device (Version=2, TCOBASE=0x0860)
iTCO_wdt iTCO_wdt.1.auto: initialized. heartbeat=30 sec (nowayout=0)


$ sudo wdctl 
Device:        /dev/watchdog0
Identity:      iTCO_wdt [version 0]
Timeout:       30 seconds
Pre-timeout:    0 seconds
Timeleft:      15 seconds
FLAG           DESCRIPTION               STATUS BOOT-STATUS
KEEPALIVEPING  Keep alive ping reply          1           0
MAGICCLOSE     Supports magic close char      0           0
SETTIMEOUT     Set timeout (in seconds)       0           0


$ cat /etc/systemd/system.conf.d/10-watchdog.conf 
#[Manager]
#RuntimeWatchdogSec=30

I rebooted without activating the watchdog to get the correct information.


With the kernel 5.10, if I run a "watch wdctl" just after booting, the value of Timeleft changes between 15 and 30. With the kernel 5.14, the value of Timeleft changes between 0 and 15.

There is nothing recorded in the logs at the time of the reboot that could give more info.


Regards.
Comment 8 Guenter Roeck 2021-10-08 01:11:35 UTC
Enough is enough. I created a revert of the offending commits.

https://patchwork.kernel.org/project/linux-watchdog/patch/20211008003302.1461733-1-linux@roeck-us.net/

in case someone wants to confirm that it works.
Comment 9 Seb Lu 2021-10-21 15:25:43 UTC
In order to give a try to your patch, I tested the last 5.14.14 stable kernel without it and the issue looks fixed. Timer is effectively reset (around the half time).
Maybe the revert is not needed anymore?
Comment 10 Guenter Roeck 2021-10-21 16:11:04 UTC
Kernel versions v5.14.6 and later include commit aec42642d91, and it was stated twice above that this commit does not fix the problem for everyone. There was no additional fix to the driver after v5.14.6.
Comment 11 Seb Lu 2021-10-21 16:24:05 UTC
I tried aec42642d91 against 5.14.3 and reboots were still present on my boards.  So, let me agree, it was not fixing the issue to me neither.

The 5.14.14 has changed something because the very same boards are not rebooting anymore. That's why I wanted to mention it here.
Maybe the changes take place elsewhere than in the iTCO code. I see changes in drivers/watchdog/watchdog_dev.c since 5.14.5 which seems to be shared between watchdog drivers.
Comment 12 Seb Lu 2021-10-21 16:46:28 UTC
Ok then, you were correct, I still have boards with 5.14.14 which finally reboots. Not as fast, but still. So, I'm building a kernel with your patch a give you a feedback.
Comment 13 Seb Lu 2021-10-21 20:55:06 UTC
No more reboot issues on all my boards when patch[1] is applied. Timer is always reset at half the timer.

Thanks Guenter!


[1] https://patchwork.kernel.org/project/linux-watchdog/patch/20211008003302.1461733-1-linux@roeck-us.net/
Comment 14 Javier S. Pedro 2021-10-23 15:36:07 UTC
If you use systemd's watchdog support with an interval of N seconds, by design it tries to ping the watchdog exactly every N/2 seconds. Since the broken cb011044e34c causes reboot also exactly after N/2 seconds instead of after the intended N seconds, your system will reboot randomly, depending on your luck or the skew between your clocks :) (i.e. depending on whether systemd gets to ping the watchdog in the microseconds before it actually reboots or not).  

So every N/2 seconds you have a ~50% chance of a random reboot; that's why the time until you see the reboot varies. 

To make it consistent one can use watchdog-test (from linux-src/tools/testing/selftests/watchdog/watchdog-test.c). E.g., 

./watchdog-test -t 30 -p 20 

Results:
 after cb011044e34c , reboot in only 15 seconds (broken).
 after aec42642d91 , reboot in only 15 seconds (broken). 
 after the revert, no reboot (correct). 

So from my side this revert definitely fixes the bug. And "timeleft" value works correctly again.
Comment 15 Javier S. Pedro 2021-11-24 13:45:47 UTC
At least 5.15.x series seem to have the revert already and it is working as expected.
Feel free to close the bug (I'm not the reporter).
Comment 16 Frederic MASSOT 2021-11-29 17:24:55 UTC
Hi,

I am using a Debian kernel. I had this bug with kernel 5.14.6 as noted in a previous comment.

I updated this afternoon with the Debian kernel 5.15.3-1 and I have no more reboots. The watchod is managed by Systemd.


Regards.
Comment 17 Jean Delvare 2021-12-03 07:52:55 UTC
It should be noted that the watchdog implementation in systemd is flawed by design so it shouldn't be trusted when testing whether a kernel driver works properly. Better use the standalone watchdog-test tool.
Comment 18 Jean Delvare 2022-06-08 08:15:16 UTC
Fixed in v5.15 by:

commit 6e7733ef0bb9372d5491168635f6ecba8ac3cb8a
Author: Guenter Roeck
Date:   Thu Oct 7 17:33:02 2021 -0700

    Revert "watchdog: iTCO_wdt: Account for rebooting on second timeout"