After installing Linux kernel 5.13.4 on my system, I found that it would always reboot after exactly 60 seconds of uptime (twice the configured watchdog interval). I am using the built-in watchdog support of systemd, if that matters. Previous versions of the kernel did not have a problem. I was able to reproduce this on a Haswell system (iTCO_wdt iTCO_wdt.1.auto: Found a Lynx Point_LP TCO device (Version=2, TCOBASE=0x1860)) and a Broadwell system (iTCO_wdt iTCO_wdt.1.auto: Found a Wildcat Point_LP TCO device (Version=2, TCOBASE=0x1860)). I didn't try it on any other systems. After discovering that the watchdog was causing the reboots, I scanned the commit log for 5.13.4 and found "watchdog: iTCO_wdt: Account for rebooting on second timeout" (5e65819a006ec8a8df2f8639dc26ef0cfaa95ae7), which looked rather suspicious. I then found that reverting that commit restores the correct watchdog behavior. That commit is also present as cb011044e34c293e139570ce5c01aed66a34345c in Linus's tree.
Same problem here on a Lynx Point TCO device. Reverting the faulty commit on top of kernel 5.13.7 solves the problem for me.
Turns out a fix is already available: https://lkml.org/lkml/2021/7/26/349
Thanks, that patch solves the issue on my systems.
I'm affected by a similar issue on Atom C2350 CPUs with 5.14 mainline and stables 5.14.x kernels. The kernel reboots after few time. Disabling userland using the watchdog is a workaround which fix the reboot loop. There is no such reboot with 5.13 kernel branch up to 5.13.14. I built a 5.14.3 with the patch and it doesn't fix the issue. Do you think this is related or I need to open a new BR? Regards,
You could try to revert commit cb011044e34c. If it helps, it is likely the same problem.
Something is still broken even after the fix aec42642d91 . I have a Broadwell system ( " 9 Series TCO device (Version=2, TCOBASE=0x1860) "), and before commit cb011044e34c ( say 5.12.x ) I would see the following behavior: I set the timeout to 30 (wdctl -s 30) wdctl prints: Timeout: 30 seconds Timeleft: 30 seconds I enable the watchdog After 30 seconds, system reboots. -------- After commit cb011044e34c, plus the fix aec42642d91 , I see: I set the timeout to 30 (wdctl -s 30) wdctl prints: Timeout: 30 seconds Timeleft: 15 seconds After only 15 seconds, system reboots. ------- In addition to the confusing UI (showing a different value for timeout/timeleft right after setting the watchdog), it is rebooting _early_ (at only half the timeout!). That is much more problematic than rebooting late... On my system during probe() I can see that SMI_EN is 0x80020023 (i.e. TCO_EN is off but GBL_SMI_EN is on), so we do the tmrval /= 2. However my system definitely reboots on the first timeout somehow, so it doesn't need that. Any ideas? Is there any other interesting register that could help understand the difference? I don't have any system where the divide by two is actually needed, and Intel manuals are not much help.
Hi, I am using a Debian kernel, I allow myself to supplement this bug, if that helps. I tested the kernel 5.14.6 on a PC and I have the same bug, the PC restarts after one minute. There is no problem with the kernel 5.10.46. The watchdog is managed by systemd. Motherboard: Asus P7P55D-E CPU: Intel Core i7 CPU 870 @ 2.93GHz Extract from the logs: iTCO_vendor_support: vendor-support=0 iTCO_wdt iTCO_wdt.1.auto: Found a P55 TCO device (Version=2, TCOBASE=0x0860) iTCO_wdt iTCO_wdt.1.auto: initialized. heartbeat=30 sec (nowayout=0) $ sudo wdctl Device: /dev/watchdog0 Identity: iTCO_wdt [version 0] Timeout: 30 seconds Pre-timeout: 0 seconds Timeleft: 15 seconds FLAG DESCRIPTION STATUS BOOT-STATUS KEEPALIVEPING Keep alive ping reply 1 0 MAGICCLOSE Supports magic close char 0 0 SETTIMEOUT Set timeout (in seconds) 0 0 $ cat /etc/systemd/system.conf.d/10-watchdog.conf #[Manager] #RuntimeWatchdogSec=30 I rebooted without activating the watchdog to get the correct information. With the kernel 5.10, if I run a "watch wdctl" just after booting, the value of Timeleft changes between 15 and 30. With the kernel 5.14, the value of Timeleft changes between 0 and 15. There is nothing recorded in the logs at the time of the reboot that could give more info. Regards.
Enough is enough. I created a revert of the offending commits. https://patchwork.kernel.org/project/linux-watchdog/patch/20211008003302.1461733-1-linux@roeck-us.net/ in case someone wants to confirm that it works.
In order to give a try to your patch, I tested the last 5.14.14 stable kernel without it and the issue looks fixed. Timer is effectively reset (around the half time). Maybe the revert is not needed anymore?
Kernel versions v5.14.6 and later include commit aec42642d91, and it was stated twice above that this commit does not fix the problem for everyone. There was no additional fix to the driver after v5.14.6.
I tried aec42642d91 against 5.14.3 and reboots were still present on my boards. So, let me agree, it was not fixing the issue to me neither. The 5.14.14 has changed something because the very same boards are not rebooting anymore. That's why I wanted to mention it here. Maybe the changes take place elsewhere than in the iTCO code. I see changes in drivers/watchdog/watchdog_dev.c since 5.14.5 which seems to be shared between watchdog drivers.
Ok then, you were correct, I still have boards with 5.14.14 which finally reboots. Not as fast, but still. So, I'm building a kernel with your patch a give you a feedback.
No more reboot issues on all my boards when patch[1] is applied. Timer is always reset at half the timer. Thanks Guenter! [1] https://patchwork.kernel.org/project/linux-watchdog/patch/20211008003302.1461733-1-linux@roeck-us.net/
If you use systemd's watchdog support with an interval of N seconds, by design it tries to ping the watchdog exactly every N/2 seconds. Since the broken cb011044e34c causes reboot also exactly after N/2 seconds instead of after the intended N seconds, your system will reboot randomly, depending on your luck or the skew between your clocks :) (i.e. depending on whether systemd gets to ping the watchdog in the microseconds before it actually reboots or not). So every N/2 seconds you have a ~50% chance of a random reboot; that's why the time until you see the reboot varies. To make it consistent one can use watchdog-test (from linux-src/tools/testing/selftests/watchdog/watchdog-test.c). E.g., ./watchdog-test -t 30 -p 20 Results: after cb011044e34c , reboot in only 15 seconds (broken). after aec42642d91 , reboot in only 15 seconds (broken). after the revert, no reboot (correct). So from my side this revert definitely fixes the bug. And "timeleft" value works correctly again.
At least 5.15.x series seem to have the revert already and it is working as expected. Feel free to close the bug (I'm not the reporter).
Hi, I am using a Debian kernel. I had this bug with kernel 5.14.6 as noted in a previous comment. I updated this afternoon with the Debian kernel 5.15.3-1 and I have no more reboots. The watchod is managed by Systemd. Regards.
It should be noted that the watchdog implementation in systemd is flawed by design so it shouldn't be trusted when testing whether a kernel driver works properly. Better use the standalone watchdog-test tool.
Fixed in v5.15 by: commit 6e7733ef0bb9372d5491168635f6ecba8ac3cb8a Author: Guenter Roeck Date: Thu Oct 7 17:33:02 2021 -0700 Revert "watchdog: iTCO_wdt: Account for rebooting on second timeout"