Bug 213809
Summary: | iTCO_wdt always reboots system after 60 seconds of uptime when running Linux 5.14-rc or 5.13.4 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Michael Marley (michael) |
Component: | Watchdog | Assignee: | drivers_watchdog (drivers_watchdog) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | debbugs, frederic, jdelvare, linux, michael, seblu |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 5.14-rc2, 5.13.4 | Subsystem: | |
Regression: | Yes | Bisected commit-id: |
Description
Michael Marley
2021-07-21 01:44:49 UTC
Same problem here on a Lynx Point TCO device. Reverting the faulty commit on top of kernel 5.13.7 solves the problem for me. Turns out a fix is already available: https://lkml.org/lkml/2021/7/26/349 Thanks, that patch solves the issue on my systems. I'm affected by a similar issue on Atom C2350 CPUs with 5.14 mainline and stables 5.14.x kernels. The kernel reboots after few time. Disabling userland using the watchdog is a workaround which fix the reboot loop. There is no such reboot with 5.13 kernel branch up to 5.13.14. I built a 5.14.3 with the patch and it doesn't fix the issue. Do you think this is related or I need to open a new BR? Regards, You could try to revert commit cb011044e34c. If it helps, it is likely the same problem. Something is still broken even after the fix aec42642d91 . I have a Broadwell system ( " 9 Series TCO device (Version=2, TCOBASE=0x1860) "), and before commit cb011044e34c ( say 5.12.x ) I would see the following behavior: I set the timeout to 30 (wdctl -s 30) wdctl prints: Timeout: 30 seconds Timeleft: 30 seconds I enable the watchdog After 30 seconds, system reboots. -------- After commit cb011044e34c, plus the fix aec42642d91 , I see: I set the timeout to 30 (wdctl -s 30) wdctl prints: Timeout: 30 seconds Timeleft: 15 seconds After only 15 seconds, system reboots. ------- In addition to the confusing UI (showing a different value for timeout/timeleft right after setting the watchdog), it is rebooting _early_ (at only half the timeout!). That is much more problematic than rebooting late... On my system during probe() I can see that SMI_EN is 0x80020023 (i.e. TCO_EN is off but GBL_SMI_EN is on), so we do the tmrval /= 2. However my system definitely reboots on the first timeout somehow, so it doesn't need that. Any ideas? Is there any other interesting register that could help understand the difference? I don't have any system where the divide by two is actually needed, and Intel manuals are not much help. Hi, I am using a Debian kernel, I allow myself to supplement this bug, if that helps. I tested the kernel 5.14.6 on a PC and I have the same bug, the PC restarts after one minute. There is no problem with the kernel 5.10.46. The watchdog is managed by systemd. Motherboard: Asus P7P55D-E CPU: Intel Core i7 CPU 870 @ 2.93GHz Extract from the logs: iTCO_vendor_support: vendor-support=0 iTCO_wdt iTCO_wdt.1.auto: Found a P55 TCO device (Version=2, TCOBASE=0x0860) iTCO_wdt iTCO_wdt.1.auto: initialized. heartbeat=30 sec (nowayout=0) $ sudo wdctl Device: /dev/watchdog0 Identity: iTCO_wdt [version 0] Timeout: 30 seconds Pre-timeout: 0 seconds Timeleft: 15 seconds FLAG DESCRIPTION STATUS BOOT-STATUS KEEPALIVEPING Keep alive ping reply 1 0 MAGICCLOSE Supports magic close char 0 0 SETTIMEOUT Set timeout (in seconds) 0 0 $ cat /etc/systemd/system.conf.d/10-watchdog.conf #[Manager] #RuntimeWatchdogSec=30 I rebooted without activating the watchdog to get the correct information. With the kernel 5.10, if I run a "watch wdctl" just after booting, the value of Timeleft changes between 15 and 30. With the kernel 5.14, the value of Timeleft changes between 0 and 15. There is nothing recorded in the logs at the time of the reboot that could give more info. Regards. Enough is enough. I created a revert of the offending commits. https://patchwork.kernel.org/project/linux-watchdog/patch/20211008003302.1461733-1-linux@roeck-us.net/ in case someone wants to confirm that it works. In order to give a try to your patch, I tested the last 5.14.14 stable kernel without it and the issue looks fixed. Timer is effectively reset (around the half time). Maybe the revert is not needed anymore? Kernel versions v5.14.6 and later include commit aec42642d91, and it was stated twice above that this commit does not fix the problem for everyone. There was no additional fix to the driver after v5.14.6. I tried aec42642d91 against 5.14.3 and reboots were still present on my boards. So, let me agree, it was not fixing the issue to me neither. The 5.14.14 has changed something because the very same boards are not rebooting anymore. That's why I wanted to mention it here. Maybe the changes take place elsewhere than in the iTCO code. I see changes in drivers/watchdog/watchdog_dev.c since 5.14.5 which seems to be shared between watchdog drivers. Ok then, you were correct, I still have boards with 5.14.14 which finally reboots. Not as fast, but still. So, I'm building a kernel with your patch a give you a feedback. No more reboot issues on all my boards when patch[1] is applied. Timer is always reset at half the timer. Thanks Guenter! [1] https://patchwork.kernel.org/project/linux-watchdog/patch/20211008003302.1461733-1-linux@roeck-us.net/ If you use systemd's watchdog support with an interval of N seconds, by design it tries to ping the watchdog exactly every N/2 seconds. Since the broken cb011044e34c causes reboot also exactly after N/2 seconds instead of after the intended N seconds, your system will reboot randomly, depending on your luck or the skew between your clocks :) (i.e. depending on whether systemd gets to ping the watchdog in the microseconds before it actually reboots or not). So every N/2 seconds you have a ~50% chance of a random reboot; that's why the time until you see the reboot varies. To make it consistent one can use watchdog-test (from linux-src/tools/testing/selftests/watchdog/watchdog-test.c). E.g., ./watchdog-test -t 30 -p 20 Results: after cb011044e34c , reboot in only 15 seconds (broken). after aec42642d91 , reboot in only 15 seconds (broken). after the revert, no reboot (correct). So from my side this revert definitely fixes the bug. And "timeleft" value works correctly again. At least 5.15.x series seem to have the revert already and it is working as expected. Feel free to close the bug (I'm not the reporter). Hi, I am using a Debian kernel. I had this bug with kernel 5.14.6 as noted in a previous comment. I updated this afternoon with the Debian kernel 5.15.3-1 and I have no more reboots. The watchod is managed by Systemd. Regards. It should be noted that the watchdog implementation in systemd is flawed by design so it shouldn't be trusted when testing whether a kernel driver works properly. Better use the standalone watchdog-test tool. Fixed in v5.15 by: commit 6e7733ef0bb9372d5491168635f6ecba8ac3cb8a Author: Guenter Roeck Date: Thu Oct 7 17:33:02 2021 -0700 Revert "watchdog: iTCO_wdt: Account for rebooting on second timeout" |