I have a hardware watchdog, what is a good way to test that it actually works? Is there a standard script or such to set my system in an infinite loop to hog all resources or such to the point of triggering the hardware watchdog?
3 Answers
One simple way to test a watchdog is to trigger a kernel panic. This can be done as root with:
echo c > /proc/sysrq-trigger
The kernel will stop responding to the watchdog pings, so the watchdog will trigger.
SysRq is a 'magical' key combo you can hit which the kernel will respond to regardless of whatever else it is doing, unless it is completely locked up. It can also be used by echoing letters to /proc/sysrq-trigger, like we're doing here.
In this case, the letter c means perform a system crash and take a crashdump if configured.
You can find the documentation for SysRq here.
If you want to test that the watchdog works and you are running a watchdog process in userspace (as opposed to a kernel watchdog that is running in a kernel thread), simply kill or stop the process. This will simulate an unresponsive system. Note that a hardware watchdog is not one where the hardware is resetting the timer. A hardware watchdog is one where the timer is implemented in hardware. Either way, it is software which periodically resets the watchdog timer.
Another typical way to test the watchdog if it is armed but not active is by simply opening it:
# cat >> /dev/watchdog
This assumes that there is no watchdog process running. Once it is opened, the watchdog timer starts. Because cat has merely opened it and won't write anything to it (it is reading from stdin, waiting for input that will never come), the timer will eventually expire and the system will reset.
Even under heavy load, a system won't normally lock up to the extent that the watchdog times out. The watchdog is designed to restart the system when it becomes unrecoverable, or at least fails to recover for a certain amount of time. This could be caused by hard lockups, such as extremely high memory pressure or a swap target on a device that is blocking.
A computer has a finite number of CPU cores. Ignoring SMT, each core can only run one process at any given instant. To prevent a process from taking up too much time, the kernel scheduler specifies a timeslice. Every process can run during its timeslice, but once it has expired (or the process has voluntarily yielded or has performed a blocking system call), the kernel will automatically preempt it and start running another process. This design means that a process can never stop another process from running just by using a lot of CPU. This is the case as long as the kernel is healthy.
A process that is running in userspace can always be preempted by the kernel. There is nothing it can do to prevent that. It can decide to use up its entire timeslice and thus guarantee the system will be at 100% load, but it cannot prevent the kernel from giving other processes a turn. However, if it's running in kernelspace due to a page fault, signal, syscall, etc., and a problem causes the kernel to be unable to satisfy the request and return to userspace, then the process will be locked up and the scheduler will not get a chance to run. If this happens on all cores, the watchdog will not get a chance to run and the timer will eventually time out, resetting the system in the process.
-
1Or
exec 3> /dev/watchdogin a Bourne-like shell to just open that device file.Stéphane Chazelas– Stéphane Chazelas2023-06-20 16:59:06 +00:00Commented Jun 20, 2023 at 16:59
It used to be that perl -e '1 while 1' would keep a system pretty busy, but that's a yawner for a multiprocessor system with 64 cores or whatever, and does not really exercise the memory. One method (on linux) is to get the process creation ahead of the out-of-memory killer so the system eventually thrashes itself to death. One way to do this is to allocate as much memory as possible in a single process, spawn a bunch of worker threads, and have those threads make random memory accesses across all the allocated memory. And then you have a few shell loops that try to start a lot of such processes, all using up as much memory as possible. Someone may have written a script along these lines already.
https://github.com/thrig/scripts/blob/master/bench/usemem.c
It uses a custom RNG that probably could be replaced with rand() calls in the threads. Haven't compiled it on Linux in a while so various -D defines might be necessary to placate compilers on Linux.
-
That won't trigger a watchdog unless the system is insanely slow already. You'd have to get an unrecoverable OOM, and while simply reading large buffers of memory in a random pattern will certainly stress test your CPU cache and will slow things down, it won't prevent the kernel scheduler from getting to the watchdog process. The only way to trigger the watchdog with a high system load is to ensure that the timeslice for the active KSE on each core never expires.forest– forest2022-08-26 00:36:06 +00:00Commented Aug 26, 2022 at 0:36