Farm's slurmdbd having intermittent issues

Incident Report for Farm HPC cluster

Investigating

Farm's slurmdbd is having intermittent issues. If you see an error like below, it means the problem has occurred again, and we will restart slurmdbd to bring it back into service.

"""sacctmgr: error: _open_persist_conn: failed to open persistent connection to host:monitoring-ib:6819: Connection timed out
sacctmgr: error: Sending PersistInit msg: Connection timed out"""

We have a support case open with SchedMD and will update this issue as we learn more.
Posted Apr 23, 2025 - 17:22 PDT
This incident affects: Slurm.