Farm HPC cluster Status

Farm's slurmdbd having intermittent issues

Investigating - Farm's slurmdbd is having intermittent issues. If you see an error like below, it means the problem has occurred again, and we will restart slurmdbd to bring it back into service.

"""sacctmgr: error: _open_persist_conn: failed to open persistent connection to host:monitoring-ib:6819: Connection timed out
sacctmgr: error: Sending PersistInit msg: Connection timed out"""

We have a support case open with SchedMD and will update this issue as we learn more.
Apr 23, 2025 - 17:22 PDT

Uptime over the past 90 days. View historical uptime.

90 days ago

99.98 % uptime

Today

Storage Operational

90 days ago

100.0 % uptime

Today

File transfer node Operational

90 days ago

100.0 % uptime

Today

high2,med2,low2 Operational

90 days ago

100.0 % uptime

Today

high,med,low Operational

90 days ago

100.0 % uptime

Today

bmh,bmm Operational

90 days ago

100.0 % uptime

Today

bigmemh,bigmemm Operational

90 days ago

100.0 % uptime

Today

bgpu Operational

90 days ago

100.0 % uptime

Today

gpuh,gpum Operational

90 days ago

100.0 % uptime

Today

Email Operational

90 days ago

100.0 % uptime

Today

Virtualization Operational

90 days ago

100.0 % uptime

Today

Proxmox Virtualization Nodes Operational

90 days ago

100.0 % uptime

Today

Ganetti cluster Operational

90 days ago

100.0 % uptime

Today

Slurm Operational

90 days ago

100.0 % uptime

Today

Software Operational

90 days ago

100.0 % uptime

Today

Operational

Degraded Performance

Partial Outage

Major Outage

Maintenance

Past Incidents

Sep 16, 2025

No incidents reported today.

Sep 15, 2025

No incidents reported.

Sep 14, 2025

No incidents reported.

Sep 13, 2025

No incidents reported.

Sep 12, 2025

No incidents reported.

Sep 11, 2025

No incidents reported.

Sep 10, 2025

No incidents reported.

Sep 9, 2025

No incidents reported.

Sep 8, 2025

No incidents reported.

Sep 7, 2025

No incidents reported.

Sep 6, 2025

No incidents reported.

Sep 5, 2025

No incidents reported.

Sep 4, 2025

No incidents reported.

Sep 3, 2025

No incidents reported.

Sep 2, 2025

No incidents reported.

Related

Past Incidents