Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
Get email notifications whenever Farm HPC cluster creates, updates or resolves an incident.
Get incident updates and maintenance status messages in Slack.
Investigating - Farm's slurmdbd is having intermittent issues. If you see an error like below, it means the problem has occurred again, and we will restart slurmdbd to bring it back into service.
"""sacctmgr: error: _open_persist_conn: failed to open persistent connection to host:monitoring-ib:6819: Connection timed out sacctmgr: error: Sending PersistInit msg: Connection timed out"""
We have a support case open with SchedMD and will update this issue as we learn more.
Apr 23, 2025 - 17:22 PDT
Login
?
Operational
90 days ago
99.98
% uptime
Today
Storage
?
Operational
90 days ago
100.0
% uptime
Today
File transfer node
?
Operational
90 days ago
100.0
% uptime
Today
high2,med2,low2
?
Operational
90 days ago
100.0
% uptime
Today
high,med,low
?
Operational
90 days ago
100.0
% uptime
Today
bmh,bmm
?
Operational
90 days ago
100.0
% uptime
Today
bigmemh,bigmemm
?
Operational
90 days ago
100.0
% uptime
Today
bgpu
?
Operational
90 days ago
100.0
% uptime
Today
gpuh,gpum
?
Operational
90 days ago
100.0
% uptime
Today
Email
?
Operational
90 days ago
100.0
% uptime
Today
Virtualization
Operational
90 days ago
100.0
% uptime
Today
Proxmox Virtualization Nodes
Operational
90 days ago
100.0
% uptime
Today
Ganetti cluster
?
Operational
90 days ago
100.0
% uptime
Today
Slurm
?
Operational
90 days ago
83.54
% uptime
Today
Software
Operational
90 days ago
100.0
% uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Related
No incidents or maintenance related to this downtime.
Resolved -
This issue has been resolved; system administrators have fixed replication issues with the account provisioning process.
Jul 23, 14:25 PDT
Monitoring -
System administrators have fixed an issue with the account synchronization process and will continue to monitor the database for inconsistencies. Users should now be able to log in without issue.
Jul 23, 09:10 PDT
Investigating -
Some users report not being able to log into Farm's head node, but they can use Open OnDemand. System administrators are currently looking into the issue.
Jul 23, 08:08 PDT