Investigating - Farm's slurmdbd is having intermittent issues. If you see an error like below, it means the problem has occurred again, and we will restart slurmdbd to bring it back into service.

"""sacctmgr: error: _open_persist_conn: failed to open persistent connection to host:monitoring-ib:6819: Connection timed out
sacctmgr: error: Sending PersistInit msg: Connection timed out"""

We have a support case open with SchedMD and will update this issue as we learn more.

Apr 23, 2025 - 17:22 PDT
Login ? Operational
90 days ago
92.02 % uptime
Today
Storage ? Operational
90 days ago
91.99 % uptime
Today
File transfer node ? Operational
90 days ago
100.0 % uptime
Today
high2,med2,low2 ? Operational
90 days ago
99.97 % uptime
Today
high,med,low ? Operational
90 days ago
99.97 % uptime
Today
bmh,bmm ? Operational
90 days ago
99.97 % uptime
Today
bigmemh,bigmemm ? Operational
90 days ago
99.97 % uptime
Today
bgpu ? Operational
90 days ago
99.97 % uptime
Today
gpuh,gpum ? Operational
90 days ago
99.97 % uptime
Today
Email ? Operational
90 days ago
100.0 % uptime
Today
Virtualization Operational
90 days ago
100.0 % uptime
Today
Proxmox Virtualization Nodes Operational
90 days ago
100.0 % uptime
Today
Ganetti cluster ? Operational
90 days ago
100.0 % uptime
Today
Slurm ? Partial Outage
90 days ago
99.27 % uptime
Today
Software Operational
90 days ago
91.8 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.

Scheduled Maintenance

Mandatory NVIDIA driver update May 6, 2025 08:00-20:00 PDT

NVIDIA has notified users of a severity High vulnerability in their GPU drivers for Linux which could allow an unprivileged user to escalate permissions (https://nvidia.custhelp.com/app/answers/detail/a_id/5630). Due to UCOP IS-3 policy, we are required to patch affected systems as soon as possible.

As a result, we will be patching NVIDIA drivers on all HPC GPU systems and rebooting them starting at 8:00 a.m. on May 6th, 2025. Jobs that are currently utilizing HPC GPUs will be killed with a reboot. New jobs will be unavailable to start until patching is complete. We expect the maintenance to last until 6:00 p.m. on the same day.

Please email hpc-help@ucdavis with any questions.

Posted on Apr 25, 2025 - 16:04 PDT
Apr 25, 2025

No incidents reported today.

Apr 24, 2025

No incidents reported.

Apr 23, 2025
Resolved - nas-12-2 recovery was successful. We were able to scrape enough data from the failing drives that ZFS was able to rebuild onto new drives.
Apr 23, 17:19 PDT
Monitoring - nas-12-2's resilver finished ahead of schedule, so we have re-enabled it for use in Farm. One disk kicked off another resilver, but it is the only disk in that vdev with any issues, so we feel pretty comfortable allowing that to happen in the background.

If you run into issues with nas-12-2, please open a Farm Support Ticket.

Apr 11, 16:10 PDT
Update - Disk replacements were successfully performed yesterday, and data reconstruction onto them is in progress. ZFS is currently estimating it will finish in a little over three days, so the best-case estimate is nas-12-2 will be available for use late Saturday evening. We will provide more updates as the reconstruction progresses.
Apr 9, 10:28 PDT
Update - The ZFS pool scrub (data verification) is in progress. As you can imagine, 409 TB of data takes a while to verify. The current ETA is that it will finish sometime late tonight. This scrub has caused 3 additional hard drives to drop out. The executive decision has been made to replace those drives before allowing users to access the pool. The estimate is an additional 3 days for those drives to have all the data reconstructed onto them, so our best-guess ETA for return-to-service is late this week. We will post additional updates as the disk replacement proceeds.
Apr 7, 14:49 PDT
Identified - The ZFS scrub has been started and is being watched carefully.
Apr 4, 17:47 PDT
Monitoring - As tends to happen with failing hard drives, data recovery often goes slower than hoped. Two drives had 100% of the data recovered, and a third had 99.99% recovered. The last drive failed too hard to recover data from, but that is okay, ZFS should be able to reconstruct everything it needs from the first three. A ZFS scrub (data verification) is in progress. When this finishes, likely early next week, we will know for sure the state of all the data on nas-12-2.
Apr 4, 14:31 PDT
Update - Summary: nas-12-2 could be online Friday at the earliest, but more likely early next week.

In consultation with Adam Getchell, the decision has been made to do low-level disk copy from the old, failing drives, to new drives. This will minimize the potential for data loss.

This process is expected to finish Thursday at the earliest. Subsequently, the new disks will be added back to the ZFS pool, and we will trigger a full ZFS data scrub. When that finishes, we will know exactly how much, if any, data loss there is and which files are impacted. That data scrub will take a minimum of 24 hours, so the earliest nas-12-2 could be back in service is late Friday. It is more likely the scrub will run through the weekend, so a more realistic return-to-service is early next week.

Apr 2, 13:59 PDT
Identified - nas-12-2 has suffered from multiple disk failures. Admins are investigating the best path forward.

The following group directories are currently unavailable:

awhitehegrp
millermrgrp
millsgrp
runciegrp
weimergrp
yujingrp

The following home directories are unavailable:

aavalos7
awhitehe
barao
bcbaikie
bcweimer
berdeja
crice
crios
cschles
dglemay
djprince
dkblaufu
drbandoy
eabernat
ecgranad
edkoch
emmaluu
eoziolor
fengq
hahudson
hemstrow
hxhu
jagill
jajpark
jamcgirr
jassim
jcariute
jdowen
jenwash
jmiller1
jroach
jrwashab
jxnliu
katng23
ljcohen
madarm11
mam12n
mary363
millermr
mlyjones
mmosmond
motch
mtreiber
namcnabb
nmariano
nreid
pjseba
profeta
prvasque
psbapat
rsbrenna
sakre
saumyaw
scsastry
seboles
sejoslin
smhigdon
spatel23
tmbolt
vfbetsis
vpdunne
wolfie12
xmixu
yoxue
ytakim
ywdong

Mar 31, 18:09 PDT
Apr 22, 2025

No incidents reported.

Apr 21, 2025

No incidents reported.

Apr 20, 2025

No incidents reported.

Apr 19, 2025

No incidents reported.

Apr 18, 2025

No incidents reported.

Apr 17, 2025

No incidents reported.

Apr 16, 2025

No incidents reported.

Apr 15, 2025

No incidents reported.

Apr 14, 2025

No incidents reported.

Apr 13, 2025

No incidents reported.

Apr 12, 2025

No incidents reported.

Apr 11, 2025