Farm: nas-12-2 down due to multiple disk failures

Incident Report for Farm HPC cluster

Identified

The ZFS scrub has been started and is being watched carefully.
Posted 2 days ago. Apr 04, 2025 - 17:47 PDT

Monitoring

As tends to happen with failing hard drives, data recovery often goes slower than hoped. Two drives had 100% of the data recovered, and a third had 99.99% recovered. The last drive failed too hard to recover data from, but that is okay, ZFS should be able to reconstruct everything it needs from the first three. A ZFS scrub (data verification) is in progress. When this finishes, likely early next week, we will know for sure the state of all the data on nas-12-2.
Posted 2 days ago. Apr 04, 2025 - 14:31 PDT

Update

Summary: nas-12-2 could be online Friday at the earliest, but more likely early next week.

In consultation with Adam Getchell, the decision has been made to do low-level disk copy from the old, failing drives, to new drives. This will minimize the potential for data loss.

This process is expected to finish Thursday at the earliest. Subsequently, the new disks will be added back to the ZFS pool, and we will trigger a full ZFS data scrub. When that finishes, we will know exactly how much, if any, data loss there is and which files are impacted. That data scrub will take a minimum of 24 hours, so the earliest nas-12-2 could be back in service is late Friday. It is more likely the scrub will run through the weekend, so a more realistic return-to-service is early next week.
Posted 4 days ago. Apr 02, 2025 - 13:59 PDT

Identified

nas-12-2 has suffered from multiple disk failures. Admins are investigating the best path forward.

The following group directories are currently unavailable:

awhitehegrp
millermrgrp
millsgrp
runciegrp
weimergrp
yujingrp

The following home directories are unavailable:

aavalos7
awhitehe
barao
bcbaikie
bcweimer
berdeja
crice
crios
cschles
dglemay
djprince
dkblaufu
drbandoy
eabernat
ecgranad
edkoch
emmaluu
eoziolor
fengq
hahudson
hemstrow
hxhu
jagill
jajpark
jamcgirr
jassim
jcariute
jdowen
jenwash
jmiller1
jroach
jrwashab
jxnliu
katng23
ljcohen
madarm11
mam12n
mary363
millermr
mlyjones
mmosmond
motch
mtreiber
namcnabb
nmariano
nreid
pjseba
profeta
prvasque
psbapat
rsbrenna
sakre
saumyaw
scsastry
seboles
sejoslin
smhigdon
spatel23
tmbolt
vfbetsis
vpdunne
wolfie12
xmixu
yoxue
ytakim
ywdong
Posted 6 days ago. Mar 31, 2025 - 18:09 PDT
This incident affects: Login and Storage.