Update - Summary: nas-12-2 could be online Friday at the earliest, but more likely early next week.
In consultation with Adam Getchell, the decision has been made to do low-level disk copy from the old, failing drives, to new drives. This will minimize the potential for data loss.
This process is expected to finish Thursday at the earliest. Subsequently, the new disks will be added back to the ZFS pool, and we will trigger a full ZFS data scrub. When that finishes, we will know exactly how much, if any, data loss there is and which files are impacted. That data scrub will take a minimum of 24 hours, so the earliest nas-12-2 could be back in service is late Friday. It is more likely the scrub will run through the weekend, so a more realistic return-to-service is early next week.
Apr 02, 2025 - 13:59 PDT
Identified - nas-12-2 has suffered from multiple disk failures. Admins are investigating the best path forward.
The following group directories are currently unavailable:
Completed -
The scheduled maintenance has been completed.
Mar 31, 18:00 PDT
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Mar 31, 17:00 PDT
Scheduled -
Some group and home directories are unable to mount on the login node. An emergency reboot of the login node is scheduled for 5pm today. This will not impact any sbatch jobs, though it will cause all srun jobs launched from the login node to fail.
Mar 31, 10:58 PDT
Resolved -
nas-5-3 is once again correctly serving data.
Mar 27, 11:59 PDT
Monitoring -
nas-5-2 has been rebooted and verified to be back in service. It is taking a very high load of writes, so access will be sluggish until backed-up jobs catch up.
Mar 27, 09:39 PDT
Identified -
nas-5-2 has crashed. Any home directories, or group directories, shared from there are currently hung. Admins are investigating.
Mar 27, 09:08 PDT