Performance Degradation on Storage, Home Directories
Incident Report for Farm HPC cluster
Monitoring
The reboot of nas-4-1 seems to have cleared up the issues with the login node. At this point, a reboot of the login node does not appear to be required.

Please open a ticket if you are still having problems.
Posted Nov 04, 2024 - 10:26 PST
Identified
Unfortunately, the same problem reoccurred, so nas-4-1 and Farm's login node will be rebooted in a few minutes. Jobs running with sbatch should just pause and resume when the NFS server recovers. srun jobs will be killed when the login node is rebooted.

Admins are still working to determine the combination of client and server kernel versions that is tickling this kernel bug.
Posted Nov 04, 2024 - 09:31 PST
Monitoring
Both nas-4-1 and the login node have been downgraded to a kernel version that should not contain this bug. Admins are monitoring.
Posted Oct 23, 2024 - 11:04 PDT
Update
System administrators will be restarting NAS-4-1 storage and the Farm login node. Emergency downtime should be approximately one hour.
Posted Oct 23, 2024 - 10:02 PDT
Identified
The same kernel bug on nas-4-1 has been identified as the root-cause. Admins are investigating the best path forward.
Posted Oct 23, 2024 - 09:47 PDT
Investigating
System administrators are still seeing hung processes on Farm storage and are continuing to look into the root cause.
Posted Oct 23, 2024 - 08:57 PDT
This incident affects: Login and Storage.