Monitoring - The reboot of nas-4-1 seems to have cleared up the issues with the login node. At this point, a reboot of the login node does not appear to be required.
Please open a ticket if you are still having problems.
Nov 04, 2024 - 10:26 PST
Identified - Unfortunately, the same problem reoccurred, so nas-4-1 and Farm's login node will be rebooted in a few minutes. Jobs running with sbatch should just pause and resume when the NFS server recovers. srun jobs will be killed when the login node is rebooted.
Admins are still working to determine the combination of client and server kernel versions that is tickling this kernel bug.
Nov 04, 2024 - 09:31 PST
Monitoring - Both nas-4-1 and the login node have been downgraded to a kernel version that should not contain this bug. Admins are monitoring.
Oct 23, 2024 - 11:04 PDT
Update - System administrators will be restarting NAS-4-1 storage and the Farm login node. Emergency downtime should be approximately one hour.
Oct 23, 2024 - 10:02 PDT
Identified - The same kernel bug on nas-4-1 has been identified as the root-cause. Admins are investigating the best path forward.
Oct 23, 2024 - 09:47 PDT
Investigating - System administrators are still seeing hung processes on Farm storage and are continuing to look into the root cause.
Oct 23, 2024 - 08:57 PDT