Performance Degradation on Storage, Home Directories

Incident Report for Farm HPC cluster

Monitoring

The reboot of nas-4-1 seems to have cleared up the issues with the login node. At this point, a reboot of the login node does not appear to be required.

Please open a ticket if you are still having problems.

Posted Nov 04, 2024 - 10:26 PST

Identified

Unfortunately, the same problem reoccurred, so nas-4-1 and Farm's login node will be rebooted in a few minutes. Jobs running with sbatch should just pause and resume when the NFS server recovers. srun jobs will be killed when the login node is rebooted.

Admins are still working to determine the combination of client and server kernel versions that is tickling this kernel bug.

Posted Nov 04, 2024 - 09:31 PST

Monitoring

Both nas-4-1 and the login node have been downgraded to a kernel version that should not contain this bug. Admins are monitoring.

Posted Oct 23, 2024 - 11:04 PDT

Update

System administrators will be restarting NAS-4-1 storage and the Farm login node. Emergency downtime should be approximately one hour.

Posted Oct 23, 2024 - 10:02 PDT

Identified

The same kernel bug on nas-4-1 has been identified as the root-cause. Admins are investigating the best path forward.

Posted Oct 23, 2024 - 09:47 PDT

Investigating

System administrators are still seeing hung processes on Farm storage and are continuing to look into the root cause.

Posted Oct 23, 2024 - 08:57 PDT

This incident affects: Login and Storage.