Monitoring - The reboot of nas-4-1 seems to have cleared up the issues with the login node. At this point, a reboot of the login node does not appear to be required.

Please open a ticket if you are still having problems.

Nov 04, 2024 - 10:26 PST
Identified - Unfortunately, the same problem reoccurred, so nas-4-1 and Farm's login node will be rebooted in a few minutes. Jobs running with sbatch should just pause and resume when the NFS server recovers. srun jobs will be killed when the login node is rebooted.

Admins are still working to determine the combination of client and server kernel versions that is tickling this kernel bug.

Nov 04, 2024 - 09:31 PST
Monitoring - Both nas-4-1 and the login node have been downgraded to a kernel version that should not contain this bug. Admins are monitoring.
Oct 23, 2024 - 11:04 PDT
Update - System administrators will be restarting NAS-4-1 storage and the Farm login node. Emergency downtime should be approximately one hour.
Oct 23, 2024 - 10:02 PDT
Identified - The same kernel bug on nas-4-1 has been identified as the root-cause. Admins are investigating the best path forward.
Oct 23, 2024 - 09:47 PDT
Investigating - System administrators are still seeing hung processes on Farm storage and are continuing to look into the root cause.
Oct 23, 2024 - 08:57 PDT
Login ? Operational
90 days ago
99.79 % uptime
Today
Storage ? Operational
90 days ago
92.48 % uptime
Today
File transfer node ? Operational
90 days ago
100.0 % uptime
Today
high2,med2,low2 ? Operational
90 days ago
100.0 % uptime
Today
high,med,low ? Operational
90 days ago
100.0 % uptime
Today
bmh,bmm ? Operational
90 days ago
100.0 % uptime
Today
bigmemh,bigmemm ? Operational
90 days ago
100.0 % uptime
Today
bgpu ? Operational
90 days ago
100.0 % uptime
Today
gpuh,gpum ? Operational
90 days ago
100.0 % uptime
Today
Email ? Operational
90 days ago
100.0 % uptime
Today
Virtualization Operational
90 days ago
100.0 % uptime
Today
Proxmox Virtualization Nodes Operational
90 days ago
100.0 % uptime
Today
Ganetti cluster ? Operational
90 days ago
100.0 % uptime
Today
Slurm ? Operational
90 days ago
99.9 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Past Incidents
Nov 25, 2024
Resolved - NAS 6-0 storage appears healthy again.
Nov 25, 11:56 PST
Monitoring - A fix for storage nas-6-0 has been identified, and system administrators are monitoring the storage array for file system integrity.
Nov 25, 11:33 PST
Investigating - System administrators are seeing performance issues on storage device nas-6-0, possibly due to hard drive faults. We are currently looking into the issue to determine a root cause.
Nov 25, 08:34 PST
Nov 24, 2024

No incidents reported.

Nov 23, 2024

No incidents reported.

Nov 22, 2024

No incidents reported.

Nov 21, 2024

No incidents reported.

Nov 20, 2024

No incidents reported.

Nov 19, 2024

No incidents reported.

Nov 18, 2024

No incidents reported.

Nov 17, 2024

No incidents reported.

Nov 16, 2024

No incidents reported.

Nov 15, 2024

No incidents reported.

Nov 14, 2024
Resolved - nas-6-0 has been brought back into service and is correctly serving data again.
Nov 14, 13:19 PST
Investigating - nas-6-0 is currently having issues and not serving NFS data. Admins are investigating.
Nov 14, 09:42 PST
Nov 13, 2024

No incidents reported.

Nov 12, 2024

No incidents reported.

Nov 11, 2024

No incidents reported.