Ganymede2 Condo Cluster Guidelines

Reviewed and approved by the HPC Advisory Group on February 13, 2025.

Introduction

Ganymede2 (G2) is a high-performance computing (HPC) cluster that is centrally managed by the HPC team while allowing faculty, departments, and schools to purchase compute nodes that are incorporated into the cluster. These guidelines outline the rules and requirements for adding and managing condo nodes within the G2 system.

A Ganymede2 condo is a set of HPC compute nodes contributed by individual faculty, departments, or schools. These nodes are integrated into the broader G2 cluster and are managed centrally under uniform configurations.

Management guidelines

Since G2 is a unified HPC cluster, all condo nodes:

Have access to the WEKA and MooseFS storage systems.
Utilize the SLURM batch processing system for job scheduling.
Support Open OnDemand to enable interactive use.
Are subject to cluster-wide policies that maintain consistency and efficiency in resource utilization.
The software environment on all condo nodes will adhere to standard Rocky Linux, SLURM, and other centrally managed configurations.
Containerized workloads are supported and recommended.
Customization of condo nodes is limited to ensure they remain usable as HPC compute nodes. (These nodes are not workstations or servers. See appendix below on comparison of workstations, servers and HPC compute nodes.)
All condo nodes will be centrally managed by cluster administrators to ensure uniformity in operation, security, and maintenance.
Owners are required to comply with system-wide maintenance schedules and updates.

Access to nodes and resource sharing

Condo owners have exclusive queues that allow them to run jobs on their own nodes.
Other users are permitted to run jobs on condo nodes when they are not in use by the owner. These users submit their jobs to “preempt” queues (currently cpu-preempt and gpu-preempt) and the system schedules these jobs on idle nodes.
If a job from a non-owner is running on a condo node and the owner submits a job, the non-owner job will be terminated or suspended to allow the owner’s job to start immediately.
The fundamental objective of the Ganymede2 condo cluster is resource sharing. When a condo owner or research group is not actively utilizing their condo nodes, other users’ jobs will run on the available compute resources. This ensures optimal utilization of the system and benefits the broader HPC community.

Hardware lifespan and warranty requirements

Any computing hardware has limited life. For servers and storage, five to seven years is typical. To guard against disruptions in system availability due to hardware issues, we wish to operate G2 with hardware supported under manufacturer’s warranty and support. At the end of the equipment’s life, it will be retired from service.

We ask all condo users to purchase new hardware with 5 years of initial warranty/support and then purchase extended support in later years to ensure smooth operation. This is the best way to manage G2 as a reliable system. When operated this way, the HPC team will give high priority to any system hardware issues.

If a condo or its component is out of warranty, our first preference is to obtain warranty/support for it. If it is not possible, then the HPC staff cannot offer any availability assurances for the condo or its component. In case of hardware failure, the staff will work with the condo owner to locate and order repair parts, install the parts, and return the condo or component to service. The HPC staff will perform this work on “best effort” basis (i.e., when time is available and urgent tasks on in-production systems are not pending). Researchers should still be able to continue their work because G2’s spare cycles in other condos will be available.

If there is a critical security vulnerability reported for the condo systems or a component, and no non-vulnerable replacement part, firmware update or software security fix is available, then that component or system must be shut down since insecure operation is not permissible. (See Security Control Standards Catalog, Texas Department of Information Resources, version 2.1, effective date May 18, 2023, section System and Information Integrity, SI-2 Flaw Redemption. Also see UTD Information Security Office Server Standard document currently available at https://utdallas.app.box.com/v/ServerStandard.)

These guidelines are subject to periodic review and modification by the HPC team in consultation with the faculty advisory group to reflect changes in technology and computing environment and evolving user needs.

Appendix: Workstation vs server vs HPC compute node

	Workstation	Server	HPC compute node
Primary use through	Graphical desktop environment	Remote shell access	Batch job
How many programs can run?	Several by one user	Several by several users	Several by several users, one by one user, one on several nodes by one user
How many users can use	One at a time	Several simultaneously	Several sequentially or simultaneously (if resources available)
System configuration	Flexible, user controlled	Flexible, sys admin controlled	Defined, cluster admin controlled
Storage	Local or using a file server	Local or using a file server	Cluster storage system, typically a high performance parallel filesystem