Accelerate FM pre-training on Amazon SageMaker HyperPod (Amazon EKS) | Amazon Web Services
Science & Technology
Introduction
Introduction
Hello everyone, my name is Alex Yanowski, and I am a Principal Solutions Architect in the Worldwide Specialist Organization. Today, I am excited to present Amazon EKS support in SageMaker HyperPod and how it can help accelerate your machine learning workloads.
What is SageMaker HyperPod?
SageMaker HyperPod is a SageMaker service that offers purpose-built machine learning infrastructure designed with built-in resiliency features. Workloads on SageMaker HyperPod can be orchestrated either via SLURM or Kubernetes, and today's focus will be on the Kubernetes interface, provided by Amazon EKS (Elastic Kubernetes Service).
Despite being a SageMaker service, customers retain the ability to access their underlying infrastructure using tools like CCTL or SSH.
Typical HyperPod Cluster
A typical HyperPod cluster comprises a service account for compute, a customer account, and an EKS cluster. In the demo I'll share with you today, we have four distinct instance groups in the compute cluster:
- Generic CPU Group: M5 2xlarge instances.
- GPU Group: G5 2xlarge with one EFA interface and one Nvidia A10 GPU per node.
- High-Performance GPU Group: P5 instances with 32 EFA adapters and eight Nvidia H100 GPUs.
- Tranium Instance Group: Each instance comprises 16 EFA adapters and 16 neuron devices, with a total of 32 neuron cores per node.
In the user’s VPC, we will utilize the Amazon Elastic Container Registry and a shared FSx for Lustre volume, in addition to leveraging Amazon CloudWatch Container Insights.
Features of HyperPod
The features of HyperPod fall into three main experiences:
Admin Experience: This involves creating, updating, and deleting clusters, accessing clusters via Kubernetes commands (kubectl), and utilizing the admin UI experience via the AWS console.
Resiliency Features: HyperPod includes deep health checks, ongoing monitoring of nodes, and automatic job resumption in case of hardware failures.
Scientist Experience: The HyperPod CLI allows users unfamiliar with the Kubernetes API to manage jobs on HyperPod clusters easily.
It’s worth noting that you can start from scratch or attach HyperPod compute to existing infrastructures for ease of transition.
Creating a HyperPod Cluster
To create a HyperPod cluster, you first need to set up essential resources in your VPC and then provision an EKS cluster. The required life cycle scripts are pushed to an S3 bucket, assisting with node initialization.
A cluster configuration file captures the instance group configuration, enabling deep health checks and node auto-recovery parameters.
Once you execute the create cluster command, nodes are provisioned, and users can interact with them using kubectl.
AWS Do HyperPod Project
To facilitate onboarding and ease the interaction with HyperPod infrastructure, we created the AWS Do HyperPod container project. This open-source project bundles helpful tools such as the HyperPod CLI for managing job submissions.
After building the container, you can access it and run commands like hyperpod list clusters
for cluster management.
Demo of Distributed Training
In the demo, we will run distributed training on various nodes. We have selected tasks based on the capabilities of each instance group:
- M5 Instances: Running an ImageNet CPU example.
- G5 Instances: Running the same example on GPUs.
- P5 Instances: Running a Llama 2 FSDP pre-training.
- Tranium Instances: Running a Llama 3 pre-training.
With a single command, all training jobs can be kicked off simultaneously, with each task routed to its designated compute resource.
Monitoring and Observability
During training, we can monitor utilization metrics through the command line using tools like htop
, nvtop
, and neuron top
for detailed insights into CPU, GPU, and neuron core utilization.
Additionally, AWS CloudWatch provides container insights to visualize metrics, including node health status, CPU, memory, and network utilization.
Conclusion
In conclusion, SageMaker HyperPod offers purpose-built infrastructure for accelerating your machine learning workloads. With its resilient capabilities, it addresses hardware failures automatically and provides flexible job orchestration options.
A big thank you to my colleagues who contributed to the development of SageMaker HyperPod, and thank you for your attention. I look forward to seeing the innovative solutions you will create using AWS and SageMaker HyperPod.
Keywords
- Amazon SageMaker HyperPod
- Amazon EKS
- Machine Learning Infrastructure
- Resiliency
- Orchestration
- Distributed Training
- Amazon CloudWatch
- CPU/GPU Utilization
FAQ
1. What is Amazon SageMaker HyperPod?
- Amazon SageMaker HyperPod is a service that provides dedicated machine learning infrastructure with built-in resiliency features.
2. How does HyperPod ensure resiliency?
- HyperPod includes deep health checks and automatic job resumption features to handle hardware failures seamlessly.
3. Can I use HyperPod with an existing EKS cluster?
- Yes, HyperPod can be attached to an existing EKS infrastructure to leverage your current resources.
4. What types of workloads can I run on HyperPod?
- HyperPod is designed to handle various workloads, including distributed training and large-scale machine learning tasks using different instance types.
5. How can I monitor the performance of my HyperPod cluster?
- Performance can be monitored through Amazon CloudWatch Container Insights and various command-line tools for CPU, GPU, and memory usage.