Amazon EKS Supports Ultra-Scale AI and ML Workloads With 100K Nodes Per Cluster

Amazon EKS Supports Ultra-Scale AI and ML Workloads With 100K Nodes Per Cluster Amazon EKS Supports Ultra-Scale AI and ML Workloads With 100K Nodes Per Cluster

Amazon EKS now handles 100,000 worker nodes in a single Kubernetes cluster. This means users can run massive AI jobs with up to 1.6 million AWS Trainium accelerators or 800,000 NVIDIA GPUs. The move targets large-scale AI/ML models and artificial general intelligence (AGI) ambitions.

Anthropic and Amazon’s own teams are already using this ultra scale. Anthropic runs its Claude models on EKS clusters mixing Trainium, NVIDIA GPUs, and Graviton CPUs. They reported performance and latency gains, with write API calls hitting 90% completion within 15ms, up from 35%.

Amazon’s AGI group uses EKS alongside SageMaker HyperPod to train the Nova family of foundation models. The setup boosts resiliency and cuts downtime during huge training jobs.

Advertisement

Architectural updates to EKS include a reworked etcd storage layer and an optimized control plane, enabling faster pod operations and better resource orchestration at massive scale.

These improvements aim to accelerate innovation, reduce costs, and give customers flexibility with frameworks while maintaining Kubernetes API compatibility.

Rohit Prasad, SVP & Head Scientist at AGI, said:

“Amazon EKS and SageMaker HyperPod have been instrumental in helping us push the boundaries of foundational AI model training at unprecedented scale, while delivering the high resiliency our workloads demand. This technological foundation has not only accelerated our innovation timeline but has become the cornerstone of our strategy to build the next generation of AGI capabilities that will transform how the world interacts with AI”

Nova DasSarma, Technical Lead for Anthropic Infrastructure, added:

“Working with AWS, we’ve enhanced our AI infrastructure capabilities with Amazon EKS support for clusters of up to 100K nodes. This combination of EKS’ industry-leading scale and AWS accelerated compute options helps strengthen our foundation for safe and scalable AI”

Amazon EKS is pushing Kubernetes into new territory for AI infrastructure, ready for workloads this big and complex. Heads-up: AWS account teams have more info for interested customers.

Add a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Advertisement