HPC Network Engineering Lead for GPU Clusters

Company

Empresa reconocida

Location

Remote, Colombia

Type

Full-time

Responsibilities Define and own a multi-year architectural vision and roadmap for InfiniBand/RDMA and high-speed Ethernet fabrics supporting massive GPU clusters and distributed AI/LLM workloads across the client portfolio 
Govern evaluation and standardization of cluster network topologies such as Fat-tree, Clos, Rail-optimized, and Dragonfly, and set decision frameworks aligned to scale, performance, and cost constraints 
Establish and enforce engineering standards for host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths 
Drive strategic performance engineering across RDMA/RoCE, NCCL/MSCCL, and collective communication for multi-node GPU training, and oversee resolution of the hardest systemic performance issues 
Define the reference architecture for Kubernetes networking on GPU clusters, including CNI plugins, network policies, multi-NI...
        

★ SearchEuropeanJobs.com

HPC Network Engineering Lead for GPU Clusters

Responsibilities

★ Ready to Start Your European Career?