About the Role
As a Lead Inference Platform Engineer, you will:
- Optimize LLMs and ML models for high-performance inference using techniques such as quantization, pruning, distillation, and hardware specific tuning
- Deploy and scale inference workloads on GPUs across AWS, Azure, GCP and internal Kubernetes clusters, ensuring predictable performance during peak traffic hours, especially during business hours
- Implement routing and failover strategies for OpenAI/Anthropic/Vertex AI traffic
- Integrate models into production grade APIs supporting TR products and enterprise workflows.
- Develop highly optimized environment and eliminate performance bottlenecks to reduce latency
- Collaborate with Platform Engineering teams (Landing Zones, Network, Storage, Compute, AI) to ensure inference workloads align with TR’s cloud native patterns (AWS, Azure, GCP, OCI)
- Build and optimize containerized inference pipelines...