This is a remote position; however, the candidate must reside within 30 miles of one of the following locations: Portland, ME; Boston, MA; Chicago, IL; Dallas, TX; San Francisco Bay Area, CA; and Seattle/WA.
About the Role/Team
We are the backbone of the AI organization, building the high-performance compute foundation that powers our generative AI and machine learning initiatives. Our team bridges the gap between hardware and software, ensuring that our researchers and data scientists have a reliable, scalable, and efficient platform to train and deploy models. We focus on maximizing GPU utilization, minimizing inference latency, and creating a seamless "paved road" for AI development.
You are a systems thinker who loves solving hard infrastructure challenges. You will architect the underlying platform that serves our production AI workloads, ensuring they are resilient, secure, and cost-effective. By optimizing our compute layer and deployment pipelines, you will directly accelerate the velocity of the entire AI product team, transforming how we deliver AI at scale.
Platform Architecture: Design and maintain a robust, Kubernetes-based AI platform that supports distributed training and high-throughput inference serving.
Inference Optimization: Engineer low-latency serving solutions for LLMs and other models, optimizing engines (vLLM,) to maximize throughput and minimize cost per token.
Compute Orchestration: Manage and scale GPU clusters on Cloud (AWS/Azure), implementing efficient scheduling, auto-scaling, and spot instance management to optimize costs.
Operational Excellence (MLOps): Build and maintain "Infrastructure as Code" (Terraform) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning.
Reliability & Observability: Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure.
Developer Experience: Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self-serve compute resources without managing underlying infrastructure.
Security & Compliance: Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively.
Machine Learning Infrastructure: 2+ years of experience building and maintaining ML infrastructure for production workloads.
Production Expertise: Proven experience managing large-scale production clusters (Kubernetes) and distributed systems.
Automation-First Mindset: Strong advocate for “Everything as Code”; skilled at automating repetitive tasks using Python, Go, or Bash.
Experience using Jupyter, MLflow, and other related ML tools.
CI/CD & Deployment: Experience designing and maintaining CI/CD pipelines for ML workloads and containerized applications.
Monitoring & Observability: Hands-on experience implementing monitoring, logging, and alerting at scale for production ML systems.
Security & Compliance: Familiarity with securing ML infrastructure, enforcing access controls, and maintaining compliance standards.
Collaboration: Experience working closely with data scientists, ML engineers, and DevOps teams to operationalize models.
Performance Optimization: Ability to profile, debug, and optimize ML workloads and infrastructure for throughput, latency, and cost efficiency.
Scalable Architecture Design: Experience designing scalable and reliable infrastructure to support high-traffic ML applications.
Core Engineering: Expert proficiency in Python and Go; comfortable digging into lower-level system performance.
Orchestration & Containers: Mastery of Kubernetes (EKS/AKS), Helm, Docker, and container runtimes. Experience with Ray is a huge plus.
Infrastructure as Code: Advanced skills with Terraform.
Cloud Platforms: Experience one of cloud AWS, Azure, GCP
Observability: Proficiency with Prometheus, Grafana, and tracing tools (OpenTelemetry).
Networking: Understanding of service mesh (Istio), load balancing, and high-performance networking.