Engineer – DevOps, Kubernetes & GPU Computing

AMD

  • Full Time

To apply for this job please visit careers.amd.com.

Key Responsibilities

  • Deploy, configure, and maintain HPC clusters using Slurm

  • Manage GPU compute infrastructure, including CUDA and/or ROCm environments

  • Support high-speed interconnects and parallel storage systems

  • Design, implement, and maintain CI/CD pipelines (Buildkite, GitHub Actions, Jenkins)

  • Automate infrastructure using Ansible, Terraform, Python, and Bash

  • Deploy and manage containerized workloads with Docker, Kubernetes, and Helm

  • Monitor system performance and reliability using Prometheus, Grafana, and Checkmk

  • Troubleshoot Linux, networking, and distributed systems issues

  • Collaborate with engineering and DevOps teams to improve workflows and document best practices

Required Skills & Qualifications

  • Strong experience with Linux system administration

  • Hands-on expertise with Slurm or similar HPC workload schedulers

  • Proficiency in DevOps tools and CI/CD pipelines

  • Experience with GPU computing environments (CUDA/ROCm)

  • Solid scripting skills in Python and Bash

  • Experience with Docker, Kubernetes, and Helm

  • Knowledge of monitoring, logging, and observability tools

  • Strong analytical, troubleshooting, and communication skills

Preferred Qualifications

  • Experience managing large-scale HPC clusters

  • Background in Site Reliability Engineering (SRE) or platform engineering

  • Familiarity with cloud-native infrastructure and hybrid environments

  • Experience with performance optimization and workload tuning

  • Agile development and cross-functional collaboration experience

Education & Experience

  • Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related field

  • Multiple years of industry experience in HPC, DevOps, Linux systems, or infrastructure engineering

Job Overview