Posted 2 months ago

To apply for this job please visit careers.amd.com.

Deploy, configure, and maintain HPC clusters using Slurm
Manage GPU compute infrastructure, including CUDA and/or ROCm environments
Support high-speed interconnects and parallel storage systems
Design, implement, and maintain CI/CD pipelines (Buildkite, GitHub Actions, Jenkins)
Automate infrastructure using Ansible, Terraform, Python, and Bash
Deploy and manage containerized workloads with Docker, Kubernetes, and Helm
Monitor system performance and reliability using Prometheus, Grafana, and Checkmk
Troubleshoot Linux, networking, and distributed systems issues
Collaborate with engineering and DevOps teams to improve workflows and document best practices

Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related field
Multiple years of industry experience in HPC, DevOps, Linux systems, or infrastructure engineering