Key Responsibilities
-
Deploy, configure, and maintain HPC clusters using Slurm
-
Manage GPU compute infrastructure, including CUDA and/or ROCm environments
-
Support high-speed interconnects and parallel storage systems
-
Design, implement, and maintain CI/CD pipelines (Buildkite, GitHub Actions, Jenkins)
-
Automate infrastructure using Ansible, Terraform, Python, and Bash
-
Deploy and manage containerized workloads with Docker, Kubernetes, and Helm
-
Monitor system performance and reliability using Prometheus, Grafana, and Checkmk
-
Troubleshoot Linux, networking, and distributed systems issues
-
Collaborate with engineering and DevOps teams to improve workflows and document best practices
Required Skills & Qualifications
-
Strong experience with Linux system administration
-
Hands-on expertise with Slurm or similar HPC workload schedulers
-
Proficiency in DevOps tools and CI/CD pipelines
-
Experience with GPU computing environments (CUDA/ROCm)
-
Solid scripting skills in Python and Bash
-
Experience with Docker, Kubernetes, and Helm
-
Knowledge of monitoring, logging, and observability tools
-
Strong analytical, troubleshooting, and communication skills
Preferred Qualifications
-
Experience managing large-scale HPC clusters
-
Background in Site Reliability Engineering (SRE) or platform engineering
-
Familiarity with cloud-native infrastructure and hybrid environments
-
Experience with performance optimization and workload tuning
-
Agile development and cross-functional collaboration experience
Education & Experience
-
Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related field
-
Multiple years of industry experience in HPC, DevOps, Linux systems, or infrastructure engineering
