The Role
We are looking for a highly skilled Linux / HPC Systems Engineer to design, operate, and scale high-performance computing (HPC) environments alongside modern DevOps infrastructure. This role blends hands-on expertise in Slurm-managed HPC clusters, GPU compute platforms, and Kubernetes-based orchestration with strong automation and CI/CD practices.
The ideal candidate is comfortable working in fast-paced, collaborative environments, takes ownership of complex systems with minimal supervision, and is passionate about building reliable, scalable, and high-performance infrastructure.
The Person
You are an experienced infrastructure engineer with a strong foundation in DevOps, Site Reliability Engineering (SRE), or platform engineering. You bring deep technical expertise in Linux systems, Kubernetes, and automation, along with practical experience supporting GPU-accelerated workloads and HPC environments.
You thrive on solving complex problems, communicate clearly across technical teams, and consistently drive execution from design through production.
Key Responsibilities
-
Deploy, configure, and operate HPC clusters using Slurm
-
Manage GPU compute environments, high-speed interconnects, and parallel storage systems
-
Design, build, and maintain CI/CD pipelines using tools such as Buildkite, GitHub Actions, and Jenkins
-
Automate infrastructure provisioning and configuration using Ansible, Terraform, Python, and Bash
-
Deploy and manage containerized workloads using Docker, Kubernetes, and Helm
-
Monitor system health, performance, and reliability using Grafana, Prometheus, and Checkmk
-
Collaborate with cross-functional teams to optimize workflows, resolve issues, and document best practices
Preferred Experience & Skills
-
Strong hands-on experience with Slurm or equivalent HPC schedulers
-
Proven expertise in DevOps, CI/CD pipelines, and infrastructure automation
-
Experience managing GPU compute stacks (CUDA and/or ROCm)
-
Advanced Linux administration, shell scripting, and distributed systems troubleshooting
-
Containerization and orchestration experience with Docker, Kubernetes, and Helm
-
Agile, collaborative mindset with excellent verbal and written communication skills
Education & Experience
-
Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related technical field
