Role: DevOps and Automation Engineer (Software Infrastructure Team)
As a key member of our software infrastructure team, you will design, build, and optimize systems that support large-scale GPU clusters interconnected via NVLink and InfiniBand. These clusters run some of the fastest, most complex HPC and AI workloads in the world.
What You will Do:
-
Develop and maintain robust CI/CD pipelines for rapid, reliable integration and deployment across intricate systems.
-
Create automation tools and workflows to streamline software releases, manage dependencies, and boost system reliability.
-
Modularize infrastructure components to enable independent release cycles and accelerate development.
-
Automate provisioning, scaling, and management of GPU cluster infrastructure.
-
Implement automated software updates and proactive system health monitoring to maximize uptime and availability.
-
Troubleshoot and resolve operational issues across distributed environments.
-
Manage firmware and software rollouts with minimal downtime and consistent execution.
-
Collaborate with global engineering teams to align infrastructure tooling with project goals and deliverables.
What We are Looking For:
-
Bachelors or Masters degree in Computer Science, Computer Engineering, or a related technical field.
-
5+ years experience managing infrastructure or systems within high-performance or distributed computing environments.
-
Strong expertise in scripting and automation using Python, Ansible, and Shell scripting.
-
Hands-on experience with modern CI/CD platforms and infrastructure-as-code tools.
-
Deep understanding of Linux systems, networking concepts, and distributed system architecture.
-
Proven ability to decompose monolithic systems into scalable, loosely coupled components.
-
Effective communication skills and adaptability to work across multinational, multi-time-zone teams.
What Will Make You Stand Out:
-
Experience with cluster management tools like Slurm.
-
Familiarity with NVIDIA DGX/HGX systems or other GPU-accelerated cluster environments.
-
Knowledge of observability and monitoring tools such as Prometheus and Grafana.
-
Demonstrated leadership in DevOps process improvement and team productivity enhancement.
