Automation Engineer – DevOps – Fabric Networking – GPU

NVIDIA

  • Full Time

To apply for this job please visit nvidia.wd5.myworkdayjobs.com.

Role: DevOps and Automation Engineer (Software Infrastructure Team)

As a key member of our software infrastructure team, you will design, build, and optimize systems that support large-scale GPU clusters interconnected via NVLink and InfiniBand. These clusters run some of the fastest, most complex HPC and AI workloads in the world.

What You will Do:

  • Develop and maintain robust CI/CD pipelines for rapid, reliable integration and deployment across intricate systems.

  • Create automation tools and workflows to streamline software releases, manage dependencies, and boost system reliability.

  • Modularize infrastructure components to enable independent release cycles and accelerate development.

  • Automate provisioning, scaling, and management of GPU cluster infrastructure.

  • Implement automated software updates and proactive system health monitoring to maximize uptime and availability.

  • Troubleshoot and resolve operational issues across distributed environments.

  • Manage firmware and software rollouts with minimal downtime and consistent execution.

  • Collaborate with global engineering teams to align infrastructure tooling with project goals and deliverables.

What We are Looking For:

  • Bachelors or Masters degree in Computer Science, Computer Engineering, or a related technical field.

  • 5+ years experience managing infrastructure or systems within high-performance or distributed computing environments.

  • Strong expertise in scripting and automation using Python, Ansible, and Shell scripting.

  • Hands-on experience with modern CI/CD platforms and infrastructure-as-code tools.

  • Deep understanding of Linux systems, networking concepts, and distributed system architecture.

  • Proven ability to decompose monolithic systems into scalable, loosely coupled components.

  • Effective communication skills and adaptability to work across multinational, multi-time-zone teams.

What Will Make You Stand Out:

  • Experience with cluster management tools like Slurm.

  • Familiarity with NVIDIA DGX/HGX systems or other GPU-accelerated cluster environments.

  • Knowledge of observability and monitoring tools such as Prometheus and Grafana.

  • Demonstrated leadership in DevOps process improvement and team productivity enhancement.

Job Overview