NVIDIA

Full Time

Posted 5 months ago

To apply for this job please visit nvidia.wd5.myworkdayjobs.com.

Role: DevOps and Automation Engineer (Software Infrastructure Team)

As a key member of our software infrastructure team, you will design, build, and optimize systems that support large-scale GPU clusters interconnected via NVLink and InfiniBand. These clusters run some of the fastest, most complex HPC and AI workloads in the world.

What You will Do:

Develop and maintain robust CI/CD pipelines for rapid, reliable integration and deployment across intricate systems.
Create automation tools and workflows to streamline software releases, manage dependencies, and boost system reliability.
Modularize infrastructure components to enable independent release cycles and accelerate development.
Automate provisioning, scaling, and management of GPU cluster infrastructure.
Implement automated software updates and proactive system health monitoring to maximize uptime and availability.
Troubleshoot and resolve operational issues across distributed environments.
Manage firmware and software rollouts with minimal downtime and consistent execution.
Collaborate with global engineering teams to align infrastructure tooling with project goals and deliverables.

What We are Looking For:

Bachelors or Masters degree in Computer Science, Computer Engineering, or a related technical field.
5+ years experience managing infrastructure or systems within high-performance or distributed computing environments.
Strong expertise in scripting and automation using Python, Ansible, and Shell scripting.
Hands-on experience with modern CI/CD platforms and infrastructure-as-code tools.
Deep understanding of Linux systems, networking concepts, and distributed system architecture.
Proven ability to decompose monolithic systems into scalable, loosely coupled components.
Effective communication skills and adaptability to work across multinational, multi-time-zone teams.

What Will Make You Stand Out:

Experience with cluster management tools like Slurm.
Familiarity with NVIDIA DGX/HGX systems or other GPU-accelerated cluster environments.
Knowledge of observability and monitoring tools such as Prometheus and Grafana.
Demonstrated leadership in DevOps process improvement and team productivity enhancement.

Job Overview

Industry
- Information Technology
Experience
- 5-6 Years
Qualification
- Bachelor Degree
- Certificate

NVIDIA

Job Overview

Industry

Experience

Qualification