AI and HPC Cluster Group Manager



Software Engineering, Data Science
Multiple locations
Posted on Thursday, April 4, 2024

NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and digital twins is transforming the world's largest industries and profoundly impacting society.

It’s a unique legacy of innovation that’s fuelled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work.

NVIDIA Networking is looking for an AI & HPC Clusters's group manager to join Cloud Solutions group. In this role, you will build, manage, and maintain the biggest cluster in NVIDIA Networking R&D to validate and test next-generation networking cloud technology and Reference Architecture that are being released to our customers. We are currently working on next generation BlackWell GPU Platform AI clouds with our XDR (800G InfiniBand) and SpectrumX800 next generation technology. Come join the team and see how you can make a lasting impact on the world.

What you’ll be doing:

  • Lead a group that is responsible for building, managing, and maintaining SW R&D clusters composed of Linux, Windows, and VMware systems, x86 and ARM CPU, GPU, Ethernet, and InfiniBand technologies.

  • Work closely with the engineering and architecture teams to understand, plan and build new clusters for validating and testing new NVIDIA Networking technology solutions.

  • Drive the design and implementation of automatic systems to deploy, configure, maintain, and monitor these clusters.

  • Drive the design and implementation of resource management systems for multiuser environments with different needs on these clusters.

  • Manage R&D lab including inventory, power, space, and cooling.

  • Build, expand, and mentor the team to address growing demands and requirements.

  • Innovate! Influence on NVIDIA Networking cluster management tools to shine in customer’s view.

What We Need to See:

  • A degree in Computer Science, Engineering, or a related field.

  • 5+ years of managerial experience including managers’ management.

  • 10+ years of relevant overall professional experience

  • Experience in Data center management from a multidisciplinary company, including handling power, cooling, and space.

  • Experience in managing HPC/AI clusters.

  • Deep understanding of operating systems, computer networks, and high-performance hardware

  • Deep knowledge of distributed resource scheduling systems and orchestration tools such as Slurm, K8s

  • Strong organizational and project management skills, comfortable with multitasking in a dynamic environment with shifting priorities and changing requirements.

  • Enthusiastic and ambitious personality, encouraging a positive and productive work environment.

Ways to Stand Out From the Crowd:

  • Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high-speed interconnects and supporting software

  • Familiarity with CUDA and managing GPU-accelerated computing systems

  • Experience and knowledge of InfiniBand

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people on the planet working for us. If you're creative and autonomous, we want to hear from you! NVIDIA is committed to fostering a diverse work environment and is proud to be an equal-opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) based on race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

In 2020, NVIDIA acquired Mellanox, a leading supplier of end-to-end Ethernet and InfiniBand intelligent interconnect solutions and services for servers, storage, and hyper-converged infrastructure. Mellanox intelligent interconnect solutions increase data center efficiency by providing the highest throughput and lowest latency, delivering data faster to applications, and unlocking system performance.

We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, perform essential job functions, and receive other benefits and privileges of employment. Please contact us to request accommodation.