Website Nvidia
About Nvidia
Nvidia is a leading technology company specializing in graphics processing units (GPUs) and AI systems.
Job Summary
We are seeking highly skilled and motivated software engineers to join our vLLM & MLPerf team. You will define and build benchmarks for MLPerf Inference, the industry-leading benchmark suite for inference system-level performance, as well as contribute to vLLM and optimize its performance to the extreme for those benchmarks on bleeding-edge NVIDIA GPUs.
Key Responsibilities
- Design and implement highly efficient inference systems for large-scale deployments of generative AI models.
- Define inference benchmarking methodologies and build tools that will be adopted across the industry.
- Develop, profile, debug, and optimize low-level system components and algorithms to improve throughput and minimize latency for the MLPerf Inference benchmarks on bleeding-edge NVIDIA GPUs.
- Productionize inference systems with uncompromised software quality.
- Collaborate with researchers and engineers to productionize innovative model architectures, inference techniques and quantization methods.
- Contribute to the design of APIs, abstractions, and UX that make it easier to scale model deployment while maintaining usability and flexibility.
- Participate in design discussions, code reviews, and technical planning to ensure the product aligns with the business goals.
- Stay up to date with the latest advancements and come up with novel research ideas in inference system-level optimization, then translate research ideas into practical, robust systems.
- Explorations and academic publications are encouraged.
Requirements
- Bachelor’s, Master’s, or PhD degree in Computer Science/Engineering, Software Engineering, a related field, or equivalent experience.
- 5+ years of experience in software development, preferably with Python and C++.
- Deep understanding of deep learning algorithms, distributed systems, parallel computing, and high-performance computing principles.
- Hands-on experience with ML frameworks (e.g., PyTorch) and inference engines (e.g., vLLM and SGLang).
- Experience optimizing compute, memory, and communication performance for the deployments of large models.
- Familiarity with GPU programming, CUDA, NCCL, and performance profiling tools.
- Ability to work closely with both research and engineering teams, translating state-of-the-art research ideas into concrete designs and robust code, as well as coming up with novel research ideas.
- Excellent problem-solving skills, with the ability to debug complex systems.
- A passion for building high-impact software that pushes the boundaries of what’s possible with large-scale AI.
Preferred Qualifications
- Background in building and optimizing LLM inference engines such as vLLM and SGLang.
- Experience building ML compilers such as Triton, Torch Dynamo/Inductor.
- Experience working with cloud platforms (e.g., AWS, GCP, or Azure), containerization tools (e.g., Docker), and orchestration infrastructures (e.g., Kubernetes, Slurm).
- Exposure to DevOps practices, CI/CD pipelines, and infrastructure as code.
To apply for this job please visit nvidia.wd5.myworkdayjobs.com.