Full Time
Remote (Remote)
Posted 2 weeks ago

Website Runwayml

About Runwayml

We are a global AI research and technology company focused on building Universal Simulation systems. The research we are doing and the tools we are building are maturing rapidly and are quickly becoming the foundation for how we will all soon approach making anything. From images to videos, scripted media to documentaries, graphic design to architecture, interactive games to social media, new forms of learning and the future of entertainment itself. Everyone will be empowered to make anything. There will no longer be any barriers to entry. Our team consists of creative, open minded, caring and ambitious people who are determined to change the world. We aspire to continuously build impossible things and our ability to do so relies on building an incredible team. If you are driven to do the same, we’d love to hear from you.

Job Summary

We’re looking for an Engineering Manager (Backend) to lead the team responsible for Runway’s machine learning platform. You should have experience leading high-performing engineering teams and be deeply interested in the intersection of machine learning and distributed systems. You will manage our current and growing team of 5. You’ll have a chance to work closely with our Research and Machine Learning teams to build out our data processing, training, and eval systems. This role is on a small team with a big impact.

Key Responsibilities

Build the platform infrastructure for ML at scale.
Lead the platform engineering team that powers Runway’s machine learning pipeline—from data processing through model training to production inference.
Work closely with research teams to build robust, scalable systems that let them move fast.
Keep training jobs running smoothly.
Build monitoring, alerting, and automation around critical multi-day training runs on hundreds of GPUs.
Your systems catch problems before they derail expensive compute jobs.
Enable model evaluation and exploration.
Maintain the platform that lets researchers inspect training data, visualize outputs, and evaluate model checkpoints.
Build tools that bridge raw infrastructure and research workflows.
Scale production inference.
Own the inference pipeline serving Runway’s products.
Implement monitoring and alerting for performance and reliability.
Lead GPU capacity planning to balance cost and user experience as demand grows.

Requirements

Platform engineering foundation.
5+ years building distributed systems, data pipelines, and infrastructure at scale.
Experience managing engineering teams of 3-8 people.
Production infrastructure expertise.
Experience with cloud platforms (AWS/GCP), container orchestration (Kubernetes/ECS) and operating services at scale.
You’ve built reliable systems that handle large data volumes and complex workloads.
Proven experience with monitoring and reliability.
Experience building comprehensive monitoring and alerting.
You know what metrics matter and how to surface the right information to different teams.
Collaborative mindset.

To apply for this job please visit job-boards.greenhouse.io.