Software Engineer, Infrastructure
4 months ago
The MRS ML Infra team will be focusing on ML Infra performance and efficiency for both large scale AI training and inference workflows in the recommen..
The MRS ML Infra team will be focusing on ML Infra performance and efficiency for both large scale AI training and inference workflows in the recommendation domain.
In this role, the engineer works on optimizing the e2e stack for model training and inference for large scale recommendation models. The opportunities are from distributed systems, to model/system co-design, to GPU system optimizations.
We are looking for someone who has previous experiences on high performance infrastructure and performance optimization. We need the candidate to not only identify and lead the execution for short/mid term opportunities for perf/efficiency optimization, but also drive long term strategies on things like model/system co-design, performance automation, etc.
RESPONSIBILITIES
- Hands on driving performance and efficiency optimizations by identifying and delivering the large optimizations across MRS models and systems.
- Drive XFN collaborations and alignments with multiple partner or product ML teams.
- Lead technical directions and roadmap for the SGP perf and efficiency team.
- Providing mentorship and guidance to grow junior engineers on the team
MINIMUM QUALIFICATIONS
- BS/MS in Electrical Engineering, Computer Science or a related field or equivalent experience.
- 7+ years of experience on AI Infra or System performance.
- Hands on experiences on deep system performance optimization, for example, distributed systems, or high performance GPU/GPU systems, or memory/cache optimizations.
- Strong written and verbal communication skills to align XFN and driving team execution
- Previous experiences on mentoring and growing junior engineers as either a tech lead or a manager.
- Strong debugging skills in complex systems that are across multiple components or sub-systems.
PREFERRED QUALIFICATIONS
- Hands on experiences on large scale AI infra system (for example, GPU training system)
- Experiences on large models training and inference such as LLM or recommendation models.
- Experiences in high performance computing including communication optimization, CUDA kernel optimization, distributed training and inference, etc.
Meta builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology. People who choose to build their careers by building with us at Meta help shape a future that will take us beyond what digital connection makes possible today—beyond the constraints of screens, the limits of distance, and even the rules of physics.
Official account of Jobstore.