As an System Design Engineering Manager specializing in AI Systems and High-Performance Computing (HPC), you’ll play a pivotal role in shaping the future of AI-driven solutions. Your leadership will drive innovation, optimize performance, and foster collaboration across cross-functional teams. Let’s delve into the details:
Key Responsibilities:
Technical Leadership:
- Lead a team of engineers, and system architects to develop the platform for DL training and inference
- Define the technical vision, strategy, and roadmap for DL/AI systems within the HPC domain.
- AI Infrastructure Design and Optimization.
- Collaborate with hardware and software teams to design and optimize AI infrastructure.
- Ensure seamless integration of AI workloads with existing HPC clusters.
GPU Cluster Management and Scalability:
- Oversee the management of GPU-based clusters.
- Scale AI infrastructure to handle large-scale training and inferencing workloads.
Performance Tuning and Benchmarking:
- Drive performance improvements by analyzing bottlenecks and optimizing system components.
- Benchmark AI models and algorithms on HPC clusters.
Collaboration and Communication:
- Work closely with product managers, researchers, and stakeholders to align AI initiatives with business goals.
- Communicate technical progress, risks, and opportunities to senior leadership.
Requirements:
- Proven track record in managing engineering teams, preferably in AI, HPC, or related fields.
- Familiarity with AI frameworks (TensorFlow, PyTorch, etc.) and HPC tools (Slurm, OpenMPI, etc.).
- Strong understanding of GPU architectures, CUDA programming, and parallel computing.
- Knowledge of containerization (Docker, Kubernetes) and cloud-based AI deployments.
- Ability to mentor and develop team members.
- Excellent decision-making, problem-solving, and conflict resolution skills.
- Effective communication across technical and non-technical stakeholders.
- Experience presenting technical concepts to executive leadership.