We are seeking a highly skilled Site Reliability Engineer (SRE) with a strong background in maintaining self-hosted Kubernetes clusters, where your primary focus will be on ensuring the stability and reliability of our production environment. Ensuring a smooth running infrastructure supports the work of our AI researchers as it provides them a steady and dependable platform.
Job Description
- Work closely with AI researchers to understand their workflow and infrastructure needs, optimizing the cluster configurations accordingly.
- Implement monitoring, alerting, and self-healing systems to ensure high availability and performance of the clusters.
- Collaborate with development teams to design and implement best practices for infrastructure as code (IaC).
- Drive automation initiatives to reduce manual toil and improve system resilience and scalability.
- Document system design and procedures, provide guidance for researchers on our cluster advance usage.
Job Requirements
- Bachelor's degree or higher in Computer Science, Engineering, or related fields.
- Proven experience in managing self-hosted Kubernetes clusters in a production environment.
- Strong understanding of containerization, orchestration, and the Kubernetes ecosystem.
- Familiarity with AI workflows, machine learning/deep learning research background is a plus.
- Proficiency in at least one programming language (e.g., Python, Go) and scripting skills for automation.
- Good working attitude, problem-solving, critical thinking, and communication skills.