As a Production Operations Engineer, you will play a critical role in operationalizing, maintaining, scaling, and optimizing our AI-driven applications and supporting infrastructure. With a blend of software development and infrastructure skills, you will work closely with cross-functional teams including software engineers, data scientists, and platform engineers, to ensure the delivery and operation of highly available, low latency, and optimally performing AIproducts.
Your expertise will be crucial in developing solutions, automating processes, monitoring system health, troubleshooting, and managing incidents to ensure our products deliver a seamless experience for our clients.
Key Responsibilities:
- Software Development and Operations:Collaborate with Software Engineers to design, implement, and maintain scalable, efficient, and secure systems using React, Python, Docker, and Kubernetes stack.
Optimize application performance by profiling and tuning frontend and backend services for speed, scalability, and resilience.
- System Monitoring & Maintenance:Monitor production systems and services, ensuring optimal uptime and performance.
Implement monitoring tools and dashboards for proactive incident detection.
- Infrastructure Automation:Automate repetitive tasks, deployment processes, and infrastructure provisioning using tools such as Ansible, Terraform, or similar.
Develop and maintain CI/CD pipelines to facilitate smooth deployments.
- Incident Management & Troubleshooting:Respond to system incidents, troubleshoot issues, and work towards timely resolutions.
Conduct root cause analysis (RCA) of system failures and develop strategies to prevent future incidents.
- Performance Optimization:Optimize AI model deployment and data pipelines for speed, efficiency, and cost-effectiveness.
Collaborate with data scientists and engineers to ensure AI systems are running efficiently in production environments.
- Scalability & Reliability:Design and implement scalable infrastructure solutions for AI applications.
Ensure system reliability, fault tolerance, and high availability through effective architecture and best practices.
- Security & Compliance:Work with security teams to ensure all systems are compliant with company security protocols and industry standards.
Implement security best practices across production environments.
Required Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience).
- 5+ years of experience in a combination of software development, production operations, DevOps, infrastructure engineering, or security roles.
- Strong application development experience in Javascript and Python, particularly with RESTful API design and development using a service-oriented architecture.
- Strong experience with cloud platforms (AWS, GCP, or Azure).
- Proficiency in container orchestration technologies (e.g., Kubernetes, Docker).
- Solid understanding of CI/CD pipelines and automation tools (Github Actions, ArgoCD, Jenkins, GitLab CI, etc.).
- Experience with infrastructure as code (Terraform, Ansible, etc.).
- Hands-on experience with monitoring and logging tools (Datadog, Prometheus, Grafana, ELK stack, etc.).
- Strong experience with Bash, and other similar scripting languages.
- Solid understanding of frontend frameworks, particularly React, and their interaction with backend services.
- Strong problem-solving skills and attention to detail.
- Experience in handling large-scale distributed systems.
Preferred Qualifications:
- Proficiency in working with NoSQL databases (MongoDB) and understanding of document-based data models.
- Knowledge of object storage systems and experience with S3-compatible APIs (MinIO) for storing and managing large-scale unstructured data.
- Experience in supporting AI/ML pipelines and production systems.
- Knowledge of data engineering and distributed data systems (e.g., Kafka, Spark, Hadoop).
- Understanding of GPU-based infrastructure for AI workloads.
- Familiarity with security best practices in cloud and AI environments.
- Flexible work environment and culture that promotes work-life balance.