Responsibilities:
- Lead the monitoring, analysis, and continuous improvement of the product runtime environment across all stages.
- Collaborate with cross-functional teams (developers, product managers, user support) to identify and rectify performance issues proactively.
- Oversee the implementation and management of incident and change management workflows, ensuring they align with best practices and business requirements.
- Provide strategic direction in developing and refining operations, processes, and preventive measures for system reliability.
- Manage the operations team, ensuring high availability and top-tier support service delivery on a 24/7 basis.
- Drive improvements in automation and monitoring to minimize downtime and enhance system performance.
- Prepare and deliver presentations and reports to higher management, detailing operational activities, improvement efforts, and results.
- Mentor junior engineers and guide them in adopting best practices for monitoring, incident management, and process improvement.
Experience and Skills Needed:
- Bachelor’s degree in Computer Science, Information Technology, or a related field.
- At least 4-5 years of experience in an Operations Engineer role, with significant exposure to cloud infrastructure, monitoring, and incident management.
- Extensive experience with full-stack monitoring, cloud-based monitoring tools, and ITSM tools for incident and change management.
- Proficiency in scripting tools (e.g., Terraform, Ansible) and process automation to minimize errors and downtime.
- Strong leadership and mentoring skills, with the ability to influence and motivate the team.
- Excellent problem-solving and communication skills, capable of translating complex technical issues for non-technical audiences.
- Certification in cloud platforms (AWS, Azure, Google Cloud) is highly preferred.