Responsibilities:
- System reliability: Design, build, and maintain scalable and reliable infrastructure to ensure high availability, performance, and resilience of systems and applications.
- Automation and tooling: Develop automation tools and scripts to streamline operational tasks, deployment processes, and monitoring of system health.
- Incident response: Lead and participate in incident response and resolution, including root cause analysis, post-incident reviews, and proactive measures to prevent recurrence.
- Performance optimisation: Identify and address performance bottlenecks, capacity planning, and optimisation of system resources to meet service level objectives.
- Monitoring and alerting: Implement and maintain monitoring systems to proactively detect and respond to system anomalies, performance degradation, and security threats.
- Collaboration: Collaborate with development teams to ensure reliability considerations are integrated into the software development lifecycle and infrastructure design.
- Continuous improvement: Drive continuous improvement through the implementation of best practices, reliability engineering principles, and the adoption of new technologies.
Note: Proceeding to apply on this job post means you have read, understand and agreed to WPH DATA PROTECTION NOTICE FOR JOB APPLICANTS in the link below.
https://www.wphdigital.com/notices