Responsibilities:
- Manage high severity incidents and high customer impact incidents focusing on fast recovery
- 24/7 Application L2 Support (12 hour shift)
- Ensure SIT and UAT approval and UAT BU Sign off
- Champions production resilience and availability, focusing on superior client experience, by working with the operation team and technology development teams
- Drive the implementation of Site Reliability Engineer (SRE) and Chaos Engineering design for all strategic systems
- Drive effective communication between business and technology with regards to production service reliability and performance
- Drive continuous improvements in processes or systems leveraging Site Reliability Engineering methods
- Respond to, evaluate and analyse production incidents to minimise their impact as well as devise innovative solutions to prevent them in the future
- Improve the reliability and availability of systems by gathering hard data, designing systems for increased service reliability and performance
- Provide expert advice and training to our engineers as to which technology solutions and advanced reliability techniques to use on each situation
Requirements:
- At least 4 years of experience
- Experience driving major production incidents and organise incident retrospective meetings
- Experience with Core Java 8, Cloud Foundry and non-relational databases, and Linux, Unix systems
- Experience with high availability, high-scale, and performant systems
- Experience with python and Unix scripting
To apply please click the Apply button or send us your updated profile to [email protected]
EA Licence No.:18S9405 / EA Reg. No.:R1330864
Percept Solutions is undergoing a growth phase and are on the lookout for talent. Applicants are encouraged to follow Percept Solutions on LinkedIn @ https://www.linkedin.com/company/percept-solutions/ to stay up to date on our upcoming roles and events.