As Site Reliability Engineer, you will have to operate and maintain LANDI Global infrastructures. Your main responsibilities will be to:
· Build, operate, and maintain our platform infrastructures, across various environments
· Collaborate with R&D team to ensure availability, reliability, and scalability of our platforms
· Implement and configure monitoring and alerting systems for visibility and resolution procedures to ensure timely remediation of platform failures
· Implement and maintain Disaster Recovery plans to ensure business continuity
· Analyse and present performance and cost optimization for the platforms
· Design and implement automated testing, continuous integration, and continuous delivery frameworks and processes for deployment efficiency
· Manage change management and incident reporting processes to anticipate and respond to incidents to confirm with platform SLA
· Provide operational support for platforms and support resolving production issues as an escalation point to the team
· Participating in 24/7 on-call rotation
· Support deployment of environments as new clients are onboarded
EXPERIENCES
· At least 5 years or more experience in similar capacity
· Excellent oral and written communication in English.
PREFERRED SKILLS
Candidates should ideally have experience in some of the following technologies:
· Experience in various cloud technologies (e.g. AWS, Azure)
· Experience in distributed Linux/Unix operating systems
· Experience in high-level programming or scripting languages
· Experience in monitoring tools (e.g. Prometheus, Grafana, Zabbix)
· Experience in configuration management tools (e.g. Ansible, Chef, Puppet)
· Experience in SQL databases (e.g. Postgres, MySQL)
· Experience in load balancing and reverse proxies (e.g. Nginx)
· Experience in CI/CD tools (e.g. Jenkins, GitLab)
· Experience in Containerization (e.g. Dockers, K8s)