Responsibilities
· Design, build, and maintain infrastructure that supports our application development, with a focus on reliability, scalability, and performance.
· Implement and automate deployment, monitoring, and scaling processes to ensure the smooth operation of our systems and services.
· Monitor system performance and reliability metrics, troubleshoot issues, and implement solutions to prevent downtime and improve efficiency.
· Collaborate with our teams Engineers to design, develop, and deploy reliable and scalable applications.
· Develop and maintain tools and scripts for automation, configuration management, and monitoring of our infrastructure and applications.
· Respond to incidents and emergencies to minimize downtime and ensure reliability of or systems.
· Continuously evaluate and improve our infrastructure, processes, and practises to enhance reliability, scalability, and efficiency.
· Stay up-to-date with industry trends, best practises, and emerging technologies in site reliability engineering and cloud computing.
Qualifications
· Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field.
· Experience working in a similar role.
· Programming skills in languages such as Python, Java, or similar.
· Hands-on experience with cloud platforms such as AWS, Azure, or GCP.
· Experience with containerization technologies such as Docker and container orchestration platforms such as Kubernetes.
· Proficiency in Linux system administration, shell scripting, and network troubleshooting.
· Experience with infrastructure as code tools such as Terraform, Ansible, or similar.
· Knowledge of CI/CD pipelines and automated testing frameworks.
· Strong analytical and problem-solving.
· Excellent communication and collaboration skills