Job Description:
As a Cloud Site Reliability Engineer (SRE), you will play a key role in ensuring the reliability, scalability, and performance of our cloud-based systems and services. Working closely with cross-functional teams, you will proactively monitor, optimize, and troubleshoot cloud infrastructure, applications, and services to minimize downtime and deliver a seamless user experience. Your expertise in cloud technologies and dedication to automation and best practices will contribute to the stability and growth of our cloud operations.
Key Responsibilities:
· Reliability and Availability: Implement best practices for high availability and disaster recovery across cloud environments.
Monitor system performance, availability, and incident response to ensure minimal downtine.
Create and maintain robust monitoring and alerting systems.
· Automation and Infrastructure as Code (laC):Develop and maintain automation scripts and Infrastructure as Code (laC) templates for provisioning and managing cloud resources.
Automate routine tasks to increase operational efficiency and reduce manual interventions.
· Scalability and Performance Optimization: Collaborate with development teams to design and implement scalable and performant cloud architectures.
Conduct performance analysis and tuning to optimize system response times and resource utilization.
· Incident Response and Troubleshooting: Participate in incident response activities, including root cause analysis, resolution, and post-incident reviews.
Troubleshoot complex issues across the cloud stack and coordinate with relevant teams for resolution.
· Security and Compliance: Implement security best practices and compliance measures in cloud environments.
Collaborate with security teams to ensure data protection and compliance with industry standards.
· Capacity Planning: Monitor resource utilization and forecast capacity requirements to support business growth.
Implement scaling strategies to accommodate changing workloads.
· Documentation and Knowledge Sharing: Maintain comprehensive documentation of cloud configurations, processes, and procedures.
Share knowledge and best practices with team members and contribute to a culture of continuous learning.
Qualifications/Requirements:
· Bachelor's Degree in Computer Science, Information Technology, or a related field.
· Proven experience in cloud operations, SRE, or a related role.
· Proficiency in cloud platforms such as AWS, Azure.
· Certification in cloud platforms (e.g. AWS Certified DevOps Engineer, Azure DevOps Engineer Expert).
· Experience with containerization and orchestration tools (e.g. Docker, Kubernetes).
· Knowledge of infrastructure monitoring and logging tools (e.g. Prometheus, Grafana. ELK stack).
· Strong scripting and programming skills (e.g. Python, Bash, Go).
· Familiarity with CI/CD pipelines and automation tools (e.g. Jenkins, GitLab CI/CD).
· Experience with GCC (Govt on Commercial Cloud).
· Excellent problem-solving and communication skills.
· Ability to work collaboratively in a cross-functional and fast-paced environment.