Role Summary:
- As an Embedded SRE, you will work closely with our development teams to build and maintain scalable, highly available, and fault-tolerant systems.
- You will be integral in shaping the culture of reliability, implementing monitoring and alerting systems, managing changes, automating toils, and enhancing the overall system's resilience through reliability planning and incident management.
Experience:
- Min. 2 years of Development experience in Java/Spring-boot or .Net (Object Oriented)
- Experience Application + Infrastructure Monitoring using monitoring tools like (New Relic/ AppDynamics/Grafana/Kibana/ Prometheus/Splunk etc.)
- Experience in Python, Shell Scripting preferred.
Key Responsibilities:
- Monitoring & Alerting: Implement and improve observability through sophisticated monitoring and alerting systems. Design and manage dashboards and alerts that proactively monitor health, performance, and compliance with agreed-upon SLIs and SLOs.
- Change Management: Champion progressive rollout strategies, detect and resolve problems efficiently, and safely manage rollback/forward processes to maintain system integrity during changes.
- Automation: Develop and implement automation strategies to reduce manual errors, minimize toil, and enhance team velocity.
- Reliability Planning: Engage in planning for both organic and inorganic growth, ensuring optimal utilization of computing resources and scalability of our infrastructure.
- Incident Management: Provide on-call support to address and mitigate incidents rapidly. Lead the practice of chaos engineering to ensure preparedness and effective response strategies.
- Release Engineering: Manage and optimize the continuous delivery process focusing on small, frequent, and safe releases.
- Reliability Consulting: Work alongside development teams to set and review SLIs, SLOs, and error budgets. Offer insights on architectural decisions, system observability, and application performance to enhance reliability.
- Cross-Functional Collaboration: Embrace the "You build it, you run it" philosophy, working across functional teams to ensure reliability and performance from development through production.
- Problem Management: Conduct blameless post-mortems to learn from failures, uncover insights, and socialize findings across the company to foster a culture of continuous improvement.
Qualifications:
- Bachelor's degree in Computer Science, Engineering, or related field.
- Proven experience as an SRE or in a similar operational engineering role.
- Strong understanding of cloud technologies, CI/CD processes, and infrastructure as code.
- Proficiency in scripting languages and automation tools.
- Excellent problem-solving, communication, and teamwork skills.
- Strong experience in programming, and cloud-native development, especially on Azure Kubernetes Services.
- Familiar with Azure components and services.
- Proactive approach to identifying problems, performance bottlenecks, and areas for improvement.