- Understand the end-to-end product topology from infrastructure and application perspective. Identify risks early on and ensure they are addressed before they become actual problems, whenever possible.
- Run the production environment by monitoring availability and taking a holistic view of system health.
- Measure and optimize performance, and solve issues across the entire stack: hardware, software, application, and network.
- Identify parts of the system that do not scale or are instable, provide alleviating measures and drive long term resolution of these problems.
- Becoming SME on VAS Issuer products and analyzing complex systems from a reliability and resilience perspective.
- Engage with stakeholders to regularly interact and discuss the roadmaps and robust supportability aspects. Should be able to drive agenda for better operational functioning and understand agile way of performing tasks and initiatives.
- Represent the SRE organization in design reviews and operational readiness exercises for new and existing services.
- Performing code bug fixes in production and recommending any architectural improvements during issue/incident analysis.
- Actively look for opportunities to improve the availability, reliability, and performance of the system by applying the learnings from monitoring and observation
- Design and implement creative solutions to operations problems, incidents, or outages such that these problems remain fixed and, as a result, driving down the burden of toil.
- Providing technical assistance to perform and run blameless root cause analyses on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again
- Provide Level 3 on-call support (based on rotation)
- Spread SRE culture, create standard SRE documentation and report templates. Provide guidance and technical expertise to junior team members and encourage the learning culture within the group and fostering innovation.
Preferred Skills
- Experience supporting production Windows and/or Linux environments, including process management, user management, distilling log files, and debugging performance issues.
- Ability to develop tools and scripts to support automation need.
- Coding experience beyond simple scripting
- 7+ years of development experience with Java, SQL, Automation, bug fixing, handling Production & Application operations
- Experience working with any log analysis tools and observability applications like Grafana, Tableau, Splunk.
- Excellent knowledge of Docker and Kubernetes, including design, build and maintenance of k8s environments.
- Knowledge of two or more platforms like Kafka, NginX, CDN, Redis/Hazelcast, Middleware software, Elastic, various SQL and noSQL database platforms is also expected
- Good working knowledge of TCP/IP, routing, and data centers.
- Linux systems engineering capabilities and network analysis expertise are great to have.
- Strong work ethic, leadership skills, excellent judgment and good time management in prioritizing work, and the ability to work in fast paced, team-oriented environment.
- Outstanding analytical, problem-solving skills and willingness to investigate complex problems, a proactive approach to spotting problems, areas for improvement, and performance bottlenecks
- Strong critical and strategic thinking skills to handle both the big picture and crucial technical decisions
- Ability to read and understand production code in any language so that you have a deeper understanding of our technology and ways to optimize it
- Experience in designing, integrating, developing web services and REST/JSON APIs.
- Knowledge in Java or related technologies would aid in bug fixing, understanding the products supported better and to support integration related issues.
- Need to have an excellent systems and product architecture understanding from application components and infrastructure perspective such as network, load balancer, firewall, gateway services etc.
- Experience supporting and working on web and mobile applications and troubleshooting problems in a cross-functional environment.
- Strong collaboration skills and ability to take ownership of problems when navigating ambiguity
- Willingness to work outside with a strong technical aptitude and excellent communication skills.
This is a hybrid position. Hybrid employees can alternate time between both remote and office. Employees in hybrid roles are expected to work from the office 2-3 set days a week (determined by leadership/site), with a general guidepost of being in the office 50% or more of the time based on business needs.