Your Impact
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. At Goldman Sachs, SRE is responsible for improving the availability and reliability of some of the firm’s most critical platform services, and ensures they meet the requirements of our internal and external users. We are looking for engineers who are motivated to collaborate with our businesses to build and run sustainable production systems, which can evolve and adapt to changes in our fast-paced, global business environment.
The SRE team develops and maintains platforms and tools which help other engineering teams in Goldman Sachs to build and operate reliable and resilient systems. The platforms we offer range from central logging and tracing to monitoring and alerting and we provide tools to drive adoption and improvements to capacity planning, operational readiness assessments, production incident postmortems, SLIs / SLOs, and deployment automation including canary releases.
The products and services we provide to our internal customers are used by thousands of engineers every day. We believe that reliability is the most important feature of any system, and we are devoted to giving our engineers the platforms and tools they need to build and operate reliable products.
How You Will Fulfil Your Potential
As a developer in the SRE team, you will work with internal customers, product owners, and SREs to design, develop, and support the platforms and tools we provide to other engineering teams to enable them to run reliable large scale production systems spanning cloud and on-prem datacenters.
Responsibilities
- Design, develop, and support SRE platforms and tools
- Create and support automation solutions and build out monitoring and alerting to improve the reliability of the platforms and tools we operate
- Collaborate with other teams to onboard them onto SRE owned platforms and tools and help them implement SRE best practices
- Adhere to and drive SRE disciplines and processes across the global team
Basic Qualifications
- Degree in computer science or engineering with at least 3 years industry experience
- Proficiency in at least one major programming language, preferably in Java or Go and JavaScript / Typescript
- Excellent programming skills including debugging, testing, and optimizing code
- Strong problem solving / analytical skills
- Experience with algorithms, data structures as well as software and system design
- Experience automating operational tasks
- Comfortable with technical ownership, managing multiple stakeholders, and working as part of a global team
Preferred Experience
- Experience with distributed systems design, maintenance, and troubleshooting
- Experience with databases / data stores like PostgreSQL, MongoDB, and Elasticsearch
- Proficiency in using Terraform for Infrastructure deployment and management
- Knowledge of cloud native solutions in AWS or GCP
- Systems experience in Linux and networking, especially in scaling for performance and debugging complex distributed systems
- Experience with monitoring and alerting systems