Jobs in Singapore » Jobs in Singapore » Engineering, Site Reliability Engineer, Associate (133692)

Engineering, Site Reliability Engineer, Associate (133692)

Goldman Sachs Services (singapore) Pte. Ltd.

Job Type   /   Job Level

Full-time   /   Others/Any

Job Location

Singapore, Singapore, Singapore

Salary Offered

Your Impact

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. At Goldman Sachs, SRE is responsible for improving the availability and reliability of some of the firm’s most critical platform services, and ensures they meet the requirements of our internal and external users. We are looking for engineers who are motivated to collaborate with our businesses to build and run sustainable production systems, which can evolve and adapt to changes in our fast-paced, global business environment.

The SRE team develops and maintains platforms and tools which help other engineering teams in Goldman Sachs to build and operate reliable and resilient systems. The platforms we offer range from central logging and tracing to monitoring and alerting and we provide tools to drive adoption and improvements to capacity planning, operational readiness assessments, production incident postmortems, SLIs / SLOs, and deployment automation including canary releases.

The products and services we provide to our internal customers are used by thousands of engineers every day. We believe that reliability is the most important feature of any system, and we are devoted to giving our engineers the platforms and tools they need to build and operate reliable products.

How You Will Fulfil Your Potential

As a developer in the SRE team, you will work with internal customers, product owners, and SREs to design, develop, and support the platforms and tools we provide to other engineering teams to enable them to run reliable large scale production systems spanning cloud and on-prem datacenters.

Responsibilities

Design, develop, and support SRE platforms and tools
Create and support automation solutions and build out monitoring and alerting to improve the reliability of the platforms and tools we operate
Collaborate with other teams to onboard them onto SRE owned platforms and tools and help them implement SRE best practices
Adhere to and drive SRE disciplines and processes across the global team

Basic Qualifications

Degree in computer science or engineering with at least 3 years industry experience
Proficiency in at least one major programming language, preferably in Java or Go and JavaScript / Typescript
Excellent programming skills including debugging, testing, and optimizing code
Strong problem solving / analytical skills
Experience with algorithms, data structures as well as software and system design
Experience automating operational tasks
Comfortable with technical ownership, managing multiple stakeholders, and working as part of a global team

Preferred Experience

Experience with distributed systems design, maintenance, and troubleshooting
Experience with databases / data stores like PostgreSQL, MongoDB, and Elasticsearch
Proficiency in using Terraform for Infrastructure deployment and management
Knowledge of cloud native solutions in AWS or GCP
Systems experience in Linux and networking, especially in scaling for performance and debugging complex distributed systems
Experience with monitoring and alerting systems

Sharing is Caring

Know others who would be interested in this job?

Apply Now

Update & Apply

Save