Jobs in Singapore » Jobs in Singapore » Engineering Job » Site Reliability Engineer

Site Reliability Engineer

Asia Gulf Cloud Pte. Ltd.

Job Type   /   Job Level

Full-time   /   Others/Any

Job Location

Singapore, Singapore, Singapore

Salary Offered

About SGB:

SGB is a new digital bank that will offer a secure and integrated platform to access and

manage conventional and digital assets and financial solutions, including round-the-clock real

time settlement, trading connectivity, custody and asset management. It serves global

investors, innovators and institutions looking for a differentiated digital banking experience.

SGB is licensed by the Central Bank of Bahrain (CBB).

About the Team：

The Site Reliability Engineering (SRE) team is responsible for ensuring the stability, reliability,

and performance of the digital bank's services and infrastructure. Key responsibilities include

system availability, performance and capacity management, incident management, change

management, automation, CICD, backup and disaster recovery, security and compliance.

Responsibilities:

● Design and set SLI and SLO for various systems in SGB, drive stability-related workstreams

with cross-functional teams.

● Define Change Management processes, drive process implementations, and continuous

improvement with relevant teams.

● Defining incident management processes, including incident response and resolution

workflows, root cause analysis and drive corrective actions.

● Define and improve backup and disaster recovery strategies, ensure stability and

availability of systems in SGB.

● Design and maintain CI/CD pipelines, collaborate with developers to improve the

pipelies’s efficiency, support testing and deployment activities.

● Ensure security and compliance of the development, operation and change management

practices through collaboration with relevant teams.

Qualifications required:

● A bachelor's degree in computer science, information systems, or its equivalent.

● Strong sense of responsibility and passionate system operation and stability work,

excellent communication, problem solving, and critical thinking skills.

● Extensive experience in system and service stability related work, familiar with high-

availability architecture, backup and disaster recovery strategies.

● Expensive hands-on experience on monitoring platforms, like Zabbix, Prometheus,

grafana and automation tools, like Ansible, terraform.

● Familiar with AWS Cloud, Linux OS, TCP/IP, load balancers, NGINX, http protocol,

databases, storages.

● Solid programming skills, well versed in least one of the programming languages: Python,

JAVA, Golang.

Save