Assurity Trusted Solutions (ATS) is a wholly-owned subsidiary of the Government Technology Agency (GovTech), incorporated to operate the National Authentication Framework (NAF) and National Certification Authority (NCA). We seek to be the Source of Trust in the use of digital services and committed to improving Trust and High Assurance of digital services by providing secure and convenient identity management solutions.
We are looking for a Cloud Infrastructure Engineer (Large Language Model) to join us in our mission to improve Singapore’s competitiveness as a trusted ICT hub for citizens and businesses.
You will be working on:
- Design, deploy, and optimize Kubernetes clusters using the Nvidia software stack to support large language model applications.
- Collaborate with cross-functional teams to integrate Nvidia GPU resources effectively within Kubernetes environments, ensuring optimal performance.
- Implement and manage infrastructure as code (IaC) for Nvidia GPU configurations, focusing on scalability and high availability.
- Monitor, troubleshoot, and resolve issues related to both Kubernetes clusters and Nvidia GPU resources to maintain a reliable and performant infrastructure.
- Stay abreast of industry best practices and emerging technologies related to Kubernetes and the Nvidia GPU ecosystem.
- Work closely with development teams to automate deployment processes, leveraging Nvidia GPU capabilities, and streamline workflows.
- Implement security best practices to safeguard Kubernetes environments, Nvidia GPU resources, and sensitive data.
- Participate in on-call rotation and provide timely response to incidents, minimizing downtime for language model applications.
- Contribute to capacity planning and performance tuning activities, considering the demands of large-scale language model applications utilizing Nvidia GPU acceleration.
- Document infrastructure configurations, processes, and procedures, facilitating knowledge sharing and team member onboarding.
Join us and discover a meaningful and exciting career with Assurity Trusted Solutions!
The remuneration package will be commensurate with your qualifications and experience. Interested applicants, please click "Apply Now".
We thank you for your interest and please note that only shortlisted candidates will be notified.
By submitting your documents, you agree that your personal data may be collected, used and disclosed by Assurity Trusted Solutions Pte. Ltd. (ATS), GovTech and their service providers and agents for the purposes of assessing your suitability for a vacancy with ATS and GovTech and contacting you for future career opportunities. You warrant that where you have disclosed personal data of third parties (e.g. next-of-kin, friends or referees) to ATS, GovTech and their service providers and agents in connection with the abovementioned purposes, you have obtained the prior consent of such third parties for ATS, GovTech and their service providers and agents to collect, use and disclose such personal data for such purposes, in accordance with any applicable laws, regulations and/or guidelines.
To succeed in this role, you will ideally have:
- Proven experience in designing, implementing, and managing on-premises infrastructure solutions.
- Strong knowledge of server virtualisation, storage systems and network infrastructure.
- Hands-on experience with cloud-native technologies and deployment strategies.
- Proven experience designing, deploying, and managing Kubernetes clusters such as SUSE Rancher, RedHat OpenShift
- Strong understanding of containerization concepts such as Docker, orchestration tools like Kubernetes and Nvidia GPU acceleration technologies.
- Proficiency in scripting, automation and configuration management using tools such as Ansible, Terraform, or similar.
- Familiarity with infrastructure-as-code principles and tools (e.g., Helm, Kubernetes manifests).
- Experience with large-scale language model applications, particularly leveraging Nvidia GPU acceleration, is highly desirable.
- Solid knowledge of networking concepts, Kubernetes networking models, and integration with Nvidia GPU resources.
- Excellent problem-solving and troubleshooting skills, with a proactive approach to system optimization.
- Strong communication skills for effective collaboration in a team-oriented, agile environment.