x
Get our mobile app
Fast & easy access to Jobstore
Use App
Congratulations!
You just received a job recommendation!
check it out now
Browse Jobs
Companies
Campus Hiring
Download App
Jobs in Singapore   »   Jobs in Singapore   »   Information Technology Job   »   HPC (High Performance Computing) System Engineer
 banner picture 1  banner picture 2  banner picture 3

HPC (High Performance Computing) System Engineer

Homebrew Computer Company Pte. Ltd.

Homebrew Computer Company Pte. Ltd. company logo

We are seeking an experienced HPC Engineer to design, deploy, and maintain a high-performance computing (HPC) cluster for our AI training workloads. The successful candidate will be responsible for setting up a GPU-based training cluster together with our Research team, and ensuring that works well with our Model Training Algorithms.


Key Responsibilities:

  • Design and deploy a GPU-based HPC cluster using industry-standard components (e.g., NVIDIA DGX/HGX, or similar), including the design of nodes (e.g. NVLink, SXM)
  • Configure and optimise the cluster for high-performance computing, focusing on AI workloads (e.g., PyTorch, Torch or similar).
  • Implement and manage cluster management software (e.g., Kubeflow, Slurm or similar).
  • Design cluster for high-bandwidth, low-latency network performance in GPU clusters (InfiniBand, Ethernet RDMA, and/or RoCE), using scalable and efficient network topologies (Fat Tree, Dragonfly, and/or Torus)
  • Troubleshoot and resolve issues related to cluster performance, hardware failures, and software glitches.

Requirements

  • Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related field
  • Extensive experience in designing, assembling, and configuring high-performance computing systems
  • Proficient in selecting and integrating HPC hardware components, including CPUs, GPUs, memory, storage, and interconnects
  • Strong knowledge of HPC software stacks, including operating systems, drivers, and specialised applications
  • Experience in designing and operating AI training clusters, including the selection and integration of the necessary hardware and software components
  • Expertise in conducting comprehensive benchmarking tests and analyzing performance data
  • [Plus] Strong networking knowledge, including experience with high-speed interconnects such as Infiniband, RoCE Ethernet, and RDMA
  • [Plus] Experience with setting up and managing Nvidia multi-node training clusters for machine learning applications

Sharing is Caring

Know others who would be interested in this job?

Similar Jobs
Head of Project Management
Cornerstone Global Partners Pte. Ltd.
Quick Apply
Senior IT System Administrator
Aspire Ft Pte. Ltd.
Quick Apply
Head of Primary
Futuris Education Pte. Ltd.
Quick Apply
Robotics Software Engineer (ROS) #IAC
Recruit Express Pte Ltd
Quick Apply
Senior Data Analyst
Telesource Executive Search Pte. Ltd.
Quick Apply
Data Scientist
Nogle (singapore) Pte. Ltd.
Quick Apply
Regional Network Support Engineer
Erp21 Pte Ltd
Quick Apply
Managing Director, Head of Global Commercial Banking Planning Division, Strategic Planning Department
Mufg Bank, Ltd. Singapore Branch
Quick Apply
Technical Manager / Engineer
Tp-link Corporation Pte. Ltd.
Quick Apply
Innovation Technology Lead (Chemical & Life Sciences)
Triton AI Pte Ltd
Quick Apply