Seeking Singaporean to join as Systems Engineer (HPC, Linux) 1 yr project renewable assignment
Mon to Fri : Office hours
Job Scope:
As a Systems Engineer specializing in High-Performance Computing (HPC), you will be a key contributor to the administration, operation, maintenance, design, implementation, and optimization of complex computational systems. Leveraging your expertise in HPC technologies, you will collaborate with cross-functional teams to ensure the seamless integration and performance of high-performance computing environments.
Responsibilities:
System Administration, Operation, Maintenance, Design and Implementation:
Administer, operate, maintain, design, implement, and maintain high-performance computing systems to meet the organization's computational needs.
Collaborate with stakeholders to understand performance requirements and hardware specifications.
Parallel Computing:
Implement and optimize parallel computing techniques to enhance system performance.
Leverage parallel programming languages and frameworks for efficient task execution.
Cluster Management:
Manage and optimize HPC clusters, ensuring scalability and reliability.
Implement and maintain cluster management tools for efficient resource utilization.
Performance Tuning:
Analyze and fine-tune system configurations, hardware, and software for optimal performance.
Identify and resolve performance bottlenecks in HPC applications.
Job Scheduling:
Utilize job scheduling systems to allocate computational resources and manage workloads efficiently.
Collaborate with users to understand job requirements and prioritize computing tasks.
Networking and Interconnects:
Configure and optimize high-speed interconnects, such as InfiniBand, for fast data transfer between nodes.
Collaborate with network administrators to ensure seamless communication within HPC environments.
Distributed File Systems:
Implement and manage distributed file systems for efficient data storage and retrieval.
Optimize data access and transfer mechanisms to support large-scale computations.
Fault Tolerance and Reliability:
Implement strategies for fault tolerance to ensure system reliability during long-running computations.
Troubleshoot and resolve system issues to minimize downtime.
Documentation:
Create and maintain detailed documentation of HPC system configurations, processes, and best practices.
Develop user guides and training materials for HPC users.
Stay Updated:
Keep abreast of emerging trends and advancements in HPC technologies.
Evaluate and recommend new hardware and software solutions to enhance system capabilities.
Job requirements
- Bachelor’s or master’s degree in computer science, Information Technology, or a related field.
- Proven experience as a Systems Engineer with a focus on High-Performance Computing.
- Knowledge of HPC architectures, technologies, and parallel programming languages.
Technical Proficiency:
- Familiarity with Linux (RHEL, CentOS), cluster management tools, job scheduling systems, and distributed file systems.
- Experience with high-speed interconnects (e.g., InfiniBand) and networking in HPC environments.
Problem-Solving Skills:
Strong analytical and problem-solving skills to address complex HPC challenges.
Communication:
Excellent communication and collaboration skills to work effectively in interdisciplinary teams.
1 Yr Project Renewable assignment