RESPONSIBILITIES:
- Support High Power Computing and ITSM
- The System engineer is responsible in specializing in High-Performance Computing (HPC), you will be a key contributor to the design, implementation, and optimization of complex computational systems. Leveraging your expertise in HPC technologies, you will collaborate with cross-functional teams to ensure the seamless integration and performance of high-performance computing environments.
System Design and Implementation:
- Design, implement, and maintain high-performance computing systems to meet the organization's computational needs.
- Collaborate with stakeholders to understand performance requirements and hardware specifications.
Parallel Computing:
- Implement and optimize parallel computing techniques to enhance system performance.
- Leverage parallel programming languages and frameworks for efficient task execution.
Cluster Management:
- Manage and optimize HPC clusters, ensuring scalability and reliability.
- Implement and maintain cluster management tools for efficient resource utilization.
Performance Tuning:
- Analyze and fine-tune system configurations, hardware, and software for optimal performance.
- Identify and resolve performance bottlenecks in HPC applications.
Job Scheduling:
- Utilize job scheduling systems to allocate computational resources and manage workloads efficiently.
- Collaborate with users to understand job requirements and prioritize computing tasks.
Networking and Interconnects:
- Configure and optimize high-speed interconnects, such as InfiniBand, for fast data transfer between nodes.
- Collaborate with network administrators to ensure seamless communication within HPC environments.
Distributed File Systems:
- Implement and manage distributed file systems for efficient data storage and retrieval.
- Optimize data access and transfer mechanisms to support large-scale computations.
Fault Tolerance and Reliability:
- Implement strategies for fault tolerance to ensure system reliability during long-running computations.
- Troubleshoot and resolve system issues to minimize downtime.
Documentation:
- Create and maintain detailed documentation of HPC system configurations, processes, and best practices.
- Develop user guides and training materials for HPC users.
Stay Updated:
- Keep abreast of emerging trends and advancements in HPC technologies.
- Evaluate and recommend new hardware and software solutions to enhance system capabilities.
REQUIREMENTS
- Bachelor’s or master’s degree in computer science, Information Technology, or a related field.
- Proven experience as a Systems Engineer with a focus on High-Performance Computing.
- Knowledge of HPC architectures, technologies, and parallel programming languages.
Technical Proficiency:
- Familiarity with cluster management tools, job scheduling systems, and distributed file systems.
- Experience with high-speed interconnects (e.g., InfiniBand) and networking in HPC environments.
Problem-Solving Skills:
- Strong analytical and problem-solving skills to address complex HPC challenges.
Communication:
- Excellent communication and collaboration skills to work effectively in interdisciplinary teams.