ROLES AND RESPONSIBILITIES
The Global Operations Centre (GOC) HPC Engineer is a technical specialist responsible for the daily operations and maintenance of the company's high-performance computing (HPC) environment. The HPC Engineer will be collaborating closely with senior engineers, monitoring system health, troubleshooting issues (especially those related to NVIDIA H-100, Infiniband and Mellanox), assisting the Global Operations Centre and creating clear documentation to ensure smooth and efficient operations.
- Assist in the deployment, configuration, and maintenance of HPC hardware and software components.
- Monitor the health and performance of HPC systems, identifying and resolving issues proactively.
- Participate in on-call rotation to ensure 24/7 availability and responsiveness to critical issues.
- Provide technical support to the GOC Support Specialist team in troubleshooting HPC-related problems.
- Analyze system logs, performance data, and user reports to diagnose and resolve issues.
- Document incident details, resolutions, and lessons learned to enhance future problem-solving.
- Create and maintain comprehensive SOPs for common HPC tasks, incident response procedures, and system configurations.
- Ensure documentation is clear, accurate, and up-to-date, contributing to knowledge sharing within the team.
- Communicate effectively with the GOC team, IT stakeholders, and end-users to ensure clear understanding of issues and resolutions.
- Participate in team meetings, project discussions, and knowledge-sharing sessions to foster a collaborative environment.
SKILLS AND EXPERIENCE
- Bachelor’s degree in computer science, Engineering, or a related field.
- 8+ years of experience in HPC system administration, Linux/Unix environments, and troubleshooting complex technical problems.
- Strong understanding of HPC architecture, networking, storage, and job scheduling systems.
- In-depth knowledge of Infiniband fabric topology and Mellanox hardware capabilities.
- Proficiency in Linux/Unix operating systems and command-line tools.
- Experience with scripting languages (e.g., Bash, Python) for automation and problem-solving.
- Familiarity with HPC software & administration, and tools (e.g., Slurm, Kubernetes etc).
- Excellent problem-solving and analytical skills.