Homebrew is looking for an Infrastructure Engineer to help run our GPU Training Cluster, internal GPU Cloud. Please note that this is an On-Premise role, as we build our own infrastructure.
Responsibilities
- Design and maintain the organisation's infrastructure, including compute and storage nodes, high-bandwidth networking infrastructure, and security and monitoring infrastructure
- Design and maintain software for infrastructure management and orchestration (e.g. Openstack, Kubeflow, Proxmox, etc)
- Participate in incident response and resolution to ensure high availability and performance
- Develop and maintain solutions for day-to-day operational administration, system/data backup, disaster recovery, and security/performance monitoring.
- Collaborate with Engineering team to implement DevSecOps practices (e.g. IAAC, CI/CD)
Requirements
- Familiar with on-premise infrastructure (e.g. Racks with power, storage, compute, network nodes)
- Ability to do basic to intermediate hardware troubleshooting, servicing and repairs
- [Plus] Experience with Slurm, Kubeflow or alternative cluster orchestration tools
- [Plus] Experience with Openstack, VMWare, Proxmox or alternative cloud orchestrator tools
- [Plus] Experience with designing GPU Clusters or HPC systems (inter-cluster networking)
- [Plus] Familiarity with software-defined storage technologies (Ceph, ZFS, NFS, etc.)