Riot Games was established in 2006 by entrepreneurial gamers who believe that player-focused game development can result in great games. In 2009, Riot released its debut title League of Legends to critical and player acclaim. As the most played PC game in the world, over 100 million play every month. Players form the foundation of our community and it’s for them that we continue to evolve and improve the League of Legends experience.
We’re looking for humble but ambitious, razor-sharp professionals who can teach us a thing or two. We promise to return the favor. Like us, you take play seriously; you’re passionate about games. We embrace those who see things differently, aren’t afraid to experiment, and who have a healthy disregard for constraints.
That's where you come in.
Service Reliability Specialist - Riot Operations Center
The Riot Operations Center (ROC) manages the 24x7 monitoring and response components of Riot's player-facing services. We are the first line of defense when things go wrong with any of our live services. We leverage technical familiarity with best-practice processes to rapidly remediate incidents. The team helps to create and mentor other Riot teams on best practice in alerting, monitoring, and operational processes.
As a Service Reliability Specialist, you will work closely with the Live Operations team and Riot globally to establish and maintain a high-performing and highly available game service for players. You will monitor and support all aspects of LIVE production environments, development environments, and general system needs. Your technical skills and grasp of system integration will help you diagnose and communicate potential issues to Rioters and the community, improving the quality of the player experience. You will be a craft master in operational and triaging skills. The team can look to you in tricky situations to lead the resolution as a proactive individual, focused on solving day to day problems that affect any aspect of running live games. You will also be involved in projects of moderate complexity that would help continuously improve overall service quality in the incident management and observability problem spaces.
Responsibilities:
- Triage and investigation of live incidents and leads the team around live incidents
- Execute technical return to service actions in a fast-paced, distributed systems environment specifically microservices to quickly restore service and protect player experience
- Monitor the health of Riot’s distributed services using observability tools, identify gaps with alerting, runbook steps, processes or tools
- Runbook development, audit and maintenance to keep documentation up to date
- Training material creation and onboarding new team members
- Onboard/ mentor team members in improving their crafts and be a tech point of escalation on a day to day basis
- Provide support, coordination during major launches, events and release deployments
- Build relationships with members from the wider Riot organizations to drive communication and strategic alignment
- Trained as an Incident commander and able to drive incidents to resolution
- Contribute to project work with little or no guidance to develop automation scripts, utilities and new processes to continuously improve the incident management process
- Document details of incident response, conduct incident retrospectives, quality checks as needed to identify problems and improve overall incident management/response
- Participate in post-incident RCA meetings as required
Required Qualifications:
- Computer Science/IT Systems/Information Technology diploma, associate degree or equivalent
- 4+ years of Service Reliability Administration or equivalent role (System Analyst, System Administrator/Engineer, Live Operations, Network Administrator, NOC Engineer etc)
- Speak with authority on incident management and have good understanding of ITIL processes
- Familiarity with the core concepts of operating systems, networking, SDLC and Agile methodologies
- Expert level troubleshooting skills with triaging incidents in a high-capacity, high-availability and highly distributed environment
- Experience with the following tools/platforms:
- Monitoring solutions eg: Datadog, NewRelic, Nagios, Elastic Search, Grafana
- Event management tools eg: BigPanda, Moogsoft
- ITIL-based Ticketing systems eg: ServiceNow, JIRA
Desired Qualifications:
- Understand relational databases like MySQL, CI/CD pipelines, especially Jenkins
- Experience working on deployments in a live environment is a plus
- Experience working in container-based ecosystems like docker and with a container scheduler like Kubernetes, Amazon EKS/ECS or GKE
- AWS Cloud Services experience/certification/training or equivalent, Linux+ and Network+, or equivalents
- Experience building automation scripts/utilities/jobs using either Python, Powershell, JavaScript or Bash
- Familiarity with Site Reliability Engineering (SRE) principles and best practices