o Responsible for ensuring the reliability, availability, and performance of critical services. Provide scalable and resilient systems, automating repetitive processes, incident management, and fostering collaboration between development and operations.
o System Reliability and Performance:
§ Maintain the availability and reliability of services across distributed systems.
§ Implement monitoring and alerting systems to detect issues before they affect users.
§ Proactively analyze and resolve performance bottlenecks, ensuring optimal system performance.
o Automation and Process Improvement:
o Develop automation scripts for tasks like deployments, monitoring, backups, scaling, etc.
§ Automate the provisioning and scaling of infrastructure using Infrastructure as Code (IaC) tools.
§ Create self-healing systems that detect and recover from failures automatically.
o Monitoring and Performance Management Tools:
§ Hands-on experience with system and application monitoring tools such as:
§ AWS CloudWatch, CloudTrail
o Capacity Planning & Scaling
§ Monitor resource usage and plan for future capacity based on growth projections.
§ Implement systems to dynamically scale resources based on traffic patterns and system loads.
o Log Management and Analysis:
§ Expertise in managing logs for audit, security, and performance purposes using tools such as AWS CloudWatch Logs
o Security and Compliance:
§ Knowledge of security best practices for system hardening, patching, and vulnerability management.
§ Familiarity with AWS security tools such as AWS Security Hub, AWS IAM, and Amazon Inspector.
o Skill requirements
§ AWS resource utilization analysis, capacity planning, and forecasting.
§ AWS Administration proficiency
§ Proficiency in at least 1 scripting language (e.g. Python, bash)
§ Strong understanding of Linux/Unix systems, including performance tuning and kernel optimization
§ Experience with load balancing, caching, and database performance optimization
§ Min 1-2 Years Exp