· Design and implement monitoring solutions using APM products.
· Expert level administration (hands-on) experience in infrastructure and App Monitoring Tools - Solarwind, Instana
· Identify vital metrics and create alerts, policies in APM tools
· Enabling auto resolution from APM tools
· Create and maintain monitoring dashboards to provide real-time visibility into system health and performance.
· Collaborate with development and operations teams to define and implement alerting rules based on established best practices and specific system requirements.
· Monitor system performance, availability, and capacity to proactively identify and address potential issues.
· Continuously analyze monitoring data to identify opportunities for optimization and efficiency improvements.
· Experience in patching and upgrade of APM tools.
· Participating in sev1 calls and help troubleshooting through APM tools
· Collaborate with cross-functional teams to ensure the reliability, scalability, and performance of our infrastructure.
· Document monitoring and alerting configurations, processes, and best practices.
· Good to have: Scripting skills.
· Bachelor’s degree in computer science, related technical discipline, or equivalent practical experiences.
· Proven experience as a Site Reliability Engineer (SRE) or a similar role with a focus on monitoring and alerting.
· Proficiency with APM tools and technologies such as Solarwinds, IBM Instana, Prometheus, Grafana, etc.
· Experience in creating and maintaining monitoring dashboards and writing alerting rules.
· Understanding of cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (e.g., Kubernetes) is a plus.
· Good communication and teamwork skills..