The System Engineer combines software development and system engineering to build and run
distributed solutions in a secured multi-tier heterogeneous environment to safeguard, provide, and
continuously improve the software and systems behind the organization’s cloud platform solutions.
Job Description: -
• With a vigilant eye on their availability, latency, performance, and capacity. Ultimately, you will view
software as the primary tool for optimizing systems, building infrastructure, and removing mundane
work through automation.
• As part of the Cloud Engineering Team, the SRE Engineer engages in and improves the full lifecycle of
cloud platform solutions from design, deployment, operation, and refinement with accuracy and in
compliance with organization policies and security requirements.
• The SRE Engineer treats operations as a software problem and therefore will code to automate
repetitive tasks and optimize cloud operations.
• Support services before going live through activities like system design consulting, developing
software platforms, and launch reviews. Maintain post-live cloud operations by measuring and
monitoring availability, latency, and overall system health with any prompt and remediation actions.
• Scale sustainably through mechanisms like automation and evolve services/solutions, leveraging IaaS,
CaaS, and PaaS by pushing for changes that improve reliability and velocity.
• Deploy product updates as required while implementing integrations when they arise. Specifying,
documenting, and developing new product features, and writing automated scripts.
• Work with open-source technologies, CI/CD, SCM tools as necessary, and source control such as
Bitbucket, implement organization containers (e.g. Docker and Kubernetes). Stay current with industry
trends and propose new ways for business improvements.
• Takes accountability in considering business and regulatory compliance risks and takes appropriate
steps to mitigate the risks.
• Maintains awareness of industry trends on regulatory compliance, emerging threats, and technologies
to understand the risk and better safeguard the company.
• Highlights any potential concerns /risks and proactively shares best risk management practices.
Job Scope and Responsibilities
• Serve as a primary point responsible for the overall health, performance, and VMware Cloud
Foundation platform.
• Function well in a fast-paced, rapidly changing environment where things need to be sorted in a
dynamic environment
• Experience with VMware virtualization skills is a MUST (vSphere, NSX-T, vSAN, VCF, vROPS, vRNI, vRLI)
• Experience using and utilizing VROps, VRNI, and VRLI for troubleshooting and analysis of incidents
• Understanding of NSX-T for configuration and using NSX-t for incident troubleshooting
• Knowledge and ability to use NSX-T Load Balancer
• Knowledge of renewing certificates in NSX-T
• Able to use and configure hardware alerts using available tools (VMware/HP/PaloAlto)
• Able to understand and use VM functions, data stores, and backup application
• Knowledge and understanding of storage functions in VMs and ability to manage allocation and
distribution of presented storage, for eg. VSAN, is required
• Prior experience with any one of the cloud platforms - vCA, AWS, or Azure
• Run the production environment by monitoring availability and taking a holistic view of system health
• Measure and optimize system performance, to push our capabilities forward, getting ahead of
internal customer needs, and innovate for continual improvement
• Experience with general performance tuning and optimization of all aspects of platforms and services
(systems, network).
• Gather and analyze metrics from operating systems as well as applications to assist in performance
tuning and fault finding (via vROPS, vRLI, vRNI)
• Enforce best practices for metrics gathering, monitoring, and alerting
• Participate in platform management, capacity planning, and incident recovery
• Provide network administration and troubleshooting via vROPS and NSX-T
• Perform deep dives into both systemic and latent reliability issues
• Create sustainable systems and services through automation and uplift
• Networking knowledge is a plus
• Willing to work on & off work hours when required (Standby)