About the role
The Reliability Engineer ensures stability of the manufacturing plant, systems health, lifecycle management, user satisfaction. Prioritizing digital capabilities and infrastructure's reliability, performance, and efficiency is a must. All employees involved in the development and maintenance of these services must work collaboratively to ensure that they meet and exceed customers and stakeholders’ expectations. The Reliability Engineer plays a crucial role, serving as a bridge between development and operations and taking a proactive approach to prevent downtime, optimize system performance, and rapidly deploy new features and capabilities. The team is currently including a Program manager, TECH Delivery Lead, few experts, a POD of 5 engineers, few TECH PM, and also a local Digital M&S team. This will change over time as per the progress of program delivery and future support model.
Role Responsibilities
System Reliability and Availability:
- Develop and implement strategies to ensure the high availability and reliability of all technology services.
- Monitor system health, analyze performance metrics, and proactively identify and resolve potential issues.
- Incident Management and Resolution:
- Lead incident response efforts to swiftly resolve service disruptions, minimizing customer impact.
- Conduct thorough transversal post-mortem analyses and implement preventive measures to avoid future incidents.
Continuous Improvement and Deployment:
- Employ continuous integration and deployment practices to automate the build, test, and deployment processes, enhancing the speed and reliability of technology capabilities.
- Innovate and iterate on processes and tools to improve operational efficiency and system resilience.
Security and Compliance:
- Ensure all systems and processes adhere to industry best practices and regulatory security and data protection standards (Lifecycle management)
- Implement and maintain security measures to safeguard against unauthorized access, data breaches, and other cyber threats.
Collaboration and Communication:
- Work closely with the assigned digital unit, Digital Technology, Digital architects and business stakeholders to align technical solutions with organizational goals.
- Serve as a subject matter expert in reliability engineering, providing guidance and support to teams across the organization.
Learning and Development:
- Stay abreast of the latest trends and technologies in reliability engineering, cloud computing, and software development practices.
- Foster a culture of learning and continuous improvement within the team and across the organization.
Expected Outcomes:
- Achieve and maintain system uptime and reliability targets defined by organizational objectives and Service Level Agreements (SLAs).
- Reduce the frequency and duration of service incidents and downtime.
- Streamline deployment processes, achieving faster time-to-market for new features and improvements.
- Ensure obsolescence is being kept under control.
- Enhance system security and ensure compliance with all relevant regulations and standards.
- Cultivate a collaborative, innovative, and learning-oriented environment within the technology department.
Job Requirements
- Ability, knowledge and experience to define, control, monitor, audit, apply quality principles and rules compliant with the overall Company Quality policy.
- Ability, knowledge, understanding and experience to handle and apply IT security concepts, principles and rules compliant with the overall Company Security policy.
- Ability to ensure the set-up of the operational service model based on ITIL framework and ensures its delivery, performance and continuous improvement according to measurable agreements with service stakeholders
- Ability, knowledge and experience to understand, handle end-to-end Infrastructure processes, Infrastructure operations, Infrastructure ecosystem and Infrastructure efficiency. Understand how to map Infra to Cloud and drive Infra through automation
- Ability, knowledge, understanding and experience in Information Technology specific expertise (VMWare, Windows systems, Linux systems, Backup and Storage, Network, …)
- Provision modern infrastructure using a DevOps approach: infrastructure as code and declarative tools (Terraform, Ansible...). Architecture changes are also made through code.
- Use modern application/infrastructure design to simplify operations, such as auto-scaling resources
- Most of the activity is on configuration management using declarative tools, as opposed to programmatic approaches. based on app events, or deploying self-healing infrastructures.
- Automate all other tasks, such as ticket resolution using DevOps practices. This allows to have no manual changes, all changes (ie: configuration) are recorded using a source control tool (ie: GitHub), deployed using a CI/CD pipeline (ie, GitHub Actions).
- Applies knowledge and deep understanding of the Company’s business, industry environment, trends, and market dynamics to contribute to business strategy direction and support its realization through operational excellence.
- Effectively copes with and adapts to change.
- Acts as cooperative and approachable team player, able to relate well to all levels of people inside and outside of the organization, finds common ground to effectively solve problems for the good of all, even at difficult times.
- Understands the transformative power, key road blocks, challenges and components of digital technologies and capabilities, as well as, their implications both for Company overall and in the relevant area of expertise. Assesses and provides advice on the use of digital technologies and capabilities in the relevant area of expertise. Applies these skills to solving existing business problems, executing the Company’s roadmap and creating new opportunities in the relevant area of expertise.
- Is able to have a positive impact on others, to get people to change attitudes, behaviours or mind sets.
- Looks at a problem from different points, develops alternative solutions, and selects the best solution given the problem, the environment and other parameters.