Your Role and Responsibilities
Automation: Develop and maintain automation tools and scripts to streamline deployment, monitoring, and management of the infrastructure and
or services.Documentation: Maintain up-to-date documentation for the infrastructure, processes, and procedures.
Collaboration: Work closely with development teams, product managers, and other stakeholders to understand requirements and ensure the reliability of the platform.
Continuous Improvement: Participate in post-incident reviews, retrospectives, and other forums to identify areas for improvement and drive continuous improvement initiatives.
Required Technical and Professional Expertise
- Experience with Cloud Platforms: Strong experience with cloud platforms such as AWS, Azure, or Google Cloud Platform, including expertise in
- Deploying and managing services in these environments.
- Managing, and troubleshooting containerized applications.
- Automation and Scripting: Strong scripting skills (e.g., Python, Bash) and experience with configuration management tools (e.g., Ansible, Chef, Puppet) to automate deployment and management tasks.
- Troubleshooting and Problem Solving: Strong troubleshooting skills and the ability to quickly identify and resolve complex issues in a production environment, including experience with incident response and post-incident analysis.
Preferred Technical and Professional Expertise
- DevOps Culture: Experience working in a DevOps culture and mindset, including a strong understanding of the collaboration between development and operations teams to achieve business goals.
- Container Orchestration: Proficiency in container orchestration tools such as Kubernetes and OpenShift, including experience in deploying,
- Monitoring and Logging: Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack) to monitor the health and performanceof infrastructure and applications.
- Experience with Scalable Architectures: Experience designing and implementing scalable architectures for cloud-based applications, including knowledge of best practices for scalability, performance, and reliability.
- Experience with Monitoring and Observability: Experience with advanced monitoring and observability practices, including using tools such as Prometheus, Grafana, and Kubernetes-native monitoring solutions to gain insights into system performance and behavior.