Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

IBM Site Reliability Engineer 
United States, California, San Jose 
562899057

06.05.2024

Your Role and Responsibilities
Automation: Develop and maintain automation tools and scripts to streamline deployment, monitoring, and management of the infrastructure and
or services.Documentation: Maintain up-to-date documentation for the infrastructure, processes, and procedures.
Collaboration: Work closely with development teams, product managers, and other stakeholders to understand requirements and ensure the reliability of the platform.
Continuous Improvement: Participate in post-incident reviews, retrospectives, and other forums to identify areas for improvement and drive continuous improvement initiatives.

Required Technical and Professional Expertise

  • Experience with Cloud Platforms: Strong experience with cloud platforms such as AWS, Azure, or Google Cloud Platform, including expertise in
  • Deploying and managing services in these environments.
  • Managing, and troubleshooting containerized applications.
  • Automation and Scripting: Strong scripting skills (e.g., Python, Bash) and experience with configuration management tools (e.g., Ansible, Chef, Puppet) to automate deployment and management tasks.
  • Troubleshooting and Problem Solving: Strong troubleshooting skills and the ability to quickly identify and resolve complex issues in a production environment, including experience with incident response and post-incident analysis.


Preferred Technical and Professional Expertise

  • DevOps Culture: Experience working in a DevOps culture and mindset, including a strong understanding of the collaboration between development and operations teams to achieve business goals.
  • Container Orchestration: Proficiency in container orchestration tools such as Kubernetes and OpenShift, including experience in deploying,
  • Monitoring and Logging: Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack) to monitor the health and performanceof infrastructure and applications.
  • Experience with Scalable Architectures: Experience designing and implementing scalable architectures for cloud-based applications, including knowledge of best practices for scalability, performance, and reliability.
  • Experience with Monitoring and Observability: Experience with advanced monitoring and observability practices, including using tools such as Prometheus, Grafana, and Kubernetes-native monitoring solutions to gain insights into system performance and behavior.