A career in IBM Software means you'll be part of a team that transforms our customer’s challenges into solutions.
Seeking new possibilities and always staying curious? We are a team dedicated to creating the world's leading AI-powered, cloud-native software solutions for our customers. Our renowned legacy creates endless global opportunities for our IBMers, so the door is always open for those who want to grow their career.
As a Site Reliability Engineer, you will specialize in ensuring the reliability and resiliency of our systems. Bringing a unique blend of knowledge and skills in both software and platform systems, you will play a key role in analyzing business needs, identifying and solving problems, designing and building automation tooling, designing and implementing monitoring solutions, deploying and managing production changes, and maintaining well-engineered information systems and ecosystems.
Responsibilities:
Reliability and Resilience:
o Specialize in ensuring the reliability and resiliency of services based on a micro-service architecture, with the goal of delivering a high-availability environment.Problem Analysis and Resolution:
o Analyze business needs and proactively identify and solve problems to enhance system performance and stability. Troubleshoot failures and execute recovery quickly to minimize impact to customers.End-to-End Engineering:
o Play a pivotal role in advising, designing, building, testing, deploying, and maintaining well-engineered cloud environments.
Technical Skills/Exposure:
SRE/DevOps in a SaaS/Cloud environment is strongly desired (Linux administrator experience will be considered).
Experience with at least one major public cloud provider or large-scale private/hybrid cloud using container orchestration.
Experience with one or more monitoring/observability tools (Prometheus & Grafana are preferred).
Programming skills in at least one language, preferably Python or Node.js, but others would be considered, such as Go, BASH, Ruby, etc.
Use of source control management (git, subversion, etc.)
Professional Experience:
Ability to manage multiple tasks, ensuring that commitments/timelines are met.
Ability to partner with internal stakeholders when designing and building automation to improve operations.
Ability to perform under pressure.
Be growth-minded, goal-oriented, and forward-thinking to provide solutions for complex technical problems.
Fluent in spoken and written English.
This role requires some after-hours on-call work.
Experience with a modern configuration management framework (Ansible, Chef, Puppet).
Experience with deployment (CI/CD) pipeline tools, preferably ArgoCD.
Experience with Infrastructure as Code tools, preferably Terraform.
Be growth minded, goal oriented, and forward thinking, to provide solutions for complex technical problems.
Fluent in spoken and written English.
This role requires some after-hours on-call work.