A career in IBM Software means you’ll be part of a team that transforms our customer’s challenges into solutions.
Seeking new possibilities and always staying curious, we are a team dedicated to creating the world’s leading AI-powered, cloud-native software solutions for our customers. Our renowned legacy creates endless global opportunities for our IBMers, so the door is always open for those who want to grow their career.
IBM’s product and technology landscape includes Research, Software, and Infrastructure. Entering this domain positions you at the heart of IBM, where growth and innovation thrive.
As a Site Reliability Engineer, you will work in an agile, collaborative environment to build, deploy, configure, and maintain systems for the IBM client business. In this role, you will lead the problem resolution process for our clients, from analysis and troubleshooting, to deploying the latest software updates & fixes. Working closely with our worldwide teams, you will have a unique opportunity to gain first-hand experience with the latest technologies and be supported by a global team of IBMers to grow your own technical skills and develop your career.
Your primary responsibilities include:
•Infrastructure & Cloud Management:
- Design, build, and manage scalable cloud infrastructure using IBM Cloud, AWS, GCP, Azure.
- Implement Infrastructure as Code using Terraform.
- Deploy and configure applications using container orchestration platforms like Kubernetes/OpenShift.
•Automation & CI/CD:
- Develop and maintain automation scripts and tools using Python, Groovy, and Ansible.
- Build and manage robust CI/CD pipelines using tools like Jenkins, IBM Continuous Delivery, and ArgoCD.
•System Monitoring & Reliability:
- Monitor health and performance of production systems (24x7 observability).
- Use tools like Instana, Grafana/Prometheus, and New Relic to build alerts and dashboards.
- Troubleshoot and resolve production issues in collaboration with engineering and support teams.
•Security & Compliance:
- Perform regular patching, upgrades, and collaborate with product support to resolve issues.
•Database & Middleware:
- Manage open-source middleware and databases such as PostgreSQL, CouchDB, Redis, Kafka, and Spark.
- Participate in incident response and on-call rotations.
•1-3+ years of experience as a DevOps or SRE Engineer.
•Experience with at least one major public cloud provider or large scale private/hybrid cloud using container orchestration.
•Experience with a modern configuration management and/or infrastructure management framework (Ansible, Puppet, Chef, Terraform, etc.).
•Production experience with one or more monitoring frameworks (Prometheus, Nagios, etc.)
•Strong scripting skills in at least one language (BASH, Python, Ruby, etc.)
•Experience with source control management such (git, subversion, etc.)
•Familiarity with Kubernetes or OpenShift platforms.
•Good understanding of CI/CD processes and tools (e.g., Jenkins).
•Solid grasp of monitoring, observability, and troubleshooting production environments.
•Hands-on experience with Linux systems administration.
•Excellent collaboration, communication, and problem-solving skills.
•Bachelor’s degree in computer science, Information Technology, or a related field are a plus.
•Relevant SRE certifications.
•Monitoring/Observability: knowledge or experience crafting alerts and dashboards using tools such as Instana, New Relic, Grafana/Prometheus.