A career in IBM Software means you'll be part of a team that transforms our customer’s
challenges into solutions.
Seeking new possibilities and always staying curious? We are a team dedicated to
creating the world's leading AI-powered, cloud-native software solutions for our
customers. Our renowned legacy creates endless global opportunities for our IBMers,
so the door is always open for those who want to grow their career.
As a Site Reliability Engineer, you will specialize in ensuring the reliability and resiliency of our systems. Bringing a unique blend of knowledge and skills in both software and platform systems, you will play a key role in analysing business needs, identifying and solving problems, designing and building automation tooling, designing and implementing monitoring solutions, deploying and managing production changes, and maintaining well-engineered information systems and ecosystems.
- Write correct and clean software programming code, consistently following stated best practices.
- Maintain a global (worldwide) anycast network in Infrastructure as Code (IaC). Must be responsible for global routing, including maintaining appropriate peering and IX (Internet Exchange) relationships. Responsible for maintaining appropriate BGP policies to constrain route advertisements as necessary to maintain network latency.
- Be responsible for codebase development, including working in other areas of the codebase.
- Participate in problem solving in an outage or escalated situation, Severity 1through Severity 3 (SEV1-3).
- Participate in the postmortem process to identify root causes.
- Provide on-call support for their area including multiple systems.
- Develop small-to-medium features from technical design through completion
- Anticipate and incorporate interruptions (incident response, bug squashing) and remain on track with project plans.
- Follow standard SRE process maintaining SLOs, SLI, and so on. Reduce toil where possible, increase reliability in our systems and networks.
- Responsible for maintaining physical (metal) infrastructure in multiple data centres worldwide. Also responsible for maintaining “cloud” infrastructure as Infrastructure as Code (IaC). Cloud infrastructure must be maintained via Terraform or other similar technologies. Including any network attachments.
- Proficient in writing clean code following best practices.
- Experience with Infrastructure as Code (IaC).
- Working knowledge of BGP and global routing policies.
- Familiarity with Internet Exchange peering and related practices.
- Experience maintaining network infrastructure across global data centers.
- Ability to provide on-call support and respond to high priority incidents.
- Experience with incident response, including root cause analysis and postmortem processes.
- Ability to develop and complete small-to-medium technical features.
- Experience using Terraform or equivalent tools for cloud infrastructure management.
- Solid understanding of cloud network architectures and network attachments.
- Strong collaboration and debugging skills in large, distributed systems.
- Experience in cross-functional codebases beyond networking (e.g., application or platform layers).
- Ability to balance interruptions (e.g., incidents, bugs) while delivering project work.
- Prior experience working in an SRE or DevOps environment. Exposure to observability practices (monitoring, alerting, metrics collection).