A career in IBM Software means you’ll be part of a team that transforms our customer’s challenges into solutions.
Seeking new possibilities and always staying curious, we are a team dedicated to creating the world’s leading AI-powered, cloud-native software solutions for our customers. Our renowned legacy creates endless global opportunities for our IBMers, so the door is always open for those who want to grow their career.
IBM’s product and technology landscape includes Research, Software, and Infrastructure. Entering this domain positions you at the heart of IBM, where growth and innovation thrive.
As a Site Reliability Engineering (SRE) Program Director, you will play a pivotal role in leading and driving the SRE program within our organization. You will be responsible for ensuring the reliability, scalability, and performance of systems and applications which support IBM Software SaaS offerings. The successful candidate will have a strong technical background, exceptional leadership skills, and a proven track record of implementing and optimizing SRE best practices in SaaS environments.
Key Responsibilities:
- Lead the SRE program strategy and execution across multiple SaaS offerings
- Drive reliability engineering practices to ensure high availability and performance of services
- Collaborate with engineering, product, and operations teams to embed SRE principles into the software development lifecycle
- Oversee incident management processes, including root cause analysis and continuous improvement
- Champion automation, observability, and proactive monitoring across systems
- Guide the adoption of container orchestration and infrastructure-as-code practices
- Mentor and grow a high-performing, globally distributed SRE team
- Proven experience in a leadership role within Site Reliability Engineering, with a focus on supporting SaaS and/or PaaS solutions
- Proficient understanding of cloud computing platforms (e.g., IBM Cloud, AWS, Azure, GCP) and infrastructure as code
- In-depth knowledge of system architecture, networking, and security principles
- Strong experience with incident management, post-incident analysis, and root cause analysis in a multi-tenant SaaS context
- Expertise in implementing and managing container orchestration platforms (e.g., Kubernetes) for multi-tenant environments
- Certification in Site Reliability Engineering or related field
- Excellent communication skills and the ability to collaborate effectively with cross-functional teams
- Demonstrated success in leading SRE transformations within organizations, particularly in the context of SaaS platforms