A career in IBM CIO means you’ll be part of a team that transforms IBM's capability to deliver to the marketplace. You will seek new possibilities and remain curious.
IBM is a team dedicated to creating the world’s leading AI-powered, cloud-native software solutions for our customers. Our renowned legacy creates endless global opportunities for our IBMers, so the door is always open for those who want to grow their career.
IBM’s product and technology landscape includes Research, Software, and Infrastructure. Entering this domain positions you at the heart of IBM, where growth and innovation thrive.
As a Senior Site Reliability Engineer, you lead work with your peer SREs in an agile, collaborative dynamic team. You will build, deploy, configure, and maintain systems for the IBM Internal Developer Experience. In this role, you will lead the problem resolution process for our developers, from analysis and troubleshooting, to deploying the latest software updates & fixes. You will grow your expertise in cloud-native operations and DevSecOps practices.
Your primary responsibilities include:
Lead design and implementation of scalable, secure GitHub Enterprise infrastructure.
Mentor junior SREs and foster a culture of reliability and continuous learning.
Drive automation across deployment, monitoring, and incident response workflows.
Own incident management processes, reducing toil and improving MTTR.
Champion infrastructure as code and CI/CD best practices.
Act as a technical liaison between vendor development and operations teams.
Ensure platform architecture adheres to enterprise security standards and compliance requirements.
Lead efforts in vulnerability management, access control, audit logging, and secure configuration.
Collaborate with security and compliance teams to align with regulatory frameworks.
Collaborate with user groups to ensure issues and roadmap needs are met.
Operate in a technically demanding and fast-evolving environment.
Comfortable with occasional shift work and off-hours support to maintain platform reliability.
Embrace continuous learning and adaptability in a global, collaborative team.
Demonstrate ownership, resilience, and a proactive mindset in high-pressure situations.
•SRE Experience: 5-8 years as an SRE, DevSecOps or infrastructure engineering roles.
•System Monitoring and Troubleshooting: Strong skills in monitoring/observability, issue response, and troubleshooting for optimal system performance.
•Automation Proficiency: Proficiency in automation for production environment changes, streamlining processes for efficiency, and reducing toil.
•Linux Proficiency: Strong Linux admistration and troubleshooting skills, including host virtualization.
•Operation and Support Experience: Demonstrated experience in handling day-to-day operations, alert management, incident support, migration tasks, and break-fix support.
•Automation scripting: Proficiency scripting (Python, Bash) and automation tools (Ansible).
•Container platforms: Deep experience with container platforms (Docker, Kubernetes, OpenShift)
•CI/CD design & implementation: Proven track record in CI/CD secure pipeline design, management and maintenance.
•Enterprise platforms: Experience supporting enterprise-grade platforms (e.g. Github Enterprise)
•Source code managemnt: mastery of SCM principles, experience of github.com & git CLI.
•English: Fluent in written and spoken English.
•Complex custom clustered systems: Experience with complex bare-metal cloud hosted systems.
•Kubernetes/OpenShift: Experience in working with production Kubernetes/OpenShift environments.
•DevSecOps secure engineering: Familiarity secure build architecture, building & operating secure hosting & pipelines.
•Automation/Scripting: In depth experience with the Ansible, Python, Terraform, and CI/CD tools such as Jenkins, Github Actions.
•Monitoring/Observability: Hands on experience crafting alerts and dashboards using tools such as Instana, New Relic, Grafana/Prometheus
•Desired tech alignment: Ruby, mysql, IBM Cloud, Ansible and Golang.