A career in IBM Consulting is rooted by long-term relationships and close collaboration with clients across the globe. You'll work with visionaries across multiple industries to improve the hybrid cloud, data and AI journey for the most innovative and valuable companies in the world. Your ability to accelerate impact and make meaningful change for your clients is enabled by our strategic partner ecosystem and our robust technology platforms across the IBM portfolio. Curiosity and a constant quest for knowledge serve as the foundation to success in IBM Consulting. In your role, you'll be encouraged to challenge the norm, investigate ideas outside of your role, and come up with creative solutions resulting in ground breaking impact for a wide network of clients. Our culture of evolution and empathy centers on long-term career growth and development opportunities in an environment that embraces your unique skills and experience.
We are seeking a highly skilled and motivated Site Reliability Architect with foundational AI knowledge to join our growing Application Operations team. In this role, you will focus on ensuring system reliability, automating operational processes, and leveraging emerging AI tools to enhance operational efficiency. You will be responsible for implementing SRE best practices, building robust automation solutions, and troubleshooting complex system issues while collaborating with cross-functional teams to maintain highly available and scalable systems. You will work at the intersection of traditional operations engineering, automation, and modern AI-enhanced tooling to build resilient systems that deliver exceptional reliability and performance.
Key Responsibilities:
• Implement and maintain SRE practices including SLIs, SLOs, error budgets, and reliability monitoring
• Design, build, and maintain automation solutions for deployment, monitoring, incident response, and system maintenance
• Troubleshoot complex system issues across distributed environments and implement sustainable solutions
• Develop and maintain observability solutions including monitoring, alerting, logging, and tracing systems
• Automate toil reduction through scripting, infrastructure as code, and process improvements
• Collaborate with development teams to improve system reliability through design reviews and reliability engineering practices
• Participate in on-call rotations and lead incident response efforts, including post-incident reviews and improvement implementation
• Build and maintain CI/CD pipelines and deployment automation tools
• Leverage AI-enhanced tools and basic machine learning concepts to improve operational insights and automate routine tasks
• Implement capacity planning and performance optimization strategies
• Maintain and improve system security, compliance, and operational governance
• Analyze system performance data and operational metrics to identify trends and improvement opportunities
• Stay current with emerging trends in SRE practices, automation tools, and AI-enhanced operational capabilities
• Ensure systems are designed for scalability, reliability, and maintainability
• Collaborate with operations teams to integrate reliability practices into existing operational workflows
• 3+ years of experience in a Site Reliability Architect role, DevOps, or similar operational roles
• Strong problem-solving and analytical troubleshooting skills across complex distributed systems
• Experience with observability platforms (Splunk, Dynatrace, New Relic, DataDog)
• Hands-on experience building automation solutions using scripting languages (Python, Bash, Go, or similar)
• Experience with SRE principles including observability, monitoring, incident management, and reliability practices
• Proficiency with infrastructure as code and configuration management tools (Terraform, Ansible, CloudFormation)
• Experience with containerization technologies (Docker, Kubernetes)
• Knowledge of CI/CD pipelines and deployment automation
• Familiarity with Agile development methodologies
• Basic understanding of AI/ML concepts and interest in leveraging AI tools for operational improvements
• Experience with monitoring and observability platforms (Prometheus, Grafana, ELK stack, or similar)
• Strong understanding of Linux/Unix systems administration
• Experience with cloud platforms (AWS, Azure, GCP) and cloud-native technologies
• Knowledge of networking, security, and system performance optimization
• Experience working with large-scale, distributed systems
• Experience with Ansible, Red Hat OpenShift, Kubernetes orchestration and management
• Knowledge of incident management platforms and ITSM tools (ServiceNow, PagerDuty, Jira Service Management)
• Experience with database administration and performance tuning
• Familiarity with GitOps practices and tools (ArgoCD, Flux)
• Experience with chaos engineering and reliability testing practices
• Understanding of microservices architecture and service mesh technologies
• Experience with performance testing and capacity planning tools
• Excellent problem-solving and communication skills
• Desire to grow skills and work in a continuous learning environment
• Interest in exploring AI/ML applications for operational use cases