A career in IBM Consulting is rooted by long-term relationships and close collaboration with clients across the globe. You'll work with visionaries across multiple industries to improve the hybrid cloud and AI journey for the most innovative and valuable companies in the world. Your ability to accelerate impact and make meaningful change for your clients is enabled by our strategic partner ecosystem and our robust technology platforms across the IBM portfolio
We are seeking a Site Reliability Engineer (SRE) to ensure high availability and performance of our services through proactive monitoring, incident management, and automation. The role requires expertise in tools like Prometheus and Grafana, strong automation skills using Ansible and Terraform, and experience handling system outages effectively. A deep understanding of MQ, system configurations, and cloud operations is essential, along with proficiency in cloud platforms such as IBM Cloud, AWS. Collaborate with engineering teams to reduce toil, enhance service stability, and drive continuous operational improvement. Key Responsibilities:
* Ensure high availability, resilience, and performance of MQ services on IBM Cloud/AWS.
* Proactively monitor systems, manage incidents, and lead recovery with root cause analysis.
* Automate operational tasks to reduce toil and improve service efficiency.
* Collaborate with engineering teams to design, deploy, and maintain reliable cloud infrastructure. * Continuously improve observability, deployment pipelines, and change management practices.
* Deep knowledge of MQ architecture and cloud operations.
* Strong automation skills using Ansible and Terraform.
* Experience with observability tools like Prometheus, Grafana, and Instana.
* Proficiency with at least one programming/scripting language (Python, Go, Node.js, Shell).
* Familiarity with Git and source control workflows
* Understanding of CI/CD pipelines and cloud-native infrastructure.
* Prior experience as an SRE or DevOps engineer in SaaS or cloud environments (AWS).
* Proven incident management skills with ability to perform under pressure.
* Strong problem-solving ability and commitment to service reliability goals.
* Effective collaboration with cross-functional teams to drive automation and operational excellence.
* Strong written and verbal communication skills in English.
* Willingness to participate in on-call support rotation.
* Hands-on experience with Kubernetes/OpenShift in production.
* Experience with ArgoCD or similar GitOps-based deployment tools.
* Infrastructure as Code expertise with Terraform.