Working in IBM Cloud gives you the platform to learn, develop and utilize your skills everyday by working on the latest cloud related technology products and services. You'll be working in an environment where we understand how we can thrive best when we play to our strengths. That's why developing our people is key to our success, the door is always open for those ready to advance their career.
Curiosity and courageous thinking are both vital when working in IBM Cloud, as we continue our dedication in guaranteeing that we are at the forefront of cloud technology. Our renowned legacy means we are leading the way in everything from analytics and security through to unmatched hardware & software designs. We provide our clients with the full end-to-end transformation as we build IBM's next generation cloud platform which is focused around delivering performance and predictability at a global scale.
IBM's product and technology landscape includes Research, Software, and Infrastructure. Entering this domain positions you at the heart of IBM, where growth and innovation thrive.
As an Site Reliability Engineer, you'll serve as a leader in defining solutions for clients.
You'll identify insights and tasks that can be automated. You'll have the opportunity to identify points of improvement in technical processes and propose new ways to do it through automation, help our customer to resolve their pain points and, through co-creation, define solutions that allow improving the efficiency of their operations. Your primary responsibilities include: Strategic Design and Analysis of Distributed Systems: Design, analyze, and troubleshooting large-scale distributed systems. Proactive Reliability Management and Incident Response: Participate in on-call rotation, engage with product teams to fix production outages, and carry forward action items to improve ongoing reliability. Empowering Tools and Automation for Enhanced Reliability: Develop effective tooling, alerts, and response to both identify and address reliability risks including automatic problem detection and mitigation.
- 2+ years of experience developing or operating complex cloud scale application/infrastructure environment .
- Proven experience with operating systems: RHEL, CentOS Linux, and Windows Servers Hands-on experience with Container technologies: Kubernetes, Docker, etc.
- Working knowledge with one or more Virtualization technologies: Citrix Hypervisor, VMware vSphere, Ubuntu KVM, etc. Hands-on experience building automation: Bash, PowerShell, Python or Go.
- Working knowledge with one or more key infrastructure tools/products: ActiveDirectory, Ansible, Chef, etc. Working knowledge with Monitoring technologies: Zabbix, Splunk, etc. Working knowledge with Network and Storage technologies
- Working knowledge with ServiceNow, JIRA, Confluent, and GitHubmax
- Experience with Message Queues
- PostgreSQL/MySQL Databases, and NoSQL Databases
- Ready to work in shifts