A career in IBM Software means you’ll be part of a team that transforms our customer’s challenges into solutions.
Seeking new possibilities and always staying curious, we are a team dedicated to creating the world’s leading AI-powered, cloud-native software solutions for our customers. Our renowned legacy creates endless global opportunities for our IBMers, so the door is always open for those who want to grow their career.
IBM’s product and technology landscape includes Research, Software, and Infrastructure. Entering this domain positions you at the heart of IBM, where growth and innovation thrive.
Your Role and Responsibilities
As a Site Reliability Engineer, you will work in an agile, collaborative environment to build, deploy, configure, and maintain systems for the IBM client business. In this role, you will lead the problem resolution process for our clients, from analysis and troubleshooting, to deploying the latest software updates & fixes.
The Site Reliability Team (SRE) ensures the service is highly available and fully optimizead in a 24/7 environment. As a SRE you will play a crutial role in ensuring the reliability and resiliency of our systems. If you are passionate about optimizing, building automation, solving problems, testing, deploying and managing highly-scalable environments - this is the perfect opportunity for you.
In this role, you will be part of a global SRE team who works closely with our development and product teams to increase the quality and reliability for our products and services but also deploy and manage of Kubernetes clusters on IBM Cloud and other cloud platforms (AWS, Azure). As a SRE you must be willing to work in a fast paced Cloud environment, share rotational on-call duty coverage with the global Ops team and support the back-end Cloud infrastructure components.
Key Responsibilities:
- Maitain high-available product and service on cloud
- Identify issues, ensure minimal downtime and drive them towards a resolution
- Monitor health and performance of production systems
- Automate repetitive tasks using scripts and tools, reduce manual interventions
- Collaborate with development teams - roll out new services, ensure stability and reliability
- Improve operational practices, ensure efficenty and innovation
- Share knowlegde, ideas and solutions with global team
- Experience with Linux system administration
- Understanding of containerization technologies
- Experience with maintaining Kubernetes-based applications on cloud infrastructure
- General scripting and automation skills in at least one language (Bash, Python, Go, Jenkins, Ansible)
- Familiarity with the usage of one or more Cloud Platforms (IBM Cloud, Amazon Web Services, Microsoft Azure)
- Strong debugging and problem-solving skills
- Passion for building and maitaning reliable and resiliant systems
- Basic understanding of networking
-Understanding of cloud storage and networking
-Experience with Infrastructure as Code
-Experience with any source version control system
-Experience with observability (Prometheus, Grafana, Sysdig etc.)