- The IBM Cloud Site Reliability Engineering (SRE) team is working on providing infrastructure and operations solutions to maintain scalable, highly reliable, and highly secure cloud-based software infrastructures to enable our clients to meet their on-demand IT and security needs to disrupt their industries (Financial, Manufacturing, Insurance and more).
- Above all, we are looking for applicants who desire creative freedom and who will thrive in an open, vibrant, flexible, and collaborative environment.
As a Site Reliability Engineer, you will work in an agile, collaborative environment to build, deploy, configure, and maintain systems for the IBM client business. In this role, you will lead the problem resolution process for our clients, from analysis and troubleshooting, to deploying the latest software updates & fixes.
Your primary responsibilities include:
•24x7 Observability: Be part of a worldwide team that monitors the health of production systems and services around the clock, ensuring continuous reliability and optimal customer experience.
•Cross-Functional Troubleshooting: Collaborate with engineering teams to provide initial assessments and possible workarounds for production issues. Troubleshoot and resolve production issues effectively.
•Deployment and Configuration: Leverage Continuous Delivery (CI/CD) tools to deploy services and configuration changes at enterprise scale.
•Security and Compliance Implementation: Implementing security measures that meet or exceed industry standards for regulations such as GDPR, SOC2, ISO 27001, PCI, HIPAA, and FBA.
•Maintenance and Support: Tasks related to applying Couchbase security patches and upgrades, supporting Cassandra and Mongo for pager duty rotation, and collaborating with Couchbase Product support for issue resolution.
This will be a shift rotation position - You will work Sunday to Thursday or Tuesday to Saturday rotation.
- Design, develop, and own different tooling and automation to monitor and improve availability, scalability, latency, and efficiency of highly secure, confidential computing cloud services.
- Deploy and manage infrastructure and services in IBM’s Cloud ecosystem.
- During your workday, as part of a global team using a follow-the-sun model, you will handle both real-time alerts as well as customer reported problems.
- Participate in scrums, sprint planning and retrospectives; Be an active member of the team and provide feedback and improvement ideas.
- Work collaboratively with the extended IBM teams, learn new technologies and apply the skills learned.
- Respond with urgency to incidents, perform root cause analysis, and build a knowledge base to enable sharing with other teams.
- Bachelor's Degree in Computer Science or related field
- Experience using Linux, GitHub, Bash, Python, Node.js, Docker, Kubernetes, and Ansibles
- Experience developing tests and reliable automation for common, repeated tasks
- Demonstrated experience with REST APIs and automation
- Proficient in cloud computing and services, specifically logging and monitoring
- Strong debugging, problem determination, and isolation skill
- Effectively communicate with global, cross functional teams and customers
- Team player who can work collaboratively, innovate, and be a quick learner