The IBM Cloud Networking Tribe is looking for a talented, innovative and enthusiastic Software engineering professional that will build the next generation IAAS to make our customers succeed. The IBM Cloud Networking Tribe has a global cloud presence that continues to grow and expand its reach. Our Network Services engineering team is responsible for delivering virtual network services with top-notch performance, first-rate security, fail-safe reliability and exceptional quality.
An IBM Cloud Network SRE Engineer will be the key individual responsible for troubleshotting customer issues, build tools to improve reliability and work with senior engineers/architects across teams, cross teams to come up with RCA for issues, interface with customers to understand the usecases and build necessary tools to achieve this.
We are a global team, so communication skills (both verbal and written) are critical as well as flexibility to work with team members in other time zones.
As a Site Reliability Engineer, you will work in an agile, collaborative environment to build, deploy, configure, and maintain systems for the IBM client business. In this role, you will lead the problem resolution process for our clients, from analysis and troubleshooting, to deploying the latest software updates & fixes.
Your primary responsibilities include:
•24x7 Observability: Be part of a worldwide team that monitors the health of production systems and services around the clock, ensuring continuous reliability and optimal customer experience.
•Cross-Functional Troubleshooting: Collaborate with engineering teams to provide initial assessments and possible workarounds for production issues. Troubleshoot and resolve production issues effectively.
•Deployment and Configuration: Leverage Continuous Delivery (CI/CD) tools to deploy services and configuration changes at enterprise scale.
•Security and Compliance Implementation: Implementing security measures that meet or exceed industry standards for regulations such as GDPR, SOC2, ISO 27001, PCI, HIPAA, and FBA.
•Maintenance and Support: Tasks related to applying Couchbase security patches and upgrades, supporting Cassandra and Mongo for pager duty rotation, and collaborating with Couchbase Product support for issue resolution.
• 8+ years of industry experience.
• Relevant experience of 8-12 years in development/automation.
• Strong programming/scripting experience in Shell, Perl, Golang, or Python
• Strong experience working on cloud networking area, with understanding of troubleshooting latency issues, packet drops and other network related issues.
• Ability to scale and learn new areas.
•Cloud Networking: Hands on experience debugging/troubleshooting Networking issues and good experience on L3/L2 Protocols
•Kubernetes/OpenShift: Strongly preferred experience in working with production Kubernetes/OpenShift environments.
•Automation/Scripting: In depth experience with the Ansible, Python, Terraform, and CI/CD tools such as Jenkins, IBM Continuous Delivery, ArgoCD
•Monitoring/Observability: Hands on experience crafting alerts and dashboards using tools such as Instana, New Relic, Grafana/Prometheus