IBM Systems helps IT leaders think differently about their infrastructure. IBM servers and storage are no longer inanimate - they can understand, reason, and learn so our clients can innovate while avoiding IT issues. Our systems power the world’s most important industries and our clients are the architects of the future. Join us to help build our leading-edge technology portfolio designed for cognitive business and optimized for cloud computing.
- Provide technical leadership in the design, development, and maintenance of scalable build systems and deployment pipelines for AI/ML components, setting standards for quality, reliability, and performance.
- Mentor and guide a team of engineers, promoting best practices in C++, Python, CI/CD, and infrastructure automation.
- Design and implement robust build automation systems that support large, distributed AI/C++/Python codebases.
- Develop tools and scripts to enable developers and researchers to rapidly iterate, test, and deploy across diverse environments.
- Integrate C++ components with Python-based AI workflows, ensuring compatibility, performance, and maintainability.
- Lead the creation of portable, reproducible development environments, ensuring parity between development and production systems.
- Maintain and extend CI/CD pipelines for Linux and z/OS, applying best practices in automated testing, artifact management, and release validation.
- Collaborate with cross-functional teams — including AI researchers, system architects, and mainframe engineers — to align infrastructure with strategic and technical goals.
- Proactively monitor and improve build performance, automation coverage, and system reliability, identifying opportunities for innovation and optimization.
- Contribute to internal documentation, process improvements, and knowledge sharing to scale impact across teams and foster a culture of continuous improvement.
- Expert-level programming skills in C++ and Python, with a strong grasp of both compiled and interpreted language paradigms; able to provide architectural guidance and code-level mentorship.
- Demonstrated leadership in building and maintaining complex automation pipelines (CI/CD) using tools like Jenkins or GitLab CI, including the ability to define strategy, review team contributions, and drive implementation.
- In-depth experience with build tools and systems such as CMake, Make, Meson, or Ninja, including development of custom scripts and support for cross-compilation in heterogeneous environments.
- Proven experience leading multi-platform development efforts, particularly on Linux and IBM z/OS, with a deep understanding of platform-specific toolchains, constraints, and performance considerations.
- Expertise in integrating native C++ code with Python using tools like pybind11 or Cython, ensuring high-performance and maintainable interoperability across language boundaries.
- Strong diagnostic and debugging skills, with the ability to lead teams in resolving build-time, runtime, and integration issues in large-scale, multi-component systems.
- Proficiency in shell scripting (e.g., Bash, Zsh) and system-level operations, with the ability to coach others in scripting best practices.
- Familiarity with containerization technologies like Docker, and a track record of leading the adoption or optimization of container-based development and deployment workflows.
- Excellent communication and collaboration skills, with the ability to coordinate across disciplines, align technical efforts with strategic goals, and foster a high-performing engineering culture.
- Working knowledge of AI/ML frameworks such as PyTorch, TensorFlow, or ONNX, with an understanding of how to integrate them into scalable, production-grade environments,able to guide teams in best practices for deployment and optimization.
- Experience developing or maintaining software on IBM z/OS mainframe systems, with the ability to mentor others in navigating legacy-modern hybrid ecosystems.
- Familiarity with z/OS build and packaging workflows, including leading efforts to streamline and modernize tooling where appropriate.
- Solid understanding of system performance tuning in high-throughput compute and I/O environments (e.g., large-scale model training or inference pipelines), and the ability to direct optimization strategies.
- Knowledge of GPU computing and low-level profiling/debugging tools, with experience driving performance-critical initiatives.
- Experience managing long-lifecycle enterprise systems, ensuring forward- and backward-compatibility across releases and deployments through proactive planning and versioning strategies.
- Background contributing to or maintaining open-source projects in infrastructure, DevOps, or AI tooling domains, with a focus on community engagement and sustainability.
- Proficiency in distributed systems, microservice architectures, and REST APIs, including guiding architectural decisions that balance performance, maintainability, and scalability.
- Proven experience leading integration of MLOps pipelines with CI/CD frameworks, ensuring seamless, secure, and automated deployment of AI/ML models into production workflows.
- Exceptional communication and stakeholder management skills, capable of clearly articulating technical strategies and trade-offs to non-technical audiences.
- Demonstrated ability to foster collaboration and alignment across diverse, cross-functional teams, including AI researchers, DevOps engineers, and enterprise architects.
- Track record of ensuring compliance with industry standards, security policies, and best practices in enterprise-scale AI engineering.
- Commitment to maintaining high standards of code quality, performance, and security, with the ability to review and enforce standards across a team or organization.