We’re building Astra Serverless, the next generation of distributed, scalable, fault-tolerant, serverless NoSQL data services — powered by Apache Cassandra and extended with native Vector and AI capabilities across multi-cloud environments.
Our customers depend on our platform to serve real-time, mission-critical workloads on a global scale. Ensuring reliability, performance, and correctness under unpredictable workloads is a non-trivial challenge — and that’s where you come in.
As an engineer on the Quality Engineering and Performance team, you’ll develop and evolve the system-level testing frameworks that validate a distributed database-as-a-service at massive AI-driven workload scale. You’ll help ensure that new features, performance improvements, and AI-driven extensions meet the highest standards of scalability and resilience.
Why this role?
You’ll work at the intersection of distributed systems engineering and test architecture — hands on designing and building automation and frameworks that simulate complex multi-cloud deployments, chaos scenarios, and performance stress conditions.
This is not QA-as-usual: you’ll engineer the test systems that validate an elastic database platform capable of scaling thousands of non-uniform nodes, self-healing under failure, and integrating real-time vector search and analytics.
If you thrive on deep technical challenges, curiosity, analytical and systems thinking, and building tools other engineers rely on, this role will feel like home.
- Design and develop frameworks for end-to-end and chaos testing of distributed, serverless Cassandra-based systems.
- Engineer automation that validates data correctness, fault tolerance, and performance under complex multi-region and -cloud topologies.
- Collaborate closely with your peers in local and remote feature development teams to model real-world scenarios and integrate automated validation into the delivery pipeline.
- Continuously evolve the test infrastructure for scale, speed, and observability — leveraging Kubernetes, Docker, and cloud-native toolchains.
- Profile and tune distributed workloads to uncover systemic bottlenecks and verify that service-level goals are consistently met.
- Contribute code to shared testing frameworks and participate in design and code reviews across teams.
- Own the full cycle of quality engineering — from test design and execution to insights and continuous improvement.
- Exposure to system level Java and Python development in testing for distributed or cloud systems — replication, partitioning, consistency, and eventual convergence.
- Eagerness to learn more about and using chaos testing, fault injection, or resilience validation.
- Knowledge of analyzing complex logs and metrics to isolate performance and reliability issues.
Familiarity with Linux, Kubernetes, Docker, and CI/CD pipelines (Jenkins, GitHub Actions, etc.).
- Familiarity with NoSQL technologies (Cassandra, DynamoDB, ScyllaDB, etc.) and cloud platforms (AWS, GCP, Azure) and multi-cloud topologies.
- Curiosity-driven mindset, strong communication skills, and a focus on collaboration and craftsmanship.
- Understanding of vector search, AI embeddings, or data-intensive workloads.