Site Reliability Engineer | Scrabble & Jigsaw
Job Description
Senior Site Reliability Engineer
Infrastructure for AI-Driven Mobile Test Automation
About Drizz
Drizz (drizz.dev) is a Vision AI-powered mobile test automation platform serving Fortune 100 enterprises. We use vision-language models, computer vision, and automation agents to make mobile testing autonomous. No selectors, no scripts, no maintenance.
Our crawler engine navigates iOS, Android, and mobile web applications, discovers user journeys, and converts them into executable end-to-end tests. The platform uses vision models and automation to interact with applications similarly to how a human tester would.
Enterprise clients run thousands of automated test sessions across real mobile devices every day on the Drizz platform.
Why This Role Exists
The Drizz platform runs large-scale distributed systems that coordinate automated test execution across multiple device cloud providers such as Genymotion, BrowserStack, LambdaTest, and Appetize. These systems include asynchronous worker pipelines, event-driven execution flows, AI inference pipelines, and distributed storage systems.
As we scale from pilot customers to production deployments across Fortune 100 environments, we need someone who owns the reliability, observability, and operational excellence of this infrastructure.
This role exists to ensure the platform remains reliable, observable, and efficient as the system scales.
What You’ll Own
Infrastructure Reliability
• Monitoring, alerting, and incident response for distributed systems composed of services, asynchronous workers, and event-driven execution pipelines
• Capacity planning and autoscaling for large numbers of concurrent test executions • Designing resilience and failover strategies when external device providers experience outages • Defining and improving service reliability targets including SLIs, SLOs, and error budgets • Improving system stability across infrastructure and execution pipelines
Observability and Debugging
• Instrumenting telemetry across execution steps, perception pipelines, and agent workflows • Tracking system behavior across asynchronous execution paths and distributed workers • Building tools/agents/workflow that clearly explain why a test execution failed without manual log digging
LOCATION TYPE TEAM REPORTS TO
Bangalore, India Full-time Engineering CTO
Page 1
• Enabling session replay and debugging from execution checkpoints Cost and Performance Optimization
• Optimizing model inference usage and multi-model execution costs • Managing device cloud utilization and session lifecycle efficiency • Profiling latency across distributed execution pipelines • Improving cost efficiency of large-scale test runs
Platform Reliability Engineering
• Improving reliability of CI/CD systems and deployment pipelines
• Managing infrastructure across cloud environments including multi-cloud and hybrid/on-prem deployments
• Operating and improving database reliability, including query performance, migrations, backups, and recovery
• Strengthening operational security and reliability for enterprise deployments
Technical Context
The systems you will work on have characteristics that differ from traditional test infrastructure.
Distributed execution systems
Test sessions are coordinated through distributed systems that manage large numbers of concurrent execution workers interacting with mobile devices.
Asynchronous processing pipelines
Most execution steps are processed through asynchronous worker pipelines to support high concurrency and parallel test execution.
Event-driven execution flows
Execution events propagate through event-driven pipelines that coordinate workflows, recovery flows, and result processing.
AI inference in the execution flow
Execution flows involve vision-language model calls, so inference latency, availability, and cost directly affect system performance.
Enterprise workloads
Drizz is used by large engineering organizations that expect strong reliability, auditability, and operational clarity, often across multi-cloud or hybrid infrastructure environments.
What We’re Looking For
• 3+ years of experience in Site Reliability Engineering, infrastructure engineering, or platform engineering • Experience operating production distributed systems • Experience with asynchronous processing systems and event-driven architectures • Strong experience with cloud infrastructure such as AWS, GCP, or similar platforms
• Familiarity with containerized services and distributed worker systems
Page 2
• Experience operating production databases and storage systems (PostgreSQL, MySQL, DynamoDB, Redis, or similar)
• Experience building monitoring, alerting, and observability systems • Experience working with infrastructure as code tools such as Terraform or Pulumi • Experience defining service reliability targets such as SLIs, SLOs, and error budgets • Experience handling incidents, running postmortems, and improving systems after production failures • Ability to read and instrument application code (Python, Node.js, or similar) • Strong focus on reliability, operational clarity, and cost efficiency
Even Better If You Have
• Experience operating large-scale workflow orchestration or distributed job systems • Experience with AI or ML inference infrastructure • Familiarity with mobile device cloud platforms • Experience running high-concurrency systems
• Experience working in fast-moving engineering teams or startup environments
Why Drizz
You will help design and operate the reliability foundation of a platform that runs thousands of automated test sessions across real mobile devices every day.
The observability systems you design will determine how easily engineers understand and debug automated test runs. The reliability improvements you implement will determine whether large organizations trust Drizz in their CI pipelines.
You will work directly with the founding team and have significant ownership over the reliability and infrastructure of the platform as it scales.
drizz.dev
Page 3
