Leadership – Site Reliability & Platform Architect | Scrabble

Posted on 18-07-2025

Job Description

Site Reliability & Platform Architect
Location: Bangalore
Type: Full-Time
Experience
7–14 years
Position
Leadership – Site Reliability & Platform Architect
About the Role
We’re scaling a high-performance SaaS platform powering logistics automation for 500+
global enterprises (Unilever, Apple, Koch, etc.). As our systems grow in complexity and
scale, we need a Site Reliability & Platform Architect to lead the next evolution of our
cloud infrastructure, DevOps maturity, and backend platform architecture.
This role blends deep DevOps/SRE expertise with backend architectural thinking — ideal
for someone who understands infrastructure and code, and loves building resilient,
observable, and scalable systems from the ground up.
What will you do
Own infrastructure architecture: Design and evolve cloud-native systems for scalability,
availability, cost efficiency, and security.
Lead backend platform design: Collaborate with product and engineering teams to design
performant, modular, and reliable backend systems.
CI/CD & Deployment Strategy: Build and scale deployment pipelines, optimize rollouts with
blue-green/canary deploys, and ensure smooth delivery processes.
Orchestrate systems: Manage containerized workloads via Kubernetes (EKS/GKE), ECS, or
other orchestration tools.
Observability & Performance: Standardize monitoring, tracing, and logging across systems;
lead capacity planning and performance tuning.
Infrastructure as Code (IaC): Define and maintain scalable infrastructure using tools
like Terraform and Helm.
Mentor & Lead: Guide engineering teams in cloud architecture, system design, and
operational excellence.
Champion reliability: Define SLOs, SLIs, incident response, root cause analysis, and
proactive fault mitigation.
Own system security: Define and enforce best practices for infrastructure and application
security, access control, secrets management, and compliance readiness.
Requirements
4+ years of backend or infrastructure experience in high-scale, production-grade systems.
3+ years of hands-on backend development experience (e.g., Ruby, Node.js, Python, or
Java)
3+ years of system design, API development, and performance optimization.
1+ years in a technical leadership role focused on DevOps/SRE/platform engineering.
Proven experience architecting and running infrastructure on AWS (preferred), GCP, or
Azure.
Deep understanding of cloud-native architecture, microservices, and distributed systems.
Hands-on with Docker, Kubernetes, Terraform, and observability tools (Prometheus,
Grafana, ELK, OpenTelemetry).
Strong programming/scripting skills in Python, Go, or Bash; comfortable reading
production backend code (Ruby/Node).
Experience with relational and NoSQL databases like Postgres, MongoDB, Redis.
A solution-first mindset with the ability to balance technical depth and business
context.
Bonus Points For:
Experience with service mesh, multi-region HA systems, or event-driven architectures
Background in security, compliance, or cost optimization
Ideal Candidate:
Led backend engineering teams and been deeply involved in designing and scaling core
systems.
Evolved into owning end-to-end platform reliability, infrastructure architecture, and
DevOps maturity.
A strong grasp of backend fundamentals (data modeling, APIs, async jobs, caching) and
knows what it takes to make systems fast, resilient, and observable.
Spent time in production—debugging incidents, tuning performance, and building systems
that can self-heal and scale.
A mindset that blends builder + architect + operator — someone who thinks in systems,
works with code, and owns uptime.
Hiring Process
We aim to complete the process within 7–10 days, with structured feedback at each step:
Submit your project write-up (see below)
Introductory Call with the VP of Engineering (30 mins - 1 hour)
Discuss your experience across backend, DevOps, and platform engineering.
We’ll share more context on our architecture, current challenges, and vision.
Review your relevant architecture project you've previously worked on. (for details check
below section)
Technical Deep Dive (1 hour)
A focused discussion on system design, infrastructure architecture, CI/CD, security,
observability, and scaling.
We'll explore your past work and how you make technical decisions under real constraints.
Hands-On Architecture Task (4–5 hours)
Mandatory real-world assignment based on an actual GoComet problem statement, covering
both architecture and Infra.
Designed to evaluate your approach to platform reliability, infrastructure design, and
backend architecture.
We'll assess your clarity of thought, decision-making, trade-offs, and ability to design
scalable and secure systems.
Culture & Leadership Conversation with the CTO (1 hour)
Discuss ownership mindset, leadership style, and long-term alignment.
Explore how you collaborate with cross-functional teams and lead technical initiatives at
scale.
Project Write-Up (Mandatory – Used Across All Rounds)
Before the interview process begins, please submit a detailed write-up of the most
challenging or interesting project where you played a key role as an SRE, DevOps
engineer, or platform architect.
Your write-up should cover:
Project Overview: The system/problem you were solving, the scale involved, and the
business context.
Architecture & Infrastructure: System architecture, infra layout
(cloud/services/regions), containerization, orchestration (Kubernetes, ECS, etc.), and
deployment pipelines.
Reliability Engineering: How you addressed fault tolerance, incident response,
performance bottlenecks, or disaster recovery.
Tooling & Automation: Tools you used or built (e.g., Terraform, Prometheus, Datadog,
Jenkins, Helm), why they were chosen, and how they were implemented.
Security & Compliance: How you approached secrets management, infrastructure security,
auditability, and platform resilience.
Impact: Quantifiable improvements in uptime, deployment velocity, cost, observability, or
scale.
Collateral: Include architecture diagrams, Terraform/Helm samples, dashboards, alerting
configs, runbooks, postmortems, or any other supporting material.
Note: This will be used throughout the interview process to evaluate your technical
depth, system thinking, and ownership mindset.