Site Reliability & Platform Architect<br /> Location: Bangalore<br /> Type: Full-Time<br /> Experience<br /> 7–14 years<br /> Position<br /> Leadership – Site Reliability & Platform Architect<br /> About the Role<br /> We’re scaling a high-performance SaaS platform powering logistics automation for 500+<br /> global enterprises (Unilever, Apple, Koch, etc.). As our systems grow in complexity and<br /> scale, we need a Site Reliability & Platform Architect to lead the next evolution of our<br /> cloud infrastructure, DevOps maturity, and backend platform architecture.<br /> This role blends deep DevOps/SRE expertise with backend architectural thinking — ideal<br /> for someone who understands infrastructure and code, and loves building resilient,<br /> observable, and scalable systems from the ground up.<br /> What will you do<br /> Own infrastructure architecture: Design and evolve cloud-native systems for scalability,<br /> availability, cost efficiency, and security.<br /> Lead backend platform design: Collaborate with product and engineering teams to design<br /> performant, modular, and reliable backend systems.<br /> CI/CD & Deployment Strategy: Build and scale deployment pipelines, optimize rollouts with<br /> blue-green/canary deploys, and ensure smooth delivery processes.<br /> Orchestrate systems: Manage containerized workloads via Kubernetes (EKS/GKE), ECS, or<br /> other orchestration tools.<br /> Observability & Performance: Standardize monitoring, tracing, and logging across systems;<br /> lead capacity planning and performance tuning.<br /> Infrastructure as Code (IaC): Define and maintain scalable infrastructure using tools<br /> like Terraform and Helm.<br /> Mentor & Lead: Guide engineering teams in cloud architecture, system design, and<br /> operational excellence.<br /> Champion reliability: Define SLOs, SLIs, incident response, root cause analysis, and<br /> proactive fault mitigation.<br /> Own system security: Define and enforce best practices for infrastructure and application<br /> security, access control, secrets management, and compliance readiness.<br /> Requirements<br /> 4+ years of backend or infrastructure experience in high-scale, production-grade systems.<br /> 3+ years of hands-on backend development experience (e.g., Ruby, Node.js, Python, or<br /> Java)<br /> 3+ years of system design, API development, and performance optimization.<br /> 1+ years in a technical leadership role focused on DevOps/SRE/platform engineering.<br /> Proven experience architecting and running infrastructure on AWS (preferred), GCP, or<br /> Azure.<br /> Deep understanding of cloud-native architecture, microservices, and distributed systems.<br /> Hands-on with Docker, Kubernetes, Terraform, and observability tools (Prometheus,<br /> Grafana, ELK, OpenTelemetry).<br /> Strong programming/scripting skills in Python, Go, or Bash; comfortable reading<br /> production backend code (Ruby/Node).<br /> Experience with relational and NoSQL databases like Postgres, MongoDB, Redis.<br /> A solution-first mindset with the ability to balance technical depth and business<br /> context.<br /> Bonus Points For:<br /> Experience with service mesh, multi-region HA systems, or event-driven architectures<br /> Background in security, compliance, or cost optimization<br /> Ideal Candidate:<br /> Led backend engineering teams and been deeply involved in designing and scaling core<br /> systems.<br /> Evolved into owning end-to-end platform reliability, infrastructure architecture, and<br /> DevOps maturity.<br /> A strong grasp of backend fundamentals (data modeling, APIs, async jobs, caching) and<br /> knows what it takes to make systems fast, resilient, and observable.<br /> Spent time in production—debugging incidents, tuning performance, and building systems<br /> that can self-heal and scale.<br /> A mindset that blends builder + architect + operator — someone who thinks in systems,<br /> works with code, and owns uptime.<br /> Hiring Process<br /> We aim to complete the process within 7–10 days, with structured feedback at each step:<br /> Submit your project write-up (see below)<br /> Introductory Call with the VP of Engineering (30 mins - 1 hour)<br /> Discuss your experience across backend, DevOps, and platform engineering.<br /> We’ll share more context on our architecture, current challenges, and vision.<br /> Review your relevant architecture project you've previously worked on. (for details check<br /> below section)<br /> Technical Deep Dive (1 hour)<br /> A focused discussion on system design, infrastructure architecture, CI/CD, security,<br /> observability, and scaling.<br /> We'll explore your past work and how you make technical decisions under real constraints.<br /> Hands-On Architecture Task (4–5 hours)<br /> Mandatory real-world assignment based on an actual GoComet problem statement, covering<br /> both architecture and Infra.<br /> Designed to evaluate your approach to platform reliability, infrastructure design, and<br /> backend architecture.<br /> We'll assess your clarity of thought, decision-making, trade-offs, and ability to design<br /> scalable and secure systems.<br /> Culture & Leadership Conversation with the CTO (1 hour)<br /> Discuss ownership mindset, leadership style, and long-term alignment.<br /> Explore how you collaborate with cross-functional teams and lead technical initiatives at<br /> scale.<br /> Project Write-Up (Mandatory – Used Across All Rounds)<br /> Before the interview process begins, please submit a detailed write-up of the most<br /> challenging or interesting project where you played a key role as an SRE, DevOps<br /> engineer, or platform architect.<br /> Your write-up should cover:<br /> Project Overview: The system/problem you were solving, the scale involved, and the<br /> business context.<br /> Architecture & Infrastructure: System architecture, infra layout<br /> (cloud/services/regions), containerization, orchestration (Kubernetes, ECS, etc.), and<br /> deployment pipelines.<br /> Reliability Engineering: How you addressed fault tolerance, incident response,<br /> performance bottlenecks, or disaster recovery.<br /> Tooling & Automation: Tools you used or built (e.g., Terraform, Prometheus, Datadog,<br /> Jenkins, Helm), why they were chosen, and how they were implemented.<br /> Security & Compliance: How you approached secrets management, infrastructure security,<br /> auditability, and platform resilience.<br /> Impact: Quantifiable improvements in uptime, deployment velocity, cost, observability, or<br /> scale.<br /> Collateral: Include architecture diagrams, Terraform/Helm samples, dashboards, alerting<br /> configs, runbooks, postmortems, or any other supporting material.<br /> Note: This will be used throughout the interview process to evaluate your technical<br /> depth, system thinking, and ownership mindset.