Site Reliability Engineer - Never Install | Scrabble

full-time

Posted on 03-10-2025

Job Description

Site Reliability Engineer (SRE)

Company Overview

[Company Overview not provided]

Job Summary

Join our SRE team to enhance the reliability, observability, and incident response capabilities of our Virtual Desktop Infrastructure (VDI) platform. You will collaborate across our entire stack—from streaming servers to cloud orchestration—ensuring enterprise-grade reliability and performance.

Responsibilities

Design and implement comprehensive monitoring and alerting systems.
Build observability for our distributed architecture including streaming servers, microservices, and orchestration.
Respond to and resolve service outages with a focus on rapid recovery.
Create runbooks, incident response procedures, and post-mortem processes.
Implement Service Level Indicator (SLI)/Service Level Objective (SLO) frameworks for enterprise customer Service Level Agreement (SLA) compliance.
Monitor and optimize performance across multiple cloud providers (Azure, OCI).
Build automation for deployment, scaling, and recovery processes.
Design disaster recovery and business continuity procedures.
Collaborate with engineering teams to improve system reliability and reduce Mean Time to Recovery (MTTR).
Implement chaos engineering and reliability testing practices.

Qualifications

3+ years of experience in Site Reliability Engineering (SRE), DevOps, or production systems.
Strong experience with monitoring tools (e.g., Prometheus, Grafana, ELK, or similar).
Experience with incident management and on-call responsibilities.
Knowledge of distributed systems reliability patterns and practices.
Understanding of cloud platforms (Azure, AWS, GCP, OCI) and their monitoring tools.
Proficient in Kubernetes, container orchestration, and microservices.
Strong scripting and automation skills (e.g., Python, Go, Bash, or similar).
Excellent troubleshooting and debugging skills across the full stack.

Preferred Skills

Experience with enterprise SLA management and reporting.
Knowledge of streaming protocols, real-time systems, or VDI platforms.
Experience with multi-cloud architectures and failover strategies.
Understanding of network protocols and performance optimization.
Background in high-availability systems or financial services.
Familiarity with infrastructure as code and GitOps practices.
Knowledge of security monitoring and compliance frameworks.

Experience

A minimum of 3+ years of relevant experience in SRE or related fields.

Environment

Location: Bengaluru (In-office preferred)

Salary

[Salary information not provided]

Growth Opportunities

[Growth opportunities not provided]

Benefits

[Benefits not provided]