HumanBit Logo

SRE Manager - R30523 | ScaleneWorks INC

full-time
Posted on November 8, 2025

Job Description

SRE Manager

Company Overview

Not specified

Job Summary

As an SRE Manager for iHotelier, you will lead a team responsible for ensuring the availability, scalability, and performance of mission-critical hospitality services. This role combines technical leadership, operational excellence, and strategic planning to deliver a seamless booking experience for thousands of hotels worldwide. You will define and enforce SRE best practices, drive automation, and partner with cross-functional teams to maintain reliability across iHotelier’s complex ecosystem.

Responsibilities

  • Lead and mentor a global team of SREs, fostering a culture of reliability and continuous improvement.
  • Define and enforce SRE best practices, including error budgets, Service Level Objectives (SLOs), and Service Level Indicators (SLIs).
  • Drive automation initiatives to reduce toil and improve deployment velocity.
  • Oversee incident response, root cause analysis, and post-mortems for iHotelier services.
  • Manage on-call rotations and ensure effective escalation processes.
  • Implement observability frameworks (monitoring, logging, alerting) using Datadog, Grafana, Prometheus, and Splunk.
  • Own Continuous Integration/Continuous Deployment (CI/CD) pipelines and deployment strategies using ArgoCD, Jenkins, and Kubernetes.
  • Ensure compliance with security and privacy standards for hospitality data.
  • Optimize cloud infrastructure (Azure) for cost and performance.
  • Govern ArgoCD/Jenkins workflows including PR/backout PR, prod1/prod1-pci branch patterns.
  • Maintain WLI/runbooks for Kafka lag, URM Router, Email Engine, EQC Provider Booking, Cache Invalidator, and Couchbase maintenance.
  • Collaborate with Research & Development (R&D), DevOps, and Product teams to design resilient architectures.
  • Align with business stakeholders to prioritize reliability improvements.
  • Participate in capacity planning for peak booking periods and ensure operational readiness.
  • Support monitoring tools currently in production and enhance alert dashboards for proactive detection.

Qualifications

  • Bachelor’s or Master’s degree in Computer Science or related field.
  • 10+ years in software engineering/operations, with 4+ years in SRE leadership.
  • Proven track record managing large-scale distributed systems.
  • Strong knowledge of Linux and Windows operating systems, cloud-native environments, and container orchestration (Kubernetes, Azure AKS).
  • Experience with SLO/SLA management, automation, and operational readiness testing.
  • Hands-on experience with monitoring tools (Datadog, Grafana, Prometheus, Splunk) and incident management platforms (ServiceNow).
  • Familiarity with CI/CD pipelines, infrastructure-as-code (Terraform), and GitOps tools (Flux).
  • Knowledge of networking fundamentals and API performance optimization.

Preferred Skills

  • Experience leading SRE or DevOps teams in a high-availability Software as a Service (SaaS) environment.
  • Familiarity with hospitality systems or booking platforms.
  • Knowledge of Content Delivery Network (CDN) technologies (Akamai, Cloudflare) and containerization (Docker).
  • Strong collaboration and communication skills.

Experience

  • Minimum of 10 years of experience in software engineering/operations.
  • At least 4 years of experience in SRE leadership.

Environment

Not specified

Salary

Not specified

Growth Opportunities

Not specified

Benefits

Not specified

Powered by
HumanBit Logo