SRE Manager - R30523 | ScaleneWorks INC
full-time
Posted on November 8, 2025
Job Description
SRE Manager
Company Overview
Not specified
Job Summary
As an SRE Manager for iHotelier, you will lead a team responsible for ensuring the availability, scalability, and performance of mission-critical hospitality services. This role combines technical leadership, operational excellence, and strategic planning to deliver a seamless booking experience for thousands of hotels worldwide. You will define and enforce SRE best practices, drive automation, and partner with cross-functional teams to maintain reliability across iHotelier’s complex ecosystem.
Responsibilities
- Lead and mentor a global team of SREs, fostering a culture of reliability and continuous improvement.
- Define and enforce SRE best practices, including error budgets, Service Level Objectives (SLOs), and Service Level Indicators (SLIs).
- Drive automation initiatives to reduce toil and improve deployment velocity.
- Oversee incident response, root cause analysis, and post-mortems for iHotelier services.
- Manage on-call rotations and ensure effective escalation processes.
- Implement observability frameworks (monitoring, logging, alerting) using Datadog, Grafana, Prometheus, and Splunk.
- Own Continuous Integration/Continuous Deployment (CI/CD) pipelines and deployment strategies using ArgoCD, Jenkins, and Kubernetes.
- Ensure compliance with security and privacy standards for hospitality data.
- Optimize cloud infrastructure (Azure) for cost and performance.
- Govern ArgoCD/Jenkins workflows including PR/backout PR, prod1/prod1-pci branch patterns.
- Maintain WLI/runbooks for Kafka lag, URM Router, Email Engine, EQC Provider Booking, Cache Invalidator, and Couchbase maintenance.
- Collaborate with Research & Development (R&D), DevOps, and Product teams to design resilient architectures.
- Align with business stakeholders to prioritize reliability improvements.
- Participate in capacity planning for peak booking periods and ensure operational readiness.
- Support monitoring tools currently in production and enhance alert dashboards for proactive detection.
Qualifications
- Bachelor’s or Master’s degree in Computer Science or related field.
- 10+ years in software engineering/operations, with 4+ years in SRE leadership.
- Proven track record managing large-scale distributed systems.
- Strong knowledge of Linux and Windows operating systems, cloud-native environments, and container orchestration (Kubernetes, Azure AKS).
- Experience with SLO/SLA management, automation, and operational readiness testing.
- Hands-on experience with monitoring tools (Datadog, Grafana, Prometheus, Splunk) and incident management platforms (ServiceNow).
- Familiarity with CI/CD pipelines, infrastructure-as-code (Terraform), and GitOps tools (Flux).
- Knowledge of networking fundamentals and API performance optimization.
Preferred Skills
- Experience leading SRE or DevOps teams in a high-availability Software as a Service (SaaS) environment.
- Familiarity with hospitality systems or booking platforms.
- Knowledge of Content Delivery Network (CDN) technologies (Akamai, Cloudflare) and containerization (Docker).
- Strong collaboration and communication skills.
Experience
- Minimum of 10 years of experience in software engineering/operations.
- At least 4 years of experience in SRE leadership.
Environment
Not specified
Salary
Not specified
Growth Opportunities
Not specified
Benefits
Not specified