Position title

Lead Cloud Engineer / Lead Site Reliability Engineer (SRE)

Description

About Us

ITOrizon is a global consulting and technology company that helps enterprises design,
implement, and optimize complex supply chain and digital transformation initiatives.
Headquartered in Atlanta, USA, with offices in India (Bengaluru) and the UAE (Sharjah), we
partner with global clients across retail, manufacturing, logistics, and distribution sectors.

We combine deep domain expertise with modern technology to deliver practical, scalable
solutions. Our teams work across strategy, implementation, and managed services — helping
organizations adopt leading platforms such as Oracle, Manhattan, Blue Yonder, and our next-generation composable enterprise platform, Karolium.

Role Overview

As the Lead Cloud Engineer / SRE, you’ll be responsible for the reliability, scalability, and
performance of our cloud-native platforms. You’ll lead proactive reliability engineering (PRE)
and SRE practices, manage multi-cloud infrastructure, and ensure high availability across our
SaaS and PaaS offerings. Expertise in Kubernetes and hands-on experience managing K8s
clusters is non-negotiable.

Responsibilities

Own production uptime, reliability, and performance across SaaS and PaaS platforms
Lead PRE/SRE initiatives including incident response, root cause analysis, and postmortems
Design and implement robust monitoring, alerting, and observability systems
Manage and optimize Kubernetes clusters across multiple cloud environments
Oversee cloud infrastructure across AWS (must), Azure (must), and GCP (good to have)
Drive patching, upgrades, and disaster recovery planning and execution
Collaborate with DevOps, Engineering, and Security teams to ensure operational excellence
Define and enforce SLAs, SLOs, and error budgets
Mentor junior engineers and foster a culture of reliability and resilience

Qualifications

8–12 years of experience in Cloud Engineering, SRE, or Infrastructure roles
Proven leadership in PRE/SRE functions within enterprise-grade environments
Certified in Cloud Operations (AWS, Azure preferred)
Expert-level knowledge of Kubernetes and hands-on experience managing K8s clusters
Strong experience with incident management, disaster recovery, and patching strategies
Proficiency in monitoring tools (e.g., Prometheus, Grafana, Datadog, ELK)
Solid scripting and automation skills (Python, Bash, Terraform, Ansible, Go, etc.)
Excellent problem-solving, communication, and stakeholder management abilities

Good to have

Experience with container orchestration beyond Kubernetes (e.g., OpenShift, EKS, AKS)
Exposure to compliance frameworks (SOC2, ISO 27001, etc.)
Familiarity with CI/CD pipelines and GitOps practices

Why Join Us?