Site Reliability Engineer - Observability
Skills
About This Role
We are hiring an SRE focused on observability, automation, and runtime reliability for AI platforms and internal agentic systems. This is not a generic SOC role. It is an engineering role for someone who builds telemetry, automates findings-to-fix loops, improves production readiness, and keeps AI systems measurable, resilient, and controllable in production.
Tech stack
- Python for automation and workflow integration
- Observability tooling: metrics, logs, traces, OpenTelemetry, Datadog or adjacent stacks
- AWS logging, telemetry, IAM-aware diagnostics, and infrastructure scripting
- CI/CD integration for runtime checks, rollback drills, and policy validation
- Nice to have: Wiz, CrowdStrike, Orca, GuardDuty, WAF / RASP-style controls, MCP / agent telemetry
Responsibilities
- Design and operate the telemetry and observability layer for AI platforms, including audit trails, tool-call logs, correlation IDs, traces, and runtime visibility across service boundaries.
- Build automated findings-to-fix loops for AI and cloud platforms, integrating signals from tooling such as Wiz, Astrix, or future AI security products into pragmatic remediation workflows.
- Implement reliability and hardening controls for internal AI systems, including alerting, health checks, rollback drills, kill-switch validation, rate limiting, and drift detection.
- Codify detections, policies, and operational checks as code where they reduce toil, prevent regressions, and improve platform control.
- Review platform and AI-application changes from a reliability and application-hardening perspective, especially around secrets, telemetry, external calls, risky MCP usage, and production readiness.
- Own AI-platform-specific operational readiness and partner with central IT / EAS / SOC teams for escalations, handoffs, and shared incident workflows when needed.
- Continuously improve production readiness through automation, post-incident learning, and repeatable playbooks for AI runtime issues.
Similar Jobs
Infrastructure & Site Reliability Engineer – Datacentre AI Engineering - Riyadh, KSA
Qualcomm · Riyadh
Company Qualcomm Middle East Information Technology Company LLC Job Area Engineering Group, Engineering Group > Software Test Engineering General Summary About Us Qualcomm is growing its presence in Riyadh and is hiring
4 days ago
Generate Resume ↗AI Infrastructure Nutanix Site Reliability Engineer
emagine · Riyadh
Job Title: AI Infrastructure Nutanix Site Reliability Engineer Location: Saudi Arabia Nationality: Saudi Nationals only Experience: 5+ years Job Overview: We are seeking an experienced AI Infrastructure Site Reliability
1 weeks ago
Generate Resume ↗Nutanix AI Site Reliability Lead Engineer
emagine · Riyadh
Nationality: Saudi Nationals only We are seeking an experienced Site Reliability Lead Engineer to act as the on-site technical lead for Nutanix AI infrastructure environments. The role is responsible for driving reliabil
1 weeks ago
Generate Resume ↗Site Reliability Engineering Officer
Takamol Holding · Riyadh
Job Description Job description : Provide support for application incidents across digital platforms, working closely with Platform Engineering, Application Development, and customer support teams to ensure timely resol
1 weeks ago
Generate Resume ↗Site Reliability Engineer
S2 Global · Riyadh
Overview S2 Global is seeking a skilled and motivated Site Reliability Engineer (SRE) to implement, maintain, and support deployments of our CertScan platform. As part of our systems engineering team, you will design and
2 weeks ago
Generate Resume ↗Senior Site Reliability Engineer
HALA · Riyadh
Who Are We HALA is a leading fintech player in the MENAP region that aims to redefine financial services and build the future bank of SMEs. HALA aims at empowering SMEs to start, run, and grow their businesses by providi
1 months ago
Generate Resume ↗Site Reliability Engineer (SRE)
PrimeGate for Communications and IT · Riyadh
About the Role: We are looking for a Site Reliability Engineer (SRE) with solid experience running production systems and working closely with development teams. The ideal candidate is comfortable with Linux, containers,
1 months ago
Generate Resume ↗Site Reliability Engineer (SRE)
Prime Gate · الرياض
About the Role: We are looking for a Site Reliability Engineer (SRE) with solid experience running production systems and working closely with development teams. The ideal candidate is comfortable with Linux, containers,
1 months ago
Generate Resume ↗Stop applying blindly.
Start getting hired.
Base Career automates the hardest parts of job searching — apply smarter, not harder.
AI Resume in 60s
Your resume rewritten for this exact role using the job description as the brief.
ATS-Optimized
Get past automated screening filters with the right keywords matched to each job.
Application Tracker
Track every job, follow-up, and interview in one visual kanban board.
Free plan · No credit card required