{bc}
linkedin

Site Reliability Engineer-AI production-automated testing ,Observability

TAT IT Technolgies
Abu Dhabi, UAE
contract
Mid-Senior
Today
engineeringdesignproject managementmaintenancequality controltechnical
Free

Job Fit Check

Base Career helps you apply smarter for this job.

?%
Ready to Scan

Key skills for this role

engineeringdesignproject management
Smart Apply

Full Job Posting

Overview

Urgent requirement for Site Reliability Engineer( AI production readiness automated testing ,Observability, SLIs, resilience) in banking domain required for our banking clients in Abu Dhabi ,UAE

Hybrid role combines SRE and automated testing to ensure AI-driven cloud applications are production-ready, resilient, and compliant with banking standards.-

-Must

Strong expertise in Python-based testing frameworks (PyTest, Robot, or similar) & experience with Azure / AWS cloud platforms.--

Must

Hands-on observability tools (Prometheus, Grafana, ELK, Datadog) & experience defining and implementing SLIs/SLOs for distributed systems.-

-Must

Practical exposure to chaos engineering and load testing frameworks (Gremlin, Locust, Jmeter) & Familiarity with AI/ML evaluation tools for production readiness.--

Must

Strong background in security and compliance automation within regulated industries (banking/finance )--

Role Overview

We are seeking a Site Reliability Engineer (AI Production Readiness) to ensure our AI-driven cloud applications are production-ready, resilient, and compliant with banking standards.

This hybrid role combines SRE practices with automated testing expertise, focusing on reliability, observability, and proactive validation of both application logic and infrastructure.

Key Responsibilities

  • Automated Validation Frameworks Design and implement Python-based automated testing frameworks to validate AI application logic, APIs, and cloud infrastructure.
  • Resilience Engineering Conduct chaos testing, load testing, and fault injection to ensure systems withstand failures and maintain service continuity.
  • SLIs/SLOs Definition Establish clear Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for AI workloads, ensuring measurable reliability targets.
  • Observability & Monitoring Build proactive monitoring, alerting, and logging pipelines across Azure and AWS environments to detect anomalies before they impact users.
  • Security & Compliance Implement automated compliance checks aligned with banking regulations, ensuring secure deployment pipelines and audit readiness.
  • AI Evaluation Tools Integrate AI-specific evaluation frameworks to continuously assess model performance, fairness, and reliability in production.
  • Skills: reliability,ai,automated testing

Apply for this job in 1 click

Skip the repetitive application forms

Install the Base Career Chrome Extension and autofill job applications across major job boards with your profile.

Sarah M.James T.Maya R.

Trusted by over 500,000 job seekers on Base Career

Start Free Today

More from this employer

More jobs at TAT IT Technolgies