Site Reliability Engineer - AI Agents

TALENTMATE

, UAE

Mid-Senior

engineeringdesignproject managementmaintenancequality controltechnical

Apply

Free

Job Fit Check

Base Career helps you apply smarter for this job.

Ready to Scan

Key skills for this role

engineeringdesignproject management

Smart Apply

Full Job Posting

Job Description

Building the Future of Open Finance
Payward - the parent company behind Kraken, NinjaTrader, Breakout, xStocks, Payward Services and CF Benchmarks - has spent the last 15 years building one of the most modern and globally accessible financial infrastructure platforms in the industry, built to advance an open, global financial system.
Before you apply, we encourage you to explore our culture page to understand what drives us and how we work.
The team
Founded in 2011, Kraken is one of the world's longest-standing crypto platforms, trusted by over 10 million individuals and institutions across the globe.
It offers spot trading, margin, futures, staking, and OTC services, with products built for both individual investors and institutional clients.
The AI Infrastructure team sits within the Data organization and is responsible for building, operating, and scaling the systems that power AI agents in production — both internal tools and external-facing products.
Working closely with the AI and Agent Systems teams, this group ensures that the orchestration, execution, and model-serving layers underpinning agentic workflows are reliable, observable, and built to scale.
This team operates at the intersection of data infrastructure and applied AI — a space that moves fast and demands engineers who can bring production discipline to emerging technology.
You'll partner across Data Engineering, ML, and product-facing teams to harden agent infrastructure and keep it running at the standards our users expect.
Importantly, this is a platform engineering team.
Beyond operating infrastructure, the team is responsible for building the APIs, SDKs, and platform capabilities that enable AI, Data, and Engineering teams to safely and efficiently consume agent infrastructure as a service.
Success in this role requires thinking beyond infrastructure operations and toward developer experience, platform adoption, and long-term scalability.
The opportunity
Design, build, and operate the infrastructure layer supporting AI agent workflows in production
Ensure reliability, scalability, and observability of agentic systems across internal and external products
Design and develop platform services, APIs, SDKs, and self-service capabilities that allow engineering teams to easily consume AI infrastructure and agent platform services
Manage and maintain the compute, orchestration, and serving infrastructure powering model inference and agent execution
Implement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloads
Utilize Infrastructure as Code (IaC) tools such as Terraform to provision and manage cloud (AWS) infrastructure components
Build and maintain CI/CD pipelines that support rapid, reliable deployment of AI services and agent workflows
Define and implement guardrails, failure handling, and recovery patterns specific to agentic and LLM-powered systems
Collaborate with AI and Data Engineering teams to translate experimental agent prototypes into hardened production systems
Manage containerized workloads using Kubernetes, ensuring efficient deployment, scaling, and orchestration of AI services
Implement access controls and security best practices across AI infrastructure environments
Document architecture, runbooks, and best practices to support knowledge sharing across the team

What You Bring

5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in a production environment
Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production
Experience building developer platforms, internal tooling, APIs, or SDKs consumed by engineering teams at scale
Strong understanding of platform engineering principles, including developer experience, self-service infrastructure, and API-driven platform design
Proficiency with Infrastructure as Code tools, particularly Terraform
Experience with containerization and orchestration, particularly Kubernetes and Docker
Solid understanding of cloud infrastructure, preferably AWS
Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)
Experience designing and operating observability, monitoring, and alerting systems
Experience implementing incident response procedures and participating in on-call rotations
Strong collaboration skills working across data, AI, and engineering teams
High ownership mindset in a fast-moving, high-stakes production environment
Nice to haves
Experience building or operating infrastructure for agent-based or LLM-powered systems
Familiarity with agent orchestration frameworks (e.g., LangGraph, CrewAI, or similar)
Background in data infrastructure, including familiarity with Airflow, Kafka, Spark, or data lake tooling
Experience with CI

Apply for this job in 1 click

Skip the repetitive application forms

Install the Base Career Chrome Extension and autofill job applications across major job boards with your profile.

Trusted by over 500,000 job seekers on Base Career

Start Free Today

More jobs at TALENTMATE

Sr Planning Engineer Sewerage System

Dubai, UAE

Mid-Seniorfulltime

Job Description Requisition Number: 24419BR Description The Senior Planning Engineer in association with the resident engineer shall develop the Project execution plan considering the proper engineering techniques and mi

TodayView →

Senior Supply Chain Manager

Dubai, UAE

Mid-Seniorfulltime

Job Description About the Company Careem is building the Everything App for the greater Middle East — making it easy to move around, order food and groceries, manage payments, and more. Our purpose is simple: to simplify

TodayView →

Principal Integration And Test - Technician - FTI

Abu Dhabi, UAE

Mid-Seniorfulltime

Job Description External Job Description Job Title: Principal Integration Technician The Integration Technician is responsible for integrating payload systems on UAVs, conducting ground testing, and supporting flight tes

TodayView →

Live Governance Project Intern TikTok LIVE - 2026 Start BS MS

Dubai, UAE

Interninternship

Responsibilities Job Description About The Team The LIVE team is dedicated to optimizing all aspects of TikTok LIVE. From content strategy, monetization, gifting, features, and data analysis to creator education, campaig

TodayView →

Emirati Talent Pool Future Opportunities At Keyloop

Dubai, UAE

Entryfulltime

Job Description About Keyloop At Keyloop, we’re building the technology that powers the automotive retail experience of tomorrow. Our mission is to help dealerships, OEMs, and partners deliver exceptional customer experi

TodayView →

EMEA Head Of ISA Operations

Dubai, UAE

Directorfulltime

Job Description We exist to create positive change for people and the planet. Join us and make a difference too! Overview / Purpose Of The Position The internal and supplier assurance (ISA) business partners with clients

TodayView →

Housekeeping Attendant

Dubai, UAE

Entryfulltime

Job Description Additional Information Job Number 26072960 Job Category Housekeeping & Laundry Location Dubai Fountain Street, Downtown Dubai, Dubai, United Arab Emirates, United Arab Emirates, 11788VIEW ON MAP Schedule

TodayView →

Trade Activation Manager-OTC

Dubai, UAE

Mid-Seniorfulltime

Job Description We are Reckitt Home to the world's best loved and trusted hygiene, health, and nutrition brands. Our purpose defines why we exist: to protect, heal and nurture in the relentless pursuit of a cleaner, heal

TodayView →

Sr Planning Engineer Sewerage System

Dubai, UAE

Todayfulltime

Senior Supply Chain Manager

Dubai, UAE

Todayfulltime

Principal Integration And Test - Technician - FTI

Abu Dhabi, UAE

Todayfulltime

Live Governance Project Intern TikTok LIVE - 2026 Start BS MS

Dubai, UAE

Todayinternship

Emirati Talent Pool Future Opportunities At Keyloop

Dubai, UAE

Todayfulltime

EMEA Head Of ISA Operations

Dubai, UAE

Todayfulltime

Housekeeping Attendant

Dubai, UAE

Todayfulltime

Trade Activation Manager-OTC

Dubai, UAE

Todayfulltime

Site Reliability Engineer - AI Agents

Job Fit Check

About the Role

Full Job Posting

Job Description

What You Bring

Apply for this job in 1 click

More jobs at TALENTMATE

Sr Planning Engineer Sewerage System

Senior Supply Chain Manager

Principal Integration And Test - Technician - FTI

Live Governance Project Intern TikTok LIVE - 2026 Start BS MS

Emirati Talent Pool Future Opportunities At Keyloop

EMEA Head Of ISA Operations

Housekeeping Attendant

Trade Activation Manager-OTC

Sr Planning Engineer Sewerage System

Senior Supply Chain Manager

Principal Integration And Test - Technician - FTI

Live Governance Project Intern TikTok LIVE - 2026 Start BS MS

Emirati Talent Pool Future Opportunities At Keyloop

EMEA Head Of ISA Operations

Housekeeping Attendant

Trade Activation Manager-OTC