{bc}
linkedin

AI Platform Engineer

Tata Consultancy Services
Dubai, UAE
fulltime
Mid-Senior
Today
Cloud ComputingInfrastructure as Code (IaC)CI/CDKubernetesDockerAnsible
Free

Job Fit Check

Base Career helps you apply smarter for this job.

?%
Ready to Scan

Key skills for this role

Cloud ComputingInfrastructure as Code (IaC)CI/CD
Smart Apply

Full Job Posting

Company

– TCS (MEA)

Location

– Dubai

Job type

– Full time

About Us

Tata Consultancy Services (TCS) is an IT services, consulting and business solutions organization that has been partnering with many of the world’s largest businesses in their transformation journeys for over 50 years.

TCS offers a consulting-led, cognitive powered, integrated portfolio of business, technology and engineering services and solutions.

This is delivered through its unique Location Independent Agile™ delivery model, recognized as a benchmark of excellence in software development.

A part of the Tata group, India's largest multinational business group, TCS has over 616,171 of the world’s best-trained consultants with 157 nationalities in 53 countries.

For more information, visit www.tcs.com and follow TCS news at @TCS_News.

Ai Platform Operations

Provide day-to-day monitoring and operational support of Gernas, its dependent AI applications, deployed AI agents, agent runtime environments, AI gateways, and supporting Azure and AWS cloud services.

Ensure continuous availability and consistent service performance across production environments, proactively identifying degradation signals before they become user-visible incidents.

Maintain operational dashboards covering platform health, agent execution, LLM consumption, gateway throughput, and integration endpoints, and ensure that ownership of every production component is unambiguous and documented.

Incident And Problem Management

Lead L2 and L3 incident response across the Gernas platform and its surrounding AI products.

Conduct structured root-cause analysis, drive service restoration within agreed P1–P4 service-level commitments, and coordinate major incidents involving multiple engineering, cloud, security, and vendor stakeholders.

Maintain problem records for recurring or systemic issues, ensure that post-incident reviews are completed with clear corrective and preventive actions, and own the closure of those actions through to engineering remediation.

Act as the bridge between the L1 AI Operations Centre and L3 engineering teams during complex incidents.

Ai Agent And Workflow Support

Troubleshoot the behaviour of single-agent and multi-agent workflows, including agent orchestration, tool invocation, skill execution, prompt construction, memory and context handling, planning loops, and inter-agent coordination.

Diagnose failures across Microsoft Agent Framework and LangGraph execution, including state-graph traversal, conditional routing, tool-call mismatches, retries, and human-in-the-loop checkpoints.

Investigate integrations with enterprise systems, MCP-exposed tools, and downstream APIs, and work with engineering teams to harden agent designs where production data reveals weaknesses.

Llm And Model Service Support

Support the operational stability of the model-serving layer across Azure OpenAI PTU deployments, AWS Bedrock, and Core42 Compass.

Monitor and troubleshoot PTU capacity utilisation, token consumption patterns, quota and throttling behaviour, latency profiles, content filtering outcomes, and model endpoint availability.

Coordinate with the AI Hub team on routing decisions between frontier and in-region models, assess cost and capacity headroom, and manage capacity escalations and vendor tickets where service degradation originates upstream.

Aks And Runtime Support

Provide hands-on operational support of the Azure Kubernetes Service estate hosting Gernas agent runtimes and supporting services.

Troubleshoot pod, deployment, service, ingress, namespace, secret, and configuration issues; investigate autoscaling behaviour and resource utilisation anomalies; and analyse container logs, network connectivity, and runtime failures.

Work closely with the Cloud Platform team on cluster-level concerns, image security, node health, and platform upgrades, ensuring AKS-hosted workloads remain stable through patching and lifecycle events.

API, AI Gateway and MCP Support

Operate and troubleshoot the Azure API Management layer that serves as both the enterprise API gateway and the MCP Gateway for Gernas.

Diagnose issues with API policies, authentication, authorisation, routing, rate limiting, quotas, caching, and backend connectivity.

Provide deep support for MCP servers, MCP tools, and the MCP Gateway pattern — including tool discovery, schema validation, protocol-level failures, and the coordination between MCP clients hosted in agents and the underlying tool endpoints.

Ensure that the gateway remains a secure, observable, and policy-compliant control plane for all AI traffic.

Observability And Performance Management

Use Comet Opik as the primary observability surface for agent and LLM execution, working with traces, prompts, agent execution paths, latency breakdowns, token usage, errors, and model quality indicators.

Build and maintain operational dashboards, alert rules, and correlation views that combine Opik telemetry with Azure Monitor, Application Insights, Log Analytics, and CloudWatch data.

Lead performance optimisation initiatives where trace evidence shows hotspots in prompts, tools, retrieval steps, or model selection, and ensure that observability coverage keeps pace with platform evolution.

Voice Ai Support

Support the production operation of ElevenLabs-based voice AI capabilities, including speech generation, voice-agent connectivity, real-time audio session handling, and API consumption patterns.

Investigate latency, audio quality, dropped sessions, and integration failures across the voice channel and its dependent platforms, and coordinate with ElevenLabs and integration partners on upstream issues.

Release And Change Management

Validate releases prior to and immediately following deployment, exercising production verification scripts, smoke tests, and rollback procedures.

Maintain release readiness through clear configuration management, environment parity checks, and pre-deployment risk reviews.

Operate within FAB's change-management framework, ensuring that all production changes including patches, upgrades, configuration adjustments, and model or prompt rotations pass through the appropriate change controls and post-implementation review.

Security, Risk and Compliance

Uphold the security and compliance posture of Gernas and dependent AI products.

Manage identity and access controls, secrets, certificates, and managed identities across the platform; coordinate vulnerability remediation and patching cycles; and maintain audit evidence for internal audit, supervisory reviews, and external assurance.

Operate responsible-AI controls including content filtering, PII and PCI detection, data egress controls, and model-access governance and ensure that secure integration patterns are followed across every API, MCP tool, and external dependency.

Service Improvement And Automation

Drive continuous reduction of manual support effort through automation of routine operational tasks, self-healing patterns, monitoring enhancements, and proactive remediation.

Maintain a current and high-quality library of support playbooks, runbooks, knowledge articles, and standard operating procedures.

Identify and lead service-improvement initiatives that lift platform reliability metrics, reduce incident volume, and shorten mean time to resolution.

Stakeholder And Vendor Coordination

Operate as a credible technical counterpart to business units, engineering teams, the Cloud Platform team, Cybersecurity, Architecture, AI Governance, and Service Management.

Lead vendor engagement with Microsoft, AWS, Core42, ElevenLabs, and other technology partners on incidents, capacity reviews, roadmap items, and product issues, ensuring that vendor accountability is exercised and that escalations are progressed effectively.

On-Call And Operational Readiness

Participate in a 24×7 support model, including a structured on-call rotation, major incident leadership, disaster-recovery exercises, business-continuity testing, and production-readiness assessments for new agents, models, and integrations entering the platform.

Treat operational readiness as a release gate rather than an afterthought, and ensure that nothing reaches production without explicit operational sign-off.

TECHNICAL SKILLS: minimum 5-7 yrs of working experience mandatory

Llm Operations

Kubernetes and Containers

Observability

DevOps and Automation

Voice Ai

Security and Governance

Education

Bachelor's degree in Computer Science, Artificial Intelligence, Information Technology, Engineering, or a closely related discipline.

A relevant master's degree in AI, Machine Learning, or Cloud Computing will be considered an advantage.

Professional Experience

Approximately 5–7 years of overall IT experience, with significant time spent in cloud application support, platform engineering, DevOps, Site Reliability Engineering, production operations, or senior technical support roles.

Of this, a minimum of 2–3 years of relevant experience supporting AI, machine learning, Generative AI, conversational AI, cloud-native platforms, or other data-intensive enterprise systems is required.

Operational Skills

Communication and Leadership

Thank you for your interest in applying for this position with TCS.

We will review your application and will get back to you if we are considering your interest in this opportunity.

Privacy Note

https://www.tcs.com/connect-with-tcs/privacy-policy

Apply for this job in 1 click

Skip the repetitive application forms

Install the Base Career Chrome Extension and autofill job applications across major job boards with your profile.

Sarah M.James T.Maya R.

Trusted by over 500,000 job seekers on Base Career

Start Free Today

More from this employer

More jobs at Tata Consultancy Services