AI Infrastructure Engineer
Job Fit Check
Base Career helps you apply smarter for this job.
Key skills for this role
About the Role
The AI Infrastructure Engineer is a platform specialist responsible for architecting, building, and operating high-performance AI infrastructure to support advanced AI workloads, including LLMs, GenAI, Computer Vision, and MLOps.
Key Skills for This Role
Full Job Posting
Overview
The AI Infrastructure Engineer is a platform specialist responsible for architecting, building, and operating high-performance AI infrastructure to support advanced AI workloads, including LLMs, GenAI, Computer Vision, and MLOps.
This role will focus on managing GPU clusters (NVIDIA A100/H100), deploying and maintaining Red Hat OpenShift AI (RHODS), and ensuring secure, scalable, and cost-efficient AI platforms across SDD’s Sovereign Cloud and hybrid/multi-cloud environments.
The engineer will enable enterprise-grade AI adoption for 200+ government entities.
Gpu & Ai Platform Architecture
Design and implement GPU-based compute clusters.
Define reference architectures for LLM hosting, Vector Databases, MLOps, and high-performance storage/networking.
Fully operational GPU-based AI infrastructure.
GPU Cluster Uptime and Performance Utilization.
Reduction in Cost per Training/Inference Workload.
Gpu Cluster Operations
Install, configure, and optimize core components: CUDA, cuDNN, NCCL, NVIDIA Drivers, and GPU Operators.
Implement GPU partitioning, scheduling, and performance tuning for high-end GPUs (e.g., A100/H100).
High-availability architecture for all AI workloads.
Complete documentation and runbooks.
Openshift Ai (Rhods) Management
Deploy, configure, and maintain the Red Hat OpenShift AI (RHODS) platform for multi-tenant use.
Manage the integration of NVIDIA GPU Operator for efficient GPU scheduling and support Data Scientists with Notebooks, Training, and Inference Endpoints.
Production-ready OpenShift AI (RHODS) platform.
AI Project Onboarding Speed.
Llm & Model Serving
Build and manage infrastructure for hosting and serving open-source LLM frameworks (Llama, Falcon, Mistral) and supporting RAG pipelines, LoRA adapters, and Vector Databases (Milvus, pgvector).
Multi-model LLM serving environment for entities.
MLOps Pipeline Success Rate and Deployment Frequency.
Mlops & Automation
Implement IaC (Terraform, Ansible) and GitOps for the automated lifecycle management of the AI platform (node onboarding, scaling, model rollout/rollback).
Build robust MLOps pipelines for data prep, training, evaluation, and monitoring (using tools like MLflow/Kubeflow).
Infrastructure automation via Terraform & Ansible.
Automation Coverage for AI Infrastructure.
Required Qualifications & Experience
- Experience: 7–12 years in Cloud Infrastructure, DevOps, ML Infrastructure, or Platform Engineering.
• Deep Hands-On Expertise
- GPU Systems (NVIDIA A100/H100), Linux, Containers, and Kubernetes.
- OpenShift AI (RHODS) or equivalent Kubernetes GPU orchestration.
- LLM Hosting (Llama, Mistral, Falcon, etc.) and supporting Vector Databases/RAG systems.
- Strong Experience In: TensorFlow, PyTorch, Hugging Face, Distributed Training (DDP, Deep Speed), and ML Ops Stacks (ML flow, Kubeflow).
Essential Skills & Competencies
- Technical: Deep understanding of GPU compute, HPC architectures, and ML performance profiling. Strong skills in IaC (Terraform/Ansible), CI/CD, and OpenShift/Kubernetes operators.
- Soft Skills: Strong troubleshooting, optimization, and performance engineering mindset. Excellent cross-functional collaboration and documentation skills.
Preferred Certifications
- NVIDIA Deep Learning / AI Infrastructure Certification
- Red Hat OpenShift AI specialization
- Kubernetes CKA/CKAD
- Azure AI or Oracle Cloud AI certifications
- Terraform & Ansible certifications
Apply for this job in 1 click
Skip the repetitive application forms
Install the Base Career Chrome Extension and autofill job applications across major job boards with your profile.
Trusted by over 500,000 job seekers on Base Career
More from this employer
More jobs at Dautom
IT Operations Administrator
Sharjah Emirate, UAE
Dautom seeks an experienced IT Operations Administrator to manage IT operations, end-user services, and infrastructure administration in Sharjah. The role involves providing L1/L2 support, administering Microsoft service
Senior OutSystems Developer
Dubai, UAE
Client Introduction In this role, you will collaborate closely with one of our esteemed clients—a global leader in their industry, recognized for their commitment to quality, innovation, and excellence. They have partner
AI Infrastructure Engineer
Sharjah, UAE
The AI Infrastructure Engineer is a platform specialist responsible for architecting, building, and operating high-performance AI infrastructure to support advanced AI workloads, including LLMs, GenAI, Computer Vision, a
Security Engineer – NGFW & DDOS
Dubai, UAE
We are looking for a talented and experienced Security Engineer – NGFW & DDOS who will be responsible for the creation of procedures, implementation of process development, and maintenance of security systems across inte
Digital Marketing & Employer Branding Specialist
Dubai, UAE
Dautom is looking for a creative, results-driven Digital Marketing & Employer Branding Specialist to strengthen digital presence and build brand awareness. The role involves planning and executing digital marketing campa
Senior Security Engineer - Splunk, Cribl & Azure Sentinel
Dubai, UAE
Client Introduction In this role, you will collaborate closely with one of our esteemed clients—a global leader in their industry, recognized for their commitment to quality, innovation, and excellence. They have partner
Senior Security Engineer
Dubai, UAE
Job description We are looking for a highly skilled Senior Security Engineer - IAM (Broadcom IGA) with strong expertise in CA Identity Suite (IDM), SiteMinder SSO, and Risk Authentication to manage and enhance enterprise
Enterprise Architect
Dubai, UAE
Mode- Onsite Experience- 8+ Years Contract- 6 months extendable Candidate's Availability- Immediate Joiners ONLY Enterprise Architect Specialist – Cloud & AI Support enterprise and solution architecture initiatives with
IT Operations Administrator
Sharjah Emirate, UAE
Senior OutSystems Developer
Dubai, UAE
AI Infrastructure Engineer
Sharjah, UAE
Security Engineer – NGFW & DDOS
Dubai, UAE
Digital Marketing & Employer Branding Specialist
Dubai, UAE
Senior Security Engineer - Splunk, Cribl & Azure Sentinel
Dubai, UAE
Senior Security Engineer
Dubai, UAE
Enterprise Architect
Dubai, UAE