Senior AI Infrastructure & Platform Engineer
Job Fit Check
Base Career helps you apply smarter for this job.
Key skills for this role
About the Role
Manage and optimize GPU-based infrastructure, deploy workloads using orchestration tools, and collaborate with teams for AI/ML solutions.
Key Skills for This Role
Full Job Posting
Key Responsibilities
- Deploy, maintain, and optimize GPU-based compute clusters and infrastructure.
- Manage and operate GPU orchestration tools and platforms such as:
• Nvidia AI Enterprise Suite
- Nvidia GPU and Network Operators
- Nvidia NIMs and Blueprints
- Configure, deploy, and maintain compute workloads using scheduling and orchestration tools including:
- Slurm
- Vanilla Kubernetes
- Install, configure, and maintain the underlying OS (e.g. Canonical Ubuntu) and supporting system software.
- Monitor and troubleshoot infrastructure performance, availability, and reliability; ensure high uptime for AI/ML workloads.
- Work with data scientists, ML engineers, and dev teams to define infrastructure requirements, resource allocation, and deployment workflows.
- Develop automation scripts, CI/CD pipelines, and best practices for infrastructure provisioning and management.
- Document architecture, configurations, and operational procedures; enforce security, compliance, and backup policies.
Required Skills & Experience
- Proven experience managing GPU-based AI/ML infrastructure and compute clusters.
- Hands-on experience with:
• Nvidia GPU/Network Operators, NIMs, Blueprints
- Strong experience with Slurm and/or Kubernetes orchestration.
- Solid Linux system administration skills preferably on Ubuntu or similar distributions.
- Strong scripting/automation ability (e.g. Bash, Python, or relevant tooling) for provisioning, deployment, and maintenance.
- Excellent troubleshooting and performance-tuning skills.
- Experience collaborating with ML/data science teams and integrating infrastructure with their workflows.
- Strong understanding of networking, security, resource allocation, and cluster management best practices.
Preferred Qualifications
- Previous experience working in a high-performance computing (HPC) or AI-focused infrastructure team.
- Knowledge of containerization, container orchestration, and GPUs in cloud or on-prem environments.
- Experience with CI/CD, infrastructure-as-code (e.g. Terraform, Ansible), monitoring tools, and logging setups.
- Familiarity with workload scheduling, job queuing, resource quotas, and GPU-shared environments.
Apply for this job in 1 click
Skip the repetitive application forms
Install the Base Career Chrome Extension and autofill job applications across major job boards with your profile.
Trusted by over 500,000 job seekers on Base Career
More from this employer
More jobs at Deepsource Technologies
Infrastructure Subject Matter Expert for BCDR & DR Automation
Riyadh, KSA
Ensure IT infrastructure supports Business Continuity and Disaster Recovery objectives, focusing on automation, performance testing, and compliance with regulatory requirements.
Senior DBA
Riyadh, KSA
The role involves managing enterprise databases, ensuring performance and security, and requires experience in database administration and high availability concepts.
Cloud Infrastructure Automation Engineer
Riyadh, KSA
Design and manage VMware environments, automate infrastructure with Terraform and Ansible, deploy Kubernetes, and provide L2/L3 support in cloud environments.
Application SME (BCDR & DR Automation)
Riyadh, KSA
Responsible for application and database disaster recovery strategy, automation integration, documentation, and compliance, requiring strong technical leadership and collaborati...
Cloud Administrator (Azure / Alibaba)- Saudi National- Riyadh, KSA
Riyadh, KSA
Seeking a Cloud Administrator to manage Azure and Alibaba Cloud environments, ensuring security, performance, and infrastructure support with Fortinet experience.
Fortinet Security Engineer
Dubai, UAE
The role involves configuring and troubleshooting FortiGate firewalls, managing security policies, and ensuring compliance in enterprise environments with strong FortiAnalyzer e...
Senior Fortinet Security Engineer
Dubai, UAE
Seeking a candidate with expertise in FortiGate, FortiAnalyzer, and FortiManager for centralized firewall management and security compliance in retail environments.
Senior IBM WebMethods ESB Architect
Riyadh, KSA
Architect end-to-end ESB integration solutions using IBM WebMethods, ensuring compliance with banking regulations and collaborating with stakeholders on design and governance.
Infrastructure Subject Matter Expert for BCDR & DR Automation
Riyadh, KSA
Senior DBA
Riyadh, KSA
Cloud Infrastructure Automation Engineer
Riyadh, KSA
Application SME (BCDR & DR Automation)
Riyadh, KSA
Cloud Administrator (Azure / Alibaba)- Saudi National- Riyadh, KSA
Riyadh, KSA
Fortinet Security Engineer
Dubai, UAE
Senior Fortinet Security Engineer
Dubai, UAE
Senior IBM WebMethods ESB Architect
Riyadh, KSA
