Senior Infrastructure Engineer (HPC)

CONNECT Professional Services

Riyadh, KSA

fulltime

Mid-Senior

Yesterday

engineeringdesignproject managementmaintenancequality controltechnical

Apply

Free

Job Fit Check

Base Career helps you apply smarter for this job.

Ready to Scan

Key skills for this role

engineeringdesignproject management

Smart Apply

Full Job Posting

Job Summary

Deploying, configuring, and managing large-scale High-Performance Computing (HPC) environments.

Demonstrating practical expertise across Linux administration (RHEL and Ubuntu), NVIDIA GPU infrastructure, Slum workload scheduling, Kubernetes, CI/CD automation, and the NVIDIA Enterprise software ecosystem.

Key Responsibilities

Design, implement, and maintain end-to-end HPC clusters, including compute nodes, storage layers, high-speed networking (InfiniBand/RoCE), and management infrastructure.
Provision and administer NVIDIA Base Command Manager (BCM) for bare-metal cluster deployment, operating system lifecycle management, and GPU fleet monitoring.
Deploy, maintain, and integrate the NVIDIA AI Enterprise Suite with MLOps frameworks, including NeMo, Triton, and RAPIDS.
Manage NVIDIA GPU Operator and Network Operator within Kubernetes environments to automate GPU driver and CUDA lifecycle management, DCGM exporter, and MIG configuration.
Configure and support NVIDIA NIM inference services and implement NVIDIA Blueprint reference architectures for production AI workloads.
Install, administer, and optimize Slurm environments, including partitions, QoS policies, fair-share scheduling, node accounting, MPI integration, and hybrid Slurm-on-Kubernetes scheduling.
Build and manage Kubernetes clusters using kubeadm, including high-availability control planes, etcd backup strategies, and zero-downtime upgrades.
Administer and maintain Red Hat Enterprise Linux (RHEL) and Canonical Ubuntu systems across all cluster nodes.
Develop and maintain CI/CD pipelines using GitLab CI and GitHub Actions to automate infrastructure provisioning and software delivery.
Analyze and optimize GPU and CPU performance, troubleshooting bottlenecks across hardware, drivers, MPI fabric, and application layers.
Implement monitoring and observability solutions using Prometheus, Grafana, and DCGM, and establish alerting and capacity-planning mechanisms.
Ensure adherence to security best practices through system hardening, kernel patching, RBAC implementation, and compliance monitoring across the HPC environment.

Requirements

Bachelor's degree in Computer Science, Information Technology, Computer Engineering, or a related field.
Minimum of
10 years of hands-on experience
in High-Performance Computing (HPC) and infrastructure engineering.
Active

Red Hat Certified Engineer (RHCE)

certification.
Active

Certified Kubernetes Administrator (CKA)

certification.
Proven experience designing, deploying, and managing large-scale HPC environments.
Strong hands-on expertise with

NVIDIA Base Command Manager (BCM)

and the

NVIDIA AI Enterprise

ecosystem.
Experience with

NVIDIA GPU Operator

Network Operator

NVIDIA NIMs

, and

NVIDIA Blueprints

.
Extensive experience administering

Slurm

and managing workload scheduling in HPC environments.
Strong knowledge of

Kubernetes

cluster deployment and administration, including high availability and lifecycle management.
Solid experience with

Red Hat Enterprise Linux (RHEL)

and

Canonical Ubuntu LTS

administration.
Proficiency in

CUDA

, GPU drivers, and GPU infrastructure management.
Experience building and maintaining
CI/CD pipelines
using GitLab CI and/or GitHub Actions.
Familiarity with high-speed networking technologies, including

InfiniBand

and

RoCE

.
Experience with monitoring and observability tools such as

Prometheus

Grafana

, and

Nvidia Dcgm

.
Strong understanding of infrastructure security, system hardening, RBAC, and compliance best practices.
Excellent troubleshooting, performance optimization, and problem-solving skills.
Strong communication and collaboration skills with the ability to work effectively in cross-functional teams.

Apply for this job in 1 click

Skip the repetitive application forms

Install the Base Career Chrome Extension and autofill job applications across major job boards with your profile.

Trusted by over 500,000 job seekers on Base Career

Start Free Today

More jobs at CONNECT Professional Services

Senior Enterprise Networks & Data Center Engineer

Riyadh, KSA

Mid-Seniorfulltime

Position Summary Highly skilled and motivated Senior Enterprise Networks & Data Center Engineer to join our Advanced Services team. The successful candidate will be responsible for the design, implementation, migration,

TodayView →

IT Service Management (ITSM) Expert (Ivanti)

Riyadh, KSA

Mid-Seniorfulltime

Job Summary: Responsible for the design, governance, and continuous improvement of IT Service Management (ITSM) processes and platforms, ensuring full alignment with ITIL 4 best practices, enterprise standards, and opera

1 weeks agoView →

Senior Enterprise Networks & Data Center Engineer

Riyadh, KSA

Todayfulltime

IT Service Management (ITSM) Expert (Ivanti)

Riyadh, KSA

1 weeks agofulltime

Senior Infrastructure Engineer (HPC)

Job Fit Check

About the Role

Full Job Posting

Job Summary

Key Responsibilities

Requirements

Red Hat Certified Engineer (RHCE)

Certified Kubernetes Administrator (CKA)

NVIDIA Base Command Manager (BCM)

NVIDIA AI Enterprise

NVIDIA GPU Operator

Network Operator

NVIDIA NIMs

NVIDIA Blueprints

Slurm

Kubernetes

Red Hat Enterprise Linux (RHEL)

Canonical Ubuntu LTS

CUDA

InfiniBand

RoCE

Prometheus

Grafana

Nvidia Dcgm

Apply for this job in 1 click

More jobs at CONNECT Professional Services

Senior Enterprise Networks & Data Center Engineer

IT Service Management (ITSM) Expert (Ivanti)

Senior Enterprise Networks & Data Center Engineer

IT Service Management (ITSM) Expert (Ivanti)