{bc}
linkedin

Senior Infrastructure Engineer (HPC)

CONNECT Professional Services
Riyadh, KSA
fulltime
Mid-Senior
Yesterday
engineeringdesignproject managementmaintenancequality controltechnical
Free

Job Fit Check

Base Career helps you apply smarter for this job.

?%
Ready to Scan

Key skills for this role

engineeringdesignproject management
Smart Apply

Full Job Posting

Job Summary

Deploying, configuring, and managing large-scale High-Performance Computing (HPC) environments.

Demonstrating practical expertise across Linux administration (RHEL and Ubuntu), NVIDIA GPU infrastructure, Slum workload scheduling, Kubernetes, CI/CD automation, and the NVIDIA Enterprise software ecosystem.

Key Responsibilities

  • Design, implement, and maintain end-to-end HPC clusters, including compute nodes, storage layers, high-speed networking (InfiniBand/RoCE), and management infrastructure.
  • Provision and administer NVIDIA Base Command Manager (BCM) for bare-metal cluster deployment, operating system lifecycle management, and GPU fleet monitoring.
  • Deploy, maintain, and integrate the NVIDIA AI Enterprise Suite with MLOps frameworks, including NeMo, Triton, and RAPIDS.
  • Manage NVIDIA GPU Operator and Network Operator within Kubernetes environments to automate GPU driver and CUDA lifecycle management, DCGM exporter, and MIG configuration.
  • Configure and support NVIDIA NIM inference services and implement NVIDIA Blueprint reference architectures for production AI workloads.
  • Install, administer, and optimize Slurm environments, including partitions, QoS policies, fair-share scheduling, node accounting, MPI integration, and hybrid Slurm-on-Kubernetes scheduling.
  • Build and manage Kubernetes clusters using kubeadm, including high-availability control planes, etcd backup strategies, and zero-downtime upgrades.
  • Administer and maintain Red Hat Enterprise Linux (RHEL) and Canonical Ubuntu systems across all cluster nodes.
  • Develop and maintain CI/CD pipelines using GitLab CI and GitHub Actions to automate infrastructure provisioning and software delivery.
  • Analyze and optimize GPU and CPU performance, troubleshooting bottlenecks across hardware, drivers, MPI fabric, and application layers.
  • Implement monitoring and observability solutions using Prometheus, Grafana, and DCGM, and establish alerting and capacity-planning mechanisms.
  • Ensure adherence to security best practices through system hardening, kernel patching, RBAC implementation, and compliance monitoring across the HPC environment.

Requirements

  • Bachelor's degree in Computer Science, Information Technology, Computer Engineering, or a related field.
  • Minimum of
  • 10 years of hands-on experience
  • in High-Performance Computing (HPC) and infrastructure engineering.
  • Active

Red Hat Certified Engineer (RHCE)

  • certification.
  • Active

Certified Kubernetes Administrator (CKA)

  • certification.
  • Proven experience designing, deploying, and managing large-scale HPC environments.
  • Strong hands-on expertise with

NVIDIA Base Command Manager (BCM)

and the

NVIDIA AI Enterprise

  • ecosystem.
  • Experience with

NVIDIA GPU Operator

,

Network Operator

,

NVIDIA NIMs

, and

NVIDIA Blueprints

  • .
  • Extensive experience administering

Slurm

  • and managing workload scheduling in HPC environments.
  • Strong knowledge of

Kubernetes

  • cluster deployment and administration, including high availability and lifecycle management.
  • Solid experience with

Red Hat Enterprise Linux (RHEL)

and

Canonical Ubuntu LTS

  • administration.
  • Proficiency in

CUDA

  • , GPU drivers, and GPU infrastructure management.
  • Experience building and maintaining
  • CI/CD pipelines
  • using GitLab CI and/or GitHub Actions.
  • Familiarity with high-speed networking technologies, including

InfiniBand

and

RoCE

  • .
  • Experience with monitoring and observability tools such as

Prometheus

,

Grafana

, and

Nvidia Dcgm

  • .
  • Strong understanding of infrastructure security, system hardening, RBAC, and compliance best practices.
  • Excellent troubleshooting, performance optimization, and problem-solving skills.
  • Strong communication and collaboration skills with the ability to work effectively in cross-functional teams.

Apply for this job in 1 click

Skip the repetitive application forms

Install the Base Career Chrome Extension and autofill job applications across major job boards with your profile.

Sarah M.James T.Maya R.

Trusted by over 500,000 job seekers on Base Career

Start Free Today

More from this employer

More jobs at CONNECT Professional Services