Senior Infrastructure Engineer (HPC)
Job Fit Check
Base Career helps you apply smarter for this job.
Key skills for this role
About the Role
Deploying, configuring, and managing large-scale High-Performance Computing (HPC) environments. Demonstrating practical expertise across Linux administration (RHEL and Ubuntu), NVIDIA GPU infrastructure, Slum workload scheduling, Kubernetes, CI/CD automation, and the NVIDIA Enterprise software ecosystem.
Key Skills for This Role
Full Job Posting
Job Summary
Deploying, configuring, and managing large-scale High-Performance Computing (HPC) environments.
Demonstrating practical expertise across Linux administration (RHEL and Ubuntu), NVIDIA GPU infrastructure, Slum workload scheduling, Kubernetes, CI/CD automation, and the NVIDIA Enterprise software ecosystem.
Key Responsibilities
- Design, implement, and maintain end-to-end HPC clusters, including compute nodes, storage layers, high-speed networking (InfiniBand/RoCE), and management infrastructure.
- Provision and administer NVIDIA Base Command Manager (BCM) for bare-metal cluster deployment, operating system lifecycle management, and GPU fleet monitoring.
- Deploy, maintain, and integrate the NVIDIA AI Enterprise Suite with MLOps frameworks, including NeMo, Triton, and RAPIDS.
- Manage NVIDIA GPU Operator and Network Operator within Kubernetes environments to automate GPU driver and CUDA lifecycle management, DCGM exporter, and MIG configuration.
- Configure and support NVIDIA NIM inference services and implement NVIDIA Blueprint reference architectures for production AI workloads.
- Install, administer, and optimize Slurm environments, including partitions, QoS policies, fair-share scheduling, node accounting, MPI integration, and hybrid Slurm-on-Kubernetes scheduling.
- Build and manage Kubernetes clusters using kubeadm, including high-availability control planes, etcd backup strategies, and zero-downtime upgrades.
- Administer and maintain Red Hat Enterprise Linux (RHEL) and Canonical Ubuntu systems across all cluster nodes.
- Develop and maintain CI/CD pipelines using GitLab CI and GitHub Actions to automate infrastructure provisioning and software delivery.
- Analyze and optimize GPU and CPU performance, troubleshooting bottlenecks across hardware, drivers, MPI fabric, and application layers.
- Implement monitoring and observability solutions using Prometheus, Grafana, and DCGM, and establish alerting and capacity-planning mechanisms.
- Ensure adherence to security best practices through system hardening, kernel patching, RBAC implementation, and compliance monitoring across the HPC environment.
Requirements
- Bachelor's degree in Computer Science, Information Technology, Computer Engineering, or a related field.
- Minimum of
- 10 years of hands-on experience
- in High-Performance Computing (HPC) and infrastructure engineering.
- Active
Red Hat Certified Engineer (RHCE)
- certification.
- Active
Certified Kubernetes Administrator (CKA)
- certification.
- Proven experience designing, deploying, and managing large-scale HPC environments.
- Strong hands-on expertise with
NVIDIA Base Command Manager (BCM)
and the
NVIDIA AI Enterprise
- ecosystem.
- Experience with
NVIDIA GPU Operator
,
Network Operator
,
NVIDIA NIMs
, and
NVIDIA Blueprints
- .
- Extensive experience administering
Slurm
- and managing workload scheduling in HPC environments.
- Strong knowledge of
Kubernetes
- cluster deployment and administration, including high availability and lifecycle management.
- Solid experience with
Red Hat Enterprise Linux (RHEL)
and
Canonical Ubuntu LTS
- administration.
- Proficiency in
CUDA
- , GPU drivers, and GPU infrastructure management.
- Experience building and maintaining
- CI/CD pipelines
- using GitLab CI and/or GitHub Actions.
- Familiarity with high-speed networking technologies, including
InfiniBand
and
RoCE
- .
- Experience with monitoring and observability tools such as
Prometheus
,
Grafana
, and
Nvidia Dcgm
- .
- Strong understanding of infrastructure security, system hardening, RBAC, and compliance best practices.
- Excellent troubleshooting, performance optimization, and problem-solving skills.
- Strong communication and collaboration skills with the ability to work effectively in cross-functional teams.
Apply for this job in 1 click
Skip the repetitive application forms
Install the Base Career Chrome Extension and autofill job applications across major job boards with your profile.
Trusted by over 500,000 job seekers on Base Career
More from this employer
More jobs at CONNECT Professional Services
Senior Enterprise Networks & Data Center Engineer
Riyadh, KSA
Position Summary Highly skilled and motivated Senior Enterprise Networks & Data Center Engineer to join our Advanced Services team. The successful candidate will be responsible for the design, implementation, migration,
IT Service Management (ITSM) Expert (Ivanti)
Riyadh, KSA
Job Summary: Responsible for the design, governance, and continuous improvement of IT Service Management (ITSM) processes and platforms, ensuring full alignment with ITIL 4 best practices, enterprise standards, and opera