{bc}

Principal Engineer - HPC Operations

Core42Abu Dhabi, UAE1 months agoMid-Seniorfulltime
GitKubernetesScalaVAT
Generate Resume for this Job
Via LinkedIn·

About This Role

About Us Core42, a leader in AI-powered cloud and digital infrastructure, is driving transformative technology solutions globally. Leveraging advanced resources and partnerships, Core42 empowers clients to harness sovereign AI infrastructure, especially in sectors with stringent regulatory needs. With a mission to redefine digital transformation, we combine sovereign capabilities with scalable, high-performance compute infrastructure, positioning itself at the forefront of AI innovation in the Middle East and beyond.

The opportunity We are seeking a highly skilled Principal Engineer – HPC Operations to oversee the daily operations and support of high-performance computing clusters designed to power large-scale AI and ML workloads. This role ensures stable, secure, and high-performing infrastructure leveraging technologies such as Slurm, Kubernetes, and modern MLOps platforms. The ideal candidate will bring deep technical expertise in HPC and a strong operational mindset to drive continuous improvement and automation across globally distributed environments. Responsibilities will extend to collaborating with multidisciplinary teams, leading complex projects, implementing cutting-edge technologies, and providing mentorship to operations engineers.

Key Responsibilities

  • Oversee the daily operational management of HPC infrastructure, including compute, storage, networking, and scheduler components (e.g., Slurm, Kubernetes, etc.).
  • Drive efforts to optimize the efficiency and performance of HPC systems, ensuring maximum resource utilization and minimizing downtime.
  • Serve as the primary technical contact for planned HPC deployments in scope.
  • Serve as the primary technical escalation point for L2 support teams, ensuring rapid and effective resolution of incidents and service requests.
  • Continuously monitor system health, performance, and resource utilization using advanced monitoring tools (e.g., Prometheus, Grafana, DCGM).
  • Manage user environments for AI/ML workloads, including container orchestration (e.g., Docker, Kubernetes) and workflow tools (e.g., MLflow, Kubeflow).
  • Define and enforce job scheduling policies, priorities, and partitions within Slurm and/or Kubernetes environments to ensure resource fairness, efficiency, and workload optimization.
  • Lead root cause analysis (RCA) of operational issues, contributing to post-mortem documentation and driving continuous improvement initiatives.
  • Provide mentorship and technical guidance to junior engineers, fostering skills development and knowledge sharing across teams. Participate in on-call rotation as necessary.
  • Ensure adherence to security and operational policies, assisting in audits and maintaining documentation for change and incident management processes.
  • Required skills / qualifications

Minimum Experience

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
  • Minimum of 8 years of experience in HPC operations, systems engineering, or DevOps roles, with at least 2 years in a leadership or ownership capacity.
  • Advanced expertise in configuring, optimizing, and maintaining complex HPC environments, including hardware, software, and storage systems.
  • Hands-on experience managing Slurm clusters and/or Kubernetes-based environments for AI/ML workloads.
  • In-depth knowledge of GPU resource management, workload schedulers, and performance tuning for AI/ML workloads.
  • Proficiency with monitoring and observability frameworks such as Prometheus, Grafana, and DCGM.
  • Strong scripting and automation skills, including Python, Bash, Ansible, and Terraform.
  • Solid understanding of Linux (RHEL/CentOS/Ubuntu), networking technologies (RDMA, InfiniBand, RoCE), and storage solutions (NFS, Lustre, Ceph).

Similar Jobs

Principal Engineer

Petrofac · Abu Dhabi

Senior

**Petrofac is a leading international service provider to the energy industry, with a diverse client portfolio including many of the world’s leading energy companies.** We design, build, manage and maintain infrastructur

Project Management

Principal Engineer (Guidance and Control)

HALCON · Abu Dhabi

Mid-Senior

**External Job Description** **Principal Engineer – Guidance and Control** **Location: Abu Dhabi, UAE** **Company: HALCON** **About HALCON** Halcon is a part of EDGE Group, is a leader in next‑generation weapon systems,

SEM

Principal Engineer (Guidance & Control)

HALCON · Abu Dhabi

Mid-Senior

**External Job Description** **Principal Engineer – Guidance and Control** **Location: Abu Dhabi, UAE** **Company: HALCON** **About HALCON** Halcon is a part of EDGE Group, is a leader in next‑generation weapon systems,

SEM

Principal Engineer Temporary Works

Dutco Construction Co. (L.L.C) · Dubai

Senior

Lead design and review of temporary works, provide mentorship, manage activities, conduct inspections, requiring a Master's degree and Professional Engineer license.

ShoringRisk AssessmentStructural Analysis

Principal Engineer - DA Mech: Airframe (Structures)

ADASI · Abu Dhabi

Mid-Senior

**External Job Description** **Position :** Principal Engineer \- DA Mechanical, Airframe (Structures) **Entity :** ADASI **Department :** Engineering **Location :** Abu Dhabi **Key Responsibilities :** * Serve as the de

SEM

Principal Engineer - Modeling & Simulation

EDGE · Abu Dhabi

Mid-Senior

**External Job Description** **Principal Flight Qualities \& Modeling Simulation Engineer** **Key Responsibilities** * Lead development of 6\-DOF simulation models for fixed\-wing and rotary UAVs, integrating aerodynamic

MATLABPython

Principal Engineer

Black & White Engineering · Dubai

Mid-Senior

Senior/Principal Engineer – Global Engineering Team (Electrical) *At Black \& White Engineering we do things differently; if you’re an Associate or Principle level Electrical Engineer who is looking for a new challenge,

Principal Engineer - Dry Utilities

WSP in the Middle East · Dubai

Mid-Senior

**Job Description** WSP are currently seeking a Principal Engineer \- Dry Utilities to join our team in Dubai, UAE. The successful candidates will lead and contribute to the design and implementation of electrical and po

Procurement

Senior / Principal Engineer – Vertical Transportation (VT)

WSP in the Middle East · Dubai

Mid-Senior

**Job Description** WSP Middle East is seeking an experienced Senior / Principal Engineer – Vertical Transportation (VT) to join our Dubai\-based team. This role will play a key part in the design, specification, coordin

BIMScalaVAT
AI Job Platform

Stop applying blindly. Start getting hired.

Base Career automates the hardest parts of job searching — apply smarter, not harder.

AI Resume in 60s

Your resume rewritten for this exact role using the job description as the brief.

ATS-Optimized

Get past automated screening filters with the right keywords matched to each job.

Application Tracker

Track every job, follow-up, and interview in one visual kanban board.

Start Today for Free

Free plan · No credit card required