{bc}
linkedin

Lead Engineer - HPC Operations

Core42
Abu Dhabi, UAE
fulltime
Mid-Senior
Today
engineeringdesignproject managementmaintenancequality controltechnical
Free

Job Fit Check

Base Career helps you apply smarter for this job.

?%
Ready to Scan

Key skills for this role

engineeringdesignproject management
Smart Apply

Full Job Posting

Overview

Lead Engineer - HPC Operations, Core 42, Abu Dhabi - UAE

About Us

Core42, a leader in AI-powered cloud and digital infrastructure, is driving transformative technology solutions globally.

Leveraging advanced resources and partnerships, Core42 empowers clients to harness sovereign AI infrastructure, especially in sectors with stringent regulatory needs.

With a mission to redefine digital transformation, we combine sovereign capabilities with scalable, high-performance compute infrastructure, positioning itself at the forefront of AI innovation in the Middle East and beyond.

The Role

We are seeking a highly skilled Lead Engineer – HPC Operations to oversee the daily operations and support of high-performance computing clusters designed to power large-scale AI and ML workloads.

This role ensures stable, secure, and high-performing infrastructure leveraging technologies such as Slurm, Kubernetes, and modern MLOps platforms.

The ideal candidate will bring deep technical expertise in HPC and a strong operational mindset to drive continuous improvement and automation across globally distributed environments.

Responsibilities will extend to collaborating with multidisciplinary teams, leading complex projects, implementing cutting-edge technologies, and providing mentorship to operations engineers.

Responsibilities

  • Oversee the daily operational management of HPC infrastructure, including compute, storage, networking, and scheduler components (e.g., Slurm, Kubernetes, etc.).
  • Drive efforts to optimize the efficiency and performance of HPC systems, ensuring maximum resource utilization and minimizing downtime.
  • Serve as the primary technical escalation point for L2 support teams, ensuring rapid and effective resolution of incidents and service requests.
  • Continuously monitor system health, performance, and resource utilization using advanced monitoring tools (e.g., Prometheus, Grafana, DCGM).
  • Manage user environments for AI/ML workloads, including container orchestration (e.g., Docker, Kubernetes) and workflow tools (e.g., MLflow, Kubeflow).
  • Define and enforce job scheduling policies, priorities, and partitions within Slurm and/or Kubernetes environments to ensure resource fairness, efficiency, and workload optimization.
  • Lead root cause analysis (RCA) of operational issues, contributing to post-mortem documentation and driving continuous improvement initiatives.
  • Provide mentorship and technical guidance to junior engineers, fostering skills development and knowledge sharing across teams. Participate in on-call rotation as necessary.
  • Ensure adherence to security and operational policies, assisting in audits and maintaining documentation for change and incident management processes.

Qualifications

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
  • Minimum of 8 years of experience in HPC operations, systems engineering, or DevOps roles, with at least 2 years in a leadership or ownership capacity.
  • Advanced expertise in configuring, optimizing, and maintaining complex HPC environments, including hardware, software, and storage systems.
  • Hands-on experience managing Slurm clusters and/or Kubernetes-based environments for AI/ML workloads.
  • In-depth knowledge of GPU resource management, workload schedulers, and performance tuning for AI/ML workloads.
  • Proficiency with monitoring and observability frameworks such as Prometheus, Grafana, and DCGM.
  • Strong scripting and automation skills, including Python, Bash, Ansible, and Terraform.
  • Solid understanding of Linux (RHEL/CentOS/Ubuntu), networking technologies (RDMA, InfiniBand, RoCE), and storage solutions (NFS, Lustre, Ceph).
  • What working at Core42 offers
  • With a diverse team of 1,100+ employees from 68 nationalities, we foster an inclusive, innovative and collaborative environment.
  • At Core42, we foster a culture grounded in trust, accountability and high performance.
  • We are united by our values:

Grit

, where we overcome challenges with resilience and determination,

Passion

, which drives us to pursue excellence in everything we do, and

Impact

  • , as we aim to inspire progress and create meaningful change.
  • Our team members thrive in an environment where each person’s contributions propel us forward, and together, we commit to achieving extraordinary results.
  • Competitive Salary: We offer an attractive salary package based on your skills and experience
  • Yearly Bonus: In recognition of your contributions, you will receive a performance-based annual bonus
  • Exclusive Discount Cards: Access special benefits with Esaad and Fazaa cards, offering discounts across a wide range of services
  • Premium Family Insurance: We provide comprehensive health coverage, including dental, vision and life insurance, ensuring the well-being of you and your family
  • Learning & Development: We offer access to top-tier learning platforms to help you grow in your career. Learn at your own pace with unlimited access to premium courses.

Apply for this job in 1 click

Skip the repetitive application forms

Install the Base Career Chrome Extension and autofill job applications across major job boards with your profile.

Sarah M.James T.Maya R.

Trusted by over 500,000 job seekers on Base Career

Start Free Today

More from this employer

More jobs at Core42