Lead Engineer - HPC Operations

Core42

Abu Dhabi, UAE

fulltime

Mid-Senior

Today

engineeringdesignproject managementmaintenancequality controltechnical

Apply

Free

Job Fit Check

Base Career helps you apply smarter for this job.

Ready to Scan

Key skills for this role

engineeringdesignproject management

Smart Apply

Full Job Posting

Overview

Lead Engineer - HPC Operations, Core 42, Abu Dhabi - UAE

About Us

Core42, a leader in AI-powered cloud and digital infrastructure, is driving transformative technology solutions globally.

Leveraging advanced resources and partnerships, Core42 empowers clients to harness sovereign AI infrastructure, especially in sectors with stringent regulatory needs.

With a mission to redefine digital transformation, we combine sovereign capabilities with scalable, high-performance compute infrastructure, positioning itself at the forefront of AI innovation in the Middle East and beyond.

The Role

We are seeking a highly skilled Lead Engineer – HPC Operations to oversee the daily operations and support of high-performance computing clusters designed to power large-scale AI and ML workloads.

This role ensures stable, secure, and high-performing infrastructure leveraging technologies such as Slurm, Kubernetes, and modern MLOps platforms.

The ideal candidate will bring deep technical expertise in HPC and a strong operational mindset to drive continuous improvement and automation across globally distributed environments.

Responsibilities will extend to collaborating with multidisciplinary teams, leading complex projects, implementing cutting-edge technologies, and providing mentorship to operations engineers.

Responsibilities

Oversee the daily operational management of HPC infrastructure, including compute, storage, networking, and scheduler components (e.g., Slurm, Kubernetes, etc.).
Drive efforts to optimize the efficiency and performance of HPC systems, ensuring maximum resource utilization and minimizing downtime.
Serve as the primary technical escalation point for L2 support teams, ensuring rapid and effective resolution of incidents and service requests.
Continuously monitor system health, performance, and resource utilization using advanced monitoring tools (e.g., Prometheus, Grafana, DCGM).
Manage user environments for AI/ML workloads, including container orchestration (e.g., Docker, Kubernetes) and workflow tools (e.g., MLflow, Kubeflow).
Define and enforce job scheduling policies, priorities, and partitions within Slurm and/or Kubernetes environments to ensure resource fairness, efficiency, and workload optimization.
Lead root cause analysis (RCA) of operational issues, contributing to post-mortem documentation and driving continuous improvement initiatives.
Provide mentorship and technical guidance to junior engineers, fostering skills development and knowledge sharing across teams. Participate in on-call rotation as necessary.
Ensure adherence to security and operational policies, assisting in audits and maintaining documentation for change and incident management processes.

Qualifications

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
Minimum of 8 years of experience in HPC operations, systems engineering, or DevOps roles, with at least 2 years in a leadership or ownership capacity.
Advanced expertise in configuring, optimizing, and maintaining complex HPC environments, including hardware, software, and storage systems.
Hands-on experience managing Slurm clusters and/or Kubernetes-based environments for AI/ML workloads.
In-depth knowledge of GPU resource management, workload schedulers, and performance tuning for AI/ML workloads.
Proficiency with monitoring and observability frameworks such as Prometheus, Grafana, and DCGM.
Strong scripting and automation skills, including Python, Bash, Ansible, and Terraform.
Solid understanding of Linux (RHEL/CentOS/Ubuntu), networking technologies (RDMA, InfiniBand, RoCE), and storage solutions (NFS, Lustre, Ceph).
What working at Core42 offers
With a diverse team of 1,100+ employees from 68 nationalities, we foster an inclusive, innovative and collaborative environment.
At Core42, we foster a culture grounded in trust, accountability and high performance.
We are united by our values:

Grit

, where we overcome challenges with resilience and determination,

Passion

, which drives us to pursue excellence in everything we do, and

Impact

, as we aim to inspire progress and create meaningful change.
Our team members thrive in an environment where each person’s contributions propel us forward, and together, we commit to achieving extraordinary results.
Competitive Salary: We offer an attractive salary package based on your skills and experience
Yearly Bonus: In recognition of your contributions, you will receive a performance-based annual bonus
Exclusive Discount Cards: Access special benefits with Esaad and Fazaa cards, offering discounts across a wide range of services
Premium Family Insurance: We provide comprehensive health coverage, including dental, vision and life insurance, ensuring the well-being of you and your family
Learning & Development: We offer access to top-tier learning platforms to help you grow in your career. Learn at your own pace with unlimited access to premium courses.

Apply for this job in 1 click

Skip the repetitive application forms

Install the Base Career Chrome Extension and autofill job applications across major job boards with your profile.

Trusted by over 500,000 job seekers on Base Career

Start Free Today

More jobs at Core42

Senior Engineer - Network Operations

Abu Dhabi, UAE

Mid-Seniorfulltime

Senior Engineer - Network Operations, Core 42, Abu Dhabi - UAE About Us Core42, a leader in AI-powered cloud and digital infrastructure, is driving transformative technology solutions globally. Leveraging advanced resour

TodayView →

Manager - Compliance & Ethics

Abu Dhabi, UAE

Mid-Seniorfulltime

Manager, Ethics & Compliance, Core42, Abu Dhabi – UAE About Us Core42, a leader in AI-powered cloud and digital infrastructure, is driving transformative technology solutions globally. Leveraging advanced resources and p

3 days agoView →

Senior Engineer

Abu Dhabi Emirate, UAE

Mid-Seniorfulltime

Senior Engineer – Cloud Data Platform Services About Us Core42, a leader in AI-powered cloud and digital infrastructure, is driving transformative technology solutions globally. Leveraging advanced resources and partners

1 weeks agoView →

Data Engineer

Abu Dhabi, UAE

Entryfulltime

Data Engineer 6/9/26 We are currently seeking motivated and dedicated professionals to join our diverse and dynamic team. Although specific job titles and roles are not delineated at this time, our organization values in

2 weeks agoView →

SQL Developer

Abu Dhabi, UAE

Entryfulltime

SQL Developer 6/8/26 We are excited to invite applications for a broad range of career opportunities spanning multiple disciplines within our organization. We firmly believe that diversity in skills, experiences, and per

2 weeks agoView →

Senior Architect

Abu Dhabi Emirate, UAE

Mid-Seniorfulltime

Introduction Core42 is an Abu Dhabi-based artificial intelligence and cloud computing company, uniquely positioned in the national ecosystem to develop and deploy holistic and scalable AI solutions to a wide range of cli

3 weeks agoView →

Principal Architect - Platforms

Abu Dhabi Emirate, UAE

Mid-Seniorfulltime

About Us Core42, a leader in AI-powered cloud and digital infrastructure, is driving transformative technology solutions globally. Leveraging advanced resources and partnerships, Core42 empowers clients to harness sovere

3 weeks agoView →

Senior Architect - DC Network Engineering

Abu Dhabi, UAE

Mid-Seniorfulltime

Senior Architect - DC Network Engineering, Core 42, Abu Dhabi - UAE About Us Core42, a leader in AI-powered cloud and digital infrastructure, is driving transformative technology solutions globally. Leveraging advanced r

3 weeks agoView →

Senior Engineer - Network Operations

Abu Dhabi, UAE

Todayfulltime

Manager - Compliance & Ethics

Abu Dhabi, UAE

3 days agofulltime

Senior Engineer

Abu Dhabi Emirate, UAE

1 weeks agofulltime

Data Engineer

Abu Dhabi, UAE

2 weeks agofulltime

SQL Developer

Abu Dhabi, UAE

2 weeks agofulltime

Senior Architect

Abu Dhabi Emirate, UAE

3 weeks agofulltime

Principal Architect - Platforms

Abu Dhabi Emirate, UAE

3 weeks agofulltime

Senior Architect - DC Network Engineering

Abu Dhabi, UAE

3 weeks agofulltime

Lead Engineer - HPC Operations

Job Fit Check

About the Role

Full Job Posting

Overview

About Us

The Role

Responsibilities

Qualifications

Grit

Passion

Impact

Apply for this job in 1 click

More jobs at Core42

Senior Engineer - Network Operations

Manager - Compliance & Ethics

Senior Engineer

Data Engineer

SQL Developer

Senior Architect

Principal Architect - Platforms

Senior Architect - DC Network Engineering

Senior Engineer - Network Operations

Manager - Compliance & Ethics

Senior Engineer

Data Engineer

SQL Developer

Senior Architect

Principal Architect - Platforms

Senior Architect - DC Network Engineering