{bc}

AI/HPC Level 1 Support Engineer

AIHostingHubDubai, UAE2 weeks agoEntryfulltime
GitScala
Generate Resume for this Job
Via LinkedIn·

About This Role

Company Description

AIHostingHub, the UAE's leading provider of cutting-edge AI and High-Performance Computing (HPC) infrastructure. We specialize in building large-scale AI data centers and delivering GPU-as-a-Service from nimble deployments to massive clusters. As a trusted professional services partner for industry giants like Supermicro and VAST Data in the GCC, we provide the technology, expertise, and support to fuel your most ambitious projects.

Our  Services

*  AI/HPC Data CentersCustom-built, scalable environments optimized for the most demanding AI workloads.

*  GPU as a ServiceOn-demand access to massive GPU clusters, starting from a 2048 GPU to over 16,384 GPU per cluster.

*  Cybersecurity MSSP Fortinet and AttackIQ powered, 24/7 managed security to protect your critical infrastructure and data.

*  Expert Professional Services End-to-end support from design and deployment to optimization, directly from GCC-based partners.

AIHostingHub prides itself on delivering customized security solutions, dedicated support, and strategic guidance, ensuring that clients can operate confidently in the digital landscape. Explore the future of cybersecurity with AIHostingHub, where protection is the top priority.

Role Description

This is a full-time, on-site role based in Dubai for an AI/HPC Level 1 Support Engineer. The role involves providing first-level troubleshooting, technical and customer support for AI and high-performance computing (HPC) infrastructures. Responsibilities include monitoring system performance, resolving operational issues, assisting clients with inquiries, and maintaining operational documentation. The engineer will also collaborate with internal teams and escalate issues to higher-level support when necessary.

Key Responsibilities

  • Monitor dashboards (Grafana, ticketing system) for GPU node health, InfiniBand link flaps, temperature, and power anomalies.
  • Log, categorize, and prioritize incidents (P1–P4) per SLA response times (1h for urgent, 2h for high).
  • Perform onsite smart‑hands tasks: cable patching, component replacement, fibre cleaning, visual inspections.
  • Execute post‑repair validation scripts (CUDA P2P, NCCL local, DCGMI, Stream) after RMA.
  • Coordinate with vendors (Nvidia, Supermicro) for warranty replacements.
  • Escalate unresolved issues to Level 2 AI/HPC Engineers.
  • Maintain operational logs, asset records, and maintenance documentation.

Required Qualifications

  • 1–3 years in datacenter, NOC, or HPC support.
  • Familiarity with GPU servers, InfiniBand/Ethernet cabling, and fibre optics.
  • Basic Linux command line (dmesg, nvidia‑smi, grep, uptime).
  • Understanding of incident management and SLA targets (response/resolution times).
  • Ability to work 24/7 rotating shifts (including weekends).
  • Strong communication and documentation skills.

Preferred

  • Experience with DCGM, Grafana, or ticketing systems (Jira/ServiceNow).
  • Knowledge of liquid cooling CDUs or Proxmox/Ceph is a plus.

We Offer

  • Structured career progression to L2/L3 roles.
  • Training on HGX platforms and AI validation frameworks.
AI Job Platform

Stop applying blindly. Start getting hired.

Base Career automates the hardest parts of job searching — apply smarter, not harder.

AI Resume in 60s

Your resume rewritten for this exact role using the job description as the brief.

ATS-Optimized

Get past automated screening filters with the right keywords matched to each job.

Application Tracker

Track every job, follow-up, and interview in one visual kanban board.

Start Today for Free

Free plan · No credit card required