{bc}

Freelance Agent Evaluation Engineer

MindriftSaudi Arabia, KSA4 days agoSenior
Senior

Skills

engineeringdesignproject management

About This Role

Overview

  • We're building a dataset to evaluate AI coding agents how well a model handles real-world developer tasks.
  • You'll create challenging tasks and evaluation criteria within realistic simulated environments:
  • Build virtual companies following a high-level plan - codebase, infrastructure, and context (conversations, documentation, tickets) that form a realistic environment with development history
  • Assemble and calibrate tasks from intermediate states of the virtual company: craft the prompt, define evaluation criteria, and ensure the task is solvable and the evaluation is fair
  • Design tasks set in isolated environments - emulations of a developer's workstation: a Linux machine with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation, etc.), and a real web application codebase
  • Write tests that accept all correct solutions and reject incorrect ones - neither too strict (breaking on valid approaches) nor too lenient (passing bad ones)
  • Iterate with an AI agent on tests - verifying they catch real problems, don't miss bad solutions, and don't break on good ones
  • Review code written by agents, analyze why an agent failed or succeeded, and design edge cases and adversarial scenarios
  • Iterate based on feedback from expert QA reviewers who score your work on quality criteria
  • A significant part of the work is done together with AI - it's very hard to create tasks that challenge frontier models without using frontier models.
  • strong>Why this is hard/strong>
  • Frontier models are already good at coding. Creating a task that genuinely challenges the best models is non-trivial. You need to deeply understand where models fail and what scenarios reveal the difference between a good and a bad solution.
  • Tasks have many valid solutions. Writing tests that accept all correct solutions and reject incorrect ones is harder than it sounds.
  • strong>How it works/strong>
  • Apply Pass qualification(s) Join a project Complete tasks Get paid

Your resume, rewritten for this exact role.

Sign up free — Base Career tailors your CV to this job description in 60 seconds.

01 / 05

Resume Tailored to This Job

Resume Tailored to This Job

Your keywords, structure, and story — rewritten to match this exact role and pass ATS filters.

Get My Free Resume

Free · No card · 60 seconds

02 / 05

Cover Letter for This Role, Done

Cover Letter for This Role, Done

Job-specific cover letters written in Gulf professional tone — ready in seconds, not hours.

Get My Cover Letter

Free · No card · 60 seconds

03 / 05

See How Well You Fit This Role

See How Well You Fit This Role

AI match score with clear reasons — know your fit before investing time in the application.

Check My Fit Score

Free · No card · 60 seconds

04 / 05

Apply in One Click

Apply in One Click

Autofill any application form on Workday, LinkedIn, Bayt, Greenhouse — with your tailored content.

Start Applying Faster

Free · No card · 60 seconds

05 / 05

Track It. Follow Up at the Right Time.

Track It. Follow Up at the Right Time.

Visual pipeline for every application with AI-timed follow-up reminders so nothing slips.

Track My Applications

Free · No card · 60 seconds

Similar Jobs

Freelance Agent Evaluation Engineer

Mindrift · Doha

Mid-Seniorparttime

Please submit your CV in English and indicate your level of English proficiency. Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving

Skills

engineeringdesignproject management

Freelance Agent Evaluation Engineer

Mindrift · Saudi Arabia

Senior

Develop challenging tasks and evaluation criteria for AI coding agents, requiring expertise in software development, testing, and full-stack systems.

Skills

engineeringdesignproject management

Freelance Agent Evaluation Engineer

Mindrift · Doha

Mid-Seniorparttime

Please submit your CV in English and indicate your level of English proficiency. Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving

Skills

engineeringdesignproject management

Freelance Agent Evaluation Engineer

Mindrift ·

Mid-Seniorparttime

Please submit your CV in English and indicate your level of English proficiency. Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving

Skills

LinuxSEM

Freelance Agent Evaluation Engineer

Mindrift ·

Mid-Seniorparttime

Please submit your CV in English and indicate your level of English proficiency. Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving

Skills

LinuxSEM

2.2K+

Cover Letters & Follow-ups

1.8K+

Resumes Tailored

190.5K+

Jobs Tracked

Trusted by professionals at

PwC//
Emaar//
KPMG//
Noon//
Amazon AWS//
Talabat//
Deloitte//
Emirates//
Careem//
Aramex//
McKinsey//
Property Finder//
Majid Al Futtaim//
Chalhoub Group//
PwC//
Emaar//
KPMG//
Noon//
Amazon AWS//
Talabat//
Deloitte//
Emirates//
Careem//
Aramex//
McKinsey//
Property Finder//
Majid Al Futtaim//
Chalhoub Group//
AI Job Platform

Stop applying blindly. Start getting hired.

Base Career automates the hardest parts of job searching — apply smarter, not harder.

AI Resume in 60s

Your resume rewritten for this exact role using the job description as the brief.

ATS-Optimized

Get past automated screening filters with the right keywords matched to each job.

Application Tracker

Track every job, follow-up, and interview in one visual kanban board.

Free plan · No credit card required