Freelance Agent Evaluation Engineer
Skills
About This Role
Overview
- We're building a dataset to evaluate AI coding agents how well a model handles real-world developer tasks.
- You'll create challenging tasks and evaluation criteria within realistic simulated environments:
- Build virtual companies following a high-level plan - codebase, infrastructure, and context (conversations, documentation, tickets) that form a realistic environment with development history
- Assemble and calibrate tasks from intermediate states of the virtual company: craft the prompt, define evaluation criteria, and ensure the task is solvable and the evaluation is fair
- Design tasks set in isolated environments - emulations of a developer's workstation: a Linux machine with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation, etc.), and a real web application codebase
- Write tests that accept all correct solutions and reject incorrect ones - neither too strict (breaking on valid approaches) nor too lenient (passing bad ones)
- Iterate with an AI agent on tests - verifying they catch real problems, don't miss bad solutions, and don't break on good ones
- Review code written by agents, analyze why an agent failed or succeeded, and design edge cases and adversarial scenarios
- Iterate based on feedback from expert QA reviewers who score your work on quality criteria
- A significant part of the work is done together with AI - it's very hard to create tasks that challenge frontier models without using frontier models.
- strong>Why this is hard/strong>
- Frontier models are already good at coding. Creating a task that genuinely challenges the best models is non-trivial. You need to deeply understand where models fail and what scenarios reveal the difference between a good and a bad solution.
- Tasks have many valid solutions. Writing tests that accept all correct solutions and reject incorrect ones is harder than it sounds.
- strong>How it works/strong>
- Apply Pass qualification(s) Join a project Complete tasks Get paid
Your resume, rewritten
for this exact role.
Sign up free — Base Career tailors your CV to this job description in 60 seconds.
01 / 05
Resume Tailored to This Job

Your keywords, structure, and story — rewritten to match this exact role and pass ATS filters.
Free · No card · 60 seconds
02 / 05
Cover Letter for This Role, Done

Job-specific cover letters written in Gulf professional tone — ready in seconds, not hours.
Free · No card · 60 seconds
03 / 05
See How Well You Fit This Role

AI match score with clear reasons — know your fit before investing time in the application.
Free · No card · 60 seconds
04 / 05
Apply in One Click

Autofill any application form on Workday, LinkedIn, Bayt, Greenhouse — with your tailored content.
Free · No card · 60 seconds
05 / 05
Track It. Follow Up at the Right Time.

Visual pipeline for every application with AI-timed follow-up reminders so nothing slips.
Free · No card · 60 seconds
Similar Jobs
Freelance Agent Evaluation Engineer
Mindrift · Doha
Please submit your CV in English and indicate your level of English proficiency. Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving
Skills
5 days ago
Apply Now↗Apply Now ↗Freelance Agent Evaluation Engineer
Mindrift · Saudi Arabia
Develop challenging tasks and evaluation criteria for AI coding agents, requiring expertise in software development, testing, and full-stack systems.
Skills
1 weeks ago
Apply Now↗Apply Now ↗Freelance Agent Evaluation Engineer
Mindrift · Doha
Please submit your CV in English and indicate your level of English proficiency. Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving
Skills
2 weeks ago
Apply Now↗Apply Now ↗Freelance Agent Evaluation Engineer
Mindrift ·
Please submit your CV in English and indicate your level of English proficiency. Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving
Skills
3 weeks ago
Apply Now↗Apply Now ↗Freelance Agent Evaluation Engineer
Mindrift ·
Please submit your CV in English and indicate your level of English proficiency. Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving
Skills
1 months ago
Apply Now↗Apply Now ↗2.2K+
Cover Letters & Follow-ups
1.8K+
Resumes Tailored
190.5K+
Jobs Tracked
Trusted by professionals at
Stop applying blindly.
Start getting hired.
Base Career automates the hardest parts of job searching — apply smarter, not harder.
AI Resume in 60s
Your resume rewritten for this exact role using the job description as the brief.
ATS-Optimized
Get past automated screening filters with the right keywords matched to each job.
Application Tracker
Track every job, follow-up, and interview in one visual kanban board.
Free plan · No credit card required