RLHF, data-annotation and evals pod for a UK AI safety lab scaling human feedback

Challenge

The lab was bottlenecked on human data: preference ranking, instruction-tuning labels, red-team prompts and rubric-based evals for a coding and reasoning model. Onshore specialist annotators were costing them roughly £30 to £35 an hour where they could find them at all, and quality on anonymous crowd platforms was inconsistent enough to poison training runs. They needed reviewers who could actually read code and reason about edge cases, plus a calibrated QA layer and an audit trail that would satisfy their own safety and governance team. Building that team directly meant payroll, management and tooling overhead they did not want to carry.

OSCABE approach

OSCABE built an AI Training Team of twelve specialists with a dedicated QA lead, blending coding-literate reviewers from our India pool with strong-English domain evaluators based in Cairo and Amman for follow-the-sun coverage. Every reviewer cleared our 5-stage vetting including a domain-specific work sample, and the QA lead ran inter-annotator agreement scoring and gold-set checks so the lab could trust the calibration. The pod plugged directly into the client's annotation stack and rubric, with a 4 to 6 hour overlap for live disambiguation sessions on ambiguous prompts. It ran as a Managed Remote Team under one UK contract, GDPR-compliant, with full visibility into who was labelling what.

Outcome

The lab roughly tripled its weekly throughput of high-quality preference and eval data within about a month of ramp, and brought cost per annotation hour down by roughly 55% against their onshore baseline. Inter-annotator agreement settled into a range their research team was comfortable feeding into training, and the audit trail cleared internal governance review. The pod has since taken on red-teaming and domain-expert evaluation for a second model, and the lab treats the QA lead as an extension of their own data team. Turnaround on a typical eval batch dropped from about two weeks to a few days.

Inside the engagement

The full evidence: team, timeline, stack, vetting, security, costs and before/after metrics.

The problem

The client, a UK AI safety lab, was bottlenecked on human data for a coding and reasoning model: preference ranking, instruction-tuning labels, red-team prompts and rubric-based evaluations. Onshore specialist annotators cost roughly GBP 30 to GBP 35 an hour where they could be found at all, and anonymous crowd platforms produced quality inconsistent enough to poison training runs. The lab needed reviewers who could genuinely read code and reason about edge cases, a calibrated QA layer, and an audit trail their internal safety and governance team would accept. Building that 12-person operation in-house meant payroll, management and tooling overhead they did not want to carry.

Team composition

OSCABE stood up a 13-person AI training team built for coverage and calibration:

1 QA lead (8 yrs, ran inter-annotator agreement scoring and gold-set checks)
5 coding-literate reviewers (4 to 7 yrs, able to read and reason about real code)
4 domain evaluators (strong-English reasoning and rubric specialists)
2 red-team prompt writers (adversarial and safety-focused)
1 data and tooling coordinator (kept the annotation stack and exports clean)

Reviewers were split across India, Cairo and Amman for follow-the-sun coverage, with a 4 to 6 hour overlap window for live disambiguation.

Timeline

Ramp to high-quality throughput took five weeks:

Week 1: QA lead and first reviewers onboarded onto the client's rubric and annotation stack; gold set established.
Weeks 2 to 3: full pod live; calibration rounds run against the gold set until inter-annotator agreement reached the lab's target range.
Weeks 4 to 5: throughput scaled to roughly triple the lab's prior weekly volume, with the audit trail reviewed and cleared by internal governance.

Tech stack

The pod worked inside the client's own annotation and RLHF tooling (a Label Studio-style interface with custom rubric plugins), backed by Python data pipelines, Jupyter for spot analysis, and structured JSONL exports feeding the lab's training and eval harness. Agreement scoring used Krippendorff's alpha and Cohen's kappa against versioned gold sets.

How OSCABE vetted the team

Every reviewer cleared OSCABE's five-step vetting. An instant AI shortlist filtered for coding literacy and reasoning ability; a senior OSCABE specialist ran a technical interview with a live, domain-specific work sample (reading code, ranking model outputs, applying a sample rubric); a communication assessment confirmed clear written rationale in English; and background and reference checks validated history. For reviewers handling code-correctness judgements we additionally confirmed engineering credentials. The QA lead was selected specifically for prior experience running calibrated annotation programmes.

What was delivered

Within about a month the lab roughly tripled its weekly volume of high-quality preference and eval data, with inter-annotator agreement settling into a range the research team was comfortable feeding into training. Turnaround on a typical eval batch dropped from about two weeks to a few days. The pod later took on red-teaming and domain-expert evaluation for a second model, and produced a documented, auditable record of who labelled what against which rubric version.

Client workflow and collaboration

The QA lead acted as an extension of the lab's own data team, running daily calibration check-ins during the overlap window and escalating ambiguous prompts to the lab's researchers for adjudication. Rubric changes were versioned and rolled out with a re-calibration pass so quality never drifted silently. Weekly throughput and agreement reports went to the Head of Human Data.

Tools used

The client's annotation platform; Slack and Notion for guidelines and disambiguation threads; Google Meet for live calibration sessions; Git for rubric and guideline version control; and dashboards tracking throughput, agreement and reviewer-level quality.

Security and compliance model

The engagement ran under ISO 27001-aligned controls and UK GDPR, under one UK contract. Reviewers accessed the annotation stack through SSO with least-privilege, project-scoped permissions on managed devices. Any personal data in prompts was handled under documented data-handling rules, with cross-border transfers covered by the UK IDTA and a transfer risk assessment. All reviewers signed NDAs, and labelled data and derived IP assigned to the client. Full reviewer-level audit logging satisfied the lab's governance requirements.

Cost comparison

Item	Onshore baseline	OSCABE pod
Effective cost per annotation hour	around GBP 32	around GBP 14
13-FTE-equivalent annual run	around GBP 780,000	around GBP 350,000
Reduction per annotation hour		roughly -55%

Before and after

Metric	Before	After
Cost per annotation hour	around GBP 32	around GBP 14
Weekly high-quality volume	baseline	roughly 3x
Eval batch turnaround	about 2 weeks	a few days
Inter-annotator agreement	inconsistent	within research target
Governance audit	unmet	cleared

What OSCABE managed vs what the client managed

OSCABE managed:

Recruiting, five-step vetting, employment and payment of all 13 specialists.
Day-to-day team management, calibration discipline and follow-the-sun coverage.
Access controls, managed devices and the compliance baseline.

The client managed:

The rubric, guidelines and what counted as a correct judgement.
Adjudication of escalated ambiguous cases and acceptance of agreement thresholds.
Use of the labelled data in their own training and eval harness.

Why remote worked

Annotation quality is driven by calibration and reviewer judgement, not co-location. A shared rubric, versioned gold sets and a daily live disambiguation window during the overlap gave the lab tighter agreement than an anonymous crowd ever did. Follow-the-sun coverage across India, Cairo and Amman meant batches moved overnight, turning a two-week cycle into a few days, while a single UK contract and full audit logging kept the whole operation governable from London.

“The difference was reviewers who could actually reason about our rubric, not just click through it. We got research-grade human feedback at roughly half the cost, with a QA layer our safety team could audit, and we scaled it in weeks.”

- Head of Human Data, AI safety lab (UK)