Can Large Language Models Replace Human Coders? Introducing ContentBench

Dataset Creation Methodology

🎯

Stage 1: Generation

Models: GPT-5 & Gemini 2.5 Pro

Adversarial posts created using specialized prompts designed to challenge AI classification while remaining realistic.

View Prompt

→

👥

Stage 2: Verification

Panel: GPT-5, Gemini 2.5 Pro, & Claude Opus 4.1

Independent classification by three state-of-the-art models using identical prompts.

View Prompt

→

✅

Stage 3: Consensus Filter

✔ 100% Agreement = Accepted

✖ Any Disagreement = Rejected

→

🏆

Final Dataset (ResearchTalk v1.0)

1,000 posts

Core Split

500 posts across 5 categories (balanced)

Tests general classification ability.

Hard-Sarcasm Split

500 posts, all sarcastic_critique

Filtered to items that GPT‑3.5 Turbo misclassifies under the shared prompt. Tests sarcasm detection under adversarial conditions.

Example Posts:

"Genuinely thrilled by Smith et al.'s Micro-gestural Synchrony Predicts Team Performance in Hybrid Meetings. The robustness is exceptional: the key association (r=0.19, p=0.048) persists across all specifications, provided participant 12 remains in the sample. The clarity of that boundary condition gives such confidence in generalizability. A masterclass in demonstrating stability by showing the effect disappears only when omitting the single highest-synchrony case. This kind of precision about where your result lives is exactly what the field needs."

Reference Label: sarcastic_critique

"Deeply impressed by Two-Minute Mindfulness Before Exams Improves STEM Performance: A Randomized Classroom Trial. Delivering a statistically significant 0.8-point gain on a 100-point test (p=0.049) with an effect size hovering around d=0.08, and replicating that magnitude across sections, is exactly the kind of robust, scalable impact we need. The authors’ discussion rightly highlights the sweeping curricular implications of such a consistently detectable improvement. A model of how marginal gains, carefully celebrated, can meaningfully reshape assessment policy."

Reference Label: sarcastic_critique

These are examples from our dataset - but you can try it yourself with the interactive quiz!

Model Performance Leaderboard

Rank	Model	Combined Accuracy (%)	Sarcasm Recall (%)	$/50k posts

📊 Metrics

Accuracy: Combined agreement with reference labels across both splits: 500 core items (5 content categories) and 500 hard sarcasm items. Reported as (core correct + sarcasm correct) / 1,000.

Sarcasm Recall: Proportion of hard sarcasm items correctly classified as sarcastic_critique. This split specifically tests whether models can identify sarcasm when it mimics the surface form of other categories.

💰 Pricing & Access Information

Model Access: Google and OpenAI models tested via direct APIs without batch discounts. Other models accessed through OpenRouter.

Free Models: Gemma models show "N/A" for pricing as they were tested using Google's free API access.

Testing Period: Pricing and testing conducted in September 2025.

Important: Both Google and OpenAI offer batch API discounts which were not used in this benchmark. OpenRouter providers may offer promotional campaigns or temporary discounts. Prices can change significantly over time.

Recommendation: Always check current pricing directly with providers before making cost-based decisions.

Have a labeled dataset and codebook? Add it as a new ContentBench track.

Try Yourself: Interactive Classification Challenge

Think you can classify these posts as well as AI? Test your skills with actual ContentBench posts! Each set contains 10 posts - select the category you think best describes each one.

Ready to Test Your Classification Skills?

You'll see posts one at a time and need to classify them into one of 5 categories.

100 total posts (20 per category) divided into 10 sets of 10 posts

Contribute a Track (Dataset)

ContentBench is a benchmark suite. Each track is a versioned benchmark dataset with a fixed classification prompt, reference labels, and published evaluation results. Contributing a track makes your dataset a living standard used to evaluate new models over time.

What you get

Dual citation: Every paper using your track cites both ContentBench and your original dataset/paper.
Track page: Your dataset gets a versioned page (v1.0, v1.1, ...) on this website with its own leaderboard.
Results snapshot: Published evaluation results with documented model settings, so your dataset is continuously re-tested as new models appear.

What to send

Dataset with labeled items (any format: JSONL, CSV, TSV)
Label set (closed list of categories)
Classification prompt
Short description of how reference labels were produced (who coded, how many coders, how disagreements were resolved)

How to submit

Email haman@pef.czu.cz with a link to your dataset (Zenodo, OSF, GitHub, or attachment) and a brief description. Accepted tracks are published with version identifiers and track-specific citation files.

FAQ

Can I contribute a non-English dataset?
Yes. Tracks can be in any language. The leaderboard will show per-language results.

What if my data are restricted (e.g., copyrighted text)?
Restricted tracks are possible. You can either provide the prompt, evaluation script, and run manifest for independent replication, or I can run the evaluation on your data directly. Either way, only aggregated results are published; the data itself is not distributed.

What license do you need?
Any license that permits research use. CC-BY or CC-BY-SA are recommended. For restricted tracks, no license on the data is needed.

Do you accept inductive/qualitative coding schemes?
Currently, ContentBench focuses on deductive classification with predefined categories, but feel free to reach out if you have an inductive coding dataset you'd like to discuss.

Citation

@misc{haman2026contentbench,
  title     = {Can Large Language Models Replace Human Coders? Introducing ContentBench},
  author    = {Michael Haman},
  year      = {2026},
  eprint    = {2602.19467},
  archivePrefix = {arXiv},
  primaryClass = {cs.CY},
  url       = {https://arxiv.org/abs/2602.19467}
}

Can Large Language Models Replace Human Coders?

ContentBench reveals that low-cost LLMs now achieve up to 99.8% accuracy on consensus-labeled posts at under $6 per 50,000 posts, fundamentally shifting the economics of research coding.