🔄 This website is actively being updated - More info will be available soon

ContentBench: Testing LLMs for Content Analysis

Can We Trust Large Language Models to Replace Human Coders?
1Czech University of Life Sciences Prague

ContentBench reveals that low-cost LLMs now achieve ~90% accuracy on double-validated consensus posts while processing thousands of posts per dollar, fundamentally shifting the economics of research.

Dataset Creation Methodology

🎯

Stage 1: Generation

Models: o3 & Gemini 2.5 Pro

Adversarial posts created using specialized prompts designed to challenge AI classification while remaining realistic.

View Prompt
👥

Stage 2: Verification

Panel: o3, Gemini 2.5 Pro, & Claude 4 Opus

Independent classification by three state-of-the-art models using identical prompts.

View Prompt

Stage 3: Consensus Filter

✔ 100% Agreement = Accepted

✖ Any Disagreement = Rejected

🏆

Final Dataset

117 posts

Double-validated consensus: Posts agreed upon twice across independent verification runs.

Covering 7 distinct categories: sarcastic critique, implied misconduct, absurdity as critique, legitimate scientific practice, genuine praise, genuine critique, and neutral query.

Example Posts:

"A groundbreaking study! The authors have achieved a p-value of 0.049 with a sample size of 8. This is the kind of robust, reproducible science we need more of."

Ground Truth: sarcastic_critique

"The sample size of n=10 is perfect. Large enough to run the stats, small enough to not be representative of anything. A masterclass in research design."

Ground Truth: sarcastic_critique

These are examples from our dataset - but you can try it yourself with the interactive quiz!

Model Performance Leaderboard

Rank Model Accuracy (%) Posts per $1

💰 Pricing & Access Information

Model Access: Google and OpenAI models tested via direct APIs without batch discounts. Other models accessed through OpenRouter.

Free Models: Gemma models show "N/A" for pricing as they were tested using Google's free API access.

Testing Period: Pricing and testing conducted in mid-July 2025.

Important: Both Google and OpenAI offer batch API discounts which were not used in this benchmark. OpenRouter providers may offer promotional campaigns or temporary discounts. Prices can change significantly over time.

Recommendation: Always check current pricing directly with providers before making cost-based decisions.

📊 Verification Model Information

Dataset Creation: o3 (High Reasoning), Claude Opus 4, and Gemini 2.5 Pro were used as verification models to create the consensus dataset.

Methodological Note: These models necessarily achieve 100% accuracy because only posts where all three models agreed were included in the final dataset. Consider this when comparing their performance to other models.

Try Yourself: Interactive Classification Challenge

Think you can classify these posts as well as AI? Test your skills with actual ContentBench posts! Each set contains 10 posts - select the category you think best describes each one.

Ready to Test Your Classification Skills?

You'll see posts one at a time and need to classify them into one of 7 categories.

117 total posts divided into 11 sets of 10 posts + 1 set of 7 posts

Citation

@article{contentbench2025,
  author    = {Michael Haman},
  title     = {ContentBench: Can We Trust Large Language Models for Content Analysis and Replace Human Coders?},
  journal   = {Journal Name},
  year      = {2025},
}