Think you can classify these posts as well as AI? Test your skills with actual ContentBench posts! Each set contains 10 posts - select the category you think best describes each one.
🔄 This website is actively being updated - More info will be available soon
Models: GPT-5 & Gemini 2.5 Pro
Adversarial posts created using specialized prompts designed to challenge AI classification while remaining realistic.
View PromptPanel: GPT-5, Gemini 2.5 Pro, & Claude Opus 4.1
Independent classification by three state-of-the-art models using identical prompts.
View Prompt✔ 100% Agreement = Accepted
✖ Any Disagreement = Rejected
1,000 posts
Double-validated consensus: Posts agreed upon twice across independent verification runs.
Covering 5 distinct categories: sarcastic critique, genuine critique, genuine praise, neutral query, and procedural statement.
Example Posts:
"Genuinely thrilled by Smith et al.'s Micro-gestural Synchrony Predicts Team Performance in Hybrid Meetings. The robustness is exceptional: the key association (r=0.19, p=0.048) persists across all specifications, provided participant 12 remains in the sample. The clarity of that boundary condition gives such confidence in generalizability. A masterclass in demonstrating stability by showing the effect disappears only when omitting the single highest-synchrony case. This kind of precision about where your result lives is exactly what the field needs."
Reference Label: sarcastic_critique
"Delighted by Ambient Citrus Scent and Cooperative Behavior in Open-Plan Offices: A Randomized Cross-Over Trial. The authors deliver decisive evidence, with cooperation rising from 50.0% to 50.7% (d=0.12, p=0.047) across 62 office-days averaged over two sites during routine operations, convincingly establishing scent as a lever for organizational design. It is hard to overstate how transformative a 0.7 percentage point shift can be when interpreted with such confidence. A model of how minimal effects can illuminate big questions."
Reference Label: sarcastic_critique
These are examples from our dataset - but you can try it yourself with the interactive quiz!
| Rank | Model | Accuracy (%) | Posts per $1 |
|---|
Model Access: Google and OpenAI models tested via direct APIs without batch discounts. Other models accessed through OpenRouter.
Free Models: Gemma models show "N/A" for pricing as they were tested using Google's free API access.
Testing Period: Pricing and testing conducted in September 2025.
Important: Both Google and OpenAI offer batch API discounts which were not used in this benchmark. OpenRouter providers may offer promotional campaigns or temporary discounts. Prices can change significantly over time.
Recommendation: Always check current pricing directly with providers before making cost-based decisions.
Dataset Creation: GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1 were used as verification models to create the consensus dataset.
Methodological Note: These models necessarily achieve 100% accuracy because only posts where all three models agreed were included in the final dataset. Consider this when comparing their performance to other models.
Think you can classify these posts as well as AI? Test your skills with actual ContentBench posts! Each set contains 10 posts - select the category you think best describes each one.
You'll see posts one at a time and need to classify them into one of 5 categories.
100 total posts (20 per category) divided into 10 sets of 10 posts
Set Complete!
Your accuracy: 70%
AI average on this set: 85%