Concept-pedia: Breaking Free from ImageNet's Shadow in Multimodal AI
TL;DR - Why You Should Care
Concept-pedia is a massive multimodal dataset containing over 165,000 semantically-annotated concepts that exposes critical limitations in modern vision-language models.
Key contributions:
- 165K+ concepts from BabelNet with rich semantic annotations
- Concept-10k: A manually-curated benchmark of 10,000 diverse visual concepts
- Reveals critical gaps: State-of-the-art models struggle with diverse concepts beyond ImageNet
- Open resource: Freely available for research and evaluation
The problem: Current multimodal evaluations are “heavily anchored to ImageNet” - models that excel on traditional benchmarks fail dramatically on real-world visual diversity.
The ImageNet Problem
For over a decade, ImageNet has been the gold standard for evaluating computer vision systems. Its 1,000 carefully curated categories have shaped how we build and evaluate visual AI.
But here’s the issue: the real world has far more than 1,000 visual concepts.
When you ask modern vision-language models about concepts outside this narrow ImageNet distribution, performance drops dramatically. Models that claim “human-level” performance on benchmarks struggle with everyday visual concepts.
Our EMNLP 2025 Research
I’m thrilled to share our paper “Concept-pedia: A Wide-coverage Semantically-annotated Multimodal Dataset”, published at EMNLP 2025 - the Conference on Empirical Methods in Natural Language Processing.
Authors: Karim Ghonim, Andrei Stefan Bejgu, Alberte Fernández-Castro, Roberto Navigli
Affiliations: Sapienza University of Rome & Babelscape
📄 Read the paper at ACL Anthology
📊 Download PDF
What Makes Concept-pedia Different?
Massive Scale with Semantic Structure
165,000+ concepts organized through BabelNet’s rich semantic network:
- Not just labels: Every concept has definitions, relations, and multilingual support
- Semantic hierarchy: Concepts are organized by meaningful relationships
- Real-world diversity: Far beyond ImageNet’s 1K categories
Concept-10k: A Rigorous Benchmark
We created Concept-10k, a manually-curated evaluation benchmark:
- 10,000 diverse visual concepts
- Human-verified quality
- Balanced across semantic categories
- Designed to test real-world generalization
Semantic Annotations
Unlike raw image-text pairs, every concept includes:
- Precise definitions
- Semantic relationships (hypernymy, meronymy, etc.)
- Multilingual coverage
- Structured knowledge graph integration
The ImageNet Anchor Problem
Our experiments reveal a critical issue: modern vision-language models are heavily anchored to ImageNet.
Performance Drop Beyond ImageNet
When we evaluate state-of-the-art models on Concept-10k:
| Model | ImageNet Performance | Concept-10k Performance | Drop |
|---|---|---|---|
| CLIP (ViT-L/14) | 75.5% | 42.3% | -33.2% |
| ALIGN | 76.4% | 43.8% | -32.6% |
| OpenCLIP | 78.2% | 45.1% | -33.1% |
Performance drops by over 30 points when tested on diverse concepts!
Why Does This Happen?
- Training data bias: Most vision-language models are trained on data distributions similar to ImageNet
- Evaluation myopia: We’ve been testing on the same narrow distribution for years
- False sense of progress: High benchmark scores don’t reflect real-world capability
Real-World Examples
Let’s see where models fail:
Example 1: Specialized Tools
Concept: “Allen wrench” (a specific type of hex key)
- Human: Easily recognizes the L-shaped tool
- CLIP: Confuses with “wrench”, “screwdriver”, “key”
- Why it fails: Too specific, not in ImageNet’s 1K categories
Example 2: Fine-grained Animals
Concept: “Bombay cat” (a specific cat breed)
- Human: Recognizes the sleek black coat
- Model: Just says “cat” or “black cat”
- Why it fails: ImageNet has “Egyptian cat” but lacks fine-grained breeds
Example 3: Cultural Objects
Concept: “Takoyaki pan” (Japanese cooking equipment)
- Human: Recognizes the specialized griddle with hemispheric molds
- Model: Confuses with “pan”, “griddle”, “muffin tin”
- Why it fails: Cultural specificity beyond Western-centric training data
These aren’t edge cases - they’re everyday objects that humans recognize instantly.
How We Built Concept-pedia
Concept Selection from BabelNet
We leveraged BabelNet, the largest multilingual semantic network:
- Started with 165,000+ visual concepts
- Filtered for concepts with clear visual representations
- Ensured semantic diversity across categories
- Maintained rich semantic annotations
Image Collection
For each concept:
- Query multiple image sources
- Automatic quality filtering
- Diversity checks (pose, lighting, context)
- Deduplication
Manual Curation for Concept-10k
For the evaluation benchmark:
- Human verification: Expert annotators verified every image
- Quality control: Multiple rounds of checking
- Balance: Ensured representation across semantic categories
- Difficulty calibration: Mix of easy, medium, hard examples
Key Findings
The ImageNet Anchor Effect is Real
Models perform dramatically worse on concepts outside ImageNet’s distribution, even when those concepts aren’t more visually complex.
Semantic Annotations Help
Models that leverage semantic structure (like our approach) show better generalization than pure vision-language pretraining.
Fine-grained Understanding Lags
Modern models struggle most with:
- Fine-grained categories (specific breeds, species, types)
- Cultural specificity (region-specific objects)
- Specialized domains (medical instruments, technical equipment)
More Data Isn’t Enough
Simply scaling up training data on similar distributions doesn’t solve the problem. We need semantic diversity, not just more examples.
Why This Matters for Practitioners
If you’re building multimodal AI systems, you need to know:
Benchmark Scores Can Mislead
A model with 80% accuracy on ImageNet might have 45% accuracy on real-world visual concepts. Test on diverse data, not just standard benchmarks.
Domain Adaptation is Critical
If your application involves:
- Specialized domains (medical, industrial, etc.)
- Cultural diversity
- Fine-grained recognition
Standard vision-language models will likely underperform. Consider:
- Domain-specific fine-tuning
- Semantic augmentation with knowledge graphs
- Evaluation on relevant concepts, not just ImageNet
Semantic Structure Helps
Incorporating semantic knowledge (definitions, relationships, hierarchies) improves generalization. Don’t rely purely on image-text correlation.
The Research Impact
Concept-pedia enables:
Better Evaluation
Researchers can now test models on 165K+ concepts beyond ImageNet, getting a true sense of generalization capability.
Improved Training
Use Concept-pedia’s semantic annotations to train models with better conceptual understanding.
Diagnostic Analysis
Concept-10k allows detailed analysis of where and why models fail, guiding future improvements.
Multilingual Extension
BabelNet’s multilingual nature enables extension to non-English visual understanding.
What We’re Working On Next
We’re actively developing:
- Concept-pedia v2: Expanding to 500K+ concepts
- Video concepts: Temporal understanding beyond static images
- 3D concepts: Spatial reasoning and object understanding
- Interactive evaluation platform: Test your models on Concept-10k
- Semantic-aware training methods: Leveraging structure for better learning
The Bigger Picture
Concept-pedia is part of a broader movement toward semantic understanding in AI.
We’ve spent a decade optimizing for ImageNet. It’s time to:
- Evaluate on real diversity - not just benchmark performance
- Incorporate semantic knowledge - not just pattern matching
- Test on long-tail concepts - not just common categories
- Build for the real world - not just academic benchmarks
The Research Team
This work was a collaborative effort:
- Karim Ghonim (Lead - Sapienza University)
- Andrei Stefan Bejgu (Sapienza University & Babelscape)
- Alberte Fernández-Castro (Sapienza University)
- Roberto Navigli (Babelscape & Sapienza University)
Presented at EMNLP 2025 in Suzhou, China.
Access the Dataset
Concept-pedia is freely available for research:
- Paper: ACL Anthology
- PDF: Download
Note: Dataset links and code repositories will be updated as they become publicly available.
Implications for Multimodal AI
For Researchers
- New benchmark: Evaluate on Concept-10k for real-world generalization
- Training resource: Use semantic annotations for better learning
- Analysis tool: Diagnose model failures on specific concept types
For Practitioners
- Reality check: Test your models beyond ImageNet
- Semantic augmentation: Incorporate structured knowledge
- Domain awareness: Understand limitations before deployment
For the Field
- Shift in evaluation: Move beyond ImageNet-centric metrics
- Semantic integration: Combine vision with knowledge graphs
- Long-tail focus: Address real-world diversity
Conclusion
For years, we’ve been optimizing for the wrong target. High performance on ImageNet doesn’t mean real-world visual understanding.
Concept-pedia provides:
- 165K+ semantically-annotated concepts for training
- Concept-10k benchmark for rigorous evaluation
- Evidence that current models are more limited than we thought
- A path forward through semantic integration
It’s time to break free from ImageNet’s shadow and build multimodal AI that truly understands the visual world’s diversity.
Citation
If you use Concept-pedia in your research, please cite our paper:
@inproceedings{ghonim-etal-2025-conceptpedia,
title = "Concept-pedia: A Wide-coverage Semantically-annotated Multimodal Dataset",
author = "Ghonim, Karim and
Bejgu, Andrei Stefan and
Fern{\'a}ndez-Castro, Alberte and
Navigli, Roberto",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1745/",
pages = "34405--34426",
}
Plain text citation:
Karim Ghonim, Andrei Stefan Bejgu, Alberte Fernández-Castro, and Roberto Navigli. 2025. Concept-pedia: A Wide-coverage Semantically-annotated Multimodal Dataset. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34405–34426, Suzhou, China. Association for Computational Linguistics.
Published at EMNLP 2025 - Conference on Empirical Methods in Natural Language Processing, Suzhou, China
Built with ❤️ by Andrei Stefan Bejgu - AI Applied Scientist @ SylloTips