Concept-pedia: Breaking Free from ImageNet's Shadow in Multimodal AI

TL;DR - Why You Should Care

Concept-pedia is a massive multimodal dataset containing over 165,000 semantically-annotated concepts that exposes critical limitations in modern vision-language models.

Key contributions:

165K+ concepts from BabelNet with rich semantic annotations
Concept-10k: A manually-curated benchmark of 10,000 diverse visual concepts
Reveals critical gaps: State-of-the-art models struggle with diverse concepts beyond ImageNet
Open resource: Freely available for research and evaluation

The problem: Current multimodal evaluations are “heavily anchored to ImageNet” - models that excel on traditional benchmarks fail dramatically on real-world visual diversity.

The ImageNet Problem

For over a decade, ImageNet has been the gold standard for evaluating computer vision systems. Its 1,000 carefully curated categories have shaped how we build and evaluate visual AI.

But here’s the issue: the real world has far more than 1,000 visual concepts.

When you ask modern vision-language models about concepts outside this narrow ImageNet distribution, performance drops dramatically. Models that claim “human-level” performance on benchmarks struggle with everyday visual concepts.

Our EMNLP 2025 Research

I’m thrilled to share our paper “Concept-pedia: A Wide-coverage Semantically-annotated Multimodal Dataset”, published at EMNLP 2025 - the Conference on Empirical Methods in Natural Language Processing.

📄 Published at EMNLP 2025
Authors: Karim Ghonim, Andrei Stefan Bejgu, Alberte Fernández-Castro, Roberto Navigli
Affiliations: Sapienza University of Rome & Babelscape

📄 Read the paper at ACL Anthology
📊 Download PDF

What Makes Concept-pedia Different?

Examples of taxonomical concept population in Concept-pedia across different categories (Cat, Emotion, Church, Pasta, Macaque, Train) showing the rich semantic structure from BabelNet.

Massive Scale with Semantic Structure

165,000+ concepts organized through BabelNet’s rich semantic network:

Not just labels: Every concept has definitions, relations, and multilingual support
Semantic hierarchy: Concepts are organized by meaningful relationships
Real-world diversity: Far beyond ImageNet’s 1K categories

Concept-10k: A Rigorous Benchmark

We created Concept-10k, a manually-curated evaluation benchmark:

10,000 diverse visual concepts
Human-verified quality
Balanced across semantic categories
Designed to test real-world generalization

Semantic Annotations

Unlike raw image-text pairs, every concept includes:

Precise definitions
Semantic relationships (hypernymy, meronymy, etc.)
Multilingual coverage
Structured knowledge graph integration

The ImageNet Anchor Problem

Our experiments reveal a critical issue: modern vision-language models are heavily anchored to ImageNet.

Performance Drop Beyond ImageNet

When we evaluate state-of-the-art models on Concept-10k:

Model	ImageNet Performance	Concept-10k Performance	Drop
CLIP (ViT-L/14)	75.5%	42.3%	-33.2%
ALIGN	76.4%	43.8%	-32.6%
OpenCLIP	78.2%	45.1%	-33.1%

Performance drops by over 30 points when tested on diverse concepts!

Comparison of concept and category distributions: Concept-10k covers 28 semantic categories with 9,837 unique concepts, far exceeding ImageNet-1k's 11 categories and 1,000 concepts. The distribution is more balanced across diverse categories.

Why Does This Happen?

Training data bias: Most vision-language models are trained on data distributions similar to ImageNet
Evaluation myopia: We’ve been testing on the same narrow distribution for years
False sense of progress: High benchmark scores don’t reflect real-world capability

Real-World Examples

Let’s see where models fail:

Example 1: Specialized Tools

Concept: “Allen wrench” (a specific type of hex key)

Human: Easily recognizes the L-shaped tool
CLIP: Confuses with “wrench”, “screwdriver”, “key”
Why it fails: Too specific, not in ImageNet’s 1K categories

Example 2: Fine-grained Animals

Concept: “Bombay cat” (a specific cat breed)

Human: Recognizes the sleek black coat
Model: Just says “cat” or “black cat”
Why it fails: ImageNet has “Egyptian cat” but lacks fine-grained breeds

Example 3: Cultural Objects

Concept: “Takoyaki pan” (Japanese cooking equipment)

Human: Recognizes the specialized griddle with hemispheric molds
Model: Confuses with “pan”, “griddle”, “muffin tin”
Why it fails: Cultural specificity beyond Western-centric training data

These aren’t edge cases - they’re everyday objects that humans recognize instantly.

Examples showing the annotation quality in Concept-pedia: correct annotations are verified by expert linguists, while ambiguous cases are carefully filtered out (e.g., distinguishing "church" from "altar" when both appear in the same image).

How We Built Concept-pedia

Concept Selection from BabelNet

We leveraged BabelNet, the largest multilingual semantic network:

Started with 165,000+ visual concepts
Filtered for concepts with clear visual representations
Ensured semantic diversity across categories
Maintained rich semantic annotations

Link propagation examples: Our methodology uses Wikipedia hyperlinks and BabelNet's semantic structure to automatically annotate images with precise concepts, ensuring high-quality annotations at scale.

Image Collection

For each concept:

Query multiple image sources
Automatic quality filtering
Diversity checks (pose, lighting, context)
Deduplication

Manual Curation for Concept-10k

For the evaluation benchmark:

Human verification: Expert annotators verified every image
Quality control: Multiple rounds of checking
Balance: Ensured representation across semantic categories
Difficulty calibration: Mix of easy, medium, hard examples

Key Findings

The ImageNet Anchor Effect is Real

Models perform dramatically worse on concepts outside ImageNet’s distribution, even when those concepts aren’t more visually complex.

Semantic Annotations Help

Models that leverage semantic structure (like our approach) show better generalization than pure vision-language pretraining.

Fine-grained Understanding Lags

Modern models struggle most with:

Fine-grained categories (specific breeds, species, types)
Cultural specificity (region-specific objects)
Specialized domains (medical instruments, technical equipment)

More Data Isn’t Enough

Simply scaling up training data on similar distributions doesn’t solve the problem. We need semantic diversity, not just more examples.

Why This Matters for Practitioners

If you’re building multimodal AI systems, you need to know:

Benchmark Scores Can Mislead

A model with 80% accuracy on ImageNet might have 45% accuracy on real-world visual concepts. Test on diverse data, not just standard benchmarks.

Domain Adaptation is Critical

If your application involves:

Specialized domains (medical, industrial, etc.)
Cultural diversity
Fine-grained recognition

Standard vision-language models will likely underperform. Consider:

Domain-specific fine-tuning
Semantic augmentation with knowledge graphs
Evaluation on relevant concepts, not just ImageNet

Semantic Structure Helps

Incorporating semantic knowledge (definitions, relationships, hierarchies) improves generalization. Don’t rely purely on image-text correlation.

The Research Impact

Concept-pedia enables:

Better Evaluation

Researchers can now test models on 165K+ concepts beyond ImageNet, getting a true sense of generalization capability.

Improved Training

Use Concept-pedia’s semantic annotations to train models with better conceptual understanding.

Diagnostic Analysis

Concept-10k allows detailed analysis of where and why models fail, guiding future improvements.

Multilingual Extension

BabelNet’s multilingual nature enables extension to non-English visual understanding.

What We’re Working On Next

We’re actively developing:

Concept-pedia v2: Expanding to 500K+ concepts
Video concepts: Temporal understanding beyond static images
3D concepts: Spatial reasoning and object understanding
Interactive evaluation platform: Test your models on Concept-10k
Semantic-aware training methods: Leveraging structure for better learning

The Bigger Picture

Concept-pedia is part of a broader movement toward semantic understanding in AI.

We’ve spent a decade optimizing for ImageNet. It’s time to:

Evaluate on real diversity - not just benchmark performance
Incorporate semantic knowledge - not just pattern matching
Test on long-tail concepts - not just common categories
Build for the real world - not just academic benchmarks

The Research Team

This work was a collaborative effort:

Karim Ghonim (Lead - Sapienza University)
Andrei Stefan Bejgu (Sapienza University & Babelscape)
Alberte Fernández-Castro (Sapienza University)
Roberto Navigli (Babelscape & Sapienza University)

Presented at EMNLP 2025 in Suzhou, China.

Access the Dataset

Concept-pedia is freely available for research:

Paper: ACL Anthology
PDF: Download

Note: Dataset links and code repositories will be updated as they become publicly available.

Implications for Multimodal AI

For Researchers

New benchmark: Evaluate on Concept-10k for real-world generalization
Training resource: Use semantic annotations for better learning
Analysis tool: Diagnose model failures on specific concept types

For Practitioners

Reality check: Test your models beyond ImageNet
Semantic augmentation: Incorporate structured knowledge
Domain awareness: Understand limitations before deployment

For the Field

Shift in evaluation: Move beyond ImageNet-centric metrics
Semantic integration: Combine vision with knowledge graphs
Long-tail focus: Address real-world diversity

Conclusion

For years, we’ve been optimizing for the wrong target. High performance on ImageNet doesn’t mean real-world visual understanding.

Concept-pedia provides:

165K+ semantically-annotated concepts for training
Concept-10k benchmark for rigorous evaluation
Evidence that current models are more limited than we thought
A path forward through semantic integration

It’s time to break free from ImageNet’s shadow and build multimodal AI that truly understands the visual world’s diversity.

Citation

If you use Concept-pedia in your research, please cite our paper:

@inproceedings{ghonim-etal-2025-conceptpedia,
    title     = "Concept-pedia: A Wide-coverage Semantically-annotated Multimodal Dataset",
    author    = "Ghonim, Karim and
                 Bejgu, Andrei Stefan and
                 Fern{\'a}ndez-Castro, Alberte and
                 Navigli, Roberto",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month     = nov,
    year      = "2025",
    address   = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url       = "https://aclanthology.org/2025.emnlp-main.1745/",
    pages     = "34405--34426",
}

Plain text citation:

Karim Ghonim, Andrei Stefan Bejgu, Alberte Fernández-Castro, and Roberto Navigli. 2025. Concept-pedia: A Wide-coverage Semantically-annotated Multimodal Dataset. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34405–34426, Suzhou, China. Association for Computational Linguistics.

Published at EMNLP 2025 - Conference on Empirical Methods in Natural Language Processing, Suzhou, China

Built with ❤️ by Andrei Stefan Bejgu - AI Applied Scientist @ SylloTips