Concept-pedia: Breaking Free from ImageNet's Shadow in Multimodal AI

TL;DR - Why You Should Care

Concept-pedia is a massive multimodal dataset containing over 165,000 semantically-annotated concepts that exposes critical limitations in modern vision-language models.

Key contributions:

  • 165K+ concepts from BabelNet with rich semantic annotations
  • Concept-10k: A manually-curated benchmark of 10,000 diverse visual concepts
  • Reveals critical gaps: State-of-the-art models struggle with diverse concepts beyond ImageNet
  • Open resource: Freely available for research and evaluation

The problem: Current multimodal evaluations are “heavily anchored to ImageNet” - models that excel on traditional benchmarks fail dramatically on real-world visual diversity.


The ImageNet Problem

For over a decade, ImageNet has been the gold standard for evaluating computer vision systems. Its 1,000 carefully curated categories have shaped how we build and evaluate visual AI.

But here’s the issue: the real world has far more than 1,000 visual concepts.

When you ask modern vision-language models about concepts outside this narrow ImageNet distribution, performance drops dramatically. Models that claim “human-level” performance on benchmarks struggle with everyday visual concepts.

Our EMNLP 2025 Research

I’m thrilled to share our paper “Concept-pedia: A Wide-coverage Semantically-annotated Multimodal Dataset”, published at EMNLP 2025 - the Conference on Empirical Methods in Natural Language Processing.

What Makes Concept-pedia Different?

Examples of taxonomical concept population in Concept-pedia across different categories (Cat, Emotion, Church, Pasta, Macaque, Train) showing the rich semantic structure from BabelNet.

Massive Scale with Semantic Structure

165,000+ concepts organized through BabelNet’s rich semantic network:

  • Not just labels: Every concept has definitions, relations, and multilingual support
  • Semantic hierarchy: Concepts are organized by meaningful relationships
  • Real-world diversity: Far beyond ImageNet’s 1K categories

Concept-10k: A Rigorous Benchmark

We created Concept-10k, a manually-curated evaluation benchmark:

  • 10,000 diverse visual concepts
  • Human-verified quality
  • Balanced across semantic categories
  • Designed to test real-world generalization

Semantic Annotations

Unlike raw image-text pairs, every concept includes:

  • Precise definitions
  • Semantic relationships (hypernymy, meronymy, etc.)
  • Multilingual coverage
  • Structured knowledge graph integration

The ImageNet Anchor Problem

Our experiments reveal a critical issue: modern vision-language models are heavily anchored to ImageNet.

Performance Drop Beyond ImageNet

When we evaluate state-of-the-art models on Concept-10k:

Model ImageNet Performance Concept-10k Performance Drop
CLIP (ViT-L/14) 75.5% 42.3% -33.2%
ALIGN 76.4% 43.8% -32.6%
OpenCLIP 78.2% 45.1% -33.1%

Performance drops by over 30 points when tested on diverse concepts!

Comparison of concept and category distributions: Concept-10k covers 28 semantic categories with 9,837 unique concepts, far exceeding ImageNet-1k's 11 categories and 1,000 concepts. The distribution is more balanced across diverse categories.

Why Does This Happen?

  • Training data bias: Most vision-language models are trained on data distributions similar to ImageNet
  • Evaluation myopia: We’ve been testing on the same narrow distribution for years
  • False sense of progress: High benchmark scores don’t reflect real-world capability

Real-World Examples

Let’s see where models fail:

Example 1: Specialized Tools

Concept: “Allen wrench” (a specific type of hex key)

  • Human: Easily recognizes the L-shaped tool
  • CLIP: Confuses with “wrench”, “screwdriver”, “key”
  • Why it fails: Too specific, not in ImageNet’s 1K categories

Example 2: Fine-grained Animals

Concept: “Bombay cat” (a specific cat breed)

  • Human: Recognizes the sleek black coat
  • Model: Just says “cat” or “black cat”
  • Why it fails: ImageNet has “Egyptian cat” but lacks fine-grained breeds

Example 3: Cultural Objects

Concept: “Takoyaki pan” (Japanese cooking equipment)

  • Human: Recognizes the specialized griddle with hemispheric molds
  • Model: Confuses with “pan”, “griddle”, “muffin tin”
  • Why it fails: Cultural specificity beyond Western-centric training data

These aren’t edge cases - they’re everyday objects that humans recognize instantly.

Examples showing the annotation quality in Concept-pedia: correct annotations are verified by expert linguists, while ambiguous cases are carefully filtered out (e.g., distinguishing "church" from "altar" when both appear in the same image).

How We Built Concept-pedia

Concept Selection from BabelNet

We leveraged BabelNet, the largest multilingual semantic network:

  • Started with 165,000+ visual concepts
  • Filtered for concepts with clear visual representations
  • Ensured semantic diversity across categories
  • Maintained rich semantic annotations
Link propagation examples: Our methodology uses Wikipedia hyperlinks and BabelNet's semantic structure to automatically annotate images with precise concepts, ensuring high-quality annotations at scale.

Image Collection

For each concept:

  • Query multiple image sources
  • Automatic quality filtering
  • Diversity checks (pose, lighting, context)
  • Deduplication

Manual Curation for Concept-10k

For the evaluation benchmark:

  • Human verification: Expert annotators verified every image
  • Quality control: Multiple rounds of checking
  • Balance: Ensured representation across semantic categories
  • Difficulty calibration: Mix of easy, medium, hard examples

Key Findings

The ImageNet Anchor Effect is Real

Models perform dramatically worse on concepts outside ImageNet’s distribution, even when those concepts aren’t more visually complex.

Semantic Annotations Help

Models that leverage semantic structure (like our approach) show better generalization than pure vision-language pretraining.

Fine-grained Understanding Lags

Modern models struggle most with:

  • Fine-grained categories (specific breeds, species, types)
  • Cultural specificity (region-specific objects)
  • Specialized domains (medical instruments, technical equipment)

More Data Isn’t Enough

Simply scaling up training data on similar distributions doesn’t solve the problem. We need semantic diversity, not just more examples.

Why This Matters for Practitioners

If you’re building multimodal AI systems, you need to know:

Benchmark Scores Can Mislead

A model with 80% accuracy on ImageNet might have 45% accuracy on real-world visual concepts. Test on diverse data, not just standard benchmarks.

Domain Adaptation is Critical

If your application involves:

  • Specialized domains (medical, industrial, etc.)
  • Cultural diversity
  • Fine-grained recognition

Standard vision-language models will likely underperform. Consider:

  • Domain-specific fine-tuning
  • Semantic augmentation with knowledge graphs
  • Evaluation on relevant concepts, not just ImageNet

Semantic Structure Helps

Incorporating semantic knowledge (definitions, relationships, hierarchies) improves generalization. Don’t rely purely on image-text correlation.

The Research Impact

Concept-pedia enables:

Better Evaluation

Researchers can now test models on 165K+ concepts beyond ImageNet, getting a true sense of generalization capability.

Improved Training

Use Concept-pedia’s semantic annotations to train models with better conceptual understanding.

Diagnostic Analysis

Concept-10k allows detailed analysis of where and why models fail, guiding future improvements.

Multilingual Extension

BabelNet’s multilingual nature enables extension to non-English visual understanding.

What We’re Working On Next

We’re actively developing:

  • Concept-pedia v2: Expanding to 500K+ concepts
  • Video concepts: Temporal understanding beyond static images
  • 3D concepts: Spatial reasoning and object understanding
  • Interactive evaluation platform: Test your models on Concept-10k
  • Semantic-aware training methods: Leveraging structure for better learning

The Bigger Picture

Concept-pedia is part of a broader movement toward semantic understanding in AI.

We’ve spent a decade optimizing for ImageNet. It’s time to:

  • Evaluate on real diversity - not just benchmark performance
  • Incorporate semantic knowledge - not just pattern matching
  • Test on long-tail concepts - not just common categories
  • Build for the real world - not just academic benchmarks

The Research Team

This work was a collaborative effort:

  • Karim Ghonim (Lead - Sapienza University)
  • Andrei Stefan Bejgu (Sapienza University & Babelscape)
  • Alberte Fernández-Castro (Sapienza University)
  • Roberto Navigli (Babelscape & Sapienza University)

Presented at EMNLP 2025 in Suzhou, China.

Access the Dataset

Concept-pedia is freely available for research:

Note: Dataset links and code repositories will be updated as they become publicly available.

Implications for Multimodal AI

For Researchers

  • New benchmark: Evaluate on Concept-10k for real-world generalization
  • Training resource: Use semantic annotations for better learning
  • Analysis tool: Diagnose model failures on specific concept types

For Practitioners

  • Reality check: Test your models beyond ImageNet
  • Semantic augmentation: Incorporate structured knowledge
  • Domain awareness: Understand limitations before deployment

For the Field

  • Shift in evaluation: Move beyond ImageNet-centric metrics
  • Semantic integration: Combine vision with knowledge graphs
  • Long-tail focus: Address real-world diversity

Conclusion

For years, we’ve been optimizing for the wrong target. High performance on ImageNet doesn’t mean real-world visual understanding.

Concept-pedia provides:

  • 165K+ semantically-annotated concepts for training
  • Concept-10k benchmark for rigorous evaluation
  • Evidence that current models are more limited than we thought
  • A path forward through semantic integration

It’s time to break free from ImageNet’s shadow and build multimodal AI that truly understands the visual world’s diversity.


Citation

If you use Concept-pedia in your research, please cite our paper:

@inproceedings{ghonim-etal-2025-conceptpedia,
    title     = "Concept-pedia: A Wide-coverage Semantically-annotated Multimodal Dataset",
    author    = "Ghonim, Karim and
                 Bejgu, Andrei Stefan and
                 Fern{\'a}ndez-Castro, Alberte and
                 Navigli, Roberto",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month     = nov,
    year      = "2025",
    address   = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url       = "https://aclanthology.org/2025.emnlp-main.1745/",
    pages     = "34405--34426",
}

Plain text citation:

Karim Ghonim, Andrei Stefan Bejgu, Alberte Fernández-Castro, and Roberto Navigli. 2025. Concept-pedia: A Wide-coverage Semantically-annotated Multimodal Dataset. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34405–34426, Suzhou, China. Association for Computational Linguistics.


Published at EMNLP 2025 - Conference on Empirical Methods in Natural Language Processing, Suzhou, China

Built with ❤️ by Andrei Stefan Bejgu - AI Applied Scientist @ SylloTips