LAION-Beyond: Reproducible Vision-Language Models Meet Concepts Out of Pre-Training

1Peng Cheng Laboratory, 2Sun Yat-sen University, 3Jinan University, 4École polytechnique fédérale de Lausanne(EPFL)
CVPR 2025

Indicates Equal Contribution, Listed Alphabetically By Last Name

* Corresponding Author

Abstract

Contrastive Language-Image Pre-training (CLIP) models, as a milestone of modern multimodal intelligence, have attracted significant research interest regarding their generalization mechanisms. While existing studies have been limited to the scope of pre-training knowledge, they have hardly addressed the model's generalization to countless open-world concepts absent from the pre-training regime. This paper investigates this Out-of-Pre-training (OOP) generalization problem from a holistic perspective. We propose the LAION-Beyond benchmark to isolate the evaluation of OOP concepts from pre-training knowledge, focusing on OpenCLIP and its reproducible variants derived from LAION datasets. Empirical analysis shows that despite image features of OOP concepts exhibiting significant category margins, their zero-shot transfer significantly fails due to poor image-text alignment. To address this, we elaborate the "name-tuning" methodology with its theoretical merits for OOP generalization, and propose few-shot name learning (FSNL) and zero-shot name learning (ZSNL) algorithms to achieve OOP generalization in a data-efficient manner. Their superiority has been further verified in our comprehensive experiments.

Key Highlights

First Systematic Study

First research to systematically explore vision-language models' generalization to concepts absent from pre-training data

Novel Benchmark

LAION-Beyond: the first multi-domain benchmark specifically designed to evaluate OOP concept generalization

Novel Algorithms

FSNL and ZSNL: innovative approaches to effectively solve the OOP generalization problem

Research Background

Modern vision-language models (like CLIP) have demonstrated remarkable zero-shot and few-shot learning capabilities through large-scale image-text pair pre-training. However, how these models perform when faced with concepts never encountered during pre-training remains a critical yet under-explored question.

Research Distinction: We distinguish between IP concepts (In-Pre-training) that appear in pre-training data and OOP concepts (Out-of-Pre-training) that are absent from pre-training data.

Comparison between IP and OOP generalization
Figure 1: Comparison between IP and OOP generalization. The former evaluates OpenCLIP's generalization with visual concepts seen in pre-training phases, whereas the latter justifies its generalization through the concepts absent during pre-training.

The LAION-Beyond Benchmark

We constructed the LAION-Beyond benchmark as the first dedicated dataset for evaluating vision-language models' generalization to Out-of-Pre-training concepts:

  • Scale: 106,052 images across 674 OOP concepts and 51,330 images across 324 IP concepts
  • Coverage: Spans 9 diverse domains including Plants & Fungi, Insects & Spiders, Animals, Pokemon, FolkArt, Landmark, Attire, Food, and Architecture
  • Multiple Versions: Provides subsets corresponding to LAION-400M, LAION-2B, and LAION-5B to support neural scaling law research
Statistics of OOP and IP concepts in LAION-Beyond
Figure 2a: The statistics of OOP and IP concepts and their images in LAION-Beyond (400M), (2B), and (5B).
Statistics of OOP and IP concepts in LAION-Beyond
Figure 2b: The statistics of train, val, test split in LAION-Beyond (400M).

Key Findings

1. Strong Image Feature Representation for OOP Concepts

OpenCLIP's image encoder can extract features with clear clustering boundaries for OOP concepts, even though these concepts never appeared during pre-training:

  • Cluster visualization shows distinct class boundaries for OOP concept image features
  • The clustering accuracy gap between OOP and IP concepts is less than 3% across most domains
t-SNE visualization of image features
Figure 3: The t-SNE visualization for (a) image features from 20 OOP classes drawn from Plants & Fungi; (b) image features from 10 OOP classes and 10 IP classes.
Anim Arch Atti Folk Food Insect Ladmk Plant Pokem Avg
IP classes 40.27 91.04 82.09 78.02 81.72 50.44 93.01 55.71 34.07 68.15
OOP classes 37.27 81.06 68.92 76.60 80.65 48.30 86.39 53.17 35.80 63.13
IP-OOP gap 3.00 10.02 13.83 1.42 1.07 2.14 6.62 2.54 1.73 5.02
Table 1. The normalized clustering accuracy across features extracted from OOP-class test images and IP-class test images across 9 domains, respectively.

2. Image-Text Alignment Failure

Despite strong image encoding capabilities, OpenCLIP fails to achieve cross-modal alignment for OOP concepts:

  • Zero-shot classification accuracy for OOP concepts is significantly lower than for IP concepts
  • The image-text alignment issue persists even with increasing pre-training data scale
  • Root cause: Token embeddings for OOP concepts were not initialized with any image-text alignment during pre-training
Zero-shot performance comparison
Figure 4a: OpenCLIP's zero-shot inference accuracy on OOP and IP classes in LAION-Beyond.
Zero-shot performance comparison
Figure 4b: EVA-CLIP's zero-shot inference accuracy on OOP and IP classes in LAION-Beyond.

Our Approach

Based on our findings, we developed two novel algorithms to address the OOP generalization problem:

Few-Shot Name Learning (FSNL)

For scenarios with a few image-text pairs for OOP concepts:

  • Optimizes only the name embeddings of OOP concepts, keeping other parameters fixed
  • Enhances training with context augmentation through similar concept shuffling
  • Combines theoretical guarantees with empirical validation

Zero-Shot Name Learning (ZSNL)

For scenarios with no image-text pairs for OOP concepts:

  • Leverages Novel Class Discovery (NCD) techniques to guide OOP class image clustering
  • Implements image-text bipartite graph matching
  • Optimizes OOP name embeddings using high-confidence samples

Experimental Results

OOP Few-Shot Learning

FSNL outperforms all baseline methods across 8 domains (except Food), with performance continuously improving as sample count increases.

Few-shot learning performance comparison
Figure 5: OOP few-shot learning performances (1,2,4,8,16 shots) of different baselines in the test sets of Animals, Landmark, and Pokemon across 9 domains in LAION-Beyond (400M).

OOP and IP Concept Balance

FSNL maintains high recognition rates for OOP concepts while preserving original performance on IP concepts, achieving the best of both worlds.

Baselines Metric Domains Extra subnet
Animals Architecture Attire FolkArt Food Insects_Spider Landmark Plants_Fungi Pokemon Avg
OpenCLIP OOP 19.70 22.49 16.18 27.07 8.88 16.86 25.65 15.71 15.47 18.67 No
IP 41.66 48.60 64.57 49.67 56.93 33.28 93.41 33.71 58.58 53.38
H-mean 26.75 30.75 25.88 35.04 15.36 22.38 40.25 21.43 24.48 26.92
CoOp OOP 38.31 76.54 60.28 68.35 61.67 47.18 85.24 48.25 39.24 58.34 No
IP 26.56 46.43 43.29 42.04 32.48 17.69 86.54 16.67 32.45 38.24
H-mean 31.37 57.8 50.39 52.06 42.55 25.73 85.89 24.78 35.52 45.12
CoCoOp OOP 23.82 38.53 29.58 31.29 19.94 23.16 45.81 20.35 18.10 27.84 Yes
IP 36.82 41.30 39.40 33.19 36.15 30.76 85.27 28.29 37.22 40.93
H-mean 28.93 39.87 33.79 32.21 25.7 26.42 59.6 23.67 24.36 32.73
CLIP-Adapter OOP 42.80 81.40 73.72 78.20 83.48 51.00 92.03 54.17 63.82 68.69 Yes
IP 35.78 46.60 57.43 44.01 52.31 23.86 89.64 22.68 48.30 46.73
H-mean 38.98 59.27 64.56 56.32 64.32 32.51 90.82 31.97 54.99 54.86
Learning-to-Name OOP 26.18 41.64 43.24 54.11 44.18 31.21 50.13 24.76 37.99 39.27 Yes
IP 32.86 40.16 51.24 39.89 42.17 28.90 88.21 26.05 45.11 43.84
H-mean 29.14 40.89 46.90 45.92 43.15 30.01 63.93 25.39 41.25 40.73
FSNL(ours) OOP 51.77 88.03 80.47 86.23 90.85 65.03 95.57 63.85 77.88 77.74 No
IP 41.66 48.60 64.57 49.67 56.93 33.28 93.41 33.71 58.58 53.38
H-mean 46.17 62.63 71.65 63.03 70.0 44.03 94.48 44.12 68.87 62.55
Table 2. OOP-to-IP open-vocabulary prediction. OOP, IP, and H-mean represent the accuracies of the models trained for OOP few-shot learning (4-shot), their accuracies in IP image dataset, and their compound Harmonic mean score, respectively (best viewed in color).

Open-World Vocabulary Transfer

In evaluations with mixed OOP and IP concepts, FSNL significantly outperforms all baseline methods.

Anim Arch Atti Folk Food Insect Ladmk Plant Pokem Avg
OpenCLIP 19.40 25.21 25.23 27.75 18.86 16.43 46.10 16.47 26.89 24.7
CoOp 23.99 57.93 43.85 52.33 32.63 26.37 80.07 27.21 22.78 40.8
CoCoOp 18.86 29.40 23.78 22.54 17.09 17.78 50.14 16.93 18.38 23.88
CLIP-Adap 29.51 64.60 58.92 59.98 64.14 29.23 85.05 32.89 54.47 53.2
L2Name 21.78 33.02 25.26 27.09 25.14 21.13 67.05 22.89 26.47 29.98
FSNL(ours) 36.35 71.00 63.75 69.27 68.09 39.70 92.54 40.53 65.29 60.72
Δ +6.84 +6.40 +5.42 +9.29 +3.95 +10.47 +7.49 +7.64 +10.82 +7.52
Table 3. The open-world transfer results (ACC across all OOP-class test images and IP-class images) across 9 domains. Linear Probe and TaskRes have been excluded due to their failure to transfer the training vocabulary. Δ indicates the absolute ratio that FSNL exceeds the second best.

OOP Zero-Shot Learning

ZSNL substantially outperforms baseline methods in zero-shot learning for OOP concepts across 13 domains.

Splits Anim Arch Atti Folk Food Insect Ladmk Plant Pokem
(400M) OpenCLIP 19.7 22.5 16.2 27.1 8.9 16.9 25.7 15.7 15.5
TransCLIP 21.8 25.2 19.1 29.3 10.1 17.3 29.5 16.8 19.6
ZSNL(ours) 25.7 39.9 34.9 27.6 29.4 24.2 43.9 26.4 62.7
Δ +3.9 +14.7 +15.8 -1.7 +19.3 +6.9 +14.4 +9.6 +43.1
(5B) OpenCLIP 25.9 - - - - 32.2 - 34.3 21.7
TransCLIP 31.1 - - - - 35.1 - 35.1 28.1
ZSNL(ours) 45.6 - - - - 59.8 - 70.7 72.5
Δ +14.5 - - - - +24.7 - +35.6 +42.4
Table 4. Zero-shot learning ACC (%) in OOP classes drawn from domains in LAION-Beyond (400M) and (5B). Δ indicates the absolute ratio that ZSNL exceeds the second best.

FSNL Performance Across Model Scales

FSNL demonstrates consistent performance improvements across different model scales, including larger models and different CLIP variants (OpenAI CLIP, EVA-CLIP).

FSNL performance across model scales
Figure 6: FSNL performance under neural scaling law, demonstrating effective improvement across larger models and different CLIP variants (OpenAI, EVA). The light-colored circles represent the Zero-shot results of the corresponding models on OOP data of LAION Beyond, while the dark-colored circles represent the results of the corresponding models after FSNL tuning.

Impact and Significance

Our research fills a critical gap in vision-language model research, with far-reaching impacts:

  • Theoretical Contribution: First theoretical framework for the OOP generalization problem
  • Practical Value: Efficient algorithms for handling new concepts in the open world
  • Future Direction: Paving the way for truly open-world multimodal systems

The LAION-Beyond benchmark and our two algorithms not only reveal the capabilities and limitations of vision-language models with OOP concepts but also provide effective solutions for OOP generalization, offering significant value for advancing open-world multimodal AI.

BibTeX

@inproceedings{chen2025laionbeyond,
    title={Reproducible Vision-Language Models Meet Concepts Out of Pre-Training},
    author={Chen, Ziliang and Huang, Xin and Fan, Xiaoxuan and Wang, Keze and Zhou, Yuyu and Guan, Quanlong and Lin, Liang},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages={xxxx--xxxx},
    year={2025}
  }