LAION-Beyond: Reproducible Vision-Language Models Meet Concepts Out of Pre-Training

Abstract

Contrastive Language-Image Pre-training (CLIP) models, as a milestone of modern multimodal intelligence, have attracted significant research interest regarding their generalization mechanisms. While existing studies have been limited to the scope of pre-training knowledge, they have hardly addressed the model's generalization to countless open-world concepts absent from the pre-training regime. This paper investigates this Out-of-Pre-training (OOP) generalization problem from a holistic perspective. We propose the LAION-Beyond benchmark to isolate the evaluation of OOP concepts from pre-training knowledge, focusing on OpenCLIP and its reproducible variants derived from LAION datasets. Empirical analysis shows that despite image features of OOP concepts exhibiting significant category margins, their zero-shot transfer significantly fails due to poor image-text alignment. To address this, we elaborate the "name-tuning" methodology with its theoretical merits for OOP generalization, and propose few-shot name learning (FSNL) and zero-shot name learning (ZSNL) algorithms to achieve OOP generalization in a data-efficient manner. Their superiority has been further verified in our comprehensive experiments.

Key Highlights

First Systematic Study

First research to systematically explore vision-language models' generalization to concepts absent from pre-training data

Novel Benchmark

LAION-Beyond: the first multi-domain benchmark specifically designed to evaluate OOP concept generalization

Novel Algorithms

FSNL and ZSNL: innovative approaches to effectively solve the OOP generalization problem

Research Background

Modern vision-language models (like CLIP) have demonstrated remarkable zero-shot and few-shot learning capabilities through large-scale image-text pair pre-training. However, how these models perform when faced with concepts never encountered during pre-training remains a critical yet under-explored question.

Research Distinction: We distinguish between IP concepts (In-Pre-training) that appear in pre-training data and OOP concepts (Out-of-Pre-training) that are absent from pre-training data.

Figure 1: Comparison between IP and OOP generalization. The former evaluates OpenCLIP's generalization with visual concepts seen in pre-training phases, whereas the latter justifies its generalization through the concepts absent during pre-training.

The LAION-Beyond Benchmark

We constructed the LAION-Beyond benchmark as the first dedicated dataset for evaluating vision-language models' generalization to Out-of-Pre-training concepts:

Scale: 106,052 images across 674 OOP concepts and 51,330 images across 324 IP concepts
Coverage: Spans 9 diverse domains including Plants & Fungi, Insects & Spiders, Animals, Pokemon, FolkArt, Landmark, Attire, Food, and Architecture
Multiple Versions: Provides subsets corresponding to LAION-400M, LAION-2B, and LAION-5B to support neural scaling law research

Statistics of OOP and IP concepts in LAION-Beyond — Figure 2a: The statistics of OOP and IP concepts and their images in LAION-Beyond (400M), (2B), and (5B).

Key Findings

1. Strong Image Feature Representation for OOP Concepts

OpenCLIP's image encoder can extract features with clear clustering boundaries for OOP concepts, even though these concepts never appeared during pre-training:

Cluster visualization shows distinct class boundaries for OOP concept image features
The clustering accuracy gap between OOP and IP concepts is less than 3% across most domains

t-SNE visualization of image features — Figure 3: The t-SNE visualization for (a) image features from 20 OOP classes drawn from Plants & Fungi; (b) image features from 10 OOP classes and 10 IP classes.

Table 1. The normalized clustering accuracy across features extracted from OOP-class test images and IP-class test images across 9 domains, respectively.
	Anim	Arch	Atti	Folk	Food	Insect	Ladmk	Plant	Pokem	Avg
IP classes	40.27	91.04	82.09	78.02	81.72	50.44	93.01	55.71	34.07	68.15
OOP classes	37.27	81.06	68.92	76.60	80.65	48.30	86.39	53.17	35.80	63.13
IP-OOP gap	3.00	10.02	13.83	1.42	1.07	2.14	6.62	2.54	1.73	5.02

2. Image-Text Alignment Failure

Despite strong image encoding capabilities, OpenCLIP fails to achieve cross-modal alignment for OOP concepts:

Zero-shot classification accuracy for OOP concepts is significantly lower than for IP concepts
The image-text alignment issue persists even with increasing pre-training data scale
Root cause: Token embeddings for OOP concepts were not initialized with any image-text alignment during pre-training

Zero-shot performance comparison — Figure 4a: OpenCLIP's zero-shot inference accuracy on OOP and IP classes in LAION-Beyond.

Our Approach

Based on our findings, we developed two novel algorithms to address the OOP generalization problem:

Few-Shot Name Learning (FSNL)

For scenarios with a few image-text pairs for OOP concepts:

Optimizes only the name embeddings of OOP concepts, keeping other parameters fixed
Enhances training with context augmentation through similar concept shuffling
Combines theoretical guarantees with empirical validation

Zero-Shot Name Learning (ZSNL)

For scenarios with no image-text pairs for OOP concepts:

Leverages Novel Class Discovery (NCD) techniques to guide OOP class image clustering
Implements image-text bipartite graph matching
Optimizes OOP name embeddings using high-confidence samples

Experimental Results

OOP Few-Shot Learning

FSNL outperforms all baseline methods across 8 domains (except Food), with performance continuously improving as sample count increases.

Few-shot learning performance comparison — Figure 5: OOP few-shot learning performances (1,2,4,8,16 shots) of different baselines in the test sets of Animals, Landmark, and Pokemon across 9 domains in LAION-Beyond (400M).

OOP and IP Concept Balance

FSNL maintains high recognition rates for OOP concepts while preserving original performance on IP concepts, achieving the best of both worlds.

Table 2. OOP-to-IP open-vocabulary prediction. **OOP**, IP, and **H-mean** represent the accuracies of the models trained for OOP few-shot learning (4-shot), their accuracies in IP image dataset, and their compound Harmonic mean score, respectively (best viewed in color).
Baselines	Metric	Domains										Extra subnet
Baselines	Metric	Animals	Architecture	Attire	FolkArt	Food	Insects_Spider	Landmark	Plants_Fungi	Pokemon	Avg	Extra subnet
OpenCLIP	OOP	19.70	22.49	16.18	27.07	8.88	16.86	25.65	15.71	15.47	18.67	No
	IP	41.66	48.60	64.57	49.67	56.93	33.28	93.41	33.71	58.58	53.38
	H-mean	26.75	30.75	25.88	35.04	15.36	22.38	40.25	21.43	24.48	26.92
CoOp	OOP	38.31	76.54	60.28	68.35	61.67	47.18	85.24	48.25	39.24	58.34	No
	IP	26.56	46.43	43.29	42.04	32.48	17.69	86.54	16.67	32.45	38.24
	H-mean	31.37	57.8	50.39	52.06	42.55	25.73	85.89	24.78	35.52	45.12
CoCoOp	OOP	23.82	38.53	29.58	31.29	19.94	23.16	45.81	20.35	18.10	27.84	Yes
	IP	36.82	41.30	39.40	33.19	36.15	30.76	85.27	28.29	37.22	40.93
	H-mean	28.93	39.87	33.79	32.21	25.7	26.42	59.6	23.67	24.36	32.73
CLIP-Adapter	OOP	42.80	81.40	73.72	78.20	83.48	51.00	92.03	54.17	63.82	68.69	Yes
	IP	35.78	46.60	57.43	44.01	52.31	23.86	89.64	22.68	48.30	46.73
	H-mean	38.98	59.27	64.56	56.32	64.32	32.51	90.82	31.97	54.99	54.86
Learning-to-Name	OOP	26.18	41.64	43.24	54.11	44.18	31.21	50.13	24.76	37.99	39.27	Yes
	IP	32.86	40.16	51.24	39.89	42.17	28.90	88.21	26.05	45.11	43.84
	H-mean	29.14	40.89	46.90	45.92	43.15	30.01	63.93	25.39	41.25	40.73
FSNL(ours)	OOP	51.77	88.03	80.47	86.23	90.85	65.03	95.57	63.85	77.88	77.74	No
	IP	41.66	48.60	64.57	49.67	56.93	33.28	93.41	33.71	58.58	53.38
	H-mean	46.17	62.63	71.65	63.03	70.0	44.03	94.48	44.12	68.87	62.55

Open-World Vocabulary Transfer

In evaluations with mixed OOP and IP concepts, FSNL significantly outperforms all baseline methods.

Table 3. The open-world transfer results (ACC across all OOP-class test images and IP-class images) across 9 domains. *Linear Probe* and *TaskRes* have been excluded due to their failure to transfer the training vocabulary. Δ indicates the absolute ratio that FSNL exceeds the second best.
	Anim	Arch	Atti	Folk	Food	Insect	Ladmk	Plant	Pokem	Avg
OpenCLIP	19.40	25.21	25.23	27.75	18.86	16.43	46.10	16.47	26.89	24.7
CoOp	23.99	57.93	43.85	52.33	32.63	26.37	80.07	27.21	22.78	40.8
CoCoOp	18.86	29.40	23.78	22.54	17.09	17.78	50.14	16.93	18.38	23.88
CLIP-Adap	29.51	64.60	58.92	59.98	64.14	29.23	85.05	32.89	54.47	53.2
L2Name	21.78	33.02	25.26	27.09	25.14	21.13	67.05	22.89	26.47	29.98
FSNL(ours)	36.35	71.00	63.75	69.27	68.09	39.70	92.54	40.53	65.29	60.72
Δ	+6.84	+6.40	+5.42	+9.29	+3.95	+10.47	+7.49	+7.64	+10.82	+7.52

OOP Zero-Shot Learning

ZSNL substantially outperforms baseline methods in zero-shot learning for OOP concepts across 13 domains.

Table 4. Zero-shot learning ACC (%) in OOP classes drawn from domains in LAION-Beyond (400M) and (5B). Δ indicates the absolute ratio that ZSNL exceeds the second best.
Splits		Anim	Arch	Atti	Folk	Food	Insect	Ladmk	Plant	Pokem
(400M)	OpenCLIP	19.7	22.5	16.2	27.1	8.9	16.9	25.7	15.7	15.5
	TransCLIP	21.8	25.2	19.1	29.3	10.1	17.3	29.5	16.8	19.6
	ZSNL(ours)	25.7	39.9	34.9	27.6	29.4	24.2	43.9	26.4	62.7
Δ		+3.9	+14.7	+15.8	-1.7	+19.3	+6.9	+14.4	+9.6	+43.1
(5B)	OpenCLIP	25.9	-	-	-	-	32.2	-	34.3	21.7
	TransCLIP	31.1	-	-	-	-	35.1	-	35.1	28.1
	ZSNL(ours)	45.6	-	-	-	-	59.8	-	70.7	72.5
Δ		+14.5	-	-	-	-	+24.7	-	+35.6	+42.4

FSNL Performance Across Model Scales

FSNL demonstrates consistent performance improvements across different model scales, including larger models and different CLIP variants (OpenAI CLIP, EVA-CLIP).

Impact and Significance

Our research fills a critical gap in vision-language model research, with far-reaching impacts:

Theoretical Contribution: First theoretical framework for the OOP generalization problem
Practical Value: Efficient algorithms for handling new concepts in the open world
Future Direction: Paving the way for truly open-world multimodal systems

The LAION-Beyond benchmark and our two algorithms not only reveal the capabilities and limitations of vision-language models with OOP concepts but also provide effective solutions for OOP generalization, offering significant value for advancing open-world multimodal AI.

BibTeX

@inproceedings{chen2025laionbeyond,
    title={Reproducible Vision-Language Models Meet Concepts Out of Pre-Training},
    author={Chen, Ziliang and Huang, Xin and Fan, Xiaoxuan and Wang, Keze and Zhou, Yuyu and Guan, Quanlong and Lin, Liang},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages={xxxx--xxxx},
    year={2025}
  }