Contrastive Language-Image Pre-training (CLIP) models, as a milestone of modern multimodal intelligence, have attracted significant research interest regarding their generalization mechanisms. While existing studies have been limited to the scope of pre-training knowledge, they have hardly addressed the model's generalization to countless open-world concepts absent from the pre-training regime. This paper investigates this Out-of-Pre-training (OOP) generalization problem from a holistic perspective. We propose the LAION-Beyond benchmark to isolate the evaluation of OOP concepts from pre-training knowledge, focusing on OpenCLIP and its reproducible variants derived from LAION datasets. Empirical analysis shows that despite image features of OOP concepts exhibiting significant category margins, their zero-shot transfer significantly fails due to poor image-text alignment. To address this, we elaborate the "name-tuning" methodology with its theoretical merits for OOP generalization, and propose few-shot name learning (FSNL) and zero-shot name learning (ZSNL) algorithms to achieve OOP generalization in a data-efficient manner. Their superiority has been further verified in our comprehensive experiments.
First Systematic Study
First research to systematically explore vision-language models' generalization to concepts absent from pre-training data
Novel Benchmark
LAION-Beyond: the first multi-domain benchmark specifically designed to evaluate OOP concept generalization
Novel Algorithms
FSNL and ZSNL: innovative approaches to effectively solve the OOP generalization problem
Modern vision-language models (like CLIP) have demonstrated remarkable zero-shot and few-shot learning capabilities through large-scale image-text pair pre-training. However, how these models perform when faced with concepts never encountered during pre-training remains a critical yet under-explored question.
Research Distinction: We distinguish between IP concepts (In-Pre-training) that appear in pre-training data and OOP concepts (Out-of-Pre-training) that are absent from pre-training data.
We constructed the LAION-Beyond benchmark as the first dedicated dataset for evaluating vision-language models' generalization to Out-of-Pre-training concepts:
OpenCLIP's image encoder can extract features with clear clustering boundaries for OOP concepts, even though these concepts never appeared during pre-training:
Anim | Arch | Atti | Folk | Food | Insect | Ladmk | Plant | Pokem | Avg | |
---|---|---|---|---|---|---|---|---|---|---|
IP classes | 40.27 | 91.04 | 82.09 | 78.02 | 81.72 | 50.44 | 93.01 | 55.71 | 34.07 | 68.15 |
OOP classes | 37.27 | 81.06 | 68.92 | 76.60 | 80.65 | 48.30 | 86.39 | 53.17 | 35.80 | 63.13 |
IP-OOP gap | 3.00 | 10.02 | 13.83 | 1.42 | 1.07 | 2.14 | 6.62 | 2.54 | 1.73 | 5.02 |
Despite strong image encoding capabilities, OpenCLIP fails to achieve cross-modal alignment for OOP concepts:
Based on our findings, we developed two novel algorithms to address the OOP generalization problem:
For scenarios with a few image-text pairs for OOP concepts:
For scenarios with no image-text pairs for OOP concepts:
FSNL outperforms all baseline methods across 8 domains (except Food), with performance continuously improving as sample count increases.
FSNL maintains high recognition rates for OOP concepts while preserving original performance on IP concepts, achieving the best of both worlds.
Baselines | Metric | Domains | Extra subnet | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Animals | Architecture | Attire | FolkArt | Food | Insects_Spider | Landmark | Plants_Fungi | Pokemon | Avg | |||
OpenCLIP | OOP | 19.70 | 22.49 | 16.18 | 27.07 | 8.88 | 16.86 | 25.65 | 15.71 | 15.47 | 18.67 | No |
IP | 41.66 | 48.60 | 64.57 | 49.67 | 56.93 | 33.28 | 93.41 | 33.71 | 58.58 | 53.38 | ||
H-mean | 26.75 | 30.75 | 25.88 | 35.04 | 15.36 | 22.38 | 40.25 | 21.43 | 24.48 | 26.92 | ||
CoOp | OOP | 38.31 | 76.54 | 60.28 | 68.35 | 61.67 | 47.18 | 85.24 | 48.25 | 39.24 | 58.34 | No |
IP | 26.56 | 46.43 | 43.29 | 42.04 | 32.48 | 17.69 | 86.54 | 16.67 | 32.45 | 38.24 | ||
H-mean | 31.37 | 57.8 | 50.39 | 52.06 | 42.55 | 25.73 | 85.89 | 24.78 | 35.52 | 45.12 | ||
CoCoOp | OOP | 23.82 | 38.53 | 29.58 | 31.29 | 19.94 | 23.16 | 45.81 | 20.35 | 18.10 | 27.84 | Yes |
IP | 36.82 | 41.30 | 39.40 | 33.19 | 36.15 | 30.76 | 85.27 | 28.29 | 37.22 | 40.93 | ||
H-mean | 28.93 | 39.87 | 33.79 | 32.21 | 25.7 | 26.42 | 59.6 | 23.67 | 24.36 | 32.73 | ||
CLIP-Adapter | OOP | 42.80 | 81.40 | 73.72 | 78.20 | 83.48 | 51.00 | 92.03 | 54.17 | 63.82 | 68.69 | Yes |
IP | 35.78 | 46.60 | 57.43 | 44.01 | 52.31 | 23.86 | 89.64 | 22.68 | 48.30 | 46.73 | ||
H-mean | 38.98 | 59.27 | 64.56 | 56.32 | 64.32 | 32.51 | 90.82 | 31.97 | 54.99 | 54.86 | ||
Learning-to-Name | OOP | 26.18 | 41.64 | 43.24 | 54.11 | 44.18 | 31.21 | 50.13 | 24.76 | 37.99 | 39.27 | Yes |
IP | 32.86 | 40.16 | 51.24 | 39.89 | 42.17 | 28.90 | 88.21 | 26.05 | 45.11 | 43.84 | ||
H-mean | 29.14 | 40.89 | 46.90 | 45.92 | 43.15 | 30.01 | 63.93 | 25.39 | 41.25 | 40.73 | ||
FSNL(ours) | OOP | 51.77 | 88.03 | 80.47 | 86.23 | 90.85 | 65.03 | 95.57 | 63.85 | 77.88 | 77.74 | No |
IP | 41.66 | 48.60 | 64.57 | 49.67 | 56.93 | 33.28 | 93.41 | 33.71 | 58.58 | 53.38 | ||
H-mean | 46.17 | 62.63 | 71.65 | 63.03 | 70.0 | 44.03 | 94.48 | 44.12 | 68.87 | 62.55 |
In evaluations with mixed OOP and IP concepts, FSNL significantly outperforms all baseline methods.
Anim | Arch | Atti | Folk | Food | Insect | Ladmk | Plant | Pokem | Avg | |
---|---|---|---|---|---|---|---|---|---|---|
OpenCLIP | 19.40 | 25.21 | 25.23 | 27.75 | 18.86 | 16.43 | 46.10 | 16.47 | 26.89 | 24.7 |
CoOp | 23.99 | 57.93 | 43.85 | 52.33 | 32.63 | 26.37 | 80.07 | 27.21 | 22.78 | 40.8 |
CoCoOp | 18.86 | 29.40 | 23.78 | 22.54 | 17.09 | 17.78 | 50.14 | 16.93 | 18.38 | 23.88 |
CLIP-Adap | 29.51 | 64.60 | 58.92 | 59.98 | 64.14 | 29.23 | 85.05 | 32.89 | 54.47 | 53.2 |
L2Name | 21.78 | 33.02 | 25.26 | 27.09 | 25.14 | 21.13 | 67.05 | 22.89 | 26.47 | 29.98 |
FSNL(ours) | 36.35 | 71.00 | 63.75 | 69.27 | 68.09 | 39.70 | 92.54 | 40.53 | 65.29 | 60.72 |
Δ | +6.84 | +6.40 | +5.42 | +9.29 | +3.95 | +10.47 | +7.49 | +7.64 | +10.82 | +7.52 |
ZSNL substantially outperforms baseline methods in zero-shot learning for OOP concepts across 13 domains.
Splits | Anim | Arch | Atti | Folk | Food | Insect | Ladmk | Plant | Pokem | |
---|---|---|---|---|---|---|---|---|---|---|
(400M) | OpenCLIP | 19.7 | 22.5 | 16.2 | 27.1 | 8.9 | 16.9 | 25.7 | 15.7 | 15.5 |
TransCLIP | 21.8 | 25.2 | 19.1 | 29.3 | 10.1 | 17.3 | 29.5 | 16.8 | 19.6 | |
ZSNL(ours) | 25.7 | 39.9 | 34.9 | 27.6 | 29.4 | 24.2 | 43.9 | 26.4 | 62.7 | |
Δ | +3.9 | +14.7 | +15.8 | -1.7 | +19.3 | +6.9 | +14.4 | +9.6 | +43.1 | |
(5B) | OpenCLIP | 25.9 | - | - | - | - | 32.2 | - | 34.3 | 21.7 |
TransCLIP | 31.1 | - | - | - | - | 35.1 | - | 35.1 | 28.1 | |
ZSNL(ours) | 45.6 | - | - | - | - | 59.8 | - | 70.7 | 72.5 | |
Δ | +14.5 | - | - | - | - | +24.7 | - | +35.6 | +42.4 |
FSNL demonstrates consistent performance improvements across different model scales, including larger models and different CLIP variants (OpenAI CLIP, EVA-CLIP).
Our research fills a critical gap in vision-language model research, with far-reaching impacts:
The LAION-Beyond benchmark and our two algorithms not only reveal the capabilities and limitations of vision-language models with OOP concepts but also provide effective solutions for OOP generalization, offering significant value for advancing open-world multimodal AI.
@inproceedings{chen2025laionbeyond,
title={Reproducible Vision-Language Models Meet Concepts Out of Pre-Training},
author={Chen, Ziliang and Huang, Xin and Fan, Xiaoxuan and Wang, Keze and Zhou, Yuyu and Guan, Quanlong and Lin, Liang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages={xxxx--xxxx},
year={2025}
}