CLIP - Contrastive Language-Image Pre-Training

Created: 2022-07-01 10:02
#paper

Main idea

CLIP is a multimodal model released by OpenAI. It works very well in Zero-shot learning tasks.
A CLIP model consists of two encoders, one for texts and the other for images, that map images and texts to the same mathematical space. CLIP is then trained to predict how likely the image corresponds to the text using contrastive pre-training.

References

Paper
OpenAI Blog
Papers with Code
Higging Face
Analyrics India
KDNuggets

Code

GitHub
Example for classification
Towards Data Science
How to implement CLIP