CLIP - Contrastive Language-Image Pre-Training

Created: 2022-07-01 10:02
#paper

Main idea

CLIP is a multimodal model released by OpenAI. It works very well in Zero-shot learning tasks.
A CLIP model consists of two encoders, one for texts and the other for images, that map images and texts to the same mathematical space. CLIP is then trained to predict how likely the image corresponds to the text using contrastive pre-training.
CLIP_encoders.png

References

  1. Paper
  2. OpenAI Blog
  3. Papers with Code
  4. Higging Face
  5. Analyrics India
  6. KDNuggets

Code

  1. GitHub
  2. Example for classification
  3. Towards Data Science
  4. How to implement CLIP