Skip to content

Vision Language Model - VLM


Contents


I. The problem

II. Overview of the VLM

III. CLIP model (Contrastive Language-Image Pretraining)

1. Overview

Information: The CLIP model was introduced in Jan, 2020 by OpenAI with paper title "Learning Transferable Visual Models from Natural Language Supervision". This is a highlight architecture in combining language and image learning, opening up zero-shot learning for a wide range of computer vision tasks.

Objectives of Model:

  • Training the model to predict which caption matches which image on a dataset of 400 million pair (image, text), developed an efficient image representation learning model from scratch.

  • Then the training, can use natural language to refer to or describe visual concepts, allowing zero-shot transfer to a variety of tasks.

Results achieved: The model was evaluated on more than 30 datasets: character recognition, video action, geolocation and fine-grained object classification. Model performed competitively with supervised training methods, accuracy comparable to ResNet-50 on ImageNet without using its training set.

2. Architecture model

Image encoder: Use ResNet50, ResNetD and ViT as the base architecture for the image encoder. Replace the GAP layer with an attention pooling mechanism - transformer style (multi-head QKV).

Text encoder:

  • Transformer with the architecture modifications: 63M-parameter, 12-layer 512-wide model with 8 attention heads.

  • Text converted to token by BPE with a 49.152 vocab size and sequence length at 76.

Multi-model embedding space: Both image feature and text feature are layer normalized, then linearly projected into the multi-model embedding space to calculate the similarity between images and descriptions.

3. Zero-shot transfer

Progress:

  1. Use image encoder to get image embedding, use text encoder to get embeddings for all class names.
  2. CLIP computes cosine similarities between the image and text embeddings, scales them by a temperature parameter, then applies softmax to get probabilities. This works like a softmax classifier where both the inputs and class embeddings are L2-normalized, there's no bias term, and temperature controls the sharpness of the output.

Multinomial logistic regression classifier:

\[\text{logit}_{i} = f_\text{img} \cdot f_\text{text, i}\]

where:

  • \(f_\text{img}\) is embedding image / inputs
  • \(f_\text{text, i}\) is embedding of text \(i\) / weights
  • \(\text{logit}_{i}\) is cosine similarity.

Devide the logits by a temperature (\(\tau\)) coefficient. Then pass it through softmax to get the classification probabilities.

\[P_{i} = \frac{e^{logit_i / \tau}}{\sum_{j}e^{logit_j / \tau}}\]

3. Evaluation zero-shot CLIP

Objectives: The main goal is to evaluate the quality of the representation learned by CLIP during its large-scale pre-training.

More specifically:

  • Logistic Regression = Supervised baseline:
    • Logistic Regression is trained features extracted from a standard backbone (ResNet50, v.v), using labeled training.
    • It represents a simple and standard supervised learning baseline, commonly used to evaluate learned representations.
  • CLIP zero-shot = No training on the new datasets:
    • CLIP doesn't require fine-tuning or labels from the new dataset.
    • It simply matches image features with text embedding using cosine similarity.
    • Predictions is done directly using ís knowledge CLIP learned during pre-training.

4. Limitations

CLIP also struggles on some tasks, especially:

  • Fine-grained classification, like telling apart car models, flower species, or airplane types.
  • Abstract or systematic tasks, like counting objects in an image.
  • New or uncommon tasks that probably weren't in CLIP's training set.

IV. Flamingo model

V. Experimental strategy for Cervical Cancer Cytology