Thirsty for more expert insights?

Subscribe to our Tea O'Clock newsletter!

Cloud Services

Data science

The many different ways to use Large Language Models in image classification

Arthur Bastide

Published on

28/11/2024

This article explores various methods derived from research papers that aim to improve image classification by integrating Large Language Models (LLMs) in different ways. These methods are not designed to make LLMs perform the entire classification task but to "boost" existing methods. While the approaches discussed here do not cover the full potential of LLMs for such use cases, they enable the highlighting of three emerging "families" of classification.

You can find an implementation proposal for the last method discussed in this article on my GitHub repository → here ←.

Before diving into the core discussion, let's revisit some key concepts.

Image classification

Image classification in machine learning involves assigning a specific label (or "class," or "tag") to an image from a finite set of labels. This can be done within a single family (e.g., identifying a dog’s breed among all possible breeds) or in order to distinguish distinct families (e.g., determining whether an image shows a car, a house, a boat, or an animal).

The goal is to train a model to use image features (shapes, colors, textures, etc.) to predict their labels. Since processing all the inputs of an image is impractical (e.g., a color image with 224x224 pixels has 224 x 224 x 3 inputs, which means over 150,000 features), convolutional neural networks (CNNs) extract and reduce image features without losing key information. While this method is highly effective for predictions, it makes interpreting the model's coefficients impossible because features are not interpretable information. This is one limitation LLMs can help address.

Zero-shot image classification

Another key concept is zero-shot classification, which allows images to be classified into categories never explicitly seen during the model's training. This is possible thanks to models like CLIP (Contrastive Language-Image Pre-Training), a text-image embedding model able to vectorize images and text into the same embedding space. CLIP is trained on 400 million labeled images. For example, the embedding of a dog’s picture and the phrase "an image of a dog" will be very close in this space.

The zero-shot classification process is as follows:

We embed all labels of interest using CLIP (e.g., 10 labels).
To do so, we use a prompt like "a photo of {label}" to generate 10 embeddings.
We embed the image to be labeled using CLIP.
We measure the cosine similarity between the image embedding and each label embedding.
We select the label with the highest similarity (or highest probability if the similarities are passed through a softmax function).

This method can be represented mathematically as:

Where x is the image to classify, C={c1,c2,…,cn} is a predefined set of labels, ϕI / ϕT are the image and text encoders (e.g., CLIP), and f(c) is the prompt "a photo of {c}."

This is the basic zero-shot classification method, often referred to as "Vanilla CLIP."

‍

“Enhanced zero-shot” image classification

Firstly, LLMs easily enhance zero-shot classification through the descriptions of labels they can provide. Indeed, instead of embedding "A photo of {label}," we embed the label's description provided by the LLM, refining its positioning in the embedding space.

This approach is represented by the equation:

Where D(c) is the description of the label c.
It should be noted that here, an average of the cosine similarities of the class descriptions is computed. This technique, called "prompt ensembling" (which usually involves averaging embeddings), improves classification accuracy.
This method is particularly effective because different prompting techniques (DCLIP, WaffleCLIP, CuPL) give varying results depending on the datasets. Therefore, it enables more robust label positioning in the embedding space.

Other methods go beyond optimizing the target label descriptions. For example, in their study “Enhancing Zero-Shot Image Classification with Multimodal Large Language Models,” Google Research teams focused on optimizing the embedding of the input image by combining three distinct embeddings:

The embedding of the image itself.
The embedding of the image's description generated by a multimodal LLM.
The embedding of the predicted label by a multimodal LLM, which is provided with all possible labels.

You will find below a visual representation of this approach :

‍

The example below highlights two of the three inputs derived from the image to be classified: the description of the image from the LLM and the label predicted by the LLM ("pencil").

Even if the prediction provided as input is incorrect, including this input significantly improves the model's performance across all tested datasets (e.g., for ImageNet, accuracy increased from 70.7 to 73.4)

‍

We have just introduced the central role of LLMs in image classification: their ability to refine the positioning of both target labels and input images in a shared embedding space.

“Low-shot” image classification

The methods discussed earlier only work well if the textual descriptions of the labels (generated by the LLM) are relevant. However, for certain datasets, such as the well-known Iris Flower Dataset, the results can be quite poor. Two main reasons explain this:

LLMs, being trained only on textual data, sometimes lack nuance in their understanding of visual aspects.
When attempting to generate label descriptions via a single generic prompt designed for any classification task, the results can sometimes be very generic (e.g., “the {label} can have multiple colors”).

‍

The WaffleCLIP method (Roth et al., 2023) highlights this issue: in many cases, replacing random words in label descriptions with vague and, above all, unrelated terms has little impact on accuracy.

‍

The “Iterative Optimization with Visual Feedback” method introduces the concept of “low-shot” image classification by proposing an approach to optimize label descriptions. It incorporates two key real-life aspects: interaction with the environment and iterative optimization. Human recognition of new objects involves a dynamic process: we gradually update our knowledge based on the object’s environment, keeping only useful information and discarding irrelevant details.

‍

‍

The methodology, for which you can find the visual diagram above, consists of three main steps:

Initialization: Initial label descriptions D0 are generated by an LLM (as described in earlier methods). The goal is to optimize these descriptions.
Visual Feedback: The idea is to provide an LLM with visual knowledge from a model like CLIP:some text
- A small set of images is selected for each label (e.g., a “training” set, although we will nuance this terminology).
- A zero-shot prediction is made for each image using the initial descriptions D0.
- Metrics such as accuracy and a custom confusion matrix are computed (details of the equations are presented below):some text
  - Instead of incrementing by +1 only the predicted category, +1 is added to all labels whose cosine similarity exceeds λ times the cosine similarity of the ground-truth class description (Equation 3).
  - Once the matrix is built, the sum is calculated for each label (Equation 4), and the top-m is retained (Equation 5). These correspond to the top-m labels whose descriptions are least clear for CLIP.
  - Instead of using argmax or softmax, this matrix highlights labels for which the prediction is incorrect. The parameter λ represents the level of strictness (or leniency if λ<1, penalizing label descriptions with cosine similarities slightly below the ground-truth description’s similarity).
- Visual feedback V(D) is then converted into textual insights using an LLM:some text
  - "For the label ..., with descriptions ..., the labels with the most confusion are ..., whose descriptions are ...”.
Iterative Optimization: For each label, three sub-steps are performed:some text
- Mutation: Using the visual feedback V(D), an LLM generates K new sets of descriptions (to ensure diversity for optimization). Irrelevant descriptions—those that are very close to ones from other most ambiguous labels (top-m)—are removed and replaced with new ones.
- Crossover: All possible combinations of new descriptions across sets are generated, their visual feedback is evaluated, and the best description is retained for the next iteration. This ensures that the next iteration begins with the best descriptions from the previous step.
- Memory Bank Update:some text
  - Descriptions are categorized as "unchanged," "deleted," or "added."
  - "Added" descriptions are stored as "positive," ensuring that the accuracy of D(i) > D(i−1) (the same is also done for "deleted" descriptions).
  - These memory banks are provided to the LLM in each iteration, along with visual feedback, to produce new description sets. This ensures that new descriptions are always generated and that deleted ones are not reused.

Where d′ represents the descriptions of a label, and d represents the descriptions of the ground-truth label.

Below is an example of results obtained on the Flowers102 dataset, compared with the CuPL method (one of the zero-shot methods with specific prompting techniques). The bar charts highlight the three best and three worst descriptions for the label “Prince of Wales Feathers.”

Why is it called “low-shot” image classification? This method diverges from zero-shot classification because it uses a small number of labeled images to optimize the target label descriptions. However, the ultimate goal remains to improve zero-shot predictions. Thus, this is not traditional Machine Learning training, as there is no risk of overfitting. This is the reason why the term “low-shot” is used.

‍

“Enhanced standard” image classification

Could LLMs also improve traditional image classification models (those mentioned in the introduction)? An approach addressing this question is presented in the research paper "LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification.” The primary goal of this approach is to overcome a limitation previously mentioned - the lack of interpretability in CNN models - enabling us to answer questions such as:

What visual elements best distinguish the labels in a class?
Which features are the most/least significant for each label?
Why did my model make this prediction?
Why did my model misclassify this label?

The method, schematized above, can be described as follows (using as an example the class "dog," with labels corresponding to different breeds):

We determine all visual characteristics of the category “dog” via successive prompts, by building a visual tree of the category :some text
- We identify "visual parts": These are the primary visual components of the category (e.g., head, eyes, coat, tail, legs). If possible, we ask the LLM to subdivide visual parts into more granular components (e.g., within the head, there is the mouth; within the mouth, there are the teeth, and so on). A granularity threshold (or maximum tree depth) is defined.
- We identify "visual attributes": These characterize the visual parts (e.g., size, shape, color, texture). Here, the prompt provides examples, as each visual part has specific attributes (e.g., the attribute "opacity" does not characterize the visual part "nose").
- We identify "attribute values": For each combination of visual part and visual attribute, we determine all possible values for the category. For instance, for the visual part "tail" and the visual attribute "length," possible values could be {"short," "medium," "long"} or {"<10 cm," "10-20 cm," ">20 cm"}. To identify these attribute values, the LLM considers all labels in the class (e.g., all dog breeds provided), and redundant values are removed to retain only unique ones (since some labels may have similar attribute values). These values form the leaves of the tree, and we name "visual clues" the branches, that’s to say all the combinations of visual parts, visual attributes, and attribute values.
- Each “visual clue” is converted into natural language (descriptive phrases) and embedded using the CLIP :

Using the LLM’s knowledge, we thus generate a set of descriptive phrases ("visual clues") for a given class. Below is an example of a portion of the tree (visualized using the pydot Python library) for the label "French Bulldog":

Each image in the dataset is embedded into the same space using the CLIP model.
For each image, a cosine similarity score is computed for each visual clue. As a result, there are as many similarity scores as there are leaves in the tree.
These similarity scores are used as features for a classification model. The features are, therefore, interpretable.

One model is trained for each main "visual part," and the most often predicted label among the models is selected as the final label predicted.

‍

Since this method could be relevant for some use cases at Fifty-five (particularly those related to optimizing ad creatives), and in order to illustrate its strengths compared to traditional image classification, we implemented a coded version of this approach, available on my GitHub repository. Before sharing some output examples, some preliminary notes :

We tested this on the well-known Stanford Dogs Dataset, which contains over 20,000 images of more than 120 dog breeds. A specificity of this dataset is the inclusion of .xml files specifying crop boundaries to focus on the key element of each image, enhancing model performance.
For the results below, the model was trained on only 10 dog breeds (1,715 images) to simplify the output analysis.
Unlike the original method, we did not split the model into sub-models dedicated to specific visual parts. Instead, our model contained a significant number of features (194), that we reduced to 110 through a grid search on the feature importance threshold (calculated using the permutation feature technique, applicable to tree models).

The approach enables model interpretability at three levels:

Class level:

We can analyze which visual clues are most (or least) effective at distinguishing dog breeds. Feature importance is calculated using the feature permutation technique applied to our Random Forest model :

‍

By leveraging our distinction between "visual parts" and "visual attributes," we can further investigate which of these are most effective at differentiating labels :

Label level:

Similar analyses can be performed for labels. For instance, below are the three most and least important features for each label (calculated using Gini coefficients this time, which is why the least important features have positive values):

The confusion matrix highlights the dog breeds the model most frequently misclassified:

‍

The most frequent confusion occurs between the labels "Lhasa" and "Scotch Terrier." By examining the top 30 features with the highest importance for each label, we find that 10 features are shared between the two labels, explaining the confusion :

Prediction level:

To further investigate, we can examine a specific prediction where the model misclassified those labels :

‍

Although close, the probability of "Lhasa" (true label) for this image is lower than that of "Scotch Terrier" (predicted label). To better understand the model’s predictions, SHAP (SHapley Additive exPlanations) values provide deeper insights into a specific prediction’s probabilities, mainly for two reasons:

They calculate the intrinsic value of features by computing models based on all possible feature combinations. With the analysis of coefficient weights for linear models or GINI coefficients for tree models, we cannot assess that the feature importance is not correlated to using other independent variables in the model.
They assign specific weights to a given prediction. For example, if a feature’s weight is high for the overall model, but the variable value is low for a specific input, its SHAP value for the prediction will be minimal.

In our example, since Scotch Terriers are mainly black, especially in our dataset, we understand the importance of visual clues like “the color of the belly is black” or “the fur color of the back is black.” Thus, the black coat that the Lhasa dog is wearing in the image is very likely to explain the model’s error.

‍

These examples showcase the potential of LLMs to enhance image classification. The improvements are not directly related to performance (CNN models typically achieve higher accuracy) but rather to model interpretability.

All these methods share a similarity: they provide descriptive information about a class and its labels, either through an iterative process involving a few images and a multimodal embedding model ("low-shot" method) or via a hierarchical tree decomposition ("enhanced standard" method). Thus, the dependency on LLM knowledge is a limitation worth noting.

Although we’ve categorized these methods into three “families” of classification, they are not mutually exclusive. For instance, we could think about using descriptions from the "low-shot" method as features for the "enhanced standard" model (even if we must ensure that we have enough images for both specific training processes).

‍

Arthur Bastide

Back to homepage