You can find an implementation proposal for the last method discussed in this article on my GitHub repository → here ←.
Before diving into the core discussion, let's revisit some key concepts.
Image classification in machine learning involves assigning a specific label (or "class," or "tag") to an image from a finite set of labels. This can be done within a single family (e.g., identifying a dog’s breed among all possible breeds) or in order to distinguish distinct families (e.g., determining whether an image shows a car, a house, a boat, or an animal).
The goal is to train a model to use image features (shapes, colors, textures, etc.) to predict their labels. Since processing all the inputs of an image is impractical (e.g., a color image with 224x224 pixels has 224 x 224 x 3 inputs, which means over 150,000 features), convolutional neural networks (CNNs) extract and reduce image features without losing key information. While this method is highly effective for predictions, it makes interpreting the model's coefficients impossible because features are not interpretable information. This is one limitation LLMs can help address.
Another key concept is zero-shot classification, which allows images to be classified into categories never explicitly seen during the model's training. This is possible thanks to models like CLIP (Contrastive Language-Image Pre-Training), a text-image embedding model able to vectorize images and text into the same embedding space. CLIP is trained on 400 million labeled images. For example, the embedding of a dog’s picture and the phrase "an image of a dog" will be very close in this space.
The zero-shot classification process is as follows:
This method can be represented mathematically as:
Where x is the image to classify, C={c1,c2,…,cn} is a predefined set of labels, ϕI / ϕT are the image and text encoders (e.g., CLIP), and f(c) is the prompt "a photo of {c}."
This is the basic zero-shot classification method, often referred to as "Vanilla CLIP."
Firstly, LLMs easily enhance zero-shot classification through the descriptions of labels they can provide. Indeed, instead of embedding "A photo of {label}," we embed the label's description provided by the LLM, refining its positioning in the embedding space.
This approach is represented by the equation:
Where D(c) is the description of the label c.
It should be noted that here, an average of the cosine similarities of the class descriptions is computed. This technique, called "prompt ensembling" (which usually involves averaging embeddings), improves classification accuracy.
This method is particularly effective because different prompting techniques (DCLIP, WaffleCLIP, CuPL) give varying results depending on the datasets. Therefore, it enables more robust label positioning in the embedding space.
Other methods go beyond optimizing the target label descriptions. For example, in their study “Enhancing Zero-Shot Image Classification with Multimodal Large Language Models,” Google Research teams focused on optimizing the embedding of the input image by combining three distinct embeddings:
You will find below a visual representation of this approach :
The example below highlights two of the three inputs derived from the image to be classified: the description of the image from the LLM and the label predicted by the LLM ("pencil").
Even if the prediction provided as input is incorrect, including this input significantly improves the model's performance across all tested datasets (e.g., for ImageNet, accuracy increased from 70.7 to 73.4)
We have just introduced the central role of LLMs in image classification: their ability to refine the positioning of both target labels and input images in a shared embedding space.
The methods discussed earlier only work well if the textual descriptions of the labels (generated by the LLM) are relevant. However, for certain datasets, such as the well-known Iris Flower Dataset, the results can be quite poor. Two main reasons explain this:
The WaffleCLIP method (Roth et al., 2023) highlights this issue: in many cases, replacing random words in label descriptions with vague and, above all, unrelated terms has little impact on accuracy.
The “Iterative Optimization with Visual Feedback” method introduces the concept of “low-shot” image classification by proposing an approach to optimize label descriptions. It incorporates two key real-life aspects: interaction with the environment and iterative optimization. Human recognition of new objects involves a dynamic process: we gradually update our knowledge based on the object’s environment, keeping only useful information and discarding irrelevant details.
The methodology, for which you can find the visual diagram above, consists of three main steps:
Where d′ represents the descriptions of a label, and d represents the descriptions of the ground-truth label.
Below is an example of results obtained on the Flowers102 dataset, compared with the CuPL method (one of the zero-shot methods with specific prompting techniques). The bar charts highlight the three best and three worst descriptions for the label “Prince of Wales Feathers.”
Why is it called “low-shot” image classification? This method diverges from zero-shot classification because it uses a small number of labeled images to optimize the target label descriptions. However, the ultimate goal remains to improve zero-shot predictions. Thus, this is not traditional Machine Learning training, as there is no risk of overfitting. This is the reason why the term “low-shot” is used.
Could LLMs also improve traditional image classification models (those mentioned in the introduction)? An approach addressing this question is presented in the research paper "LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification.” The primary goal of this approach is to overcome a limitation previously mentioned - the lack of interpretability in CNN models - enabling us to answer questions such as:
The method, schematized above, can be described as follows (using as an example the class "dog," with labels corresponding to different breeds):
Using the LLM’s knowledge, we thus generate a set of descriptive phrases ("visual clues") for a given class. Below is an example of a portion of the tree (visualized using the pydot Python library) for the label "French Bulldog":
Since this method could be relevant for some use cases at Fifty-five (particularly those related to optimizing ad creatives), and in order to illustrate its strengths compared to traditional image classification, we implemented a coded version of this approach, available on my GitHub repository. Before sharing some output examples, some preliminary notes :
The approach enables model interpretability at three levels:
We can analyze which visual clues are most (or least) effective at distinguishing dog breeds. Feature importance is calculated using the feature permutation technique applied to our Random Forest model :
By leveraging our distinction between "visual parts" and "visual attributes," we can further investigate which of these are most effective at differentiating labels :
Similar analyses can be performed for labels. For instance, below are the three most and least important features for each label (calculated using Gini coefficients this time, which is why the least important features have positive values):
The confusion matrix highlights the dog breeds the model most frequently misclassified:
The most frequent confusion occurs between the labels "Lhasa" and "Scotch Terrier." By examining the top 30 features with the highest importance for each label, we find that 10 features are shared between the two labels, explaining the confusion :
To further investigate, we can examine a specific prediction where the model misclassified those labels :
Although close, the probability of "Lhasa" (true label) for this image is lower than that of "Scotch Terrier" (predicted label). To better understand the model’s predictions, SHAP (SHapley Additive exPlanations) values provide deeper insights into a specific prediction’s probabilities, mainly for two reasons:
In our example, since Scotch Terriers are mainly black, especially in our dataset, we understand the importance of visual clues like “the color of the belly is black” or “the fur color of the back is black.” Thus, the black coat that the Lhasa dog is wearing in the image is very likely to explain the model’s error.
These examples showcase the potential of LLMs to enhance image classification. The improvements are not directly related to performance (CNN models typically achieve higher accuracy) but rather to model interpretability.
All these methods share a similarity: they provide descriptive information about a class and its labels, either through an iterative process involving a few images and a multimodal embedding model ("low-shot" method) or via a hierarchical tree decomposition ("enhanced standard" method). Thus, the dependency on LLM knowledge is a limitation worth noting.
Although we’ve categorized these methods into three “families” of classification, they are not mutually exclusive. For instance, we could think about using descriptions from the "low-shot" method as features for the "enhanced standard" model (even if we must ensure that we have enough images for both specific training processes).
Discover all the latest news, articles, webinar replays and fifty-five events in our monthly newsletter, Tea O'Clock.