As names go, “zero-shot” is unintuitive even by machine learning standards. That makes it no less fascinating, however — especially when we consider that, compared to simple classifiers, zero-shot learning more closely approximates human understanding.
Traditionally, classification labels were treated merely as identifiers — ‘Category A’ and ‘Category B’. During the training phase, classifiers were given a certain number of ‘A’ samples and a comparable number of ‘B’ samples. For example, in Duda’s classic text on Pattern Recognition, the authors address the problem of classifying fish as ‘salmon’ or ‘sea bass’ by extracting features such as length and color from a labeled training set and constructing a decision boundary in feature space — for example, after analyzing some training samples, a classifier could decide (simplistically) that all fish that are gray in color and longer than 20 cm should be classified as salmon. This problem implicitly assumes that the classifier knows nothing about salmon and sea bass outside of the labeled samples used for training. In the real world, this is obviously not true — one may know that zebras look like striped horses even if one has never seen a zebra (nor a picture of one). The zero-shot technique takes such auxiliary knowledge into account; of course, word definitions and category descriptions are freely available on the Internet.
Zero-shot classification, then, relies fundamentally on “understanding the labels” — using semantic information contained in the label names to expand the universe of classifiable data without requiring training data for each class. The natural conclusion of this approach is obviously to use no training data at all, and to classify purely based on the meaning of the labels — indeed, zero-shot learning is also referred to as dataless classification.
So, while all classifiers are expected to generalize sufficiently to assign new samples to classes they have already observed during training, zero-shot classifiers can assign new samples to unobserved classes (for which no samples have been encountered in training).
In practice, zero-shot learning is most often associated with image classification. How is dataless image classification accomplished? Early zero-shot approaches prior to 2010 relied on extracting attributes from images — features such as shape and color, much as Duda’s text suggested — and matched these attributes to known classes. In other words, the attributes of an input image were extracted and the class of the image was chosen to be the class with the most similar attributes. In 2013, the Deep Visual-Semantic Embedding model (DeViSE) was among the first large-scale efforts to explicitly map images into a semantic embedding space — effectively unifying text and image data. This is the basis of the modern approach to zero-shot learning: semantically-capable models encode the class labels into the same vector space as the dataset. Under such models, the vector representation of an image of salmon is spatially close to the vector representation of the ‘salmon’ text label. This reduces the classification problem to a simple similarity search.
In 2016, Facebook AI Research made further strides towards bridging vision and language using the notion of visual n-grams. Today, the GPT-2 and GPT-3 transformers as well as OpenAI’s landmark CLIP model represent the state of the art in this area.
The impact of zero-shot learning on practical classification tasks cannot be overstated. Models have traditionally been trained on manually annotated datasets that are expensive to construct: the ImageNet dataset required over 25,000 workers to annotate 14 million images for 22,000 object categories. A traditional model trained using ImageNet may accurately assign unseen images to one of the 1000 ImageNet categories, but cannot be extended to include a new category without fine-tuning. In contrast, a modern zero-shot model such as CLIP can classify images on the fly with user-assigned labels — with an accuracy level comparable to fully supervised models.
At Quilt.AI, we use zero-shot learning and other ML techniques to uncover cultural meaning in Internet data. Reach out to us at [email protected] for more information!