When you search for a term on the Internet, say covid, you expect the results to include a number of pages containing that term. But you may also get pages that talk about the coronavirus or the pandemic without explicitly mentioning covid at all. Modern search engines do not merely look for keyword matches — they apply similarity search to ensure that you get relevant results.
The word2vec model revolutionized the field of natural language processing (NLP) in 2013 by “learning” how to represent words as vectors in a way that preserves meaning. The word2vec vector representations, or embeddings, for semantically similar words such as ‘dog’ and ‘canine’ are closer to each other (using a metric such as cosine distance) than the embeddings for semantically unrelated words. Remarkably, word2vec embeddings also support vector algebra — the result of vector(‘King’) — vector(‘Man’) + vector(‘Woman’) is close to vector(‘Queen’).
The release of word2vec heralded a series of rapid innovations in the field of NLP. Barely a year later, in 2014, the GloVe model improved upon word2vec’s embeddings. GloVe was trained by constructing a co-occurrence matrix where each row represents a word, each column represents a “context” (i.e. neighboring words), and the matrix values represent the frequency of a given word co-occurring with other words in the training corpus. After training, the values in each row formed the embedding vector for the corresponding word. In 2017, Facebook’s fastText showed that embeddings could further be improved using sub-words, allowing the model to “guess” the meanings of previously-unseen words based purely on known word parts — much like a human learner.
The year 2018 saw the release of ELMo and BERT, neural network architectures that produced context-aware embeddings — under these models, the word running can have two different embeddings depending on whether it refers to running a marathon or running a company. While ELMo (Embeddings from Language Models) achieves context-awareness using LSTMs, BERT (Bidirectional Encoder Representations from Transformers) uses the ‘self-attention’ mechanism of Google’s novel Transformer architecture. As of today, Google processes most search queries using BERT embeddings and semantic similarity — which explains why Googling returns such ‘good’ results.
In 2021, the CLIP model was made public by the OpenAI consortium. Building on prior work on semantically-capable models, CLIP had learnt how to represent text and images in the same vector space — today, it represents the state of the art in extending the notion of semantic embeddings to computer vision. CLIP assigns images their own embeddings, and vector(image of a female regent) is close to vector(‘Queen’). This extraordinary advance — the embedding of an image is close to the embedding of text that describes the image — allows un-annotated image datasets to be searched reliably using text keywords, or labeled automatically.
The embeddings are themselves not entirely free of controversy. Note that embeddings are learnt by the model — and since models are typically trained on large swathes of public data, embeddings tend to reflect biases in literature, imagery and public discourse. Nevertheless, semantic embeddings have effectively transformed problems that were almost philosophical (e.g. Which image in a dataset is most similar to a given image? Out of a set of text labels, which is the best label for a given sentence?) into a single, tractable mathematical problem: out of a vector dataset, which vector is closest to a query vector?
In the second part of this post, we’ll talk about some ways to answer this question.
At Quilt.AI, we use machine learning models to analyze semantic relationships between text, images and ideas. Reach out to us at [email protected] for more information!