In the previous post, we discussed how to translate a variety of questions on semantic similarity into a single, tractable mathematical problem: which vector is closest to a query vector?
One way to address the nearest-vector question is, of course, to compare the query vector with each data vector. For a 2-dimensional Euclidean space this would translate to applying the distance formula for each resulting pair of vectors; for most modern embeddings, the cosine similarity measure is preferred as a more accurate representation of semantic proximity (in fact, cosine similarity is equivalent to the Euclidean distance between normalized vectors). Regardless of the similarity measure used, this method is slow and doesn’t scale well at all — we need to iterate over each point in the dataset, calculate the distance/similarity between the query point and this data point, and then identify the closest data point. Effectively, this approach scales linearly with the size of the dataset. Can we do better?
The key to solving this problem is realizing that we can build a structure ahead of time, well before the query comes in, that sits with the database and allows us to answer the query quickly. Such a structure is called an index. Indexes are by no means specific to semantic search — they have been used for decades to speed up database queries by orders of magnitude. If you have a simple database table that maps your friends’ names to their addresses and phone numbers, you may build an index on the name column to allow queries (e.g. “What is Fred’s address?”) to be answered quickly. An index is a sorted data structure, and in the same way that sorted lists can be scanned in logarithmic time via binary search, database indexes allow the database engine to perform a log-time search rather than a linear scan through all records. Most database indexes are B-trees — hierarchical, rapidly-navigable data structures that allow the database engine to quickly locate the row corresponding to a query.
While traditional B-tree indexes are well suited to exact-match queries (“find the address corresponding to the name Fred”), they do not address similarity or closeness. But semantic similarity search, by definition, requires similar matches rather than exact ones. So, how can we build an index on a billion-vector dataset to efficiently answer the ‘find the closest vector’ query? Or, more generally, how can we find the k-closest vectors efficiently?
In the final part of this post, we’ll talk about some approaches to building the right index for semantic similarity search.
At Quilt.AI, we use machine learning models to analyze semantic relationships between text, images, and ideas. Reach out to us at [email protected] for more information.