More than 25 years ago a search technique known as “Latent Semantic Indexing” was born. This technique was very influential at the time and similar techniques are still quite practically important. Subsequent work uses the term “Topic Modeling” instead of “Latent Semantics”.

Ever since the birth of information retrieval the so called “vocabulary mismatch” problem plagued search engine implementation. In simple terms what this means is that there will be a small mismatch between the query and the document and the user will not find what they are looking for.

Perhaps the query will contain the word “a”, but the document will refer to the same concept with a synomum like “b”. Searchers and document authors may refer to the same concept with different words, and therefore a search engine will fail to work.

Many attempts have been made to fix this issue. One method relies on expanding each query word with a list of synonyms. In another method the synonyms will be replaced by words derived by statistical co-occurences.

The issue may be more complicated since sometimes two consequitive words may need to be replaced by a phrase or a phrase by a single word.

Some types of errors in search engine:

  • query or document contains typos
  • query or document refer to the same concept by synomyms
  • word in query has different meaning than the word in a document

The ingenuity of LSI was that it could not only relate queries to documents (or documents to documents) but also words to words. In this way it worked similarly to a document or query expansion mechanism. The way this is achieved is that both words and documents are treated as vectors that are comparable. In keyword search models, a document is the union of its words. In LSI a document is the sum of the vectors representing its words.

LSI has some weaknesses that people should be aware of.

  • First, it does not work very well for rare words and overly common words distort the representation of all words.

  • The second issue is that statistical co-occurence expansion is not always desirable. For example, Germany and Austria are highly similar but expanding Germany to Austria may not be awlays desirable.

  • Third, LSI produces dense vectors which are not searchable.

Does it make sense then to use LSI or a similar technique?

Yes, it does but there are a number of conditions that have to be met.

Traditional use of LSI

The traditional use of LSI is to map each sparse document (or bag of words) to an LSI dense vector. A query is also represented as a dense vector. Then search is just running cosine similarity between the query and each document. (I’ll address how to make this search efficient later on).

LSA query and document vectors

As I mentioned in LSA, each word is represented as a vector of a fixed dimension, for example 500. What are these dimensions? One way to think of the dimensions is that they are context within which the word appears.

If we want to represent the words within some set of dimension we could do the following: We use the dimensions to express properties (features) of the words. For example, one dimension could be the frequency of a word in the language. Another dimenion could denote if the word is a noun. We could also use some dimensions to represent if the word is a country or a color or a personal name.

This is just to give an example what we can encode within a vector of fixed dimensions. We can also see that representing each word as a vector of properies (or set of properties) can yield a potential improvement to a search engine.

In LSA, we cannot define what the properties the numbers in the vectors represent. No single dimension is interpretable on its own.

However, we know that the following holds:

doc_vec = sum of the vectors of the words in a document

  • Given a doc. vector, can you find the words that represent it: yes, those are the words closest to the vector.

  • now we discover words which are not in the document.

  • Consider this example: a query of one word. The closest document is again a document containing one word.
  • a closest document with two words: under bag of words, it’s just any document of two words in which the query word appears.
  • under lsa, it’s a word which co-occurs with the query word. For example, query is new. Closest match is “new york”

Another example: query … germany. Even thouugh germany is expanded to austria, show that because of the other words cos(query, austria), is not likely, so austria is not a likely expansion.

query vs. document expansion: query expansion is more risky because it is based on a smaller context.

a query or document is not just a vector, but a cone (including the std. dev. and expansions within on stddev)

words in the query/doc title and doc text have different statistical properties. expanding the query based on the query log is less risky

You can also search similar queries and run them as well.