Suffix Arrays for Phrase Search
This article is still a work in progress and may contain inaccuracies and undeveloped sections! Use it at your own discretion!
Use cases
Imagine you have a very large collection of documents and you want to organize it (or index it) in such a way so that you can easily
- find if the collection contains documents with any phrase, no matter how long the phrase
- compute ngram frequencies for large ngrams.
Suffix arrays are one way to go about it. The approach of suffix arrays works by organizing the words in collection in such a way that a linear scan will reveal the number of times an ngram appears in the collection, no matter how long the ngram.
One quite interesting application related to the two questions above is how to extract cheaply all the phrases in a large collection of text.
What is a suffix array?
Suffix arrays are typically explained on a string of text such as “abracadabra”
Substring search problem
What we want here is given a substring such as “cadab” to find easily if it is a substring of the original “abracadabra” string. (Instead of array of letters one can consider array of words, the substring in this case is replaced by a phrase containing words.)
So the way we can go about this problem it is to take the original string and create the list of all suffixes:
abracadabra bracadabra racadabra acadabra cadabra adabra dabra abra bra ra a
Then if we search for “cadab” we can see that it will be a prefix of one of the suffixes, namely “cadabra”.
A substring is a prefix of a suffix of the complete string. Therefore, when we search for a substring, we search for a prefix of a suffix. For example, "cadabra" is a suffix, and "cada" is a prefix of length 4 in the suffix "cadabra". If we were searching for "cada", we'd find it to be the prefix of the suffix "cadabra".
Speeding up the search
One method for speeding up the search is to sort the suffixes and then search with binary search.
Reducing storage
When we generate all suffixes of a string of $n$ characters, we need $O(n^2)$ space. This can be avoided if we figure out that we can represent each suffix with a pointer (or offset) in the original string.
Thus, initially the suffixes are represented via the offsets:
0: abracadabra 1: bracadabra 2: racadabra 3: acadabra 4: cadabra 5: adabra 6: dabra 7: abra 8: bra 9: ra 10: a
After sorting we retrieve the suffix string from the offset:
sorted suffixes and the offset in the original string ----------------------------------------------------- start in original array suffix ----------------------------------------------------- 10 "a" 7 "abra" 0 "abracadabra" 3 "acadabra" 5 "adabra" 8 "bra" 1 "bracadabra" 4 "cadabra" 6 "dabra" 9 "ra" 2 "racadabra"
Sorting the suffixes
Even though we reduced the memory from the naive $O(n^2)$ to $O(n)$, if we use the obvious sorting method (say plain quick sort), it might take $O(n^2)$ to sort the suffixes. One idea is that we can use a special version of quick sort that works very well for sorting arrays of strings (this version is called multikey ternary quicksort) or let’s just call it fast string sort
Now all that is needed to adapt the implementation(s) from the above article by providing custom implementations for the following functions of strings:
string.length string.charAt(...)
The idea kind of works but a problem appears in a data that contains a lot of duplicate ngrams. For example:
abc abc abc abc
If we attempt to sort and choose the letter ‘b’ as the first pivot we reach the following state:
let extra = abc group 1 (first letter < 'b') abc extra extra extra abc extra extra abc extra abc -------------------- group 2 (first letter == 'b') bc extra extra extra bc extra extra bc extra bc -------------------- group 3 (first letter > 'b') c extra extra extra c extra extra c extra c
Now as one can see, recursing on the first group, since all suffixes there start with ‘a’, we arrive at a group that looks exactly like the second group. Since those groups of suffixes are handled independently, work will be repeated.
A pathological case is data that contains the same character, for example ‘aaaaaaaaaaaaa….aaaa’
The use of the string sorting algorithm is competitive if not many duplicates appear. But no one can guarantee that.
A clever algorithm: SA-IS
This section is still under construction.
Computing the LCP array
LCP stands for longest common prefix. What LCP means is best explained by an example.
Consider two consecutive suffixes after sorting.
i ind lcp rnk select --------------------------- 0 10 - 0 "a" 1 7 1 1 "abra" 2 0 4 2 "abracadabra" 3 3 1 3 "acadabra" 4 5 1 4 "adabra" 5 8 0 5 "bra" 6 1 3 6 "bracadabra" 7 4 0 7 "cadabra" 8 6 0 8 "dabra" 9 9 0 9 "ra" 10 2 2 10 "racadabra"
For example, in the sorted list, consider the suffix at position $i=6$. This is “bracadabra” Before this suffix, at position $i=5$ we see the suffix “bra”. Since the common prefix of “bra” and “bracadabra” is 3, we write 3 for the LCP of “bracadabra”.
Given the sorted suffixes the LCP array can be computed in linear time. See the implementation in function computeLCP
Uses of the LCP array
The LPC array is useful for
- speeding up substring search
- finding the longest repeated substring
- quickly discovering ngram counts
Computing phrases from Wikipedia
Input: 50000 documents Number of total words in those documents: 88 686 630 Size of input array in MBs: 160 (the array is an array of integer ids, each id represents a word)
The time to sort the suffixes is 50 seconds. The same time was achieved by both the fast string sort algorithm and the SA-IS algorithm with implementation by yuta256. However, the fast string sort algorithm may fail to perform well on certain type of data. So the SA-IS algorithm is preferable.
After the suffix array and the LCP (longest common prefix) array are computed it’s a matter of scanning through the LCP array and keeping track of a few counters. This is the function dumpNgrams
The top bigrams are (the scores below are counts):
united kingdom 13463.0 prime minister 13370.0 war ii 12824.0 median income 12762.0 high school 10932.0 university press 10239.0 civil war 9752.0 every females 8878.0 los angeles 8797.0 soviet union 8602.0 per square 8488.0 air force 8413.0 further reading 8190.0 square mile 8184.0 american actor 8110.0 african american 8069.0 press isbn 8027.0
Some interesting trigrams (the scores below are not counts):
dien bien phu -1357.0843587874344 guru granth sahib -1395.1764721438385 tuatha dã© danann -1419.5544827727674 chen shui bian -1431.465076962982 obi wan kenobi -1444.5153566396255 lee kuan yew -1516.8594907003933 chiang ching kuo -1548.9171440014002 gamal abdel nasser -1550.7136811332614 josip broz tito -1579.5224149565151 ursula le guin -1602.9230687266368 ebook kansas cyclopedia -1630.9021398533328 mutant ninja turtles -1645.7369039407974 castilla la mancha -1669.7628172544435 kuala lumpur malaysia -1700.1293484922126 teenage mutant ninja -1719.8348978683298
Demo
A working demo is available in the suffix-array project.