This article is still a work in progress and may contain inaccuracies and undeveloped sections!
Use it at your own discretion!


Use cases

Imagine you have a very large collection of documents and you want to organize it (or index it) in such a way so that you can easily

• find if the collection contains documents with any phrase, no matter how long the phrase
• compute ngram frequencies for large ngrams.

Suffix arrays are one way to go about it. The approach of suffix arrays works by organizing the words in collection in such a way that a linear scan will reveal the number of times an ngram appears in the collection, no matter how long the ngram.

One quite interesting application related to the two questions above is how to extract cheaply all the phrases in a large collection of text.

What is a suffix array?

Suffix arrays are typically explained on a string of text such as “abracadabra”

Substring search problem

What we want here is given a substring such as “cadab” to find easily if it is a substring of the original “abracadabra” string. (Instead of array of letters one can consider array of words, the substring in this case is replaced by a phrase containing words.)

So the way we can go about this problem it is to take the original string and create the list of all suffixes:

abracadabra
dabra
abra
bra
ra
a


Then if we search for “cadab” we can see that it will be a prefix of one of the suffixes, namely “cadabra”.

A substring is a prefix of a suffix of the complete string.
Therefore, when we search for a substring, we search for a prefix
of a suffix.
For example, "cadabra" is a suffix, and "cada" is a prefix of length
4 in the suffix "cadabra". If we were searching for "cada", we'd find
it to be the prefix of the suffix "cadabra".


One method for speeding up the search is to sort the suffixes and then search with binary search.

Reducing storage

When we generate all suffixes of a string of $n$ characters, we need $O(n^2)$ space. This can be avoided if we figure out that we can represent each suffix with a pointer (or offset) in the original string.

Thus, initially the suffixes are represented via the offsets:

0: abracadabra
6:       dabra
7:        abra
8:         bra
9:          ra
10:          a


After sorting we retrieve the suffix string from the offset:

sorted suffixes and the offset in the original string
-----------------------------------------------------
start in
original
array       suffix
-----------------------------------------------------
10         "a"
7         "abra"
8         "bra"
6         "dabra"
9         "ra"


Sorting the suffixes

Even though we reduced the memory from the naive $O(n^2)$ to $O(n)$, if we use the obvious sorting method (say plain quick sort), it might take $O(n^2)$ to sort the suffixes. One idea is that we can use a special version of quick sort that works very well for sorting arrays of strings (this version is called multikey ternary quicksort) or let’s just call it fast string sort

Now all that is needed to adapt the implementation(s) from the above article by providing custom implementations for the following functions of strings:

string.length
string.charAt(...)


The idea kind of works but a problem appears in a data that contains a lot of duplicate ngrams. For example:

abc abc abc abc


If we attempt to sort and choose the letter ‘b’ as the first pivot we reach the following state:

let extra = abc

group 1 (first letter < 'b')
abc extra extra extra
abc extra extra
abc extra
abc
--------------------
group 2 (first letter == 'b')
bc extra extra extra
bc extra extra
bc extra
bc
--------------------
group 3 (first letter > 'b')
c extra extra extra
c extra extra
c extra
c


Now as one can see, recursing on the first group, since all suffixes there start with ‘a’, we arrive at a group that looks exactly like the second group. Since those groups of suffixes are handled independently, work will be repeated.

A pathological case is data that contains the same character, for example ‘aaaaaaaaaaaaa….aaaa’

The use of the string sorting algorithm is competitive if not many duplicates appear. But no one can guarantee that.

A clever algorithm: SA-IS

This section is still under construction.


Computing the LCP array

LCP stands for longest common prefix. What LCP means is best explained by an example.

Consider two consecutive suffixes after sorting.

  i ind lcp rnk select
---------------------------
0  10   -   0 "a"
1   7   1   1 "abra"
2   0   4   2 "abracadabra"
3   3   1   3 "acadabra"
4   5   1   4 "adabra"
5   8   0   5 "bra"
6   1   3   6 "bracadabra"
7   4   0   7 "cadabra"
8   6   0   8 "dabra"
9   9   0   9 "ra"
10   2   2  10 "racadabra"


For example, in the sorted list, consider the suffix at position $i=6$. This is “bracadabra” Before this suffix, at position $i=5$ we see the suffix “bra”. Since the common prefix of “bra” and “bracadabra” is 3, we write 3 for the LCP of “bracadabra”.

Given the sorted suffixes the LCP array can be computed in linear time. See the implementation in function computeLCP

Uses of the LCP array

The LPC array is useful for

• speeding up substring search
• finding the longest repeated substring
• quickly discovering ngram counts

Computing phrases from Wikipedia

Input: 50000 documents
Number of total words in those documents: 88 686 630
Size of input array in MBs: 160 (the array is an array of integer ids, each id represents a word)


The time to sort the suffixes is 50 seconds. The same time was achieved by both the fast string sort algorithm and the SA-IS algorithm with implementation by yuta256. However, the fast string sort algorithm may fail to perform well on certain type of data. So the SA-IS algorithm is preferable.

After the suffix array and the LCP (longest common prefix) array are computed it’s a matter of scanning through the LCP array and keeping track of a few counters. This is the function dumpNgrams

The top bigrams are (the scores below are counts):

united kingdom	13463.0
prime minister	13370.0
war ii	12824.0
median income	12762.0
high school	10932.0
university press	10239.0
civil war	9752.0
every females	8878.0
los angeles	8797.0
soviet union	8602.0
per square	8488.0
air force	8413.0
square mile	8184.0
american actor	8110.0
african american	8069.0
press isbn	8027.0


Some interesting trigrams (the scores below are not counts):

dien bien phu	-1357.0843587874344
guru granth sahib	-1395.1764721438385
tuatha dã© danann	-1419.5544827727674
chen shui bian	-1431.465076962982
obi wan kenobi	-1444.5153566396255
lee kuan yew	-1516.8594907003933
chiang ching kuo	-1548.9171440014002
gamal abdel nasser	-1550.7136811332614
josip broz tito	-1579.5224149565151
ursula le guin	-1602.9230687266368
ebook kansas cyclopedia	-1630.9021398533328
mutant ninja turtles	-1645.7369039407974
castilla la mancha	-1669.7628172544435
kuala lumpur malaysia	-1700.1293484922126
teenage mutant ninja	-1719.8348978683298


Demo

A working demo is available in the suffix-array project.