Custom Similarity for ElasticSearch
This blog post describes how to write your own custom similarity for Elasticsearch and when you want to do so. I’m using as a running example the use case of measuring the overlap between user-generated clicks for two web pages. I present all the details that are relevant to computing an overlap similarity in Elasticsearch. The full source code with a test example is avalable in github. In the running example, a page about encryption is found to be similar to pages about security and wiretapping, like so:
query: page_encryption {"page_id":"page_encryption"} {"page_id":"page_clipper"} {"page_id":"page_chip"} {"page_id":"page_key"} {"page_id":"page_keys"} {"page_id":"page_secure"} {"page_id":"page_algorithm"} {"page_id":"page_escrow"} {"page_id":"page_security"} {"page_id":"page_secret"} {"page_id":"page_crypto"}
Please visit the example: Custom Similarity For ElasticSearch
Motivation
Writing a custom similarity is typically considered a last resort solution although it’s not hard from a developer perspective. The downside typically is (as with any computer code) that you need to test it, profile it, distribute it to the production servers and continue to maintain it as bugs are discovered or the Elasticsearch API changes.
So before choosing this path please make sure that existing similarities and their parameters won’t work. Make sure that extending the similarities with scripting won’t work either. A typical use case for scriptiong is when you want to give a boost to a document based on recency or location. So typically time-based boosting or location boosting is not a good case for a similarity plugin.
However, there could be situations when a custom scoring mechanism is the best option.
A good use case is when you have a well-performing similarity measure (and you are sure of that!), but this similarity is not integrated into Elasticsearch.
A good use case is recommendation systems
One of the simplest recommenation systems that is based on user clicks (or user iteraction with items) is by finding item-to-item correlations. This a type of recommendation system pioneered by Amazon which can be simply described as:
"Most" users who clicked on A also clicked on B.
Even though there are a number of ways to implement such as scoring function, many of those methods require the overlap between the vectors of user clicks for two items.
You have the following situation:
item_a = [user_1, user_50, user_32, user_65, ...] item_b = [user_23, user_50, user_52, user_64, ...]
One simple similarity is the following
clicks_overlap = #distinct users who clciked on both item_a and item_b clicks_a = #distinct users who clicked on item_a clicks_b = #distinct users who clicked on item_b similarity = min(clicks_overlap/clicks_a, clicks_overlap/clicks_b) = clicks_overlap/max(clicks_a, clicks_b)
Another similarity measure popular in the literature on recommendation systems is also based on the aggregate statistics clicks_a, clicks_b, clicks_ab. It is implemented in Mahout. Here is a blog post describing the intuition behind Mahout LLR scoring.
In this blog post I am are going to implement only the first.
One important consideration is that I’m not running on all the click data per user but only on asample. These are not simple random samples, but are aligned. The publication that introduced this method called the sampling method Condional Random Sampling
Elasticsearch plugin architecture
Elasticsearch can be easily extended with plugins. There are two kinds of plugins: site and jvm. JVM plugins are java code and allow developers to extend the actual engine, while the site plugins are html and javascript. Common use cases for JVM plugins are:
- analyzers: code that break your string into words
- scorers: code that score your documents
The current use case requires a jvm plugin.
Project on github
I’ve put a simple project on github that implements the overlap based similarity
Below I’ll describe how such a plugin is developed.
Structure
This project is a maven project since it was eaiser to get started. Many of the existing plugins use maven. In principle we can use a gradle or an sbt project.
No matter the build system used, it is important that a file called plugin-descriptor.properties appear in the top level of the built zip or jar file.
unzip overlap-similarity-plugin-0.0.1-SNAPSHOT.jar ls plugin-descriptor.properties
Sample plugin-descriptor.properties
description=Overlap Similarity Plugin version=0.0.1-SNAPSHOT jvm=true name=overlap-similarity-plugin elasticsearch.version=2.1.1 java.version=1.8 classname=stefansavev.esplugins.OverlapSimilarityPlugin
When elasticsearch installs and loads the plugin, it is looking for the plugin-descriptor.properties file.
Defining the Plugin
In my plugin I need to create a class that extends from the Elasticsearch Plugin class
public class OverlapPlugin extends Plugin { @Override public String name() { return "overlap-similarity"; } @Override public String description() { return "Overlap Similarity Plugin"; } public void onModule(final SimilarityModule module) { module.addSimilarity("overlap", OverlapSimilarityProvider.class); } }
In this particular case I’m not implementing modules() or services() functions as I’m not creating any new modules or services. However, I am adding a similarity to the existing SimilarityModule defined in the Elasticsearch codebase. Notice that onModule is not an overriden method and must take as an argument a SimilarityModule type. If we we doing an custom analyzer the signature would be using AnalysisModule instead of SimilarityModule, like so:
//if a custom analyzer was needed public class CustomAnalyzerPlugin extends Plugin{ public void onModule(final AnalysisModule module) { ... } }
The onModule function is called via Reflection and Elasticsearch uses the declaration of the module type (SimilarityModule vs. AnalysisModule) to figure out which object to pass to our plugin.
SimilarityProvider and Similarity
The custom similarity provider is specified in the custom plugin class (see onModule function). The similarity provider returns a similarity to be used:
package stefansavev.esplugins; import org.apache.lucene.search.similarities.Similarity; import org.elasticsearch.common.inject.Inject; import org.elasticsearch.common.inject.assistedinject.Assisted; import org.elasticsearch.common.settings.Settings; import org.elasticsearch.index.similarity.*; public class OverlapSimilarityProvider extends AbstractSimilarityProvider { private OverlapSimilarity similarity; @Inject public OverlapSimilarityProvider(@Assisted String name, @Assisted Settings settings) { super(name); this.similarity = new OverlapSimilarity(); } @Override public Similarity get() { return similarity; } }
The custom similarity is the most complex of the three classes we have to create. Ignoring the implementation details the custom similarity looks like this:
public class OverlapSimilarity extends Similarity { public float coord(int overlap, int maxOverlap) {...} public float queryNorm(float valueForNormalization) {...} @Override public long computeNorm(FieldInvertState state){...} @Override public Similarity.SimWeight computeWeight(float queryBoost, CollectionStatistics collectionStats, TermStatistics... termStats){...} @Override public Similarity.SimScorer simScorer(SimWeight stats, LeafReaderContext context) throws IOException{...} }
Understanding how Similarity works in Elasticsearch
That’s the most complex part of the article, so make sure to take a short break before proceeding.
Registering custom similarity with a field
Suppose our plugin is deployed. First, we need to create a name for our custom similarity
curl -XPUT 'localhost:9200/page_clicks' -d ' { "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 1, "store": "memory" }, "similarity" : { "custom_similarity" : { "type" : "overlapsimilarity" } } } } '
Second, when we create the mapping we need to specify that we want our custom similarity based on the field. In this particular case I also specify that I want to store only the user_ids (this is “index_options”: “docs”)
curl -XPUT 'localhost:9200/page_clicks/_mapping/pages' -d ' { "properties": { "page_id": { "type": "string", "index": "not_analyzed" }, "user_ids": { "type": "string", "analyzer": "standard", "index_options": "docs", "similarity": "custom_similarity" } } }
We are executing the following queries:
First, we want to get the user_ids for a particular page
curl -XGET 'localhost:9200/page_clicks/pages/_search' -d' { "query" : { "match" : { "page_id" : "page_encryption" } } } '
The result is
"hits":[{"_index":"page_clicks","_type":"page","_id":"AVIyiARRwYU3IsvU-MhY","_score":6.3752785,"_source": { "page_id": "page_encryption", "user_ids": "user_1951 user_3710 user_7497 user_8351 ..." } }]}}
Then we use the returned user_ids to formulate a similarity query with our similarity
curl -XGET 'localhost:9200/page_clicks/word/_search?pretty' -d' { "_source": [ "page" ], size: 20, "query": { "match" : { "user_ids": "user_1951 user_3710 user_7497 user_8351 ..." } }, "aggs": {} } '
Creating a benchmark dataset and obtaining results
I’m running on a dataset based on the 20-newsgroups dataset. I’m imagining each word corresponds to a page (for example in a on online dictionary each word has a page). Each document corresponds to a user_id. I downloaded the 20-newsgroups dataset and prepared the data for elasticsearch with a script.
The prepared data can be added to elasticsearch via the command:
cd example curl -s -XPOST localhost:9200/_bulk?pretty --data-binary "@data.txt"; echo
I’m executing the above two queries in a script and getting the results:
./similar_pages page_algorithm "_source":{"page":"page_algorithm"} "_source":{"page":"page_secret"} "_source":{"page":"page_encryption"} "_source":{"page":"page_clipper"} "_source":{"page":"page_escrow"} "_source":{"page":"page_key"} "_source":{"page":"page_chip"} "_source":{"page":"page_keys"} "_source":{"page":"page_crypto"} "_source":{"page":"page_secure"} "_source":{"page":"page_security"} "_source":{"page":"page_wiretap"} "_source":{"page":"page_nsa"} "_source":{"page":"page_des"} "_source":{"page":"page_encrypted"} "_source":{"page":"page_details"} "_source":{"page":"page_privacy"} "_source":{"page":"page_information"} "_source":{"page":"page_public"} "_source":{"page":"page_independent"}
It makes sense.
Back to how elasticsearch similarity works
When we run a query Elasticsearch (in this particular case) does something like this:
our_similarity = find the similarity based on the field and the specification in the mapping for each user_id (word) in the user_ids field of the query: sim_weight = our_similarity.computeWeight(user_id, global_statistics) #in sim_weight we get access to the so called normalizations or field_length #in our particular case we get access to the total number of clicks per page value_for_query_normalization = sim_weight.getValueForNormalization() query_normalization = 1/sqrt(sum of value_for_query_normalization for all words in the query) for each user_id in the user_ids field of the query: sim_weight = get the similarity weight for the user_id //tell the similarity_weight for each word about the query normalization sim_weight.normalize(query_normalization) total_score_per_page = {} #page_id -> score or document -> score for each user_id (word) in the user_ids field of the query: sim_weight = get the similarity weight for the user_id sim_scorer = our similarity scorer(sim_weight,...) selected_page_ids = get a list of pages which have this user_id (or get a list of candidate documents) for page_id (document) in selected_page_ids: total_score_per_page[page_id] += sim_scorer.score(page_id,...) sort the page_ids in total_score_per_page by their scores and return the top
The above code explains how the SimWeight and SimilarityScorer interact. The key to understanding the Elasticsearch scoring in this particular case is that we need to know the query length (number of user_ids in the query). The query length is given implicitly to the SimWeight class via a call to the normalize method. What we get there is not actually the query length but 1/sqrt(sum of some weights). The weights that are summed in the square root are actually supplied by the SimWeight scorer via getValueForNormalization method - in this case they are 1 for each user_id in the query.
Another statistic we need is the length of the user_ids field. This information is provided to the SimScorer via the simScorer method:
@Override public final SimScorer simScorer(SimWeight stats, LeafReaderContext context) throws IOException { OverlapStats overlapStats = (OverlapStats) stats; //get access to the query length NumericDocValues docNorms = context.reader().getNormValues(overlapStats.field); //get access to field lengths return new OverlapScorer(overlapStats, docNorms); }
Once we have access to the required statistics we can implement the score method in the SimScorer class that will combine them.
Running the code
The code for Custom Similarity For Elasticsearch is in github with detailed instructions how to run it.