As far as (no-SQL) databases go ElasticSearch is pretty advanced.

What does it support?

  • Key value lookups
  • Complex domain object as JSON
  • Unnormalized objects
  • Normalizied Objects
  • Transations at the document level (but no transactions spanning multiple documents)
  • Arbitrary Filters (Search)
  • Advanced grouping of results
  • Joins are possible at the client side only (by issuing multiple queries from the client and joining there)
  • Horizontally scalable (just add more servers when your data grows)
  • via a plugin ElasticSearch also supports SQL
  • much faster than SQL databases

The two most advanced features compared to other databases are search via the inverted index and grouping of the results. It turns out that to build a flexible analytics engine, those two properties are enough.

The code at a glance

Project on github

Example

Let’s say we are building a website like stackoverflow. The main entities are:

  • user
  • question
  • answer
  • activity_log (users posting/answering and viewing questions)
  • vote
  • job_listing

The user entity can be represented as:

entity user{
  user_id: string
  first_name: string
  last_name: string
  location: string
  web_site: string
  about_me: string
  interests: [string]
  experience: list[
    { role: optional[string]
      company: optional[string]
    }]
  job_search_status: looing|notlooking
}

The question entity can be represented as:

entity question{
  title: string
  full_text: string
  created_at: string
  last_modified_at: string
  user_id: string
}
entity answer{
  full_txt: string
  created_at: string
  last_modified_at: string
}

Implementation

Creating index and collection schemas

  1. Creating an index (and give it a name, e.g. mystackoverflow)
curl -XPUT 'localhost:9200/mystackoverflow' -d '
{
    "settings" : {
        "index" : {
            "number_of_shards" : 10,
            "number_of_replicas" : 1
        }
    }
}
'

1.1. Verify the index

 curl -XGET 'http://localhost:9200/mystackoverflow/'

1.2. Update the index with mappings for a user

curl -XPUT 'localhost:9200/mystackoverflow/_mapping/user' -d '
{
  "properties": {
    "first_name": {
      "type": "string",
      "analyzer": "standard"
    },
    "last_name": {
      "type": "string",
      "analyzer": "standard"
    },
    "location": {
      "type": "string",
      "analyzer": "standard"
    },
    "about_me": {
      "type": "string",
      "analyzer": "english"
    },

    "interests" : {
            "type" : "string",
            "index":    "not_analyzed"
            },

    "created_at": {
      "type": "date",
      "format": "date_hour_minute_second_millis"
    }
  }
}
'

1.3. Update the index with mappings for a question

  • TODO: upgrade the schema (not possible on existing collections)
  • TODO: adding referential integrity

1.4 Create the activity log

class ActivityLog{
  user_id: string
  item_type: string (question or answer)
  item_id: (question or answer) string
  created_at: date
  interaction_type: string (create/update (a question or answer), 
                    view (a question or answer), 
                    upvote, downvote (a question or answer)
}

Adding data

For example, adding a user:

curl -XPUT localhost:9200/mystackoverflow/user/john_smith?pretty -d '
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "about" :      "I love to go rock climbing",
    "interests": [ "cplusplus", "go" ]
}
'

Updating data

For example, update the about section in a user:

#notice XPOST vs XPUT
curl -XPOST localhost:9200/mystackoverflow/user/john_smith/_update?pretty -d '
{
  "doc": {
        "about" :      "I love to go rock climbing and kayaking"
    }
}
'

Update the keywords of a user:

#do not put duplicates in the list
curl -XPOST localhost:9200/mystackoverflow/user/john_smith/_update?pretty -d '
{
   "script" : "ctx._source.interests = (ctx._source.interests + new_item).unique( false )",
   "params" : {
      "new_item" : "java"
   }
}
'

Queries with aggregations

Number of questions per keyword

curl -XGET 'localhost:9200/mystackoverflow/question/_search?pretty' -d '
{
   "size": 0,
   "aggs": {
        "group_by_tag": {
            "terms": { "field": "tags" }
        }
    }
}'

The query has the following behavior:

  • analyze the activity in a time window (from now - X days to now)
  • group the activity by page
  • for each group (bucket) per page, calculate the number of distinct users
  • sort the page groups by the number of distinct users in descending order
  • output the top K pages with the count of distinct users

Since this is a heavy and frequently executed query, it’s best to cache it

curl -XGET 'localhost:9200/mystackoverflow/activity_log/_search??request_cache=true&pretty' -d '
{
   size: 0, 

  "query": {
      "filtered": {
         "query": {
              "match_all": {}
         },
         "filter": {
            "range" : {
                "created_at" : {
                    "gte" : "now-1000d/d",
                    "lt" :  "now/d"
                }
            }
         }
      }
   },

   "aggs": {
        "group_by_item": {
             "terms": { 
                  "field": "item_id",
                  "size": 2,
                  "order" : { "distinct_users" : "desc" }
              },
              "aggs" : {
                  "distinct_users" : {
                      "cardinality" : {
                        "field" : "user_id"
                      }
                  }
              }
        }
    }
}'