ElasticSearch as a Database System
As far as (no-SQL) databases go ElasticSearch is pretty advanced.
What does it support?
- Key value lookups
 - Complex domain object as JSON
 - Unnormalized objects
 - Normalizied Objects
 - Transations at the document level (but no transactions spanning multiple documents)
 - Arbitrary Filters (Search)
 - Advanced grouping of results
 - Joins are possible at the client side only (by issuing multiple queries from the client and joining there)
 - Horizontally scalable (just add more servers when your data grows)
 - via a plugin ElasticSearch also supports SQL
 - much faster than SQL databases
 
The two most advanced features compared to other databases are search via the inverted index and grouping of the results. It turns out that to build a flexible analytics engine, those two properties are enough.
The code at a glance
- Delete index
 - Create index
 - Create user mappings
 - Create question mappings
 - Create users script
 - Create questions script
 - List users
 - List questions
 - Number of questions for each keyword
 - Trending questions
 
Example
Let’s say we are building a website like stackoverflow. The main entities are:
- user
 - question
 - answer
 - activity_log (users posting/answering and viewing questions)
 - vote
 - job_listing
 
The user entity can be represented as:
entity user{
  user_id: string
  first_name: string
  last_name: string
  location: string
  web_site: string
  about_me: string
  interests: [string]
  experience: list[
    { role: optional[string]
      company: optional[string]
    }]
  job_search_status: looing|notlooking
}
The question entity can be represented as:
entity question{
  title: string
  full_text: string
  created_at: string
  last_modified_at: string
  user_id: string
}
entity answer{
  full_txt: string
  created_at: string
  last_modified_at: string
}
Implementation
Creating index and collection schemas
- Creating an index (and give it a name, e.g. mystackoverflow)
 
curl -XPUT 'localhost:9200/mystackoverflow' -d '
{
    "settings" : {
        "index" : {
            "number_of_shards" : 10,
            "number_of_replicas" : 1
        }
    }
}
'
1.1. Verify the index
curl -XGET 'http://localhost:9200/mystackoverflow/'
1.2. Update the index with mappings for a user
curl -XPUT 'localhost:9200/mystackoverflow/_mapping/user' -d '
{
  "properties": {
    "first_name": {
      "type": "string",
      "analyzer": "standard"
    },
    "last_name": {
      "type": "string",
      "analyzer": "standard"
    },
    "location": {
      "type": "string",
      "analyzer": "standard"
    },
    "about_me": {
      "type": "string",
      "analyzer": "english"
    },
    "interests" : {
            "type" : "string",
            "index":    "not_analyzed"
            },
    "created_at": {
      "type": "date",
      "format": "date_hour_minute_second_millis"
    }
  }
}
'
1.3. Update the index with mappings for a question
- TODO: upgrade the schema (not possible on existing collections)
 - TODO: adding referential integrity
 
1.4 Create the activity log
class ActivityLog{
  user_id: string
  item_type: string (question or answer)
  item_id: (question or answer) string
  created_at: date
  interaction_type: string (create/update (a question or answer), 
                    view (a question or answer), 
                    upvote, downvote (a question or answer)
}
Adding data
For example, adding a user:
curl -XPUT localhost:9200/mystackoverflow/user/john_smith?pretty -d '
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "about" :      "I love to go rock climbing",
    "interests": [ "cplusplus", "go" ]
}
'
Updating data
For example, update the about section in a user:
#notice XPOST vs XPUT
curl -XPOST localhost:9200/mystackoverflow/user/john_smith/_update?pretty -d '
{
  "doc": {
        "about" :      "I love to go rock climbing and kayaking"
    }
}
'
Update the keywords of a user:
#do not put duplicates in the list
curl -XPOST localhost:9200/mystackoverflow/user/john_smith/_update?pretty -d '
{
   "script" : "ctx._source.interests = (ctx._source.interests + new_item).unique( false )",
   "params" : {
      "new_item" : "java"
   }
}
'
Queries with aggregations
Number of questions per keyword
curl -XGET 'localhost:9200/mystackoverflow/question/_search?pretty' -d '
{
   "size": 0,
   "aggs": {
        "group_by_tag": {
            "terms": { "field": "tags" }
        }
    }
}'
Find Hot (Trending) Items with ElasticSearch
The query has the following behavior:
- analyze the activity in a time window (from now - X days to now)
 - group the activity by page
 - for each group (bucket) per page, calculate the number of distinct users
 - sort the page groups by the number of distinct users in descending order
 - output the top K pages with the count of distinct users
 
Since this is a heavy and frequently executed query, it’s best to cache it
curl -XGET 'localhost:9200/mystackoverflow/activity_log/_search??request_cache=true&pretty' -d '
{
   size: 0, 
  "query": {
      "filtered": {
         "query": {
              "match_all": {}
         },
         "filter": {
            "range" : {
                "created_at" : {
                    "gte" : "now-1000d/d",
                    "lt" :  "now/d"
                }
            }
         }
      }
   },
   "aggs": {
        "group_by_item": {
             "terms": { 
                  "field": "item_id",
                  "size": 2,
                  "order" : { "distinct_users" : "desc" }
              },
              "aggs" : {
                  "distinct_users" : {
                      "cardinality" : {
                        "field" : "user_id"
                      }
                  }
              }
        }
    }
}'