As far as (no-SQL) databases go ElasticSearch is pretty advanced.

What does it support?

Key value lookups
Complex domain object as JSON
Unnormalized objects
Normalizied Objects
Transations at the document level (but no transactions spanning multiple documents)
Arbitrary Filters (Search)
Advanced grouping of results
Joins are possible at the client side only (by issuing multiple queries from the client and joining there)
Horizontally scalable (just add more servers when your data grows)
via a plugin ElasticSearch also supports SQL
much faster than SQL databases

The two most advanced features compared to other databases are search via the inverted index and grouping of the results. It turns out that to build a flexible analytics engine, those two properties are enough.

The code at a glance

Project on github

Example

Let’s say we are building a website like stackoverflow. The main entities are:

user
question
answer
activity_log (users posting/answering and viewing questions)
vote
job_listing

The user entity can be represented as:

entity user{
  user_id: string
  first_name: string
  last_name: string
  location: string
  web_site: string
  about_me: string
  interests: [string]
  experience: list[
    { role: optional[string]
      company: optional[string]
    }]
  job_search_status: looing|notlooking
}

The question entity can be represented as:

entity question{
  title: string
  full_text: string
  created_at: string
  last_modified_at: string
  user_id: string
}

entity answer{
  full_txt: string
  created_at: string
  last_modified_at: string
}

Implementation

Creating index and collection schemas

Creating an index (and give it a name, e.g. mystackoverflow)

curl -XPUT 'localhost:9200/mystackoverflow' -d '
{
    "settings" : {
        "index" : {
            "number_of_shards" : 10,
            "number_of_replicas" : 1
        }
    }
}
'

1.1. Verify the index

 curl -XGET 'http://localhost:9200/mystackoverflow/'

1.2. Update the index with mappings for a user

curl -XPUT 'localhost:9200/mystackoverflow/_mapping/user' -d '
{
  "properties": {
    "first_name": {
      "type": "string",
      "analyzer": "standard"
    },
    "last_name": {
      "type": "string",
      "analyzer": "standard"
    },
    "location": {
      "type": "string",
      "analyzer": "standard"
    },
    "about_me": {
      "type": "string",
      "analyzer": "english"
    },

    "interests" : {
            "type" : "string",
            "index":    "not_analyzed"
            },

    "created_at": {
      "type": "date",
      "format": "date_hour_minute_second_millis"
    }
  }
}
'

1.3. Update the index with mappings for a question

TODO: upgrade the schema (not possible on existing collections)
TODO: adding referential integrity

1.4 Create the activity log

class ActivityLog{
  user_id: string
  item_type: string (question or answer)
  item_id: (question or answer) string
  created_at: date
  interaction_type: string (create/update (a question or answer), 
                    view (a question or answer), 
                    upvote, downvote (a question or answer)
}

Adding data

For example, adding a user:

curl -XPUT localhost:9200/mystackoverflow/user/john_smith?pretty -d '
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "about" :      "I love to go rock climbing",
    "interests": [ "cplusplus", "go" ]
}
'

Updating data

For example, update the about section in a user:

#notice XPOST vs XPUT
curl -XPOST localhost:9200/mystackoverflow/user/john_smith/_update?pretty -d '
{
  "doc": {
        "about" :      "I love to go rock climbing and kayaking"
    }
}
'

Update the keywords of a user:

#do not put duplicates in the list
curl -XPOST localhost:9200/mystackoverflow/user/john_smith/_update?pretty -d '
{
   "script" : "ctx._source.interests = (ctx._source.interests + new_item).unique( false )",
   "params" : {
      "new_item" : "java"
   }
}
'

Queries with aggregations

Number of questions per keyword

curl -XGET 'localhost:9200/mystackoverflow/question/_search?pretty' -d '
{
   "size": 0,
   "aggs": {
        "group_by_tag": {
            "terms": { "field": "tags" }
        }
    }
}'

The query has the following behavior:

analyze the activity in a time window (from now - X days to now)
group the activity by page
for each group (bucket) per page, calculate the number of distinct users
sort the page groups by the number of distinct users in descending order
output the top K pages with the count of distinct users

Since this is a heavy and frequently executed query, it’s best to cache it

curl -XGET 'localhost:9200/mystackoverflow/activity_log/_search??request_cache=true&pretty' -d '
{
   size: 0, 

  "query": {
      "filtered": {
         "query": {
              "match_all": {}
         },
         "filter": {
            "range" : {
                "created_at" : {
                    "gte" : "now-1000d/d",
                    "lt" :  "now/d"
                }
            }
         }
      }
   },

   "aggs": {
        "group_by_item": {
             "terms": { 
                  "field": "item_id",
                  "size": 2,
                  "order" : { "distinct_users" : "desc" }
              },
              "aggs" : {
                  "distinct_users" : {
                      "cardinality" : {
                        "field" : "user_id"
                      }
                  }
              }
        }
    }
}'