ElasticSearch as a Database System
As far as (no-SQL) databases go ElasticSearch is pretty advanced.
What does it support?
- Key value lookups
- Complex domain object as JSON
- Unnormalized objects
- Normalizied Objects
- Transations at the document level (but no transactions spanning multiple documents)
- Arbitrary Filters (Search)
- Advanced grouping of results
- Joins are possible at the client side only (by issuing multiple queries from the client and joining there)
- Horizontally scalable (just add more servers when your data grows)
- via a plugin ElasticSearch also supports SQL
- much faster than SQL databases
The two most advanced features compared to other databases are search via the inverted index and grouping of the results. It turns out that to build a flexible analytics engine, those two properties are enough.
The code at a glance
- Delete index
- Create index
- Create user mappings
- Create question mappings
- Create users script
- Create questions script
- List users
- List questions
- Number of questions for each keyword
- Trending questions
Example
Let’s say we are building a website like stackoverflow. The main entities are:
- user
- question
- answer
- activity_log (users posting/answering and viewing questions)
- vote
- job_listing
The user entity can be represented as:
entity user{ user_id: string first_name: string last_name: string location: string web_site: string about_me: string interests: [string] experience: list[ { role: optional[string] company: optional[string] }] job_search_status: looing|notlooking }
The question entity can be represented as:
entity question{ title: string full_text: string created_at: string last_modified_at: string user_id: string }
entity answer{ full_txt: string created_at: string last_modified_at: string }
Implementation
Creating index and collection schemas
- Creating an index (and give it a name, e.g. mystackoverflow)
curl -XPUT 'localhost:9200/mystackoverflow' -d ' { "settings" : { "index" : { "number_of_shards" : 10, "number_of_replicas" : 1 } } } '
1.1. Verify the index
curl -XGET 'http://localhost:9200/mystackoverflow/'
1.2. Update the index with mappings for a user
curl -XPUT 'localhost:9200/mystackoverflow/_mapping/user' -d ' { "properties": { "first_name": { "type": "string", "analyzer": "standard" }, "last_name": { "type": "string", "analyzer": "standard" }, "location": { "type": "string", "analyzer": "standard" }, "about_me": { "type": "string", "analyzer": "english" }, "interests" : { "type" : "string", "index": "not_analyzed" }, "created_at": { "type": "date", "format": "date_hour_minute_second_millis" } } } '
1.3. Update the index with mappings for a question
- TODO: upgrade the schema (not possible on existing collections)
- TODO: adding referential integrity
1.4 Create the activity log
class ActivityLog{ user_id: string item_type: string (question or answer) item_id: (question or answer) string created_at: date interaction_type: string (create/update (a question or answer), view (a question or answer), upvote, downvote (a question or answer) }
Adding data
For example, adding a user:
curl -XPUT localhost:9200/mystackoverflow/user/john_smith?pretty -d ' { "first_name" : "John", "last_name" : "Smith", "about" : "I love to go rock climbing", "interests": [ "cplusplus", "go" ] } '
Updating data
For example, update the about section in a user:
#notice XPOST vs XPUT curl -XPOST localhost:9200/mystackoverflow/user/john_smith/_update?pretty -d ' { "doc": { "about" : "I love to go rock climbing and kayaking" } } '
Update the keywords of a user:
#do not put duplicates in the list curl -XPOST localhost:9200/mystackoverflow/user/john_smith/_update?pretty -d ' { "script" : "ctx._source.interests = (ctx._source.interests + new_item).unique( false )", "params" : { "new_item" : "java" } } '
Queries with aggregations
Number of questions per keyword
curl -XGET 'localhost:9200/mystackoverflow/question/_search?pretty' -d ' { "size": 0, "aggs": { "group_by_tag": { "terms": { "field": "tags" } } } }'
Find Hot (Trending) Items with ElasticSearch
The query has the following behavior:
- analyze the activity in a time window (from now - X days to now)
- group the activity by page
- for each group (bucket) per page, calculate the number of distinct users
- sort the page groups by the number of distinct users in descending order
- output the top K pages with the count of distinct users
Since this is a heavy and frequently executed query, it’s best to cache it
curl -XGET 'localhost:9200/mystackoverflow/activity_log/_search??request_cache=true&pretty' -d ' { size: 0, "query": { "filtered": { "query": { "match_all": {} }, "filter": { "range" : { "created_at" : { "gte" : "now-1000d/d", "lt" : "now/d" } } } } }, "aggs": { "group_by_item": { "terms": { "field": "item_id", "size": 2, "order" : { "distinct_users" : "desc" } }, "aggs" : { "distinct_users" : { "cardinality" : { "field" : "user_id" } } } } } }'