Even though Elasticsearch has the word search in the name, it’s more often used as an log analysis engine, or a database rather than a keyword search engine. In this blog post, I’ll explore the text search capabilities.

Text search can be very simple or highly complex depending on the use case. Starting from simple to highly complex here are the use cases for text search:

  • Search in a database field containing a relatively small number of entries. For example, a list of company suppliers.

  • Search in a database containing a large number of strings, but the strings are just a few words. For example, search in a database of movie names.

  • Faceted search

  • Keyword search for a collection with a small number of documents.

  • Enterprise search (not solved)

  • Websearch (very complex, works because billions of queries are executed and judged implicitly by clicks)

Topics:

  • index different datatypes: numbers, dates, bools, enums, geopoints, geoshapes, range searches

bitsets and caches

“term” : {“title” : “brown”} #contains term

range: from “a” to “m”

“range”: { “content”: {“gte”: “a”, “lt” : “m” } }

“range”: { “date”: { “gte”: “2014-01-01”, “lt” : “2041-02-01” } }

“range”: { “date”: { “gte”: “now - 1h” “gte”: “now - 1h / h” – cached version }

“bool”: { “must”: [ … ], “should”: [ ], “must_not”: [] }

“exists”: { “field”: “title” }

“bool”: { “must” : { “term”: {“title”: “rabbits”} “should”: [ {“term”: {“title”: “quick”}}, {“term”: {“content”: “quick”}} ]

} }

minimum_should_match

_score is the elasticsearch score

“should” is for relevance.

_score of bool query = sum(_score of each query)* num_matching_queries / num of queries

“minimum_should_match”: “75%”

{ “match”: { “title”: { “query”: “KIUCK BRON FOXS!”, “fuzziness” : “AUTO” } } }

https://www.elastic.co/webinars/elasticsearch-query-dsl?baymax=rtp&storm=recommendation&elektra=blog&iesrc=rcmd&astid=57ae5af0-b8ba-4c5c-be77-3e177c5f6d3e&at=5&rcmd_source=WIDGET&req_id=7aac55fd-40af-443c-953d-9eb70deb1968

{ “match_phrase”: { “title”: { “query”: “BROWN QUICK FOX!”, “slop”: “10” – matches within a window of 10 words } } }

“bool”: { “must”: { }, “should”: [ { }, { } ] }

dis_max query: combining info from multiple fields _score = best matching query + tie_breaker*others known as multimatch

“multi_match”: { “query”: “quick brown fox”, “fields”: [“title”, “content”] “tie_breaker”: 0.2 }

“title”: { “type”: “string”, “fields”: { “stemmed”: { “type”: “string”, “analyzer”: “english” }, “autocomplete”: { “type”: “string”, “analyzer” : “edge_ngrams” } }

}

title: [brown, fox, jumped] title.stemmed = [brown, fox, jump] title.autocomplete [b, br, bro, brow, brown, f, fo, fox, j, ju, jum, jump, jumpe, jumped ]

“multi_match”: { “query”: “quick brown fox”, “fields”: [ “title”, “title.stemmed”, “title.autocomplete” ] “type”: “most_fields” }

matching in names Reginald Kenneth Dwight

(first: Reginald AND first: Kenneth AND first: Dwight ) OR (middle: Reginald, AND middle: Kenneth, AND middle: Dwight ) OR … BUT TERM FREQUENCIES DEPEND ON THE FIELD

“first”: { “type”: “string”, “copy_to”: “full” }

“full”: { “type”: “string” }

query_context (query) vs. filter context (filter)

edgeNGram starts at the beginning nGram does not

https://www.elastic.co/guide/en/elasticsearch/guide/current/_index_time_search_as_you_type.html https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html https://www.elastic.co/blog/found-fuzzy-search

custom stopwords/custom stemming PUT /my_index { “settings”: { “analysis”: { “analyzer”: { “my_english”: { “type”: “english”, “stem_exclusion”: [ “organization”, “organizations” ], “stopwords”: [ “a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”, “for”, “if”, “in”, “into”, “is”, “it”, “of”, “on”, “or”, “such”, “that”, “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”, “was”, “will”, “with” ] } } } } }

Comments