Elasticsearch Basics: searching, indexing and full text search

backend7 Min to Read12 Aug 20

Quickly setup and test elasticsearch

You can run below docker command to run single node elasticsearch.

docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.8.1

You can run below curl command to test if elasticsearch is running, it should return node details.

curl -X GET "localhost:9200/_cat/nodes?v&pretty"

Indexing and Searching

In order to search in elasticsearch, you will have to first index your data in elasticsearch. Your source of data could be anywhere i.e postgres, mysql etc. Elasticsearch provide rest APIs to index and search data.

Index is just like table which contain collection of document in optimized way and document is just like row or record in table. I.e index user-index contains list of user document.

elastic search index flow

Indexing APIs

There are two way to index data in elasticsearch, One is using index API and other is using bulk API.

Using index API, you can update or add one document to index. By default, index is automatically created if it does not exist. To know more about index api visit this link.

For example, below rest api create user document with id 1 in users index.

POST users/_doc/1
{
  "email": "vikash.kumar",
  "age": 26,
  "about": "love writting code, badminton lover"
}

Using bulk API, you can create, update and delete multiple index in single rest api call. To know more about bulk api visit this link.

For example, below api create two user with id 1 and 2 in users index.

POST _bulk
{ "index" : { "_index" : "users", "_id" : "1" } }
{ "email": "vikash.kumar", "age": 26, "about": "programmer, blogger, badminton lover"}
{ "index" : { "_index" : "users", "_id" : "2" } }
{ "email": "john.wick", "age": 31, "about": "blogging, guitar player"}

Search APIs

Elasticsearch provide extensive rest api to search documents.

  • you can search for documents in single, multiple or all index.
  • you can search document by single column or multi column.
  • you can search document by keyword using term query or full text search using match query

You will see more about full text search capability and feature in below sections.

Elasticsearch provide two syntax for query. One using q query parameter or other using request body parameter.

Using q query parameter is not that advance and does not support full elasticsearch query DSL. Hence it is recommended to use request body syntax. To know more about search api visit this link.

Query using q query parameter.

GET users/_search?q=age:26

Query using request body parameter

GET users/_search
{
 "query": {
   "term": {
     "age": 26
   }
 }
}

Mapping

Each field in document could be quite different and serve different purpose. For example, you might want to do exact search on user email field but full text search on user description field. Depending on each field type or usage, elasticsearch will optimally index.

For example, user email will be indexed optimally for exact search while user description will be indexed optimally for full text search.

Telling elasticsearch about your data type is called mapping. You know much better about your data than elasticsearch knows but still, elasticsearch does it best and automatically and implicitily create a optimal mapping for you as per below sample table. You can learn more about mapping here.

Data Type Elasticsearch Mapping Type
String text and keyword
Numeric long, integer, short, byte etc
Boolean boolean
Date date

You can check current mapping for given index using below index. You can see in below response, string(email) field has been mapped to two data type text and keyword.

GET /users/_mapping
{
  "users" : {
    "mappings" : {
      "properties" : {
        "about" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "age" : {
          "type" : "long"
        },
        "email" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

Explicit Mapping

Since you know best about your data, its better you create mapping explicitily instead of implicitily getting created by elasticsearch.

Elasticsearch implicitily map string data type to two different type keyword and text. So that you can do exact search and full text search as well on same string field. But if you are sure that you will only use string field either as keyword or text then you can explicitly creating mapping for your index document type.

For example, to map user email as keyword and about as text, you can use below update index api.

PUT /users
{
  "mappings": {
    "properties": {
      "age":    { "type": "integer" },
      "email":  { "type": "keyword"  },
      "about":   { "type": "text"  }
    }
  }
}

Now, you can recheck your updated mapping as shown below.

GET users/_mapping
{
  "users" : {
    "mappings" : {
      "properties" : {
        "about" : {
          "type" : "text"
        },
        "age" : {
          "type" : "integer"
        },
        "email" : {
          "type" : "keyword"
        }
      }
    }
  }
}

Full Text Search: language and analyser

When it comes to full text search, we discuss below terms:

elastic search index flow

Stemming

Stemming is the process of reducing a word to its root form. Multiple word can be reduced to single root word. only root word is stored on disk.

For example, walking and walked can be stemmed to single root word: walk.

Stemming is often language-dependent and its performance depend on the quality of dictionary for that language.

Stopwords

Stopwords are very common words that are filtered out after analysis. These words are so common that they are not part of index and are not stored on disk.

For example, the, an, is are some example of english stopwords.

Stopwords are often language-dependent and its performance depend on the quality of dictionary for that language.

Synonyms

Synonyms are alternate word that has the same meaning of original word.

For example, cheating is synonyms fo fraud. User search for cheating should also match the redord which contain fraud or vice versa.

Synonyms are often language-dependent and its performance depend on the quality of dictionary for that language.

Score and Ranking

Elasticsearch assign score to each document matched against given query, document with high score is ranked top.

Document score depend on many factor, like:

  • number of term in query matched against document field.
  • order of term in query matched against document field.

Analyzer

While indexing, if document field are of type text, then that field is analyzed by one of text analyser. By default, elasticsearch uses standard analyzer.

Elasticsearch has lots of built in analyzer. Some of them are:

  • Standard Analyzer
  • Simple Analyzer
  • Stop Analyzer
  • Keyword Analyzer
  • Language Analyzer

For example, you can specify english as analyzer to about field of user index.

PUT /users
{
  "mappings": {
    "properties": {
      "age":    { "type": "integer" },
      "email":  { "type": "keyword"  },
      "about":   { "type": "text", "analyzer": "english"  }
    }
  }
}

You can also user _analyze api, to quickly test how different elasticsearch analyzer process given text. For Example:

POST _analyze
{
  "analyzer": "english",
  "text": "The 2 QUICK Brown-Foxes is jumped over the lazy dog's bone."
}
{
  "tokens" : [
    {
      "token" : "2"
    },
    {
      "token" : "quick"
    },
    {
      "token" : "brown"
    },
    {
      "token" : "fox"
    },
    {
      "token" : "jump"
    },
    {
      "token" : "over"
    },
    {
      "token" : "lazi"
    },
    {
      "token" : "dog"
    },
    {
      "token" : "bone"
    }
  ]
}

As you can see above, words the and is are stopword and has been removed after analysis. Word foxes, jumped and lazy has been stemmed to root word fox, jump and lazi respectively.

Reference

Index API
Bulk API
Search API
Mapping

If you loved this post, Please share it on social media.