Elasticsearch Basics: searching, indexing and full text search
Quickly setup and test elasticsearch
You can run below docker command to run single node elasticsearch.
docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.8.1
You can run below curl command to test if elasticsearch is running, it should return node details.
curl -X GET "localhost:9200/_cat/nodes?v&pretty"
Indexing and Searching
In order to search in elasticsearch, you will have to first index your data in elasticsearch. Your source of data could be anywhere i.e postgres, mysql etc. Elasticsearch provide rest APIs to index and search data.
Index is just like table which contain collection of document in optimized way and document is just like row or record in table. I.e index user-index
contains list of user document.
Indexing APIs
There are two way to index data in elasticsearch, One is using index API
and other is using bulk API
.
Using index API
, you can update or add one document to index. By default, index is automatically created if it does not exist. To know more about index api visit this link.
For example, below rest api create user document with id 1
in users
index.
POST users/_doc/1
{
"email": "vikash.kumar",
"age": 26,
"about": "love writting code, badminton lover"
}
Using bulk
API, you can create, update and delete multiple index in single rest api call. To know more about bulk api visit this link.
For example, below api create two user with id 1
and 2
in users
index.
POST _bulk
{ "index" : { "_index" : "users", "_id" : "1" } }
{ "email": "vikash.kumar", "age": 26, "about": "programmer, blogger, badminton lover"}
{ "index" : { "_index" : "users", "_id" : "2" } }
{ "email": "john.wick", "age": 31, "about": "blogging, guitar player"}
Search APIs
Elasticsearch provide extensive rest api to search documents.
- you can search for documents in single, multiple or all index.
- you can search document by single column or multi column.
- you can search document by keyword using
term
query or full text search usingmatch
query
You will see more about full text search capability and feature in below sections.
Elasticsearch provide two syntax for query. One using q
query parameter or other using request
body parameter.
Using q
query parameter is not that advance and does not support full elasticsearch query DSL. Hence it is recommended to use request body syntax. To know more about search api visit this link.
Query using q
query parameter.
GET users/_search?q=age:26
Query using request
body parameter
GET users/_search
{
"query": {
"term": {
"age": 26
}
}
}
Mapping
Each field in document could be quite different and serve different purpose. For example, you might want to do exact search on user email
field but full text search on user description
field. Depending on each field type or usage, elasticsearch will optimally index.
For example, user email
will be indexed optimally for exact search while user description
will be indexed optimally for full text search.
Telling elasticsearch about your data type is called mapping. You know much better about your data than elasticsearch knows but still, elasticsearch does it best and automatically and implicitily create a optimal mapping for you as per below sample table. You can learn more about mapping here.
Data Type | Elasticsearch Mapping Type |
---|---|
String | text and keyword |
Numeric | long, integer, short, byte etc |
Boolean | boolean |
Date | date |
You can check current mapping for given index using below index. You can see in below response, string(email)
field has been mapped to two data type text
and keyword
.
GET /users/_mapping
{
"users" : {
"mappings" : {
"properties" : {
"about" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"age" : {
"type" : "long"
},
"email" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
Explicit Mapping
Since you know best about your data, its better you create mapping explicitily instead of implicitily getting created by elasticsearch.
Elasticsearch implicitily map string
data type to two different type keyword
and text
. So that you can do exact search and full text search as well on same string field. But if you are sure that you will only use string field either as keyword
or text
then you can explicitly creating mapping for your index document type.
For example, to map user email as keyword and about as text, you can use below update index api.
PUT /users
{
"mappings": {
"properties": {
"age": { "type": "integer" },
"email": { "type": "keyword" },
"about": { "type": "text" }
}
}
}
Now, you can recheck your updated mapping as shown below.
GET users/_mapping
{
"users" : {
"mappings" : {
"properties" : {
"about" : {
"type" : "text"
},
"age" : {
"type" : "integer"
},
"email" : {
"type" : "keyword"
}
}
}
}
}
Full Text Search: language and analyser
When it comes to full text search, we discuss below terms:
Stemming
Stemming is the process of reducing a word to its root form. Multiple word can be reduced to single root word. only root word is stored on disk.
For example, walking
and walked
can be stemmed to single root word: walk
.
Stemming is often language-dependent and its performance depend on the quality of dictionary for that language.
Stopwords
Stopwords are very common words that are filtered out after analysis. These words are so common that they are not part of index and are not stored on disk.
For example, the
, an
, is
are some example of english stopwords.
Stopwords are often language-dependent and its performance depend on the quality of dictionary for that language.
Synonyms
Synonyms are alternate word that has the same meaning of original word.
For example, cheating
is synonyms fo fraud
. User search for cheating
should also match the redord which contain fraud
or vice versa.
Synonyms are often language-dependent and its performance depend on the quality of dictionary for that language.
Score and Ranking
Elasticsearch assign score to each document matched against given query, document with high score is ranked top.
Document score depend on many factor, like:
- number of term in query matched against document field.
- order of term in query matched against document field.
Analyzer
While indexing, if document field are of type text
, then that field is analyzed by one of text analyser. By default, elasticsearch uses standard
analyzer.
Elasticsearch has lots of built in analyzer. Some of them are:
- Standard Analyzer
- Simple Analyzer
- Stop Analyzer
- Keyword Analyzer
- Language Analyzer
For example, you can specify english
as analyzer to about
field of user
index.
PUT /users
{
"mappings": {
"properties": {
"age": { "type": "integer" },
"email": { "type": "keyword" },
"about": { "type": "text", "analyzer": "english" }
}
}
}
You can also user _analyze
api, to quickly test how different elasticsearch analyzer process given text. For Example:
POST _analyze
{
"analyzer": "english",
"text": "The 2 QUICK Brown-Foxes is jumped over the lazy dog's bone."
}
{
"tokens" : [
{
"token" : "2"
},
{
"token" : "quick"
},
{
"token" : "brown"
},
{
"token" : "fox"
},
{
"token" : "jump"
},
{
"token" : "over"
},
{
"token" : "lazi"
},
{
"token" : "dog"
},
{
"token" : "bone"
}
]
}
As you can see above, words the
and is
are stopword and has been removed after analysis. Word foxes
, jumped
and lazy
has been stemmed to root word fox
, jump
and lazi
respectively.