Elasticsearch is one of the famous open source tools for in searching and indexing category. It is being used by highly respected organizations like Wikipedia, Linkedin, etc. The project started in 2010. Its core is Lucene indexing engine and has an HTTP interface for communicating with the core indexing engine. Elasticsearch is highly scalable and lightning fast.
In this tutorial series, I will cover elastic search installation, cluster setup, index creation strategies, backups, client nodes and much more. Throughout this series of posts, I will teach you to set up a production ready elasticsearch cluster even though you don’t have any prior knowledge in elasticsearch.
Note: This article is focussed on IT Ops/ DevOps, guys.
Unlike traditional SQL database, elasticsearch is distributed, and it can scale horizontally. This type of scaling allows you to add many nodes to process the requests and to handle the load.
To understand its distributed nature, you should understand the basic building blocks of it. Let’s have a look at its basic building blocks.
1. Indexes: All the data you store in elasticsearch is stored in the form of indexes. Adding data to an index is called indexing.
2. Shards: All the indexes are stored in shards. A shard is a Lucine database. Shard is the scalable unit of elasticsearch. The rule of thumb is to have at least one shard in a node.
You can store a single index in multiple shards on a single node. However, does not make any sense to have replicas on the same node. So when you add a node to the elasticsearch cluster, it gets added as a peer and shards get migrated to the new node for an even distribution of shards. This process is termed as “Rebalancing”.
Replicas The duplicates of the shard is known as replicas. For high availability, you can have the shard duplicates distributed across the cluster.
Node Role The nodes in the cluster falls under the different roles. The data node, the master node and the client node. The default installation has all the three roles set up in one server. However, with some fine tuning, you can set up these nodes as different servers for high availability and better performance.
1. Data nodes: It contains all the data and shards. It primary function is to house all the data and does not service any of the query requests.
2. Client Node: Client node is the entry point for elasticsearch queries. It receives all the queries and routes them to data nodes.
3. Master Node: This node maintains the cluster and updates the cluster state. All the nodes in the cluster will have the cluster statge by only the master node will be able to update the status of the cluster.
Cluster Capacity Planning
The amount of resource you need to set up the cluster can only be determined by the amount of data you are going to process. You can insert data into a single node cluster and perform a test by checking the CPU and memory utilizations. If there are enough CPU and memory available, you can insert more data and perform the tests.
By repeated testing, you will know the number of nodes you need to have in a production cluster.
For example, if you have a 500k document and the query response is taking 4 seconds, you will need four data nodes to reduce the response time to 1 sec.
Next, you need to plan on your master node. Elasticsearch official site recommends that you should have at least three master nodes in a production cluster so that it will maintain a quorum of two.
Finally, you will need more than one client node under a load balancer. So that the load on the data nodes can be reduced.
So as per our design, there will be two client nodes, three master nodes and four data nodes for a 500k document process. This is just an example and not based on any tests.
In this elasticsearch tutorial, we went through the basic concepts involved elasticsearch. In the next article, I will teach you how to set up an elasticsearch cluster using a client, master and data nodes.