Technical details on an implementation of an ElasticSearch cluster and its ingestion and visualization tools (MetricBeat, Logstash and Kibana) to build the following:
- Capture and centralization of system logging
- Centralization of Apache Web Server access logs
- Tweets capture for analysis
Setting up the clusterThe minimal recommended number of nodes for a cluster, is 3. This is to avoid a split-brain scenario:
If the cluster is 3 or more nodes, you will have to tell it the minimal number of nodes to be seen to be promoted master. If you have 3 nodes, this number will be equal to 2. So if a network problem happens:
- one non-master node that doesn't see any other nodes won't promote itself as master.
- one master that doesn't see any other node, it will no more act as a master for the data it is holding.
- two nodes that still see each other but no more the master will elect one of them to be the master. When the original master is joinable again, it will have stopped acting as a master and will join the cluster again as a non-master node
- Master node: one and only one in the cluster, chosen by election among the node that can become a master. The setting node.master=true allows a node to become a master.
- Data node: node that can hold data (Lucene indexes) and perform data related operations on them. Assigned via the setting node.data=true.
- Client node: node that doesn't hold any data but perform document enrichment before they are effectively written into the indexes. Assign this role with node.ingest=true.
- Tribe node: a very special node, connecting to different clusters, able to do search across multiple cluster. When acting as tribe node, the node will just do coordination and not store data. Use the parameters tribe.* to achieve this (more information about this in the official ElasticSearch documentation)
- Dedicate 3 or more node to be master-eligible and no data node (thus setting node.data=false). By dedicating master-eligible nodes, you will exclude them from the list of node to connect to upload data into the indices. Elastic recommend that you don't perform search, ingestion or any other client operations via the master node of the cluster. As you may not know which node is the master, by creating a group of nodes master-eligible only, you will exclude them from your client connection URL.
- Dedicate as much data node you need according to the estimated amount of data, thus turning off master-eligibility.
- If you need to create heavy pipelines in the cluster, to transform the documents when they enter the cluster but before they are really added into the indexes, you will dedicate some node to this role by turning off the master-eligibility and data role (node.data=false; node.master=false)
You will need to have a valid Java Virtual Machine installed on your server before running ElasticSearch. All the configurations and operations presented below have been done on virtual machines with 1 vCPU and 4 GB RAM, running Ubuntu 16.04LTS.
The JVM used to perform the test if the Oracle Java SE 8u101, downloaded from Oracle/Java web site.
ElasticSearch has been installed via the Elastic APT repository:
Install the Elastic packages signing public key:
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
Add the HTTPS transport for APT if not yet done:
sudo apt-get install apt-transport-https
Add the Elastic APT repository for ElasticSearch:
echo "deb https://artifacts.elastic.co/packages/5.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-5.x.list
Update the local package cache and perform the installation of the package:
sudo apt-get update && sudo apt-get install elasticsearch
You configure your cluster via the file /etc/elasticsearch/elasticsearch.yml on each node.
An example of configuration for a simple 3 nodes cluster will be:
# Name of the cluster, common for all nodes part of the same cluster cluster.name: pandora # The name to identify this instance on this node in the cluster (can be different from the hostname) node.name: ES1 # We can use any directory we want to put the data (indices) and the cluster logs path.data: /es/data path.logs: /es/logs # # Lock the memory on startup - to reserve memory when starting up the binary #bootstrap.memory_lock: true # # Make sure that the heap size is set to about half the memory available # on the system and that the owner of the process is allowed to use this # limit. # Elasticsearch performs poorly when the system is swapping the memory. # # Bind to a specific IP address on the system only network.host: 192.168.2.13 #http.port: 9200 # # To discover other nodes that are part of this server discovery.zen.ping.unicast.hosts: ["es1", "es2"] # Prevent the "split brain" by configuring the majority of nodes (total number of nodes / 2 + 1): discovery.zen.minimum_master_nodes: 2 # # Block initial recovery after a full cluster restart until N nodes are started: gateway.recover_after_nodes: 1 # # Disable starting multiple nodes on a single system: node.max_local_storage_nodes: 1 # # Require explicit names when deleting indices: #action.destructive_requires_name: trueWhen the ElasticSearch process is started on all nodes, then you have a cluster running, waiting for input on port 9200.
To check that not only the ElasticSearch daemon is running but also performing well, we can use the following 2 calls, they illustrate well how any client software will interact with the search engine.
All you need is to have curl installed on your system. Any other command line to send HTTP calls (GET, POST, PUT, HEAD, DELETE, ...) can do the trick.
The first examples is used to display the general health status of the cluster's nodes. The second one will list each indexes and their status.
es1(ubuntu):~$ curl http://localhost:9200/_cat/nodes?v ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name 192.168.2.14 26 95 7 0.01 0.00 0.00 mdi - ES2 192.168.2.13 49 92 4 0.02 0.01 0.00 mdi * ES1 es1(ubuntu):~$ curl http://localhost:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open icinga-2017.04 2Hla9BSiQUWciDvSt6xHWg 5 1 294937 0 275.9mb 137.9mb green open apache-2017.04 B6i1gmQIRj-dW41BFAL4AQ 5 1 412241 0 813.1mb 406.2mb green open syslog-2017.04 9gPSk2z9RxaT3WdKFRHI1g 5 1 168997 0 152.5mb 67.8mb green open metricbeat-2017.04 KBeHsGj4TVqhKysguPlKew 2 1 1480507 0 1020.1mb 510mb
The parameter ?v added to the URL is used to have a nice human readable output.