Author: Nicholas M. Hughes

You’re probably here as a result of being in the same boat as me… you just completely refuse to listen when people tell you something can’t or shouldn’t be done.

Elastic suggests using “the right size” servers for your physical cluster. What this usually means is a maximum of 64GB of RAM, lots of CPU cores for concurrency, and fast storage (such as SSDs). Much to Elastic’s chagrin, many IT departments end up with hardware for their cluster that wasn’t necessarily purchased for a specific purpose. Money ends up in the budget that has to be used… and the next step is usually “Buy the beefiest thing out there! We’ll find something to use it for.”

If you end up in this scenario, you feel wasteful. In my case I had 5 servers to work with, each with 256GB of RAM and 144 cores. Attempts to increase the heap allocation didn’t yield much performance gain due to many factors, including Compressed OOPs. What we needed was more nodes in the cluster. Unfortunately, more hardware wasn’t going to magically appear.

In these scenarios, Elastic often recommends virtualization or containerization in order to run multiple instances of the software on a single physical server. Unfortunately, constraints in my client’s environment precluded me from implementing those solutions. Why can’t I just make it work? Well, Elastic decided to make things harder on me. Support for the es.config command line parameter was dropped in recent versions. This was traditionally the mechanism by which people were running multiple instances of the software with different configuration files. Searching for solutions to the problem in online forums either yielded the old solution that doesn’t work anymore or unsolved calls for help. Things were looking bleak.

I beat my head against it for a little bit, and then realized something… I can’t pass multiple configuration files anymore, but that doesn’t stop me from littering my configuration file with variables! I found my hack!

Let’s Hack Things Up!

These instructions are for converting a pre-existing cluster. If this doesn’t apply to your situation, skip over any steps telling you to PUT to the Elasticsearch API.

Additionally, API calls are shown generically for the purposes of this document. Feel free to translate for cURL command line or any other tool you fancy.

1. First, note that this process will involve running Elasticsearch on multiple, non-standard ports on your server. If you’re using a host firewall (IPTables, etc.), then it will need to be updated for the new ports.

2. Decommission a single “old” physical node by moving off shards:

PUT _cluster/settings
{
   "transient": {
       "cluster.routing.allocation.exclude._name": "node-1"
   }
}

3. Wait until there aren’t any shards relocating and the document count has dropped to 0:

GET _cluster/health?pretty

The jq application is super useful for parsing JSON. The example will dump out the just the number of documents on a node called NODENAME.

GET _nodes/NODENAME/stats/indices | jq '.. | .docs? | select(.count? != null)'

4. Stop the elasticsearch service on the “old” node.

5. Delete the physical remnants of the “old” index data. Basically, go to your path.data directories and wipe them clean.

6. Check cluster health again. This is mostly because I’m nervous anytime I perform a destructive process. If we’re green, then we’re good to proceed.

GET _cluster/health?pretty

7. Check that the “old” node has left the cluster.

GET _cat/nodes?v

8. Now comes the fun part. Updating configuration files! (waits for applause)

Create log directories for the new services. On my CentOS implementation, existing logs were created in the /var/log/elasticsearch directory. I deployed 3 instances to each physical node, so I labeled each instance with a single digit starting at zero. Choose a scheme that works for you.

mkdir /var/log/elasticsearch{0..2} && \
     chown elasticsearch. /var/log/elasticsearch*

Hack up your service files. Again, I’m using CentOS… so that meant creating SystemD unit files for each new labeled service and then deleting the “old” one.

for i in {0..2}; do
     cp -p /usr/lib/systemd/system/elasticsearch{,$i}.service;
done && rm -f /usr/lib/systemd/system/elasticsearch.service

The unit files provided with the Elasticsearch software have these three lines that need to be updated for each service. Replace the “X” with appropriate label for each service.

RuntimeDirectory=elasticsearchX
Environment=PID_DIR=/var/run/elasticsearchX
EnvironmentFile=/etc/sysconfig/elasticsearchX

Create environment files for your variable references. These guys will probably be in different locations based upon your specific distribution. In my case, the unit file expects them in the /etc/sysconfig/ directory. Of note below are the ES_NODE_NAME, ES_PATH_DATA, ES_HTTP_PORT, and ES_TCP_PORT variables. Each Elasticsearch node running on the system will need its very own name and unique ports. Additionally, they’ll need their own location to store the index data. My servers had 21 distinct disks for the original implementation, which divided nicely into three for my purposes.

for i in {0..2}; do
     cat << EOF > /etc/sysconfig/elasticsearch${i}
PID_DIR="/var/run/elasticsearch${i}"
ES_NODE_NAME="NODENAME-${i}"
ES_PATH_DATA="/data${i}0,/data${i}1,/data${i}2,/data${i}3,/data${i}4,/data${i}5,/data${i}6"
ES_PATH_LOGS=/var/log/elasticsearch${i}
ES_HTTP_PORT=920${i}
ES_TCP_PORT=930${i}
EOF
     chmod 644 /etc/sysconfig/elasticsearch${i}
done

Change the configuration file to use your variable references. Edit your /etc/elasticsearch/elasticsearch.yml file and pop your variable references in there. It’ll look something like this:

node.name: ${ES_NODE_NAME}
path.data: ${ES_PATH_DATA}
path.logs: ${ES_PATH_LOGS}
http.port: ${ES_HTTP_PORT}
transport.tcp.port: ${ES_TCP_PORT}

Note that there is one more configuration parameter that is very important. Discovery will probably need the new ports in order to figure out what’s going on here.

discovery.zen.ping.unicast.hosts:
     - sally:9200
     - jessy:9200
     - raphael:9200
     - donatello:9200
     - NODENAME:9200
     - NODENAME:9201
     - NODENAME:9202

Check your JVM options. In my case, I had heap sizes that were fairly large… much higher than the Elastic recommendations. I set a smaller heap in the jvm.options file that is used by all of the services (32766m for Compressed OOPs). You can also poke around and check if there are any other configuration items that need to be tweaked or reduced based upon your new layout.

1. Start the new services. This may take a systemctl daemon-reload or comparable command in order to recognize the new services and the removal of the old one.

2. Check that the new nodes have joined the cluster.

GET _cat/nodes?v

3. Clean up the transient setting used to decommission the old node.

PUT _cluster/settings
{
   "transient" : {
       "cluster.routing.allocation.exclude._name" : ""
   }
}

4. Move on to the next physical server and do it all again!

Conclusion

This was a great exercise in working around limitations in the capabilities that a software package provides. Elastic giveth and Elastic taketh away. When they taketh my ability to run multiple instances on a server, then I giveth them heartburn.

Running Multiple Elasticsearch 6.x Instances on a Single Server

Let’s Hack Things Up!

Conclusion

A Biased Critique of Ansible vs. Salt Search Results