"After all, the engineers only needed to refuse to fix anything, and modern industry would grind to a halt." -Michael Lewis

Enable Massive Growth

How to do a Rolling Upgrade of an Elasticsearch Cluster Using Ansible

Mar 2019

You can see the source code for this blog post on GitHub.

In a previous post, we saw how to provision a multi-node elasticsearch cluster using ansible. The problem with that post is that, by the time I was done writing it, Elastic had already come out with a new version of elasticsearch. I'm being mildly facetious, but not really. They release new versions very quickly, even by the standards of modern software engineering.

It would be wise, therefore, to think about upgrading from the very beginning. The recommended way to upgrade versions of elasticsearch from 5.6 onwards is a rolling upgrade. As you can see from that article, upgrading (even in place) elasticsearch is not trivial by any stretch of the imagination.

However, it can be done, and in this post I'll show you one way to do it using ansible.

Doing the Upgrade

To start with, I'll create an ansible role using molecule to demo what is required:

$ molecule init role -r upgrade-elasticsearch-cluster -d vagrant

I'm choosing Vagrant as my VM driver and calling this role upgrade-elasticsearch-cluster.

So I don't have to reinvent the wheel I'm reusing the role that installs elasticsearch (version 6.3.0) by including it in my meta/main.yml file:

---
dependencies:
  - role: install-elasticsearch-cluster

That role is still a WIP, and in fact I changed the discovery IP addresses to be 192.168.56.101-103, hardcoded in the configuration file. To demonstrate a minimum viable product for this example I'll reuse that molecule/default/molecule.yml platform's section:

platforms:
  - name: elasticsearchNode1
    box: ubuntu/xenial64
    memory: 4096
    provider_raw_config_args:
    - "customize ['modifyvm', :id, '--uartmode1', 'disconnected']"
    interfaces:
    - auto_config: true
      network_name: private_network
      ip: 192.168.56.101
      type: static
  - name: elasticsearchNode2
    box: ubuntu/xenial64
    memory: 4096
    provider_raw_config_args:
    - "customize ['modifyvm', :id, '--uartmode1', 'disconnected']"
    interfaces:
    - auto_config: true
      network_name: private_network
      ip: 192.168.56.102
      type: static
  - name: elasticsearchNode3
    box: ubuntu/xenial64
    memory: 4096
    provider_raw_config_args:
    - "customize ['modifyvm', :id, '--uartmode1', 'disconnected']"
    interfaces:
    - auto_config: true
      network_name: private_network
      ip: 192.168.56.103
      type: static

This creates three virtual machines that will be provisioned with elasticsearch and join the same cluster. Also be sure to add some host variables to your molecule.yml file:

provisioner:
  name: ansible
  inventory:
    host_vars:
      elasticsearchNode1:
        node_name: es_node_1
        is_master_node: true
        es_version_to_upgrade_to: elasticsearch-6.5.3.deb
      elasticsearchNode2:
        node_name: es_node_2
        is_master_node: true
        es_version_to_upgrade_to: elasticsearch-6.5.3.deb
      elasticsearchNode3:
        node_name: es_node_3
        is_master_node: false
        es_version_to_upgrade_to: elasticsearch-6.5.3.deb

We will eventually be upgrading to version 6.5.3, and that will become clear soon.

We will create a boolean flag that allows us to upgrade elasticsearch and call it upgrade_es, change your defaults/main.yml file to look like this:

---
upgrade_es: false

Then, you can make your tasks/main.yml file look like:

- include: upgrade_es.yml
  when: upgrade_es

And create a tasks/upgrade_es.yml file, which will house all of the upgrade logic:

---
# tasks file for upgrade-elasticsearch-cluster
- name: ensure elasticsearch already present
  service:
    name: elasticsearch
    state: started
  become: yes

- name: ensure elasticsearch is up and available
  wait_for:
    host: 127.0.0.1
    port: 9200
    delay: 5

- name: get elasticsearch to upgrade to
  get_url:
    dest: "/etc/{{ es_version_to_upgrade_to }}"
    url: "https://artifacts.elastic.co/downloads/elasticsearch/{{ es_version_to_upgrade_to }}"
    checksum: "sha512:https://artifacts.elastic.co/downloads/elasticsearch/{{ es_version_to_upgrade_to }}.sha512"
  become: yes

- name: perform upgrade process as root
  block:
    - name: disable shard allocation
      uri:
        url: http://127.0.0.1:9200/_cluster/settings
        body: '{"persistent":{"cluster.routing.allocation.enable":"none"}}' # specify no shard allocation
        body_format: json
        method: PUT
      
    - name: stop non essential indexing to speed up shard recovery
      uri:
        url: http://127.0.0.1:9200/_flush/synced
        method: POST
      ignore_errors: yes

    - name: get cluster id
      uri:
        url: http://127.0.0.1:9200
      register: pre_upgrade_cluster_info

    - name: shut down node
      service:
        name: elasticsearch
        state: stopped

    - name: upgrade node
      apt:
        deb: "/etc/{{ es_version_to_upgrade_to }}"

    - name: bring up node
      service:
        name: elasticsearch
        state: started
      notify: wait for elasticsearch to start

    - meta: flush_handlers

    - name: validate it joins cluster
      uri:
        url: http://127.0.0.1:9200
      register: post_upgrade_cluster_info
      until: pre_upgrade_cluster_info.json.cluster_uuid == post_upgrade_cluster_info.json.cluster_uuid
      retries: 3
      delay: 10

    - name: reenable shard allocation
      uri: 
        url: http://127.0.0.1:9200/_cluster/settings
        body: '{"persistent":{"cluster.routing.allocation.enable":null}}' # reenabling the setting removes shard allocation
        body_format: json
        method: PUT

    - name: wait for elasticsearch to recover
      script: check_es_health.py
      register: es_recovery_response
      until: es_recovery_response.rc == 0
      retries: 15
      delay: 10
  become: yes

There is one script that has to run located at files/check_es_health.py. This is pretty simple:

#!/usr/bin/python

import urllib2
import sys

response = urllib2.urlopen("http://127.0.0.1:9200/_cat/health")
body = response.read()
response.close()
if "green" in body:
    sys.exit(0)
else:
    sys.exit(1)

Finally, there is one handler in handlers/main.yml file that looks like:

---
# handlers file for upgrade-elasticsearch-cluster
- name: wait for elasticsearch to start
  wait_for:
    host: 127.0.0.1
    port: 9200
    delay: 5

To run the source code from github, you will have to first leave the upgrade_es flag to be false. Then run:

$ molecule create && molecule converge

After the VMs come up and they have working elasticsearch instances, you need to add the serial: 1 flag at the top of the molecule/default/playbook.yml file. Then switch the upgrade_es flag to true, and run:

$ molecule converge

The virtual machines will then be upgraded one by one, stopping until the instance comes up and is available.

Nick Fisher is a software engineer in the Pacific Northwest. He focuses on building highly scalable and maintainable backend systems.