Managing the Component Store (Consul)

This section describes how to manage the cluster component store (Consul).

Overview

With Apcera release 3.0, Apcera has added a distributed Key/Value storage system to the cluster, Consul.

Consul is used within the Apcera cluster as the backend storage for Vault (also new in Apcera 3.0), which is used by the cluster for securely storing secrets, including internal security keys and X.509 certificates for HTTPS connections. In future releases Consul will become the default storage location for almost all Apcera cluster metadata.

Provisioning

Consul uses a distributed consensus protocol, Raft, to maintain the Key/Value data in a reliable way. With Raft (and other consensus protocols) a "quorum" of Consul servers must be available at all times to elect a Consul Leader. If a quorum of servers are not available, Consul service is not functional. As a result, Consul deployment has specific requirements for deployment to ensure continued operation of Consul during failure scenarios. For more details, see the Consul documentation at https://www.consul.io/docs/internals/consensus.html

Datacenters

Consul operates the consensus protocol only within individual Consul datacenters. Apcera configures Consul datacenters to match the Apcera datacenter tags. If no datacenter tagging is configured, then all Consul servers are configured as a single Consul datacenter. Consul expects low latency connections between servers in a single Consul datacenter, so the Apcera datacenter tags must be configured to match your network topology. The latency expectation of Consul is 10ms or less round trip time between servers.

Multiple Datacenters in an Apcera cluster

Within an Apcera cluster, only one datacenter will be considered the master or "brain" datacenter for all Consul data. That datacenter will be the authoritative datacenter for all cluster metadata, and any outage to that datacenter will cause other datacenters to enter a reduced functionality state.

The primary datacenter is selected by setting the cluster configuration setting chef->continuum->consul->master_datacenter. In any Apcera deployment with Consul in multiple datacenters, this setting must be configured. If Consul is only deployed in a single datacenter this setting is not required.

Any other datacenter which contains components which require Consul will have secondary Consul clusters installed which exist primarily to forward all Consul transactions to the primary datacenter. For example, in a Store3 enabled cluster, a remote datacenter which contains a Package Manager must also contain at least one host with the kv-store role.

Server Count

Each Consul datacenter must have an odd number of servers. This requirement is common amongst consensus protocol implementations, because any even number of servers can be susceptible to a "split brain" failure where network connectivity loss between servers results in two sets of active servers each containing exactly half the total number of servers.

In most deployments a single Consul datacenter will have either three or five Consul servers.

In AWS, in order to achieve desired failover behavior between Availability Zones, it is strongly recommended that the Apcera platform only be installed in AWS Regions with three (or more) Availability Zones. In any region with only two Availability Zones there is an elevated risk of a failure of the entire Consul cluster due to an outage impacting a single Availability Zone. For example, it would be possible to install three servers into two AZs with an unbalanced topology, but an outage to the AZ which contained two servers would leave the other AZ unable to serve Consul requests as it would be unable to perform a leader election.

Configuration

Here is an example cluster.conf entry that would install the cluster using store3 (Consul).

chef: {
  "continuum": {
    "component_database": {
      "kind": "store3"
    }
}

Monitoring

The Zabbix monitoring system in an Apcera cluster will be automatically configured with various tests against the Consul servers. The list below explains the possible alarms in more detail.

  1. Consul Cluster Status
    • Description: The Consul agent on this host is unable to find a current Consul cluster leader.
    • Possible Causes: Network connectivity between servers. Too many failed servers in the Consul cluster.
    • Troubleshooting Steps: Use Consul diagnostic commands to determine what peers exist and their status. Investigate any peers which are not active.
  2. Consul Cluster at risk of losing quorum
    • Description: This Consul datacenter is within a single server of losing quorum. This is a warning that in the current state any additional failure will result in a total outage of Consul in this datacenter.
    • Possible Causes: One or more other servers have failed, or stale servers exist in the Consul configuration.
    • Troubleshooting Steps: Restore the failed Consul servers to normal operation, or replace them. If no failed server exist, investigate the Consul configuration to see if there are stale servers listed which no longer exist.
  3. Consul Cluster accepting writes
    • Description: An attempt to write data to Consul from this host failed.
    • Possible Causes: If the Consul cluster is otherwise operating normally, this would indicate an unexpected failure inside Consul.
    • Troubleshooting Steps: If the consul cluster is otherwise operating normally, investigate via diagnostic commands and contact Apcera Support if needed.
  4. Consul Cluster accepting reads
    • Description: An attempt to read data to Consul from this host failed.
    • Possible Causes: If the Consul cluster is otherwise operating normally, this would indicate an unexpected failure inside Consul.
    • Troubleshooting Steps: If the consul cluster is otherwise operating normally, investigate via diagnostic commands and contact Apcera Support if needed.
  5. Consul Cluster monitoring keys must be increasing
    • Description: Data being read from this Consul datacenter has stopped updating. The data which is written and read in the tests listed above is a timestamp. If that timestamp stops increasing this alarm will be triggered.
    • Possible Causes: If the Consul cluster is otherwise operating normally, this would indicate an unexpected failure inside Consul.
    • Troubleshooting Steps: If the consul cluster is otherwise operating normally, investigate via diagnostic commands and contact Apcera Support if needed.

Consul Security

Apcera configures Consul to require authentication tokens for all Consul transactions. Some of the diagnostic commands listed below require a token. Consul tokens can have various permissions granted to them which restrict that they are allowed to do. The "master" token is the only pre-generated token which is guaranteed to exist, and is always available. Apcera places this token into /etc/consul.d/master-acl.json as part of the Consul configuration. When logged into one of the Consul servers you can load that token into your shell environment by running:

export CONSUL_HTTP_TOKEN=`jq -r .acl_master_token /etc/consul.d/master-acl.json`

In a Store3-enabled cluster, every component which uses Store3 will have a Consul token, and those tokens will be managed by Vault. Components will fetch their Consul tokens from Vault during component startup.

Diagnostic Commands

This is a list of some useful diagnostic commands for Consul. Running some of these commands requires a Consul token, see Consul Security. Note that all of these steps would be run on a consul host as root.

  • consul members - Lists the members of the Consul cluster as known by the local Consul agent
  • consul info - Lists details of this Consul agent's status
  • consul operator raft -list-peers - Lists the configured raft peers in the Consul cluster, and their current states. (requires auth token)
  • curl http://localhost:8500/v1/acl/list?token=${CONSUL_HTTP_TOKEN} | json_pp -json_opt canonical,pretty - Lists all Consul ACL policies and the associated tokens. Note: This contains tokens which are sensitive and should not be exposed.
  • sv status consul Shows the current status of the Consul agent, according to the runit system which manages Consul.
  • sv restart consul Restarts the Consul agent

Backups & Restores

Because Consul is a critical datastore for cluster metadata, backups of that data are an essential part of the Consul deployment.

Backups are performed nightly automatically, and are also performed during cluster deployments.

Backups are stored on the Consul servers in /var/lib/postgresql/backups/consul. If the cluster configuration includes a S3 bucket for storing database backups, the Consul backups are copied to that bucket as well. Timestamped directories are created at the time of the backup, with the files placed inside.

Two types of backups are performed:

  • Snapshot backups: These are complete snapshots of the entire Consul cluster. Snapshots are a built in feature of Consul, and provide a simple binary object which can be used to restore the Consul data to the state it was in when the snapshot was created.
    • Files: The name of the snapshot file will be consul-snapshot-<hostid>.snap.
    • Restoring: To restore a snapshot backup, you use the consul command: consul snapshot restore <filename>
  • consul-backinator backups: consul-backinator is a tool for performing partial backups of Consul data. It can backup and restore subsets of the Consul data without impacting other portions of the data. It can also restore keys from a backup into an alternate prefix inside Consul, allowing for a backup to be imported and examined without impacting existing data.
    • Files: The consul-backinator backups are created with names that indicate what portion of the Consul data is included in the file. The name of the files will be consul-backup-<what-data>-<hostid>.bak. consul-backinator also creates cryptographic signatures of the files to ensure they have not been corrupted, those file names send with .sig.
    • Restoring: To restore a consul-backinator backup, you use the consul-backinator command: consul-backinator restore -file <filename> -delete true -prefix <prefix of the data being restored>. The delete and prefix parameters cause the tool to delete all existing keys from the listed prefix before performing the restore. Not including this parameter could result in corrupting the cluster data.

Note about restoring backups

Restoring backups of Consul data should only be necessary under extreme circumstances. It is strongly recommended that you consult with Apcera Support before attempting a restore. If performing a restore, you must stop all components inside the cluster that use the Consul data which you are restoring. This most likely means stopping all cluster services, or at least the specific services whose data you are restoring.

In some cases, restoring the data will then require a full deploy cycle on the cluster to fully restore all services. (For example, if restoring the data in the vault prefix, many cluster services will be unable to start until a deploy has been run to reset all of the integrations between the Apcera cluster components and Vault.)