Restarting Cluster Hosts

On occasion, after performing a cluster upgrade, you may have to restart one or more cluster hosts. For example, if you are upgrading to Apcera Platform release 2.2.3, you will need to reboot all cluster hosts after upgrade.

Starting with Orchestrator version 0.5.3, available with the Apcera Platform 2.4.0 release, you can automatically reboot cluster components using the orhcestrator-cli reboot command. The manual reboot instructions procedure is deprecated in favor of the automated approach. It is provided here for reference and troubleshooting.

When performing an automated or manual cluster reboot, each Instance Manager (IM) will be evacuated and its jobs rescheduled elsewhere before each reboot. Assuming all jobs have multiple instances and enough capacity exists, no job downtime should occur, but every job will be restarted during the process.

Automated Reboot Instructions

You can use the orchestrator-cli reboot [args] command to perform a cluster rolling reboot. Machines in the Apcera cluster are rebooted in prescribed order that is specific to a deployed release.

To perform a rolling reboot, run the command orchestrator-cli reboot -c cluster.conf [args]. With this command, Orchestrator will automatically reboot all cluster nodes. The reboot will stop on failures (including pre- and post-checks).

When a reboot is required after upgrade on fresh deploy, Orchestrator will present you with a warning message indicating that a reboot is required. Note that you will not be warned to reboot for cluster configuration changes that may required nodes to be restarted.

Reboot status

As shown below, you can view the "Reboot" status in the cluster status table using the command orchestrator-cli status -c cluster.conf. Note that the Reboot status will only be accurate for the 2.4.0 and later releases.

screenshot

As shown in the above screenshot, there are two possible reboot states: required and blank. The following table describes each state:

Status Description
Required Indicates that the machine will be rebooted when you run the orhcestrator-cli reboot command without the --force flag.
Blank Reboot is not required or has already been performed successfully.

Reboot options

When you run the orchestrator-cli reboot command without any options, the system performs a pre-check, a reboot, and a post-check for each machine whose Reboot status is "required." The system halts the entire reboot process on any failure. To resume the reboot process, simply rerun the reboot command. The system will skip machines that were already successfully rebooted (any machine whose reboot status is blank).

Status Description
--dry-run The dry run option displays all processing steps that an actual reboot command would go though. The reboot status shown for each machine is the known status at the time the dry run is performed. Note that this command does not perform any pre- or post-checks, and does not perform any reboots.
--machines Using this flag will reboot only the specified machines whose Reboot status is "required." You can specify one or more machines in a comma-separated list (without spaces) identified by the string in the "Name" column of the status output (see screenshot above). The strings in the "Machine Config" column can also be used in the comma-separated list of names for --machines flag. Using the Machine Config string allows a class of machines to be operated on.
--force Using the force command reboots every machine in the cluster even if the machine reboot status is not required (blank).

Reboot command examples

To view machine reboot status for all machines in the cluster:

orchestrator-cli status -c cluster.conf

To view the default reboot order for all machines in the cluster:

orchestrator-cli reboot --dry-run -c cluster.conf

To perform a reboot of all "required" machines:

orchestrator-cli reboot -c cluster.conf

To perform a reboot of all machines regardless of reboot status:

orchestrator-cli reboot --force -c cluster.conf

To perform a reboot of specific machines whose status is "required":

orchestrator-cli reboot --machines "instance-manager,central" -c cluster.conf

Troubleshooting automated rolling reboots

If a host reboot failes, you will see the error message “Failed to complete cluster reboot”. The first step is to look in the chef client log for “FATAL” level logging. For the specific log file, you can look at the console output. Near the top, you will see, for example, "Chef client log file: reboot-chef-client-271060831223112.log". There will be a FATAL message showing the specific failure. If you run grep FATAL reboot-chef-client-***.log, the first result will usually be the specific error.

If you receive Gluster host reboot errors, ssh into the machine and run sudo gluster volume status all to see the status and any errors. If the output of sudo gluster volume status all is a number greater than 0, wait for the number to go down to 0, indicating that all Gluster volumes are ready, and rerun the reboot process again for the Gluster hosts.

If you receive Riak host reboot errors, ssh to the Riak host that failed and run the command riak-admin member-status output. This command will list all member nodes in the Riak cluster and their status. If you receive a Riak error indicating we cannot ge the "ring status," run the command riak-admin ring-status until you receive the output “Ring Ready: true”, “No pending changes”, and “All nodes are up and reachable.” If you receive the Riak error “No transfers active,” run the command riak-admin transfers to check when there are no more transfers active.

If you receive Postgres host reboot errors, ssh to the Postgres host and run sudo su - postgres -c “psql -c ‘select client_addr, state, sent_location, write_location, flush_location, replay_location from pg_stat_replication’. (See the manual reboot instructions below.) Specifically, if the streams count returned by the above command does not match the expected, run the above command and wait for the number of streams now matches the expected number. If you receive a Postgres replication error, run the above command and refer to the manual steps below.

An additional error you may see in the console output (not in the reboot-chef-client log) is “Failed to read reboot script file.” Orchestrator rolling reboots can only be used with Apcera Platform 2.4.0 and later. The fix is to upgrade your cluster to 2.4.0 and Orchestrator to 0.5.3.

Timing for reboots

Keep in mind the following timing considerations when performing an automated cluster reboot. These times are the typical times, but may be longer if you have connection issues or if any of the checks end up retrying.

Host Time Description
Base ~3 minutes The base time to reboot a single cluster host is approximately 3 minutes, including pre- and post-checks.
Postgres ~7 minutes It takes approximately 4 more minutes to reboot each Postgres host in your cluster (such as the Component DB and Auditlog DB hosts).
IM ~10 minutes It takes approximately 7 more minutes per Instance Manager for all the job migration and IM-specific checks.

Keep in mind that if you run the orchestrator-cli reboot command without specifying one or more hosts the system will perform pre- and post-checks on all hosts including those that do not require a reboot. To save time, if you are resuming the reboot process or targeting only a few machines, use the —-machines flag and specify only those machines that are "required" for reboot.

Lastly, the "reboot_delay" setting is confiurable via cluster.conf. The default is no delay ("reboot_delay": "now"). If you want to change this you can add the following to your cluster.conf file, changing the value "now" to "2" or "3" (for example). The "reboot_delay" parameter accepts as an argument either the string “now” or an integer in minutes, such as "3".

chef: {
  "continuum": {
    "cluster_reboot": {
      "reboot_delay": "now"
    }
  }
}

In addition, you can configure the shutdown_mode for Instance Managers. The default is shown below:

chef: {
  "continuum": {
    "cluster_reboot": {
      "instance_manager": {
        "shutdown_mode": "evacuation"
      }
    }
  }
}

Manual Reboot Instructions

If you are not using Orchestrator 0.5.3, you can follow the reboot instructions set forth below to manually restart cluster hosts.

1) Create a list of all hosts to track the status.

Run orchestrator-cli ssh to get a list of all hosts in the cluster.

Hit Control-C to exit from the menu.

Copy that list to a text editor, spreadsheet, or some other system of your choice.

As you proceed through this process note every host you reboot.

2) Reboot the auditlog-secondary host.

This is the auditlog-database host without the auditlog-database-master tag.

Before rebooting, validate that Postgres replication is working. On the auditlog-database-master validate Postgres is streaming by running the following commands:

orchestrator-cli ssh /auditlog-database-master

sudo su - postgres -c "psql -c 'select client_addr, state, sent_location, write_location, flush_location, replay_location from pg_stat_replication;'"

You should see output similar to the following:

client_addr |   state   | sent_location | write_location | flush_location | replay_location
-------------+-----------+---------------+----------------+----------------+-----------------
172.27.0.89 | streaming | 0/7915520     | 0/7915520      | 0/7915520      | 0/7915520

On the auditlog-database slave you are about to reboot, validate the replication delay is minimal:

orchestrator-cli ssh /auditlog-database

Select the host NOT tagged with auditlog-database-master

Run this command:

sudo su - postgres -c "psql -c 'select now() - pg_last_xact_replay_timestamp() AS replication_delay;'"

You should see output similar to the following:

replication_delay
-------------------
00:00:17.959963
(1 row)

Run the following command:

orchestrator-cli ssh /auditlog-database

Select the host NOT tagged with auditlog-database-master.

Run the following command:

sudo reboot

Wait for the auditlog-database server to reboot.

After the server is back up, run the same postgres validation commands on both hosts again to verify postgres replication is working.

3) Reboot the auth-server host.

NOTE: Restarting this host causes a brief outage for the API Server.

Run the following commands:

orchestrator-cli ssh /auth-server

sudo reboot

Wait for auth-server server to reboot.

Verify you can logout and login via APC or the web console.

4) Reboot the auditlog-database-master.

Run the following commands:

orchestrator-cli ssh /auditlog-database-master

sudo reboot

Wait for auditlog-database-master server to reboot.

As root on the auditlog-database-master validate Postgres is streaming:

sudo su - postgres -c "psql -c 'select client_addr, state, sent_location, write_location, flush_location, replay_location from pg_stat_replication;'"

You should see output similar to the following:

client_addr |   state   | sent_location | write_location | flush_location | replay_location
-------------+-----------+---------------+----------------+----------------+-----------------
172.27.0.89 | streaming | 0/7921038     | 0/7921038      | 0/7921038      | 0/7921038

As root on the auditlog-database slave validate the replication delay is minimal:

Run the following commands:

orchestrator-cli ssh /auditlog-database

sudo su - postgres -c "psql -c 'select now() - pg_last_xact_replay_timestamp() AS replication_delay;'"

You should see output similar to the following:

 replication_delay
-------------------
 00:00:01.905033
(1 row)

5) Reboot the component-database servers, one at a time.

This host is the component-database without the component-database-master tag. If you have more than one component-database server, you must wait for the one you rebooted to successfully start before you proceed with rebooting any other component-database server you have.

You need to reboot all of your component-database servers one by one at this point in the process as described below.

On the component-database-master validate Postgres is streaming:

Run the following commands:

orchestrator-cli ssh /component-database-master

sudo su - postgres -c "psql -c 'select client_addr, state, sent_location, write_location, flush_location, replay_location from pg_stat_replication;'"

You should see output similar to the folowing. The number of rows should match the number of non-master database servers in your cluster:

 client_addr |   state   | sent_location | write_location | flush_location | replay_location
-------------+-----------+---------------+----------------+----------------+-----------------
 172.27.0.72 | streaming | 0/3066C648    | 0/3066C648     | 0/3066C648     | 0/3066C648
(1 row)

TROUBLESHOOTING: If the "validate Postgres is streaming" test fails: Go to the machine and check if Postgres is running. Also check the console. If the console does not work or behaves strangely, contact Apcera support.

On the component-database slave you are about to reboot, validate that the replication delay is minimal (replication delay might increase if there are no API calls being made).

Run the following command:

orchestrator-cli ssh /component-database

Select a host NOT tagged with "component-database-master." (There may be more than one, the other will be rebooted later.)

Run the following command:

sudo su - postgres -c "psql -c 'select now() - pg_last_xact_replay_timestamp() AS replication_delay;'"

Verify results similar to the following:

 replication_delay
-------------------
 00:00:02.035951
(1 row)

To test the replication delay, make a trivial change (such as stopping or starting a job) and verify that the replication shows a recent update.

TROUBLESHOOTING: If the "validate the replication delay" test fails: The validation of the replication delay can only fail if you have an empty cluster (nothing at all was done). Once you create one job, the replication delay test should success. If not, contact Apcera Support.

Now reboot that host:

sudo reboot

Wait for component-database server to reboot.

After the server is back up, run the same postgres validation commands on both hosts again to verify postgres replication is working.

6) Reboot the component-database-master.

NOTE: Restarting this host causes a brief outage for the API Server.

Run the following commands:

orchestrator-cli ssh /component-database-master

sudo reboot

Wait for component-database-master server to reboot.

On the component-database-master, validate Postgres is streaming.

Run the following commands:

orchestrator-cli ssh /component-database-master

sudo su - postgres -c "psql -c 'select client_addr, state, sent_location, write_location, flush_location, replay_location from pg_stat_replication;'"

Verify results similar to the following:

 client_addr |   state   | sent_location | write_location | flush_location | replay_location
-------------+-----------+---------------+----------------+----------------+-----------------
 172.27.0.72 | streaming | 0/3066C750    | 0/3066C750     | 0/3066C750     | 0/3066C750
(1 row)

As root on the component-database slave validate the replication delay is minimal:

Run the following command:

orchestrator-cli ssh /component-database

Select any host other than the master.

Run the following command:

sudo su - postgres -c "psql -c 'select now() - pg_last_xact_replay_timestamp() AS replication_delay;'"

Verify results similar to the following:

 replication_delay
-------------------
 00:00:02.11448
(1 row)

7) Reboot the graphite server.

orchestrator-cli ssh /graphite-server

sudo reboot

Wait for graphite-server to reboot.

8) Reboot every other non-IM host, using best judgement for parallelism.

orchestrator-cli ssh

Select a host from the list that is NOT marked as an instance-manager, or one that has NOT already rebooted. Note that there is no preference for the order of machine reboot for this step (no particular order).

sudo reboot

Repeat until all non instance-manager hosts are rebooted.

Note the following:

  • On reboot of a Riak cluster, you can use the command riak-admin member-status to verify that each replication member is operating after a reboot.
  • On reboot of a Gluster cluster, you can use the command sudo gluster volume status all to verify that volume replication on each cluster node exists.

9) Reboot all the Instance Manager (IM) hosts, one at a time.

orchestrator-cli ssh /instance-manager

Select a IM host from the list.

Run the following command:

sudo touch /var/run/instance_manager-stop && sudo sv 2 /etc/sv/continuum-instance-manager && sleep 301 && sudo reboot

Repeat the process until all IM hosts are rebooted.

NOTE: On each IM host, we notify the IM process to force all jobs to be relocated to other hosts. The IM waits up to five minutes for all jobs to be rescheduled elsewhere before killing any remaining jobs and exiting.