Restoring the Component Database

This section describes how to restore the component database (Postgres) in case of failure.

Component database master failure

If the component database master is unavilable, cluster functionality is severely limited. Running jobs will not be stopped, but the Health Manager will be unable to function and there will be a risk of job downtime (if a job crashes for other reasons, it will not be automatically restarted.) In addition, you will not be able to manage the cluster or any jobs using Apcera APIs, the web console, or using APC.

Cluster behavior is affected in this scenario only if the component database master is down or unavilable, not a component database slave node.

The component database master may fail if:

  • The Central Host (where the component DB is typically deployed) is destroyed or unusable
  • The Cluster cannot connect to the Central Host
  • The Postgres DB is not running

If you have configured monitoring alerts, you should receive the following alerts if the component database master fails:

  • When the postgres service fails on the master, you will get an alert with:

    "Postgres database replication status", severity High on the component database master host

  • When the postgres master database host is unreachable or unusable for more than 5 minutes, you will get an alert with:

    "Zabbix agent on is unreachable for 5 minutes", severity Average

NOTE: It takes around 5 minutes to get alerted.

Restoring the component database

The platform does not currently have automatic failover for the component database. If the component database master fails, you must manually restore the component database using one of the slave hosts.

To restore full cluster function:

  1. ssh into your orchestrator machine.

    If the component database master host is running and accessible to the cluster, you may connect for troubleshooting purposes by running the following command:

     orchestrator-cli ssh component-database-master
    

    If you are unable to connect to the host, or if you want to fail over to a component database slave while you are resolving any issues with Postgres on the current master, you can remove the current master from your cluster.conf and redeploy.

  2. Get the IP addresses for all component-database nodes.

     orchestrator-cli ssh component-database
    

    This command lists all component database hosts, including the master (you will see the machine tags when you run the command, master has an extra "component-database-master" tag). When you run the command, take note of all the component-database IPs and which is master and the slaves.

  3. SSH into each component-database slave node and verify that Postgres is online.

     sudo service postgresql status
    

    You should see that postgres is online on each slave, though they may be in recovery.

  4. Locate the host's IP in cluster.conf.

    All component database hosts will be listed together on the Central host.

    For example:

     central: { hosts: ['10.75.0.1', '10.75.0.2', '10.75.0.3'] suitable_tags: [ "component-database" "job-manager" "stagehand" "cluster-monitor" "package-manager" "health-manager" "metrics-manager" "events-server" "nats-server" "auth-server" "api-server" ] }
    
  5. Remove the component database master's IP address (and the addresses of any slaves that are not functioning).

    At this point, you may also want to provision replacement machines. Add the IP addresses for any replacements to the list of hosts where you have removed the master.

  6. After you have a final list of component database hosts, adjust the component count in cluster.conf.

    For example:

     Central Components
     component-database: 3
     job-manager: 3
     package-manager: 3
     cluster-monitor: 3
     health-manager: 3
     metrics-manager: 3
     nats-server: 3
     events-server: 3
     auth-server: 3
     api-server: 3
    

    NOTE: If the number of component database hosts has changed (for example, if you have chosen to remove the master from the cluster without provisioning a replacement), make sure you are also adjusting the component count for any other tags on the removed host.

  7. Once you have edited your cluster.conf, redeploy the cluster:

     orchestrator-cli deploy -c cluster.conf --dry-run
     orchestrator-cli deploy -c cluster.conf
    

    This will automatically select a new component database master and set up replication with existing and newly provisioned slaves. Once the deploy is complete, full cluster functionality is restored. If apps have crashed during the failure, the health manager will automatically restart them at this point, and no manual steps are needed.

    NOTE: It takes around 20 minutes to redeploy the cluster. Cluster functionality may be restored before the redeploy completed.