Troubleshooting Apcera Deployments

This section provides techniques for troubleshooting Apcera production deployments, including deployment failures, troubleshooting workflow

If cluster deployment fails

If an orchestrator-cli deploy run does not finish successfully, you should immediately work to resolve the error and re-run it. A failed deploy may result in only partial monitoring of the servers in the cluster.

If your deployment fails partially and it is an initial deployment, you may need to remove any VMs that were created. The proper way to do this is to use the command orchestrator-cli teardown. Note that each time you do a tear down you will have to update your DNS records.

Use command orchestrator-cli help to list available command options.

General troubleshooting approach

The general troubleshooting approach is as follows:

1) Observe closely the output of the Orchestrator CLI during deployment.

2) Locate in the error message(s) where the deployment failed.

3) Determine what component failed and on what host that component is installed.

4) SSH to that host and grep the logs.

5) Alternatively, to start the debugging process, you may want to view the chef-client.log or chef-log on the Orchestrator host.

To do this:

  • Run command orchestrator-cli ssh to list all the components (that Orchestrator can connect to).
  • Run command ls -ltr to list the logs on that host.
  • Run command less chef-client-20151211030033.log (for example) to scroll through the log and list the errors.

Orchestrator agent issues

If you receive an error indicating something similar to "failed to verify agent on machine "xxxxxxxx", log into the offending machine and make sure the Orchestrator agent process is running. To start the Orchestrator agent, run the command sudo service orchestrator-agent restart.

Pulling the component logs

To collect the component logs, you can use a log collection facility. This facility is available with Orchestrator version 0.2.3 and later.

To use this tool:

1) Run command orchestrator-cli collect logs all on the Orchestrator host.

This will log into each Apcera node, pull the latest component logs, and package them into a tarball for each host on the Orchestrator machine in the working directory. Once done you can then do the following:

2) Exit out to your local machine.

3) Copy the logs using a command similar to the following: scp -r orchestrator@172.27.20.115:~/*-20151117012219.tar.gz ./. In addition, you should also do it for the chef-client-*, chef-server-* and martini-debug-* logs as well.

For each component, we keep the "latest" component logs (last 24 hours), and a few older log files before rotating them out entirely. The older log file names are in the same location and start with “@".

For each component host, by default logs are collected from /var/log/continuum-*/current, /tmp/orchestrator-*, and /tmp/chef-*.

To specify what logs you want to collect, use the --from flag, for example:

orchestrator-cli collect all --from /var/log/some-log-dirr/* /var/log/some-log-file

Layers of a Docker image fail to download

In cases where some layers of a Docker image fail to download when the Docker image is retrieved from a Docker registry, it can indicate that the Job Manager's cache is insufficiently sized, leading to cache contention. As of Apcera Platform release 3.2.2, the Job Manager cache size is set to 1000 jobs. Job Manager cache size and related configuration settings can now be set in the config.conf file. It is recommended to use these entries only under the direction of Customer Support for the Apcera Platform. An example of setting these configuration settings:

"continuum": {
  ...
  "job_manager": {
    ...
    "db": {  
      "default_cache_size": 100,
      "job_cache_size": 1000,
      "mapped_fqn_cache_size": 2000,
      "allocator_cache_size": 1
    }
  }
}

System Administration for services outside the platform's scope of management

The Apcera Platform actively manages the resource utilization of the services under its scope of management. However your cluster configuration may interact with services that utilize system resources that are outside the platform's scope of management. For example the Job Manager uses Redis as the persistent store for job states. The use of system resources by Redis is outside the scope of management for the platform. Over a long period of time the Redis AOF file can grow larger than the ability of Redis to manage the file. In such cases the remediation can be as simple as the system administrator deleting the file in question and then restarting the service.

For caches, restarting the VM instance for Instance Managers and Package Managers can address cache management issues. Additionally, Package Managers should have their cache settings adjusted appropriately in cluster.conf (and potentially rebuilt with greater package cache storage).

Configuring Splunk

To help troubleshoot cluster operations, you can integrate with Splunk to collect component and job logs. See configuring Splunk for details.