Troubleshooting Apcera Deployments

This section provides techniques for troubleshooting Apcera production deployments, including deployment failures, troubleshooting workflow

If cluster deployment fails

If an orchestrator-cli deploy run does not finish successfully, you should immediately work to resolve the error and re-run it. A failed deploy may result in only partial monitoring of the servers in the cluster.

If your deployment fails partially and it is an initial deployment, you may need to remove any VMs that were created. The proper way to do this is to use the command orchestrator-cli teardown. Note that each time you do a tear down you will have to update your DNS records.

Use command orchestrator-cli help to list available command options.

General troubleshooting approach

The general troubleshooting approach is as follows:

1) Observe closely the output of the Orchestrator CLI during deployment.

2) Locate in the error message(s) where the deployment failed.

3) Determine what component failed and on what host that component is installed.

4) SSH to that host and grep the logs.

5) Alternatively, to start the debugging process, you may want to view the chef-client.log or chef-log on the Orchestrator host.

To do this:

  • Run command orchestrator-cli ssh to list all the components (that Orchestrator can connect to).
  • Run command ls -ltr to list the logs on that host.
  • Run command less chef-client-20151211030033.log (for example) to scroll through the log and list the errors.

Orchestrator agent issues

If you receive an error indicating something similar to "failed to verify agent on machine "xxxxxxxx", log into the offending machine and make sure the Orchestrator agent process is running. To start the Orchestrator agent, run the command sudo service orchestrator-agent restart.

Pulling the component logs

To collect the component logs, you can use a log collection facility. This facility is available with Orchestrator version 0.2.3 and later.

To use this tool:

1) Run command orchestrator-cli collect logs all on the Orchestrator host.

This will log into each Apcera node, pull the latest component logs, and package them into a tarball for each host on the Orchestrator machine in the working directory. Once done you can then do the following:

2) Exit out to your local machine.

3) Copy the logs using a command similar to the following: scp -r orchestrator@172.27.20.115:~/*-20151117012219.tar.gz ./. In addition, you should also do it for the chef-client-*, chef-server-* and martini-debug-* logs as well.

For each component, we keep the "latest" component logs (last 24 hours), and a few older log files before rotating them out entirely. The older log file names are in the same location and start with “@".

For each component host, by default logs are collected from /var/log/continuum-*/current, /tmp/orchestrator-*, and /tmp/chef-*.

To specify what logs you want to collect, use the --from flag, for example:

orchestrator-cli collect all --from /var/log/some-log-dirr/* /var/log/some-log-file

Configuring Splunk

To help troubleshoot cluster operations, you can integrate with Splunk to collect component and job logs. See configuring Splunk for details.