Troubleshooting Apcera Deployments
This section provides techniques for troubleshooting Apcera production deployments, including deployment failures, troubleshooting workflow
If cluster deployment fails
If an orchestrator-cli deploy
run does not finish successfully, you should immediately work to resolve the error and re-run it. A failed deploy may result in only partial monitoring of the servers in the cluster.
If your deployment fails partially and it is an initial deployment, you may need to remove any VMs that were created. The proper way to do this is to use the command orchestrator-cli teardown
. Note that each time you do a tear down you will have to update your DNS records.
Use command orchestrator-cli help
to list available command options.
General troubleshooting approach
The general troubleshooting approach is as follows:
1) Observe closely the output of the Orchestrator CLI during deployment.
2) Locate in the error message(s) where the deployment failed.
3) Determine what component failed and on what host that component is installed.
4) SSH to that host and grep the logs.
5) Alternatively, to start the debugging process, you may want to view the chef-client.log
or chef-log
on the Orchestrator host.
To do this:
- Run command
orchestrator-cli ssh
to list all the components (that Orchestrator can connect to). - Run command
ls -ltr
to list the logs on that host. - Run command
less chef-client-20151211030033.log
(for example) to scroll through the log and list the errors.
Orchestrator agent issues
If you receive an error indicating something similar to "failed to verify agent on machine "xxxxxxxx", log into the offending machine and make sure the Orchestrator agent process is running. To start the Orchestrator agent, run the command sudo service orchestrator-agent restart
.
Pulling the component logs
To collect the component logs, you can use a log collection facility. This facility is available with Orchestrator version 0.2.3 and later.
To use this tool:
1) Run command orchestrator-cli collect logs all
on the Orchestrator host.
This will log into each Apcera node, pull the latest component logs, and package them into a tarball for each host on the Orchestrator machine in the working directory. Once done you can then do the following:
2) Exit out to your local machine.
3) Copy the logs using a command similar to the following: scp -r orchestrator@172.27.20.115:~/*-20151117012219.tar.gz ./
. In addition, you should also do it for the chef-client-*
, chef-server-*
and martini-debug-*
logs as well.
For each component, we keep the "latest" component logs (last 24 hours), and a few older log files before rotating them out entirely. The older log file names are in the same location and start with “@".
For each component host, by default logs are collected from /var/log/continuum-*/current
, /tmp/orchestrator-*
, and /tmp/chef-*
.
To specify what logs you want to collect, use the --from
flag, for example:
orchestrator-cli collect all --from /var/log/some-log-dirr/* /var/log/some-log-file
Configuring Splunk
To help troubleshoot cluster operations, you can integrate with Splunk to collect component and job logs. See configuring Splunk for details.