Monitoring and Managing Job Instance Health
Apcera jobs are self-healing. In addition, Apcera provides features for monitoring and managing job health.
- Component interactions
- Health checks
- Health score
- Job instance load balancing
- Health Manager HA
- Job state
- Job status
- Instance states and flags
Apcera is a distributed system that uses NATS publish/subscribe messaging for component communications. Apcera monitors and manages the health of job instances by coordinating activities between the Job Manager, Health Manager, and Instance Manager components.
|job-manager||The Job Manager (JM) is the authority on job state. The JM instructs the IM to start or stop a job instance, and verifies that the job is running.|
|health-manager||The Health Manager (HM) is a watchdog process that is responsible for monitoring the intended state vs. actual state and publishing the result.|
|instance-manger||The Instance Manager (IM) starts, stops, and updates job instances (containers), and publishes instance heartbeat messages.|
When an Instance Manager (IM) runs a job instance, it performs two health checks:
|App start process||On job instance start or update, and every 60 seconds thereafter, we monitor the app start process. If it is not running, we tear down the job. See Start process health check.|
|Exposed port(s)||On job instance start or IM restart, we check for any exposed ports. If an exposed port is not open (the app is not listening for a connection on the port), and the exposed port is not optional, the IM considers the job instance to be unhealthy and tears it down. See Port health check.|
Start process health check
The primary health check is to make sure that the start command you specified for the app is a running process on the container instance. If the start command exits, we consider the job instance to be in an unhealthy state and will tear it down.
Port health check
If you expose a port on a job (using
apc job update myjob --port-add 3306, for example), you are telling the system that "this workload should be listening on port 3306.” If the system expects the job instance to be listening on the exposed port, but it isn't, we consider the job to be unhealthy and will tear it down. You will not be able to connect to the job via SSH, and you will receive the system error "Health probe for route on port(s)…failed" if you try to start the job.
You can use the
--optional flag when exposing a port on a job (
apc job update sample --port-add 3306 --optional). This results in the job being considered healthy even if the port does not respond to the health check. Note that ports with
--optional flag are not continously monitored and if an application stops accepting connections on an optional port, automatic restart is not supported.
Note that if you use the web console to expose a port on a job, you have to to opt-in to the port health check by selecting the Include In Health Check option.
Every 60 seconds, each IM publishes a NATS message for each job instance indicating that the job instance is running. The IM doesn't care who receives them. It is just an FYI message on a subject name that is unique to each instance. IMs care only about instances, the ones they are running.
The Heath Manager subscribes to all heartbeat messages (using a wildcard subject). By default, if the Health Manager does not receive a heartbeat message for a job instance at least once within a 5 minute interval, the Health Manager will publish a message that the job instance is
failed. The Job Manager will consume this message and in turn publish a message for the IM to restart the job instance.
The 5-minute default duration is configurable by the cluster admin in cluster.conf. The default duration is a product of the
heartbeat_interval (default 60 seconds) and the
max_missed_heartbeats (default 5). However, Apcera cautions against changing these settings to avoid overreacting to brief network outages or other temporary errors.
Apcera implements eventual consistency. This is reflected by the job instance Health Score, which is the ratio of running job instances to requested job instances.
A health score of 100% (using APC) or 1.00 (using the web console) means all requested job instances are running (job status is
OK). A health score less than 100% (or 1.00) means not all requested instances are running (job status is
apc job health node-todo Looking up "node-todo"... done Retrieving job health... done ╭────────────────────┬────────────────────────────────╮ │ Job: │ job::/sandbox/NAME::node-todo │ ├────────────────────┼────────────────────────────────┤ │ Status: │ Running │ │ Health Score: │ 100% │ │ Running Instances: │ 1/1 │ ╰────────────────────┴────────────────────────────────╯
Job instance load balancing
By default Apcera load balances job instances across all available IMs. When you start a job with multiple instances, it is likely one or more instances will run on separate IMs based on their taint. You can override the default job scheduling behavior using job scheduling tags.
Health Manager HA
In a typical deployment you will have a single Health Manager. In large-scale deployments, you can choose to deploy two or more Health Managers. Apcera implements a RAFT-based leader election algorithm so that only the master is active. In the case of failure, the backup can take over. Note that the Health Manager does not maintain any state.
The following table summarizes the various job states.
|ready||Job is successfully deployed, but has has never been started.|
|started||At least 1 instance of the requested number of job instances is started.|
|stopped||All job instances are stopped. This can occur if the job is stopped gracefully by the user, or if performing a job update requires job restart. The Job Manager publishes a message instructing the IM to stop the job instance. All IMs receive the message. Only the IMs running one or more instances take action to stop the job.|
|finished||Job has run successfully.|
|errored||Job has failed to start within 3 days. Once a job is in the "errored" state, it will require human intervention, including addressing the issue(s) causing the app to crash and manually starting the job, for example, using the
If a job is
started and healthy, its job status may be
OK or it may be
||All of the requested job instances are running.|
||Not all of the requested job instances are running. See health score.|
Instance states and flags
The following table lists instance states.
|starting||When you start a job, the Job Manager publishes a message requesting bids from IMs to start the specified number of job instances. The IMs subscribe to the message subject. Each IM responds according to its taint (current workload, scheduling tags, other criteria). The first IM to respond "wins" and is assigned the task of starting the job instance, performing the health checks, and publishing instance heartbeats. See also job instance load balancing|
|failed||Job has failed health check.|
|updating||Each time a job template is modified, the Job Manager publishes a job update message. The IMs use this to update instances. The HM uses it to update its intended state. Job update will trigger the check for the number of instances.|
|restarting||The restart command modifies the job state twice (start and stop) and relaunches all instances. This entails some downtime. The order in which instances restart is not determined.|
|flapping||Job has failed to start 3 times within 5 minutes. If a job instance crashes, the IM will attempt to restart it. If a job instance crashes 3 times within 5 minutes, the job is placed into the "flapping" state. Once a job is in the flapping state, the IM will wait 5 minutes before trying to start the instance again, and will then go back to watching for 3 failures within 5 minutes. If the job state remains flapping for 3 days, the job will enter the "errored" state at which point no more restarts are attempted.|