Monitoring and Managing Job Instance Health

Apcera jobs are self-healing. In addition, Apcera provides features for monitoring and managing job health.

Component interactions

Apcera is a distributed system that uses NATS publish/subscribe messaging for component communications. Apcera monitors and manages the health of job instances by coordinating activities between the Job Manager, Health Manager, and Instance Manager components.

Component Description
job-manager The Job Manager (JM) is the authority on job state. The JM instructs the IM to start or stop a job instance, and verifies that the job is running.
health-manager The Health Manager (HM) is a watchdog process that is responsible for monitoring the intended state vs. actual state and publishing the result.
instance-manger The Instance Manager (IM) starts, stops, and updates job instances (containers), and publishes instance heartbeat messages.

Health checks

When an Instance Manager (IM) runs a job instance, it performs two health checks:

Heatlh Check Description
App start process On job instance start or update, and every 60 seconds thereafter, we monitor the app start process. If it is not running, we tear down the job. See app check.
Exposed port(s) On job instance start or IM restart, we check for any exposed ports. If an exposed port is not open (the app is not listening for a connection on the port), and the exposed port is not optional, the IM considers the job instanace to be unhealthy and tears it down. See port check.

Start process health check

The primary health check is to make sure that the start command you specified for the app is a running process on the container instance. If the start command exits, we consider the job instance to be in an unhealthy state and will tear it down.

Port health check

If you expose a port on a job (using apc job update myjob --port-add 3306, for example), you are telling the system that "this workload should be listening on port 3306.” If the system expects the job instance to be listening on the exposed port, but it isn't, we consider the job to be unhealthy and will tear it down. You will not be able to connect to the job via SSH, and you will receive the system error "Health probe for route on port(s)…failed" if you try to start the job.

You can use the --optional flag when exposing a port on a job (apc job update sample --port-add 3306 --optional). This results in the job being considered healthy even if the port does not respond to the health check. Note that if you use the web console to expose a port on a job, you have to to opt-in to the port health check by selecting the Include In Health Check option.

Heartbeats

Every 60 seconds, each IM publishes a NATS message for each job instance indicating that the job instance is running. The IM doesn't care who receives them. It is just an FYI message on a subject name that is unique to each instance. IMs care only about instances, the ones they are running.

The Heath Manager subscribes to all heartbeat messages (using a wildcard subject). By default, if the Health Manager does not receive a heartbeat message for a job instance at least once within a 5 minute interval, the Health Manager will publish a message that the job instance is failed. The Job Manager will consume this message and in turn publish a message for the IM to restart the job instance.

The 5-minute default duration is configurable by the cluster admin in cluster.conf. The default duration is a product of the heartbeat_interval (default 60 seconds) and the max_missed_heartbeats (default 5). However, Apcera cautions against changing these settings to avoid overracting to brief network outages or other temporary errors.

Health score

Apcera implements eventual consistency. This is reflected by the job instance Health Score, which is the ratio of running job instances to requested job instances.

A health score of 100% (using APC) or 1.00 (using the web console) means all requested job instances are running (job status is Running or OK). A health score less than 100% (or 1.00) means not all requested instances are running (job status is Warning).

apc job health node-todo
Looking up "node-todo"... done
Retrieving job health... done
╭────────────────────┬────────────────────────────────╮
│ Job:               │ job::/sandbox/NAME::node-todo  │
├────────────────────┼────────────────────────────────┤
│ Status:            │ Running                        │
│ Health Score:      │ 100%                           │
│ Running Instances: │ 1/1                            │
╰────────────────────┴────────────────────────────────╯

Job instance load balancing

By default Apcera load balances job instances across all available IMs. When you start a job with multiple instances, it is likely one or more instances will run on separate IMs based on their taint. You can override the default job scheduling behavior using job scheduling tags.

Health Manager HA

In a typical deployment you will have a single Health Manager. In large-scale deployments, you can choose to deploy two or more Health Managers. Apcera implements a RAFT-based leader election algorithm so that only the master is active. In the case of failure, the backup can take over. Note that the Health Manager does not maintain any state.

Job state

The following table summarizes the various job states.

State Description
ready Job is successfully deployed, but has has never been started.
started At least 1 instance of the requested number of job instances is started.
stopped All job instances are stopped. This can occur if the job is stopped gracefully by the user, or if perfoming a job update requires job restart. The Job Manager publishes a message instructing the IM to stop the job instance. All IMs receive the message. Only the IMs running one or more instances take action to stop the job.
finished Job has run successfully.
errored Job has failed to start within 3 days. Once a job is in the "errored" state, it will require human intervention, including addressing the issue(s) causing the app to crash and manually starting the job, for example, using the apc app start command. See "flapping" below.

Job status

If a job is started and healthy, its job status may be OK or it may be Warning.

State Status Description
started Running, OK All of the requested job instances are running.
started Warning Not all of the requested job instances are running. See health score.

Instance states and flags

The following table lists instance states.

State Description
starting When you start a job, the Job Manager publishes a message requesting bids from IMs to start the specified number of job instances. The IMs subscribe to the message subject. Each IM responds according to its taint (current workload, scheduling tags, other criteria). The first IM to respond "wins" and is assigned the task of starting the job instance, performing the health checks, and publishing instance heartbeats. See also job instance load balancing
failed Job has failed health check.
updating Each time a job template is modified, the Job Manager publishes a job update message. The IMs use this to update instances. The HM uses it to update its intended state. Job update will trigger the check for the number of instances.
restarting The restart command modifies the job state twice (start and stop) and relaunches all instances. This entails some downtime. The order in which instances restart is not determined.
flapping Job has failed to start 3 times within 5 minutes. If a job instance crashes, the IM will attempt to restart it. If a job instance crashes 3 times within 5 minutes, the job is placed into the "flapping" state. Once a job is in the flapping state, the IM will wait 5 minutes before trying to start the instance again, and will then go back to watching for 3 failures within 5 minutes. If the job state remains flapping for 3 days, the job will enter the "errored" state at which point no more restarts are attempted.