Apcera Architecture and Components

Reference Architecture

Apcera is a distributed system comprising several components where each component performs a specific function within the system. Apcera components communicate through the NATS messaging system, and those messages are signed and validated using Elliptic Curve (EC) public-private key pair. Therefore, each component has an Elliptic Curve key and communicates with Security Server (Auth Server) to generate signed ephemeral tokens. These component class keys are stored in Vault, and components talk to Vault via TLS/TCP.

In addition, Apcera component processes run a policy engine to interpret and enforce policy and an agent for internal monitoring.

screenshot

Apcera components

An Apcera component is a process on a node in a cluster that communicates with other components using messaging and is governed by the policy engine. Components act as managers to perform their functions and are not managed by other components.

Component Description Count HA Considerations
api-server Provides HTTP API endpoints for the cluster. 1 or more Scale horizontally by running multiple to handle a large number of concurrent client connections coming from APC, the web console, or custom clients.
auditlog-db Stores audit logs in PostgreSQL DB. 1 or more Scale horizontally by running multiple. See audit log HA.
auth-server Cluster Security Server for encrypted policy and NATS key storage and distribution. 1 or more Scale horizontally by running multiple.
cluster-monitor Reports real-time cluster statistics. 1 or more Scale horizontally by running multiple.
cluster-object-storage Gluster Package Storage Backend. 0 or 3 x N Use in production for HA package-manager storage when there is no other S3-compatible storage.
component-db Stores cluster artifacts in PostgreSQL DB. 1 or more Scale horizontally by running multiple.
events-server Streams life-cycle and resource usage events for a cluster resource (a job, package, or route, for example) to subscribed clients. It also manages client event subscriptions and garbage-collects subscriptions for disconnected clients. 1 or more Scale horizontally by running multiple.
flex-auth-server Central authority for authentication: basic-auth-server, google-auth-server, ldap-auth-server, keycloak-auth-server, app-auth-server (for App Token) 1 or more Scale horizontally by running multiple Flex Auth components.
gluster-server Provides HA NFS persistence. 0 or 3 x N (Optional) Recommended for production clusters requiring persistence. The count is a multiple of 3 for replication (3 x N). See HA NFS persistence.
graphite-server Storage for cluster metrics. 1 exactly Singleton. See Graphite storage.
health-manager Calculates and reports job health. 1 or more Scale horizontally by running multiple HMs.
instance-manager Runtime environment for job instances. 1 or more Scale horizontally or vertically to run more job instances. In production, run 3 or more IMs.
ip-manager Provides static IP addressing for integrating with legacy systems that require fixed IP addresses. 0 or 1 Optional singelton.
job-manager Manages jobs (workloads) deployed to the cluster. 1 or more Scale horizontally by running multiple.
kv-store Key-Value storage system for the cluster (Consul) 1 or more Scale horizontally by running multiple.
metrics-manager Handles statsd traffic and reports cluster statistics over time. 1 or more Scale horizontally by running multiple.
monitoring Component monitoring Zabbix server and database (PostgreSQL DB). 0 or 1 Typically both components are installed on the same host. For HA, use an RDS or install DB in HA mode on separate host. Use an external monitoring system to monitor the server.
nats-server Message bus for component communications. 1 or more Scale horizontally by running multiple.
nfs-server Provides NFS persistence layer. 0 or 1 Optional singleton. For HA, use gluster-server x 3.
orchestrator-database orchestrator-server Use to install cluster software, manage cluster deployments, collect component logs, etc. Includes PostgreSQL DB. 1 exactly Run on VM host, version control cluster.conf, back up DB regularly. See the Orchestrator documentation.
package-manager Manages distribution of platform packages. 1 or more May run as a singleton in local mode. For HA, run multiple PMs in s3 or gluster. See configuring package manager.
redis-server Log buffer for storing job logs. 1 exactly Singleton.
riak-node Distributed S3-compliant package store. 0, 3 or 5 Use in production for HA package-manager storage when there is no other S3-compatible blob storage. The minimum number of Riak hosts is 3; the recommended number of Riak hosts is 5. Riak is required on for on-premises, non-AWS cluster deployments where HA package management is required.
router HTTP router (NGINX) responsible for routing and load balancing inbound traffic. 1 or more Scale horizontally by running multiple to handle high volume of inbound requests or if your network requires it. If multiple then fronted by separate load balancer such as ELB.
splunk-search Lets you to search across Splunk-collected component and job logs. 0 or 1 Optional singleton.
splunk-indexer Lets you to index component and job logs for Splunk searches. 0 or 1 Optional singleton.
stagehand Responsible for creating and updating system-provided jobs and resources. 1 exactly Required singleton. (Not a runtime component.)
tcp-router TCP traffic into the cluster (NGNIX). 0 or more (Optional) Multiple TCP routers allowed, but auto not supported.
vault Encrypt the secrets using Vault, and store them in Consul (kv-store) for high-availability 1 or more Scale horizontally by running multiple.

Operational Roles

Apcera components are deployed to distinct operational roles or planes:

  • Control Plane provides communication and distributes policy for the system.
  • Routing Plane load balances traffic into the system.
  • Monitoring Plane monitors system components.
  • Management Plane commands and controls system components.
  • Runtime Plane provides the execution environment for Linux container instances.

Control plane

The Control Plane provides communication and distributes policy for the system.

screenshot

Communication

Apcera components communicate using the NATS messaging system. Components communicate in peer-to-peer fashion through the exchange of messages. The NATS messaging server resides in the Management Plane, and each component is a client.

Governance

Policy is a machine readable representation of the intended behavior of the cluster. Virtually all surface area of Apcera is governed by policy.

The policy language is a declarative scripting language. A policy comprises a realm and one or more rules. The realm defines the set of resource(s) to which a policy rule applies. A rule is a statement evaluated by the policy engine.

Routing plane

The Routing Plane load balances traffic into the system.

screenshot

HTTP Server

The HTTP Server receives inbound traffic from clients and routes them to the appropriate component or job using HTTP. Apcera implements a standard web server and adds several additional features to support the needs of the platform.

TCP Server

The TCP Server routes TCP traffic within the system for jobs that have enabled TCP routes.

Monitoring plane

The Monitoring Plane monitors system components. Apcera integrates with a third-party monitoring server and embeds a monitoring agent in each component.

screenshot

Management Plane

The Management Plane commands and controls system components.

screenshot

Security (Auth) Server

The Security Server is the root of trust in Apcera Platform. It is also responsible for distributing the policy to an individual component so that the user defined policy rules are enforced.

Flex Auth Server

Apcera uses a token-based approach to authenticate users. The Flex Auth Server integrates with various identity providers, including Google OAuth and LDAP. The Flex Auth Server does not store credentials.

API Server

The API Server receives requests from clients (apc, web console, REST API) and converts them into requests within the cluster over NATS communicating with the relevant components.

Package Manager

A package is binary data, typically encapsulated in a tarball, and JSON metadata.

The Package Manager (PM) is responsible for managing assets associated with a package, including references to the location of assets within the packages. The PM connects to the persistent datastore filesystem to store and retrieve packages. Once retrieved, packages are cached locally on runtime hosts.

Job Manager

A job is the base unit in the system. A job is a workload definition that represents an inventory of packages, metadata, and associated policies. There are several types of jobs, including: Apps, capsules, Docker images, stagers, service gateways, and semantic pipelines.

The JM maintains the desired state of jobs within a cluster, and the set of packages that constitute a job. The JM stores job metadata and state within the database.

Health Manager

The Health Manager (HM) communicates with IMs to determine the number of live instances of each job and communicates this to the JM that can initiate start or stop requests to the runtime plane based on deltas from the intended state. The HM subscribes to these heartbeats and takes actions based on the intended/actual state disparities.

Events Server

Events Server serves events pertaining to different resources in the Apcera Cluster (Jobs, Routes, Packages etc.) to an end user based on policy. Events Server also manages subscriptions from clients and garbage-collects them if clients are disconnected.

Metrics Manager

The Metrics Manager (MM) component tracks the utilization of resources (disk, cpu, ram, network) over time. The related Cluster Monitor component provides realtime cluster statistics to the web console.

Cluster Monitor

The Cluster Monitor reports current cluster statistics which include state of job instances and IM nodes.

Vault

Refer to Securing Cluster Secrets for details.

Key-Value Store

Refer to Managing the Component Store (Consul) for details.

Stagehand

The Stagehand is a tool designed to run as a part of the Orchestrator to update component jobs within Continuum. Specifically it is used to update the following jobs when we update a cluster:

  • Staging Coordinator
  • Stagers
  • Service Gateways
  • Semantic Pipelines
  • Web Console (Lucid)
  • Continuum Guide
  • Metrics Exporter

Runtime plane

The Runtime Plane comprises a single component: the Instance Manager (IM).

screenshot

An instance is an instance of a Linux container. Each instance runs a job workload in isolation on an Instance Manager node. The IM is responsible for managing the lifecycle of container instances. Each instance is represented by a container that runs on an IM host. The IM performs CRUD plus monitoring operations on container instances.

In Apcera, a container instance is a running job within a single isolated execution context. Each container instance is instantiated with a unioned filesystem to layer only the dependencies required by the running processes in the job definition. Apcera container instances implement Linux cgroups and kernel namespacing to isolate the execution context of a running job instance.

Additional responsibilities of the IM include:

Function Description
Scheduling IM receives container instance placement requests from the JM and calculates the taint value based on a number of scheduling criteria (tags, resource usage, locality). The taint value is used as response delay: after that delay IM responds to the placement request. It is up to the requester to follow up with the container instance creation request.
Health The IM periodically sends out heartbeats for each instance. The Health Manager (HM) subscribes to these heartbeats and takes actions based on the intended/actual state disparities.
Metrics The IM periodically sends out instance metrics (resource consumption, uptime etc.). The Metrics Manager (MM) subscribes to metrics and persists them in metrics storage.
Routes Once instance is running, IM sends out information about ports/routes exposed on the instance. This is used by routers for route mapping and by other IMs to set up the destination IP/port for dynamic bindings.
Log Delivery The IM monitors instances logs for any new activity and delivers new logs to a number of configured log sinks.
Semantic Pipeline Setup When the IM starts an instance that’s supposed to have a semantic pipeline, it suspends the instance start process until semantic pipeline is set up completely. It also takes care of setting up network rules so that job-to-service traffic flows through semantic pipeline.
Runtime Configuration The IM fills out any templates defined in the instance job/package using templated values, including the process name, UUID, and network connections, including links (ip:port) and bindings(ip:port, credentials, scheme).
Network Connectivity IM is responsible for setting up network configuration for container instances. This involves setting up and maintaining firewall rules (iptables), and optionally connecting to other components, for example to set up GRE tunnels for fixed IP routing or to provide persistence for the container.
Filesystem IM downloads packages referenced by instance job and lays them out in the container filesystem.
Package cache IM reserves approximately 1/2 of its allocated disk space for caching packages locally and storing the job logs.

Interactions

Apcera provides various interfaces for interacting with the platform, including the APC command-line interface (CLI), the web console, and a REST API for use by client applications.

Creating starting a job instance from source using an Apcera client involves the following component interactions.

Job creation

  • APC sends a request to the API server, packages the source code in a tarball and uploads it to the cluster.
  • API server invokes the staging coordinator job to ingest the source code along with any dependent packages through package resolution.
  • Detected or specified staging pipeline is invoked and stager(s) compile or build the source code and push it to the PM which stores the assets and metadata as a “Package.”
  • API sends the expected initial state of the application to the JM, including the package and package dependency information.
  • JM broadcasts a NATS message to all IMs requesting bids for an IM to start one or more job instances (containers).
  • Each IM responds according to its taint with an offer to start an instance of the job. Each IM's taint is determined by a set of scoring rules based on the set of nodes (containers) under its control, tags asserted on the job definition, and other criteria.
  • JM receives the bids and sends a request to the “best” scoring IM to start an instance of the app.
  • IM retrieves the relevant packages from its local cache, or from the PM if it is an initial deploy, and starts the appropriate number of instances. Each IM will start no more than one instance per node (container).

Job instance monitoring

  • Job instance log output (stdout & stderror) is centrally available and routed through TCP log drains, including ring buffer, syslog, and NATS.
  • HM monitors number of running instances of each given job based on the job definition.
  • HM communicates the number of running instances to the JM.
  • MM collects the resource utilization of nodes within the cluster and can be queried via the API server to display resource and performance information through APC and the web console.
  • Cluster monitor collects and provides real-time instance data to the web console.
  • Audit log provides full record of all cluster activities stored on dedicated DB in HA mode.

Job scaling and resiliency

  • If the number of running instances of a given job reported by the HM is lower than the defined state of the job, the JM will send out a request to the IMs for bids to start additional instances, similar to the initial app create/deploy and start process.
  • If the number of running instances of a given job reported by the HM is higher than the defined state of the job, the JM will send out a request to the IMs for reverse bids to shut down extraneous instances.
  • The JM will accept the “best” bids and send RPC calls to the relevant IMs to shut down the relevant job instances.