Apcera Architecture and Components
- Reference Architecture
- Apcera components
- Operational Roles
Apcera is a distributed system comprising several components where each component performs a specific function within the system. Apcera components communicate through the NATS messaging system, and those messages are signed and validated using Elliptic Curve (EC) public-private key pair. Therefore, each component has an Elliptic Curve key and communicates with Security Server (Auth Server) to generate signed ephemeral tokens. These component class keys are stored in Vault, and components talk to Vault via TLS/TCP.
An Apcera component is a process on a node in a cluster that communicates with other components using messaging and is governed by the policy engine. Components act as managers to perform their functions and are not managed by other components.
||Provides HTTP API endpoints for the cluster.||1 or more||Scale horizontally by running multiple to handle a large number of concurrent client connections coming from APC, the web console, or custom clients.|
||Stores audit logs in PostgreSQL DB.||1 or more||Scale horizontally by running multiple. See audit log HA.|
||Cluster Security Server for encrypted policy and NATS key storage and distribution.||1 or more||Scale horizontally by running multiple.|
||Reports real-time cluster statistics.||1 or more||Scale horizontally by running multiple.|
||Gluster Package Storage Backend.||0 or 3 x N||Use in production for HA
||Stores cluster artifacts in PostgreSQL DB.||1 or more||Scale horizontally by running multiple.|
||Streams life-cycle and resource usage events for a cluster resource (a job, package, or route, for example) to subscribed clients. It also manages client event subscriptions and garbage-collects subscriptions for disconnected clients.||1 or more||Scale horizontally by running multiple.|
||Central authority for authentication:
||1 or more||Scale horizontally by running multiple Flex Auth components.|
||Provides HA NFS persistence.||0 or 3 x N||(Optional) Recommended for production clusters requiring persistence. The count is a multiple of 3 for replication (3 x N). See HA NFS persistence.|
||Storage for cluster metrics.||1 exactly||Singleton. See Graphite storage.|
||Calculates and reports job health.||1 or more||Scale horizontally by running multiple HMs.|
||Runtime environment for job instances.||1 or more||Scale horizontally or vertically to run more job instances. In production, run 3 or more IMs.|
||Provides static IP addressing for integrating with legacy systems that require fixed IP addresses.||0 or 1||Optional singelton.|
||Manages jobs (workloads) deployed to the cluster.||1 or more||Scale horizontally by running multiple.|
||Key-Value storage system for the cluster (Consul)||1 or more||Scale horizontally by running multiple.|
||Handles statsd traffic and reports cluster statistics over time.||1 or more||Scale horizontally by running multiple.|
||Component monitoring Zabbix server and database (PostgreSQL DB).||0 or 1||Typically both components are installed on the same host. For HA, use an RDS or install DB in HA mode on separate host. Use an external monitoring system to monitor the server.|
||Message bus for component communications.||1 or more||Scale horizontally by running multiple.|
||Provides NFS persistence layer.||0 or 1||Optional singleton. For HA, use
||Use to install cluster software, manage cluster deployments, collect component logs, etc. Includes PostgreSQL DB.||1 exactly||Run on VM host, version control
||Manages distribution of platform packages.||1 or more||May run as a singleton in
||Log buffer for storing job logs.||1 exactly||Singleton.|
||Distributed S3-compliant package store.||0, 3 or 5||Use in production for HA
||HTTP router (NGINX) responsible for routing and load balancing inbound traffic.||1 or more||Scale horizontally by running multiple to handle high volume of inbound requests or if your network requires it. If multiple then fronted by separate load balancer such as ELB.|
||Lets you to search across Splunk-collected component and job logs.||0 or 1||Optional singleton.|
||Lets you to index component and job logs for Splunk searches.||0 or 1||Optional singleton.|
||Responsible for creating and updating system-provided jobs and resources.||1 exactly||Required singleton. (Not a runtime component.)|
||TCP traffic into the cluster (NGNIX).||0 or more||(Optional) Multiple TCP routers allowed, but
||Encrypt the secrets using Vault, and store them in Consul (
||1 or more||Scale horizontally by running multiple.|
Apcera components are deployed to distinct operational roles or planes:
- Control Plane provides communication and distributes policy for the system.
- Routing Plane load balances traffic into the system.
- Monitoring Plane monitors system components.
- Management Plane commands and controls system components.
- Runtime Plane provides the execution environment for Linux container instances.
The Control Plane provides communication and distributes policy for the system.
Apcera components communicate using the NATS messaging system. Components communicate in peer-to-peer fashion through the exchange of messages. The NATS messaging server resides in the Management Plane, and each component is a client.
Policy is a machine readable representation of the intended behavior of the cluster. Virtually all surface area of Apcera is governed by policy.
The policy language is a declarative scripting language. A policy comprises a realm and one or more rules. The realm defines the set of resource(s) to which a policy rule applies. A rule is a statement evaluated by the policy engine.
The Routing Plane load balances traffic into the system.
The HTTP Server receives inbound traffic from clients and routes them to the appropriate component or job using HTTP. Apcera implements a standard web server and adds several additional features to support the needs of the platform.
The TCP Server routes TCP traffic within the system for jobs that have enabled TCP routes.
The Monitoring Plane monitors system components. Apcera integrates with a third-party monitoring server and embeds a monitoring agent in each component.
The Management Plane commands and controls system components.
Security (Auth) Server
The Security Server is the root of trust in Apcera Platform. It is also responsible for distributing the policy to an individual component so that the user defined policy rules are enforced.
Flex Auth Server
Apcera uses a token-based approach to authenticate users. The Flex Auth Server integrates with various identity providers, including Google OAuth and LDAP. The Flex Auth Server does not store credentials.
The API Server receives requests from clients (apc, web console, REST API) and converts them into requests within the cluster over NATS communicating with the relevant components.
A package is binary data, typically encapsulated in a tarball, and JSON metadata.
The Package Manager (PM) is responsible for managing assets associated with a package, including references to the location of assets within the packages. The PM connects to the persistent datastore filesystem to store and retrieve packages. Once retrieved, packages are cached locally on runtime hosts.
A job is the base unit in the system. A job is a workload definition that represents an inventory of packages, metadata, and associated policies. There are several types of jobs, including: Apps, capsules, Docker images, stagers, service gateways, and semantic pipelines.
The JM maintains the desired state of jobs within a cluster, and the set of packages that constitute a job. The JM stores job metadata and state within the database.
The Health Manager (HM) communicates with IMs to determine the number of live instances of each job and communicates this to the JM that can initiate start or stop requests to the runtime plane based on deltas from the intended state. The HM subscribes to these heartbeats and takes actions based on the intended/actual state disparities.
Events Server serves events pertaining to different resources in the Apcera Cluster (Jobs, Routes, Packages etc.) to an end user based on policy. Events Server also manages subscriptions from clients and garbage-collects them if clients are disconnected.
The Metrics Manager (MM) component tracks the utilization of resources (disk, cpu, ram, network) over time. The related Cluster Monitor component provides realtime cluster statistics to the web console.
The Cluster Monitor reports current cluster statistics which include state of job instances and IM nodes.
Refer to Securing Cluster Secrets for details.
Refer to Managing the Component Store (Consul) for details.
The Stagehand is a tool designed to run as a part of the Orchestrator to update component jobs within Continuum. Specifically it is used to update the following jobs when we update a cluster:
- Staging Coordinator
- Service Gateways
- Semantic Pipelines
- Web Console (Lucid)
- Continuum Guide
- Metrics Exporter
The Runtime Plane comprises a single component: the Instance Manager (IM).
An instance is an instance of a Linux container. Each instance runs a job workload in isolation on an Instance Manager node. The IM is responsible for managing the lifecycle of container instances. Each instance is represented by a container that runs on an IM host. The IM performs CRUD plus monitoring operations on container instances.
In Apcera, a container instance is a running job within a single isolated execution context. Each container instance is instantiated with a unioned filesystem to layer only the dependencies required by the running processes in the job definition. Apcera container instances implement Linux cgroups and kernel namespacing to isolate the execution context of a running job instance.
Additional responsibilities of the IM include:
|Scheduling||IM receives container instance placement requests from the JM and calculates the taint value based on a number of scheduling criteria (tags, resource usage, locality). The taint value is used as response delay: after that delay IM responds to the placement request. It is up to the requester to follow up with the container instance creation request.|
|Health||The IM periodically sends out heartbeats for each instance. The Health Manager (HM) subscribes to these heartbeats and takes actions based on the intended/actual state disparities.|
|Metrics||The IM periodically sends out instance metrics (resource consumption, uptime etc.). The Metrics Manager (MM) subscribes to metrics and persists them in metrics storage.|
|Routes||Once instance is running, IM sends out information about ports/routes exposed on the instance. This is used by routers for route mapping and by other IMs to set up the destination IP/port for dynamic bindings.|
|Log Delivery||The IM monitors instances logs for any new activity and delivers new logs to a number of configured log sinks.|
|Semantic Pipeline Setup||When the IM starts an instance that’s supposed to have a semantic pipeline, it suspends the instance start process until semantic pipeline is set up completely. It also takes care of setting up network rules so that job-to-service traffic flows through semantic pipeline.|
|Runtime Configuration||The IM fills out any templates defined in the instance job/package using templated values, including the process name, UUID, and network connections, including links (ip:port) and bindings(ip:port, credentials, scheme).|
|Network Connectivity||IM is responsible for setting up network configuration for container instances. This involves setting up and maintaining firewall rules (iptables), and optionally connecting to other components, for example to set up GRE tunnels for fixed IP routing or to provide persistence for the container.|
|Filesystem||IM downloads packages referenced by instance job and lays them out in the container filesystem.|
|Package cache||IM reserves approximately 1/2 of its allocated disk space for caching packages locally and storing the job logs.|
Creating starting a job instance from source using an Apcera client involves the following component interactions.
- APC sends a request to the API server, packages the source code in a tarball and uploads it to the cluster.
- API server invokes the staging coordinator job to ingest the source code along with any dependent packages through package resolution.
- Detected or specified staging pipeline is invoked and stager(s) compile or build the source code and push it to the PM which stores the assets and metadata as a “Package.”
- API sends the expected initial state of the application to the JM, including the package and package dependency information.
- JM broadcasts a NATS message to all IMs requesting bids for an IM to start one or more job instances (containers).
- Each IM responds according to its taint with an offer to start an instance of the job. Each IM's taint is determined by a set of scoring rules based on the set of nodes (containers) under its control, tags asserted on the job definition, and other criteria.
- JM receives the bids and sends a request to the “best” scoring IM to start an instance of the app.
- IM retrieves the relevant packages from its local cache, or from the PM if it is an initial deploy, and starts the appropriate number of instances. Each IM will start no more than one instance per node (container).
Job instance monitoring
- Job instance log output (stdout & stderror) is centrally available and routed through TCP log drains, including ring buffer, syslog, and NATS.
- HM monitors number of running instances of each given job based on the job definition.
- HM communicates the number of running instances to the JM.
- MM collects the resource utilization of nodes within the cluster and can be queried via the API server to display resource and performance information through APC and the web console.
- Cluster monitor collects and provides real-time instance data to the web console.
- Audit log provides full record of all cluster activities stored on dedicated DB in HA mode.
Job scaling and resiliency
- If the number of running instances of a given job reported by the HM is lower than the defined state of the job, the JM will send out a request to the IMs for bids to start additional instances, similar to the initial app create/deploy and start process.
- If the number of running instances of a given job reported by the HM is higher than the defined state of the job, the JM will send out a request to the IMs for reverse bids to shut down extraneous instances.
- The JM will accept the “best” bids and send RPC calls to the relevant IMs to shut down the relevant job instances.