Deployment Sizing Guidelines

This document provides minimum capacity sizing guidelines for running Apcera Platform in production.

Apcera cluster components

The following cluster components are required for an Apcera Platform deployment on any supported provider.

Component Description Count HA Considerations
api-server Provides HTTP API endpoints for the cluster. 1 or more Scale horizontally by running multiple to handle a large number of concurrent client connections coming from APC, the web console, or custom clients.
auditlog-db Stores audit logs in PostgreSQL DB. 1 or more Scale horizontally by running multiple. See audit log HA.
auth-server Cluster Security Server for encrypted policy and NATS key storage and distribution. 1 or more Scale horizontally by running multiple.
cluster-monitor Reports real-time cluster statistics. 1 or more Scale horizontally by running multiple.
cluster-object-storage Gluster Package Storage Backend. 0 or 3 x N Use in production for HA package-manager storage when there is no other S3-compatible storage.
component-db Stores cluster artifacts in PostgreSQL DB. 1 or more Scale horizontally by running multiple.
events-server Streams life-cycle and resource usage events for a cluster resource (a job, package, or route, for example) to subscribed clients. It also manages client event subscriptions and garbage-collects subscriptions for disconnected clients. 1 or more Scale horizontally by running multiple.
flex-auth-server Central authority for authentication: basic-auth-server, google-auth-server, ldap-auth-server, keycloak-auth-server, app-auth-server (for App Token) 1 or more Scale horizontally by running multiple Flex Auth components. Note that the rules for "install flex-auth servers" differ from singleton installs to HA installs; if you don't individually tag and enumerate each flex-auth server component, they don't get installed HA (e.g., with 3 centrals having 3 auth-server instances, you end up with 1 ldap-auth-server component on one of the centrals if not specifically enumerated). Assuming you want N of the component and have N centrals, then the numbers should match. Just listing auth-server: N gets you N auth-servers, but will only get you 1 of each flex-auth-server.
gluster-server Provides HA NFS persistence. 0 or 3 x N Recommended for production clusters requiring persistence. The count is a multiple of 3 for replication (3 x N). See HA NFS persistence.
graphite-server Storage for cluster metrics. 1 exactly Singleton. See Graphite storage.
health-manager Calculates and reports job health. 1 or more Scale horizontally by running multiple HMs.
instance-manager Runtime environment for job instances. 1 or more Scale horizontally or vertically to run more job instances. In production, run 3 or more IMs.
ip-manager Provides static IP addressing for integrating with legacy systems that require fixed IP addresses. 0 or 1 Optional singelton.
job-manager Manages jobs deployed to the cluster. 1 or more Scale horizontally by running multiple.
kv-store Key-Value storage system for the cluster (Consul) 1 or more Scale horizontally by running multiple.
metrics-manager Handles statsd traffic and reports cluster statistics over time. 1 or more Scale horizontally by running multiple.
monitoring Component monitoring Zabbix server and database (PostgreSQL DB). 0 or 1 Typically both components are installed on the same host. For HA, use an RDS or install DB in HA mode on separate host. Use an external monitoring system to monitor the server.
nats-server Message bus for component communications. 1 or more Scale horizontally by running multiple.
nfs-server Provides NFS persistence layer. 0 or 1 Optional singleton. For HA, use gluster-server x 3.
orchestrator-database orchestrator-server Use to install cluster software, manage cluster deployments, collect component logs, etc. Includes PostgreSQL DB. 1 exactly Run on VM host, version control cluster.conf, back up DB regularly. See the Orchestrator documentation.
package-manager Manages distribution of platform packages. 1 or more May run as a singleton in local mode. For HA, run multiple PMs in s3 or gluster. See configuring package manager.
redis-server Log buffer for storing job logs. 1 exactly Singleton.
riak-node Distributed S3-compliant package store. 0, 3 or 5 Use in production for HA package-manager storage when there is no other S3-compatible blob storage. The minimum number of Riak hosts is 3; the recommended number of Riak hosts is 5. Riak is required on for on-premises, non-AWS cluster deployments where HA package management is required.
router HTTP router (NGINX) responsible for routing and load balancing inbound traffic. 1 or more Scale horizontally by running multiple to handle high volume of inbound requests or if your network requires it. If multiple then fronted by separate load balancer such as ELB.
splunk-search Lets you to search across Splunk-collected component and job logs. 0 or 1 Optional singleton.
splunk-indexer Lets you to index component and job logs for Splunk searches. 0 or 1 Optional singleton.
stagehand Responsible for creating and updating system-provided jobs and resources. 1 exactly Required singleton. (Not a runtime component.)
tcp-router Handles TCP traffic into the cluster (NGNIX). 0 or more Multiple TCP routers allowed, but auto not supported.
vault Encrypt the secrets using Vault, and store them in Consul (kv-store) for high-availability 1 or more Scale horizontally by running multiple.

Minimum viable deployment

Minimum viable deployment (MVD) is a bare minimum Apcera installation that serves as a baseline reference point. MVD provides no redundancy and is not production grade.

screenshot

MVD Resource requirements

The minimum machine resources required for MVD on any supported platforms are as follows:

Count Machine Role RAM Disk Components
1 orchestrator 2GB 8GB orchestrator-server, orchestrator-database
1 central 4GB 20GB auditlog-database, component-database, api-server, job-manager, router, stagehand, cluster-monitor, auth-server, health-manager, metrics-manager, nats-server, package-manager, redis-server, tcp-router, nfs-server, events-server
1 instance-manager 8GB 100GB instance-manager, graphite-server
1 monitoring 4GB 20GB zabbix-server, zabbix-database

MVD Considerations

  • The monitoring host is optional.
  • The auth-server is responsible for policy and security. One or more Flex Auth Server components (such as basic-auth-server) is automatically deployed for cluster authentication.
  • The central host generally has low resource requirements. However, because this host is running several processes, you must allocate enough CPUs depending on the workloads you are running to ensure that the disk is not under contention and the host is able to handle fluctuating CPU demands. Note that the HTTP router may require high network throughput.
  • The runtime hosts (IMs) require the most resources. Additional CPU allows for more parallelism to handle CPU spikes, such as starting many jobs at the same time. The disk size ensures that as the cluster evolves and has more packages, disks do not come under contention.
  • Each IM reserves approximately 50% of its partitioned disk space for package caching, instance logs, and job metadata. The rest is used to run container workloads (job instances), and is the amount reported when viewing cluster resources using apc or the web console. This is accounted for in these recommendations.
  • The graphite-server component cannot run on the central host due to a port 80 conflict with the HTTP router. In production it is deployed to a dedicated host.

Minimum production deployment

Minimum production deployment (MPD) adds redundant Central and IM hosts, dedicated hosts for logging and metrics, and other components typically required for production workloads.

screenshot

MPD Considerations

  • Monitoring is required for production clusters. You may omit any other optional component you do not need (tcp-router, nfs-server, ip-manager).
  • Commonly scaled components are deployed to the central host anticipating future growth of the cluster.
  • Components with an asterisk (*) are singletons.
  • For HA NFS persistence, install 3xN gluster-server hosts for replication of NFS data. (See RPD below)
  • If the package-manager uses local storage it is a singleton. Deploying more than one package-manager requires the use of a remote Package Store backend (S3, Riak, Gluster).
  • The auth-server is the Security Server component and is made redundant by running multiple on the central hosts.
  • Each flex-auth-server component is made redundant by running multiple on the central hosts.
  • The Auditlog expects a dedicated disk for Postgres storage. Clusters that are to be scaled should use a dedicated Machine Role for audit or an external auditdb. Migration of the auditlog is not supported without professional services.

MPD on AWS

The following table lists the minimum resource requirements for installing Apcera on AWS.

Count Machine Role Instance Type Components
1 orchestrator t2.small orchestrator-server, orchestrator-database
3 central m3.medium router, api-server, auth-server, flex-auth-server, nats-server, job-manager, package-manager, health-manager, cluster-monitor, metrics-manager, component-database, events-server, tcp-router, nfs-server, ip-manager, stagehand*
1 logs-metrics c4.large redis-server, graphite-server
1 audit c4.large auditlog-database
1 monitoring m3.medium zabbix-server, zabbix-database
3 instance-manager r3.large instance-manager

AWS MPD Considerations

  • To potentially reduce costs, you could split the monitoring host by deploying the zabbix-database to a specialized RDS (Relational Database Service) host using db.t2.small and the zabbix-server to a t2.small EC2 host.
  • The r3.xlarge EC2 instance type provides good ECU (Elastic Compute Units) allocation, a healthy amount of RAM, and sufficient disk storage. Although the m2.2xlarge type is considered legacy, and is more expensive than the r3, m2 may be used if you prefer not to use an SSD disk for the IMs.
  • The package-manager component runs on each central. The HA storage backend is an AWS S3 bucket.
  • This deployment assumes that you will use an ELB in front of the multiple HTTP routers.
  • For AWS consider using a Postgres RDS for the Audit Log DB.

MPD on OpenStack

The following table lists the minimum resource requirements for installing Apcera on OpenStack.

Note that this information describes a minimum production deployment of on OpenStack. See the OpenStack reference cluster for a redundant production deployment.

Count Machine Role CPU RAM Disk Components
1 orchestrator 1 2GB 8GB orchestrator-server, orchestrator-database
3 central 2 4GB 20GB router, api-server, flex-auth-server, nats-server, job-manager, package-manager, health-manager, cluster-monitor, metrics-manager, component-database, events-server, auth-server, tcp-router, nfs-server, ip-manager, stagehand*
1 logs-metrics 2 4GB 50GB redis-server, graphite-server
1 audit 2 4GB 50GB auditlog-database
1 monitoring 2 4GB 20GB zabbix-server, zabbix-database
3 instance-manager 4 8GB 100GB instance-manager

OpenStack MPD Considerations

  • Deploying more than one package-manager requires a HA Package Storage Backend. See package manager configuration.
  • The Auditlog expects a dedicated disk for Postgres storage. The Apcera-provided OpenStack configuration only provides one disk for the Central Host because OpenStack can't manage to reliably provide two dedicated disks to the host, so we reserve the Central Host disk for the Package Manager. In this case you must install the auditlog-database on dedicated hosts.
  • See the example deployment for OpenStack for RPD considerations.

MPD on vSphere

The following table lists the minimum resource requirements for installing Apcera on vSphere.

Count Machine Role CPU RAM Disk Components
1 orchestrator 1 2GB 8GB orchestrator-server, orchestrator-database
3 central 2 4GB 20GB router, api-server, flex-auth-server, nats-server, job-manager, package-manager, health-manager, cluster-monitor, metrics-manager, component-database, events-server, auth-server, tcp-router, nfs-server, ip-manager, stagehand*
1 logs-metrics 2 4GB 50GB redis-server, graphite-server
1 audit 2 4GB 50GB auditlog-database
1 monitoring 2 4GB 20GB zabbix-server, zabbix-database
3 instance-manager 4 8GB 100GB instance-manager

vSphere MPD Considerations

  • Deploying more than one package-manager requires a HA Package Storage Backend. See package manager configuration.
  • We only create virtual machines in vSphere. Terraform does not create networks or security groups on vSphere. You define which pre-existing network(s) to use in main.tf. Firewalls are up to you to set up. Refer to the required ports documentation for details.

Recommended production deployment

We provide only the minimum deployment requirements for going into production with Apcera. In practice your production deployment will based on your unqiue capacity planning estimates. You may work with Apcera technical staff to plan your production installation.

In general, a recommended production deployment (RPD) has the following characteristics:

  • Based on the MPD resource requirements for your chosen platform.
  • Uses HA for all possible components, including package manager and NFS Services.
  • Scales the central host to 3 or 4 nodes.
  • Deploys 3 or more Instance Manager hosts.
  • Typically has dedicated machines for the routers. (The HTTP router may require high network throughput; the TCP router may require its own IP address.)
  • Deploys the auditlog-database on dedicated machines with dedicated mounts or use an external auditlog-database.

screenshot

Capacity planning

Each cluster machine host has finite capacity. To determine how much capicity you will need in production, you need to factor in the capicity of each machine host, anticipated utilization, and desired level of fault tolerance.

General capacity planning questions:

1) What is your expected utilization?

2) What is your desired fault tolerance?

For example, with a 5 IM-node cluster, you may be able to tolerate 1 machine failure if you're under 80% utilization, whereas you may be able to tolerate 2 machines failing if you're below 60% utilization.