Populating Cluster.conf

You configure your Enterprise Edition cluster for deployment using the Apcera cluster configuration file (cluster.conf). This file is a configuration template for your cluster in Dconf format. You populate this file, upload it to the Orchestrator host, and deploy your cluster. Each cluster.conf file is specific to your resources and is customized for your environment and architectural preferences.

Securing cluster.conf

The cluster.conf file stores cluster information in plain text, including some required credentials and SSL certs. Once you have generated your cluster.conf file, Apcera strongly recommends that you secure and version control it (as well as associated files such cluster.conf.erb, main.tf, or *.tfvars). To do this we suggest you create a Git repository for your cluster installation files, and use git-crypt to encrypt these files. See encrypting cluster.conf for guidance on doing this.

Cluster.conf configuration parameters

The cluster.conf file includes the following main sections, each containing parameter values that you populate:

Section Description
provisioner Specify the provisioner type. Only the generic (IP address) type is supported.
machines Define the machines using their IP addresses that will be used for the cluster hosts.
components Specify the desired number components for each component type.
chef Various cluster configuration settings used by Chef, including all of the following.
base_domain and cluster_name Provide the cluster name and domain name.
router HTTP Router configuration, including HTTPS certificates, failover, and sticky sessions.
package_manager Configure the package store (local or remote (s3)) as well as the Staging Coordinator size and RAM.
ip-mappings Map public to private IPs for IP Manager and TCP Router components.
stagehand Configure stagehand settings, including service gateways.
subnets Optional subnet configuration.
mounts Volume mounts for IMs and other components.
nfs Use to specify the NFS version.
ntp Use to specify the NTP server.
health_manager Configuring hearbeat intervals and health monitoring settings.
ssh Configure the SSH key for accessing cluster hosts.
tag_rules Job schedluing tags for instance_manager, router, and datacenter (virtual) components.
ipsec Enable cluster encryption.
auth Configure the identity provider and admin users.
apzabbix Configure cluster monitoring settings.
vxlan_enabled Enable or disable VXLAN for virtual networks.
splunk Integrate cluster and job logs with Splunk.

This documentation describes each primary area of configuration in the cluster.conf file. Refer to the actual installation instructions for your platform for specific details on how to populate the cluster.conf file for your type of deployment.

Provisioner

The provisioner section specifies the generic provisioner which uses IP addresses to identify cluster hosts.

provisioner {
  type: generic
}

As of the Apcera Platform version 2.2.0 ("Buffalo") release, all individual platform provisioners (aws, openstack, and vsphere) are deprecated in favor the IP address-based generic provisioner. While your existing configurations will continue to work, you should migrate to the generic provisioner as soon as possible. Contact Apcera Support for assistance.

Machines

The machines section defines a set of different base machine templates for different roles. This is where you can specify what Apcera processes can run on what machines and how those machines should be created.

You enter one or more IP addresses in the hosts section. The suitable_tags section of each entry lists the components that machine would be capable of running. The machine isn't always running them, however it used to specify that it is able to run them. This can be used to control the co-location of different processes, or to separate a process to its own set of boxes. The Orchestrator will only have the required numbers of each process running, however when scaling a process up or down, it will always look at the set of existing machines for placement.

machines: {
  central: {
    hosts: [ "192.168.46.133", "192.168.46.134" ]
    suitable_tags: [
      "auditlog-database"
      "component-database"
      "api-server"
      "job-manager"
      "router"
      "cluster-monitor"
      "flex-auth-server"      // Such as ldap-auth-server, google-auth-server, etc.
      "health-manager"
      "metrics-manager"
      "nats-server"
      "package-manager"
      "events-server"
    ]
  }
  singleton: {
    hosts: [ "192.168.46.135" ]
    suitable_tags: [
      "ip-manager"
      "nfs-server"
      "tcp-router"
      "stagehand"
      "auth-server"           // Policy server
    ]
  }
  logs-metrics: {
    hosts: [ "192.168.46.135" ]
    suitable_tags: [
      "redis-server"
      "statsd-server"
      "graphite-server"
    ]
  }
  monitoring: {
    hosts: [ "192.168.46.136" ]
    suitable_tags: [
      "monitoring"
    ]
  }
  instance_manager: {
    hosts: [ "192.168.46.131", "192.168.46.132" ],
    suitable_tags: [
      "instance-manager"
    ]
  }
}

Components

The components section defines the number of each component that should be instrumented within the cluster. Refer to the Sizing Guidelines for a list and description of cluster components, considerations, and recommendations.

The primary component that may need to be scaled within a cluster is the instance-manager, as this component defines the capacity for user workloads within the system. That is, the more workloads (jobs) you have, the more Instance Managers you may need to deploy. See scaling the cluster.

components: {
          monitoring: 1  | Monitoring host

  component-database: 2  | Central host (n-wise scalable)
          api-server: 2  
         job-manager: 2  
      health-manager: 2  
     metrics-manager: 2
     package-manager: 2  
         nats-server: 2   
     cluster-monitor: 2  
              router: 2
       events-server: 2  

    instance-manager: 2  | Instance Manager hosts (n-wise scalable)

         auth-server: 1  | Other components (single hosts)
     package-manager: 1
          tcp-router: 1  
          ip-manager: 1  
     graphite-server: 1  
       statsd-server: 1  
          nfs-server: 1  
        redis-server: 1  
           stagehand: 1  

}

Chef section

The chef section of the Orchestrator configuration includes a number of attributes that are applied to the Chef attributes that are made available to machines. The configuration contains the set of values that are most commonly specified on a per-cluster basis, and typically has comments before each block as to their usage.

There are three main parts to the chef section. Although not common, you may have entries at the root level, such as the following:

chef: {
  "authorization": {
    "sudo": {
      "users": ["ubuntu"],
    }
  },
  "nginx": {
    "default_site_enabled": false
  },

Check with Apcera Support for all options here.

The chef.continuum section provides several fields and parameters as described below. The chef.apzabbix provides monitoring information.

Cluster domain

You populate the cluster name and domain settings using the cluster_name and base_domain parameters in the chef.continuum section of cluster.conf. Note that you must configure DNS for the base_domain. Refer to the DNS instructions.

chef: {
  "continuum": {
    "cluster_name": "cluster-name",
    "base_domain": "cluster.domain",
    "cluster_platform": "aws",          // Optional
    }
  }

For example:

chef: {
  "continuum": {
    "cluster_platform": "platform-vmware_desktop",

    # Cluster name and domain settings.
    "cluster_name": "tutorial",
    "base_domain": "tutorial.apcera-platform.io",
    }
  }    

Note the following regarding the cluster-name attribute:

  • You must set the cluster_name attribute in cluster.conf to get a persistent value of cluster-name, which is required for configuring monitoring alerts using Zabbix alerts;
  • Starting with the Orchestrator 0.5.x release, we allow the . (dot character) in the cluster_name parameter, but our installation will replace it with the - (dash character);
  • Once you deploy a cluster with the cluster_name attribute set, to change the cluster_name you must 1) edit the cluster.conf and specify the new cluster-name and 2) redeploy the cluster and specify the --update-name flag. For example:
orchestrator-cli deploy -c cluster.conf --update-name

Nameserver rules

The chef.continuum.nameserver_rules configuration field takes an array of rules that overrides a host's resolver configuration options. Each rule specifies a filter used to select which hosts to apply the rule – either as an IP address filter in CIDR notation (network_cidr) or a datacenter tag name (datacenter) – and one or more resolver configuration options to apply.

For example, the following defines two nameserver rules:

  • The first matches on the CIDR address range of "10.10.0.0/16" and defines two nameservers with IP addresses of "8.8.8.8" and "8.8.8.4".
  • The second rule matches on the datacenter tag name "onprem-4" and defines two nameserver IP addresses, and a domain to search for host name lookups ("internal-domain.local"):

      chef: {
        "continuum": {
          "nameserver_rules": [
              { "network_cidr": "10.10.0.0/16", "nameservers": ["8.8.8.8",  "8.8.4.4"] },
              { "datacenter":   "onprem-4", 
                "nameservers": ["10.0.1.1", "10.0.1.2"], 
                "search": "internal-domain.local"
              }
            ]
        }
      }
    

Each nameserver rule may contain the following configuration fields:

  • nameservers – An array of IP addresses for name servers that the resolver should query, e.g.:

    "nameservers": ["8.8.8.8",  "8.8.4.4"]
    
  • search – List of domains to search for host name lookup, separated by tabs or spaces, e.g.:

    "search": "example.com local.lan"
    
  • options – An array of resolver configuration options to apply, e.g.:

    "options": ["rotate",  "timeout:1"] 
    
  • interface-order – An array of shell glob patterns that specifies the order of network interfaces in which resolvconf nameserver information records are processed. If specified, the array's values override those in the host's /etc/resolveconf/interface-order configuration file, e.g.:

    "interface-order": ["lo*", "eth*"]
    

Package Manager store

The chef.continuum.package_manager parameters specify the Package Manager settings. You an use local package datastore, in which case the package-manager is a singleton, or s3 mode in which case you can run multiple package-manager components for HA. See the sizing guidelines.

chef: {
  "continuum": {
    "package_manager": {
      "package_store_type": "local",
      "local_store": {
        "cleanup_on_delete": true
      }
    },

See Package Manager LRU for details on using cleanup_on_delete.

chef: {
  "continuum": {
    "package_manager": {
      "package_store_type": "s3",
      "s3_store": { 
        "access_key": "PackageBucketSettings", 
        "secret_key": "PackageBucketSettings", 
        "endpoint": "s3-us-west-2.amazonaws.com", 
        "bucket": "cluster-stack-s3bucket-3f935ertytd" } 
      }
    }
  }

See Riak Package Store for details on using Riak as the package store for on-prem deploys where AWS S3 cannot be used.

If necessary you can change the default RAM and disk size allocated to the Staging Coordinator to launch apps from source code or from capsules. The defaults are 256MB memory and 2GB disk. If you are deploying large legacy apps, you may need to increase both of these settings.

chef: {
  "continuum": {
    "package_manager": {
      "staging_coordinator": {
        "memory": 268435456,
        "disk": 2147483648
      }
    }
  }
}

The chef.continuum.package_manager.db.conn_max_life_time parameter lets you specify the maximum lifetime (in seconds) of idle database connections before they are cleaned up. A value of 0 means database connections live for ever, which is the default behavior. For example, the following sets the maximum lifetime for idle database connections to 300 seconds:

chef: {
  "continuum": {
    "package_manager": {
      "db": {
        "conn_max_life_time":300
      }
    }
  }
}

HTTP Router

To enable HTTPS access to the cluster, you configure the chef.continuum.router section. You can also configure the timing around how the router handles job instance fail-overs and sticky sessions.

HTTPS certificates

chef: {
  "continuum": {
    "router": {
      "http_port": 8080
       "https_port": 8181,
        "ssl": {
          "enable": false,
          "tlshosts": [
            {
              "server_names": [ "*.prod.example.tld", "prod.example.tld" ],
              "certificate_chain": (-----CERTIFICATE-----)
              "private_key": (-----PRIVATE KEY-----)
            },
            {
              "server_names": [ "*.clustername.example.tld", "clustername.example.tld" ],
              "certificate_chain": (-----CERTIFICATE-----)
              "private_key": (-----PRIVATE KEY-----)
            },
            {
              "server_names": [ "www.example.com", "example.com" ],
              "certificate_chain": (-----CERTIFICATE-----)
              "private_key": (-----PRIVATE KEY-----)
            },
            {
              "server_names": [ "www.example.net", "example.net" ],
              "certificate_chain": (-----CERTIFICATE-----)
              "private_key": (-----PRIVATE KEY-----)
            },
          ]    # tlshosts
        }      # ssl
    },       # router

Controlling job instance fail-over times

If a job request is routed to an Instance Manager (IM) that is not available for some reason then the request will fail. By default, the router will retry failed client requests three times immediately, each time to a different instance; each failed instance is temporarily suspended for 60 seconds.

The chef.continuum.router.continuum_max_fails and chef.continuum.router.continuum_fail_timeout fields let you configure the number of retries and the amount of time that failed instances are suspended, respectively. For instance, with the following configuration the router will retry failed client requests five times immediately and suspend failed instances temporarily for 30 seconds.

chef: {
  "continuum": {
    "router": {
        "continuum_max_fails" : 5,
        "continuum_fail_timeout" : 30,
        ...
    }
  }
}

JSESSIONID

The HTTP router supports JSESSIONID for basic sticky sessions with Java applications.

To use JSESSIONID, you must enable it for the router in the chef.continuum.router block of the cluster.conf file by setting the sticky_target parameter to true. Cookie encryption is required and is set using the chef.continuum.router.cookie_encryption_key parameter whose value must be 32 characters in length. Note that the chef.continuum.router.cookie_encryption_iv parameter and value is also required.

For example:

chef: {
  "continuum": {
    "router": {
      "sticky_target": true,
      "cookie_encryption_key": "01234567890123456789012345678901",
      "cookie_encryption_iv": "1234567890123456", 
    },
  },
}

IP mappings

The chef.continuum.ip_mappings section of the cluster.conf is used to map private IP addresses to public addresses. Both the TCP router and IP Manager use these mappings to allow them to be aware of external addresses (provided via NAT at a firewall for example) and accept routes/bindings to those addresses.

To configure the TCP Router and/or IP Manager to be aware of and accept Apcera routes for external addresses, add a block similar to the following to the cluster.conf file:

chef: {
  continuum: {
    "ip_mappings": {
      '10.10.0.200': "54.187.165.76"
      }
  }
}
chef: {
  continuum: {
    "ip_mappings": {
      "10.10.0.200": [ "192.168.2.204", "192.168.2.207" ]
    }
  }
}

Note that the ip_mappings block is required on any cluster not deployed on AWS to inform the Orchestrator about the public IP mappings for the TCP router and IP Manager components. For example, if you add a TCP route for an app, that app will not be able to access the routable address unless the chef.continuum.ip_mappings block in the cluster.conf file has a map to the external IP of the router.

Stagehand

Stagehand controls the deploy's configuration within the cluster, specifically, the available default service gateways.

If the chef.continuum.stagehand.gatways value is empty, the cluster is configured with the default available service gateways:

chef: {
  "continuum": {
    "stagehand": {
      # "gateways": [...]
      "debug": {
       # "console": true
      }
    }
  }
}

If you need to manually add one or more service gateways, you can do so as follows:

chef: {
  "continuum": {
    "stagehand": {
      "gateways": ["generic", "gnatsd", "http", "ipm", "memcache", "mongodb", "mysql", "network", "nfs", "postgres", "rabbitmq", "redis", "s3"],
      "debug": {
        "console": true
      }
    }
  }
}

Subnets

Use the chef.continuum.cluster.subnets parameter to specify the cluster's subnet. This is primarily needed for some understanding of "internal" vs "external", but much of the system gets its own IP using the default ethernet device, which will be the public IP.

chef: {
  "continuum": {
    "cluster": {
      "subnets": ["192.168.2.205/32","192.168.2.204/32"]
    }
  }

Volume mounts

Use the chef.continuum.mounts parameter to specify mount settings. The Instance Manager uses LVM for container volumes. This sets the default to be /dev/sda4.

chef: {
  "continuum": {
    "mounts": {
      "instance-manager": {
        "device": "/dev/sda4"
      }
      "package-manager": {
        "device": "/dev/sda4"
      }
      "gluster-brick0": {
        "device": "/dev/sda4"
      }
    },
  }

NFS server

If you are deploying an NFS server in your cluster, you must add the following parameters to your cluster.conf to enable NFSv4:

chef: {
  "continuum": {
    "nfs": {
      "version": "4",
    },
  }

If you don't add NFSv4 to your cluster.conf, the NFS server will default to version 3. See supported NFS protocols and using NFSv3 for more information.

See also HA NFS using Gluster.

NTP server

To configure a local NTP server, you can add the following block to your cluster.conf file:

chef: {
  # site specific NTP Servers
  "ntp": {
    # The default is 4 pool members of the public time pool.
    "servers": ["ntp1.company.com","ntp2.company.com"]
  },

See also the vSphere installation documenation.

Health management

Some fields are exposed that allow cluster administrators to configure health management settings.

The heartbeat_interval setting controls how often the Instance Manager reports instance liveliness to the Health Manager, and how often the Health Manager expects to receive this information.

The max_missed_heartbeats is the maximum number of heartbeats the Health Manager allows before it treats an instance as failed, and restarts it.

These settings are global to the cluster, and affect all jobs equally.

chef: {
  "continuum": {
    "cluster": {
      // If not set, the default value is 1 minute
      "heartbeat_interval": "1m30s"
      },
      // If not set, the default value is 5
      "health_manager": {
        "max_missed_heartbeats": 10
      }
    }
  }

By default, the window within which the Health Manager responds to instance failure is 5 minutes. This comes from the product of the heartbeat_interval and the max_missed_heartbeats.

You may lower the heartbeat interval to allow the Health Manager to notice host failure more quickly, or raise it to make the Health Manager react more slowly to these conditions.

Apcera recommends not turning these too low, as it may cause the Health Manager to overreact to intermittent failure modes like brief network partitions that recover on their own.

SSH key

To access to cluster hosts using SSH, add your SSH key to cluster.conf as follows:

chef: {
  "continuum": {
    "ssh": {
      "custom_keys":[ "ssh-rsa SSHKEY" ]
    },
    ...
  }
}

Then, forward your SSH key to the Orchestrator host and log in using SSH.

PKIX

The pkix section gives you the ability to add internal certificates to the list of trusted SSL certificates for LDAP or other identity servers. Insert the PEM format of the trusted root CA certificate here.

# Optional additional CA certificates to install, for validating external 
# services (for example LDAP servers) using root CA 
"chef": {
  "continuum": {
    "pkix": {
      "system": {
      "additional_certs": [
      # YOUR cert name
        "-----BEGIN CERTIFICATE-----
        YOUR CERT HERE
        -----END CERTIFICATE-----
        ",
        ] # chef -> continuum -> pkix -> system -> additional_certs[]
      } # chef -> continuum -> pkix -> system
    } # chef -> continuum -> pkix
  } # chef -> continuum
} # chef 

Cluster encryption

Starting with Apcera Platform Enterprise Edition version 2.2.0 release, you can enrypt the runtime traffic in your cluster using Internet Protocol Security (IPsec). When IPsec is enabled, runtime traffic across cluster nodes is encrypted.

Hybrid deployments

To enable deployment across multiple providers, you can tag certain components, including the Instance Manager, HTTP Router, and Package Manager. See Configuring hybrid deployments for details.

Auth Server

To configure access to the cluster, you configure the chef.continuum.auth_server section.

Apcera supports several identity providers.

To configure an identity provider, you populate the auth_server section of the cluster.conf file. The Auth Server settings section of the configuration file lists users who can log in to the cluster and use it. The users subsection is used to authenticate users of the cluster.

Only settings which generate policy, including "users" for each auth provider and "admins," are ignored by the Auth Server after initial deploy. If you intend to use an identity provider, you must configure it for your cluster. In the event that an identity provider is not configured in cluster.conf, we provide a default which allows access via APC only.

chef: {
  "continuum": {
	  "auth_server": {
      "identity": {
        "crowd": {
          "enabled": true,
          "user": "user-name",
          "password": "password",
          "url": "http://172.27.0.101:8095"
        },
        "ldap_basic": {
          "enabled": true,
	  "model_name": "basic",
          "search_dn": "cn=Directory Manager",
          "search_passwd": "password",
          "base_dn": "ou=People,dc=example,dc=com",
          "uri": "ldaps://172.27.0.157",
          "port": "1636",
        },
        "kerberos": {
          "enabled": true,
          "service_name": "HTTP/auth.cluster.acme.net@ACME.IO",
          "keytab_contents_b64": "BQIAAABZAAIACUFQU0FSQSTwAESFRUAAXYXV0aC52YWETC",
          "kdc_list": [
            "ad1.acme.io:88"
          ],
          "realm": "ACME.IO"
        },
        "google": {
          "users": [
            "kamisama.acme@gmail.com",
            "jjones@acme.com"
          ],
          "client_id": "690542023564-j45bi.apps.googleusercontent.com"
          "client_secret": "byS5RFQsKzsczoD"
          "web_client_id": "690542023564-j1n1e564vb.apps.googleusercontent.com"
          },
        },
      },
      # Staff who get admin access in this `chef.continuum.auth_server.admins` array.
      "admins": [
        "kamisama.acme@gmail.com",
        "jjones@acme.com"
      ],
    },
  },  # continuum

Replacing the example ldap_basic section above with the following configures Active Directory as an LDAP service:

        "ldap_basic": {
          "enabled": true,
	  "model_name": "AD",
	  "default_domain": "EXAMPLE.COM",
          "search_dn": "cn=ldaps,ou=ServiceAccounts,ou=Company,dc=example,dc=com",
          "search_passwd": "password",
          "base_dn": "dc=example,dc=com",
          "uri": "ldaps://ad1.example.com",
          "port": "636",
        },

With the AD model, the user logs in with thier account name and the domain is taken from default_domain. If your environment supports multiple domains, use the ADMulti model as shown in the following example:

        "ldap_basic": {
          "enabled": true,
	  "model_name": "ADMulti",
          "search_dn": "cn=ldaps,ou=ServiceAccounts,ou=Company,dc=example,dc=com",
          "search_passwd": "password",
          "base_dn": "dc=example,dc=com",
          "uri": "ldaps://ad1.example.com",
          "port": "636",
        },

See Identity Providers for details on configuring each type of ID provider.

Cluster monitoring

To configure internal monitoring (Zabbix), you populate the chef.apzabbix section of the cluster.conf file.

See configuring monitoring for parameter details.

chef: {
  "apzabbix": {
    "db": {
      "hostport": "MonitoringPostgresEndpoint.us-west-2.rds.amazonaws.com:5432",
      "master_user": "acme_ops",
      "master_pass": "PaSsWoRd",
      "zdb_user": "zabbix",
      "zdb_pass": "PaSsWoRd"
    },
    "users": {
      "guest": { "user": "monitoring", "pass": "PaSsWoRd" },
      "admin": { "user": "admin", "pass": "PaSsWoRd", "delete": false }
    },
    "web_hostnames": ["clustermon.example.com"],
    "PF_pagerduty": {
      "key": "API-ACCESS-TOKEN-GOES-HERE"
    },
    "PF_email": {
      "smtp_server": "localhost",
      "smtp_helo": "localhost",
      "smtp_email": "zabbix@example.com"
    }
  }
}

NOTE: Zabbix monitoring requires a Postgres DB backend, which can be installed on the same host as the zabbix-server (internal) or a remote database service (RDS) (external). Whether or not the zabbix-database is external or internal is set by the Postgres connection string provided in cluster.conf:chef.apzabbix.db.hostport. If this value is set to localhost:5432, zabbix-database is be installed on the host. Otherwise, as shown in the above example, an external database server is used.

Virtual network configuration

Apcera Platform release 2.4.0 improves virtual network connections using Virtual Extensible LAN (VxLAN) as the default tunneling technology for new clusters. VXLAN is used to tunnel network packets over virtual networks.

Support for Generic Routing Encapsulation (GRE) is deprecated in 2.4, and will be removed entirely in the next major release.

Any new 2.4.0 cluster, Enterprise or Community, will use VXLAN by default. Existing clusters must migrate to VXLAN.

Enabling VXLAN

To enable VXLAN in a 2.4 installation of Apcera Platform Enterprise Edition, or to enable it when updating an existing cluster to 2.4, complete the following steps.

If you do not enable VXLAN, the system will use GRE. Note that GRE is deprecated in 2.4 and will be removed in the next major release.

Step 1. Add "vxlan_enabled": true to the chef.continuum section of your cluster.conf file, for example:

chef: {
  "continuum": {
      "vxlan_enabled": true,
      ...
  }
}

Step 2. In addition, for VXLAN you must ensure that UDP port 4789 is open for each Instance Manager host.

On AWS you can do this by updating the firewall rule for the corresponding Security Group.

Migrating to VXLAN

Apcera Platform release 2.4 includes important updates for virtual networking, including:

  • OVS is upgraded from 2.4 to 2.5.1
  • VXLAN is the default tunneling technology for new clusters
  • Support for Generic Routing Encapsulation (GRE) is deprecated and will be removed in the next LTS release

To update an existing cluster to use VxLAN:

  1. Enable VxLAN as described above.
  2. Redeploy the cluster.

Note that jobs in the virtual networks will be restarted. See the release notes for upgrade details.

Splunk

See configuring Splunk.