Auto-scaling Job Instances

Job auto-scaling is a platform feature that automatically scales up or down the number of job instances in response to application load. You specify a performance metric you want to monitor (for example, network request latency, CPU usage, or a custom metric), select and configure an auto-scaling method to respond to observed changes in the performance metric, and the maximum and minimum number of job instances you want to run.

The following diagram illustrates the basic flow of the auto-scaling system and its components. The system continually gathers raw performance metrics for each job instance over a time period you specify, called the "observation interval". At the end of this interval, the auto-scaling system calculates the performance metric's new value. Based on this value, and on the job's auto-scaling configuration, the auto-scaling method decides whether to scale the number of instances up, down, or make no change.

The system provides a "warm-up" period after an instance count change to allow new instances to start and old instances to be torn down. After this warm-up period has elapsed the auto-scaling system resumes aggregating performance metrics from job instances.

Job auto-scaling configuration

The following parameters are common to all auto-scaling configurations:

  • The performance metric you want to monitor. You can monitor one of the built-in metrics, or specify a custom metric provided by an HTTP endpoint.
  • The auto-scaling method to use and its configuration values. The auto-scaling method Apcera provides two auto-scaling methods: threshold-based and one based on PID (proportional-integral-derivative) control theory. Each method provides its own set of parameters you configure to control its behavior.
  • The maximum and minimum number of job instances you want running at any point. You can specify the initial job instance count as long as it falls within the maximum and minimum values you specify.
  • The observation interval (in seconds) that the platform monitors job metrics before making an auto-scaling decision.
  • The warm-up period after a job instance count change start before further auto-scaling actions are considered.

Choosing an auto-scaling method

The auto-scaling method you select identifies the internal logic used to determine if an auto-scaling action should be taken. Apcera provides two auto-scaling methods: a PID-based (proportional–integral–derivative) controller and a threshold-based controller.

One criterion for choosing an auto-scaling method is how long it takes your application to handle requests. For applications with long-lived requests, or requests with varying response times, the PID auto-scaling method will provide better results. If your application responds to requests quickly and uniformly then the threshold method is a better fit.

The PID auto-scaler, while not as intuitive to configure as the threshold auto-scaler, sets the number of instances to reach the desired setpoint in a controlled way that prevents overshooting or undershooting the number of instances needed for the job.

PID auto-scaling method

A proportional–integral–derivative (PID) controller is a control loop feedback mechanism that continuously calculates an error value as the difference between a desired metric setpoint and the measured value. It applies a correction based on proportional (P), integral (I), and derivative (D) terms.

  • The proportional term dictates the magnitude of the corrective action in proportion to the magnitude of the error. It influences the magnitude of the action and determines how fast or aggressively the auto-scaler will react to changes in the value of the metric.
  • When the error is steady and small over a long period of time, the proportional term is not effective, as its value will be equally small. The integral term is an accumulation of the error, second by second; its value will eventually be significant enough to eliminate the steady small errors.
  • The derivative term measures the rate of change of the error. By adding it to the delta, the controller tries to anticipate the next value of the error and act accordingly.

The PID method is identified in auto-scaling configuration settings by the string pid.

Example PID configuration using a multi-resource manifest

The following multi-resource manifest configures a job for auto-scaling using the PID auto-scaling method ("type": "pid"). The auto-scaler's setpoint parameter is configured to maintain a request latency of 200 milliseconds. The proportional gain term (kp) is set to 0.45, the integral term (ki) is set to 0.0013 and the derivative term (kd) is set to zero (0).

{
  "jobs": {
    "job::/sandbox/admin::myapp": {
      "packages": [
        { "fqn": "package::/sandbox/admin::myapp-pkg" }
      ],
      "routes": [
        {
          "type": "http",
          "endpoint": "auto",
          "config": {
            "/": [
              { "port": 0 }
            ]
          }
        }
      ],
      "instances": 1,
      "autoscaling": {
        "max_instances": 5,
        "min_instances": 1,
        "observation_interval_secs": 10,
        "rule": {
          "type": "pid",
          "metric": "request_latency",
          "config": {
            "setpoint": 200.0,
            "kp": 0.45,
            "ki": 0.0013,
            "kd": 0
          }
        }
      }
    }
  }
}

Threshold auto-scaling method

The threshold auto-scaling controller scales the number of job instances when the value of the observed performance metric breaches lower or upper thresholds that you specify. In addition to the metric to observe this auto-scaling method takes the following parameters:

  • Upper and lower metric thresholds that should trigger an auto-scaling action.
  • The number of job instances to create or destroy during each scale action. This lets your app, for example, scale up its instance count more quickly than it scales down.
  • A optional time window that the metric has to remain outside the threshold boundaries before a auto-scaling action is taken. Once this time period is up, a final metric calculation is made. Only if this newly observed metric value breaches a threshold is a scaling action taken.

Example threshold configuration using a multi-resource manifest

The following manifest configures threshold auto-scaling on a job to maintain a network request latency of at most 1000 milliseconds and scale up the job instance count as necessary; if the request latency is less 100ms than scale down instances. The monitoring_window_secs parameter indicates the auto-scaler will make a final metric calculation after that time period before making a final decision to auto-scale.

{
  "jobs": {
    "job::/sandbox/admin::myapp": {
      "packages": [
        { "fqn": "package::/sandbox/admin::myapp-pkg" }
      ],
      "routes": [
        {
          "type": "http",
          "endpoint": "auto",
          "config": {
            "/": [
              { "port": 0 }
            ]
          }
        }
      ],
      "instances": 1,
      "autoscaling": {
        "max_instances": 5,
        "min_instances": 1,
        "observation_interval_secs": 10,
        "warmup_secs": 10,
        "rule": {
          "type": "threshold",
          "metric": "request_latency",
          "config": {
            "monitoring_window_secs": 5,
            "upper_threshold": 1000,
            "lower_threshold": 100,
            "scale_up_delta": 1,
            "scale_down_delta": 1
          }
        }
      }
    }
  }
}

Performance Metrics

You can monitor one of the built-in performance metrics or use a custom metric provided by an arbitrary HTTP endpoint.

Built-in metrics

The following lists the available built-in performance metrics are available and how they are calculated.

  • cpu_per_second – Average CPU usage for all job instances during the observation period, in milliseconds of processing time/per second (ms/s).
  • request_latency – Average HTTP request latency of each instance multiplied by the number of total requests that an instance received during the observation period.
  • requests_per_second – Average number of HTTP requests handled by each instance from a job during the observation period.

Note: The request_latency and requests_per_second metrics are only available for jobs with a defined route.

Custom metrics

Instead of using a built-in metric to make auto-scaling decisions you can monitor an custom HTTP endpoint that you specify. This endpoint is expected to return a numeric value (float) that's used to make an auto-scaling decision. For example, suppose you have an application that processes orders from an external billing system. The billing system maintains a queue of orders waiting to be processed and an HTTP endpoint that returns the size of the queue (http://monitoring.example.com/queue_size). The performance goal is to keep the order queue at 50 orders.

To configure a custom metric you add a monitoring configuration block to a job definition in a multi-resource manifest. This block has properties that specify the custom metric name to monitor and an HTTP endpoint that returns the metric value. For example, the following manifest defines a new metric named queue_size and its monitoring endpoint (http://monitoring.example.com/queue). The autoscaling block is configured to monitor the queue_size metric.

{
  "jobs": {
    "job::/sandbox/admin::myapp": {
      ...,
      "monitoring": {
        "queue_size": {
          "type": "http_metric",
          "config": {
            "url": "http://monitoring.example.com/queue"
          }
        }
      },
      "autoscaling": {
        "max_instances": 20,
        "min_instances": 1,
        "observation_interval_secs": 10,
        "rule": {
          "type": "pid",
          "metric": "queue_size",
          "config": {
            "setpoint": 50,
            "KP": 0.45,
            "KI": 0.01,
            "delta": 1
          }
        }
      }
    }
  }
}

You can use custom metrics with either auto-scaling method (PID or threshold). When using the PID method the value returned by the monitoring endpoint must be lower or higher than the PID setpoint for the auto-scaling to function properly. In other words, the PID auto-scaler must be able to apply corrective deltas in both directions (negative or positive).

Note that custom metrics can only be configured using a multi-resource manifest; there is currently no APC support for this feature.

Multi-resource manifest auto-scaling configuration

You can configure a job's auto-scaling parameters in a multi-resource manifest using an autoscaling configuration block. To configure a custom metric to monitor you add a monitoring configuration block to a job definition.

Auto-scaling configuration using APC

You use apc app create command to configure auto-scaling when creating a new application. To update (or remove) auto-scaling configuration on an existing application you use the apc app autoscale command.

Below is an example APC command string that enables auto-scaling on a new application using the threshold method. The autoscaler is configured to maintain an average of at least 5 requests per second and to scale up the job instance count as necessary, by 2 new instance each scale action, up to a maximum of 20 instances; if the average req/sec exceeds 10 requests/second, then the number of instances will be scaled down, by one job instance per scale action, to a minimum of 3 job instances.

apc app create my-app --autoscale \
--autoscale-max-instances 20 --autoscale-min-instances 3 \
--autoscale-metric requests_per_second \
--autoscale-threshold-lower 5 --autoscale-threshold-upper 10 \
--autoscale-threshold-scaleup-delta 2  --autoscale-threshold-scaledown-delta 1 \
--autoscale-threshold-window 10

To update auto-scaling for an existing application, use the apc app autoscale command. For example, the following updates the specified job's auto-scaling configuration to use the PID auto-scaling method to maintain a setpoint of 200 millisecond request latency:

apc app autoscale sample --max-instances 25 --min-instances 5 \
    --method pid --metric request_latency --setpoint 200
╭───────────────────────────────────────────────╮
│             Autoscaling Settings              │
├─────────────────┬─────────────────────────────┤
│            FQN: │ job::/sandbox/admin::sample │
│  Max Instances: │ 25                          │
│  Min Instances: │ 5                           │
│  Obs. Interval: │ 10                          │
│ Warmup Seconds: │ 0                           │
│         Metric: │ request_latency             │
│         Method: │ pid                         │
│             KP: │ 0.450                       │
│             KI: │ 0.001                       │
│             KD: │ 0.000                       │
╰─────────────────┴─────────────────────────────╯

Is this correct? [Y/n]:

You can also use APC's interactive mode to be prompted for all available auto-scaling parameters (as long as you don't include the --batch command-line parameter, which disables interactive mode). Answer "Y" key when asked if you want to enable autoscaling and follow the subsequent prompts to specify other auto-scaling value, for example:

apc app create app
Deploy path [/Users/timstatler/go/src/github.com/apcera/sample-apps/example-static]:
Instances [1]:
Memory [256MiB]:
Enable Autoscaling [y/N]: Y
Autoscaling Metric:
[0] AVG CPU/sec
[1] AVG requests/sec
[2] AVG request latency
Enter your selection [0]: 1
Autoscaling Min Instances [1]: 5
Autoscaling Max Instances [6]: 15
...
╭──────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                         Application Settings                                         │
├──────────────────────────────┬───────────────────────────────────────────────────────────────────────┤
│                         FQN: │ job::/sandbox/admin::app                                              │
│                   Directory: │ /Users/timstatler/go/src/github.com/apcera/sample-apps/example-static │
│                   Instances: │ 1                                                                     │
│                     Restart: │ always                                                                │
│            Staging Pipeline: │ (will be auto-detected)                                               │
│                         CPU: │ 0ms/s (uncapped)                                                      │
│                      Memory: │ 256MiB                                                                │
│                        Disk: │ 1GiB                                                                  │
│                      NetMin: │ 5Mbps                                                                 │
│                      Netmax: │ 0Mbps (uncapped)                                                      │
│                    Route(s): │ auto (HTTPS only: false)                                              │
│             Startup Timeout: │ 30 (seconds)                                                          │
│                Stop Timeout: │ 5 (seconds)                                                           │
│                              │                                                                       │
│                 Autoscaling: │ (enabled)                                                             │
│                      Metric: │ AVG requests/sec                                                      │
│               Min Instances: │ 5                                                                     │
│               Max Instances: │ 15                                                                    │
│ Observation Period Duration: │ 10 (seconds)                                                          │
│      Warmup Period Duration: │ 0 (seconds)                                                           │
│                      Method: │ Threshold                                                             │
│             Lower Threshold: │ 1 (avg req/s)                                                         │
│             Upper Threshold: │ 2 (avg req/s)                                                         │
│              Scale-Up Delta: │ 1                                                                     │
│            Scale-Down Delta: │ 1                                                                     │
│ Threshold Monitoring Window: │ 0 (seconds)                                                           │
╰──────────────────────────────┴───────────────────────────────────────────────────────────────────────╯

To remove auto-scaling from a app pass the --disable flag to the apc app autoscale command:

apc app autoscale sample --disable

Specifying initial job instance count

When creating or updating a job that is configured for auto-scaling, you can "manually" specify how many instances you would like to create when the job is started. The only requirement is that the instance count you specify be between the maximum and minimum number of instances specified in the job's auto-scaling configuration. If the requested instance count is outside of those bounds then the platform corrects the instance count so it conforms.

This is useful, for example, if you are starting a new job that will immediately experience high load. You can set a high initial instance job count rather than wait for the auto-scaling system to respond to the load dynamically.

For example, suppose you have a job configured to auto-scale between one and five instances and you set the initial instance count to two jobs. In this case, the job will start with two instances, as requested. However, if you specified an initial instance count of 10 (outside of auto-scaling bounds) then the job will start with five instances (the maximum instance count allowed by auto-scaling).