Using the Metrics API

The Metrics API is a set of endpoints for querying an Apcera Platform cluster's metrics storage. It consists of the following endpoints:

Metric Query Request and Response Formats

Each Metrics API endpoint takes one or more metric query parameters, each of which is a comma-delimited list of values that provide the input for that query. The order of each value in the list is significant; it determines how the value is interpreted by the API Server.

For example, the following shows an example job metrics query that contains a single metric query parameter:

GET /metrics/jobs?metric=job::/apcera::continuum-guide,cpu.mean:avg-1hr,-6h,now HTTP/1.1

The query string value (job::/apcera::continuum-guide,cpu.mean:avg-1hr,-6hr,now) consists of the following parts:

  • job::/apcera::continuum-guide– FQN of the job to query.
  • cpu.mean:avg-1hr – Name of the performance metric to query (cpu.mean) followed by a colon (:) and the name of a down-sampling function combined with a down-sampling time-period, separated by a hyphen (avg-1hr).
  • -6hr,now – The time period (from,until) to consider for the query, in this case the 6 hour period preceding the current time (now). You can also specify absolute times for a from/until time, and mix relative and absolute times. Specifying query times and dates for details.

The response to a Metrics API call is a MetricResponse JSON object consisting of a top-level metrics object is a map of metric query string values (from the original query) to a map of MetricSeries objects whose keys are job/namespace FQN or Instance Manager names that were the target(s) of the query. The response to the example query discussed above is shown below:

GET /metrics/jobs?metric=job::/apcera::continuum-guide,cpu.mean:avg-1hr,-6h,now HTTP/1.1
{
  "metrics": {
    "job::/apcera::continuum-guide,cpu.mean:avg-1hr,-6hr,now": {
      "job::/apcera::continuum-guide": {
        "times": [
          1507219700,
          1507223300,
          1507226900,
          1507230500,
          1507234100,
          1507237700
        ],
        "values": [
          18250.329166666666,
          18389.229166666668,
          18237.331944444446,
          18250.30277777778,
          18034.50138888889,
          18193.968055555557
        ]
      }
    }
  },
  "errors": {}
}

Specifying Metric Query Time Periods

The last two items in each metric query string specify the "from" and "until" dates/times to consider for the query, respectively. Each value can be an absolute or relative time value. Relative time values are preceded by a minus sign (-) followed by a unit of time. Valid units of relative time are listed below:

Abbreviation Unit
s seconds
min minutes
h hours
d days
w weeks
mon month (30 days)
y year (365 days)

If the "from" value is omitted from a query then it defaults to 24 hours ago; if the "until" value is omitted it defaults to the current time (now). You can also use the term now to indicate the current time.

Absolute time values can be expressed in the format HH:MM_YYMMDD, YYYYMMDD, MM/DD/YY, or any other time format compatible with the at(1) Unix command.

Metric Down-sampling Functions

By default, a successful query returns the complete time series for the specified time period. Your query can optionally specify a down-sampling function to summarize the data into interval buckets of a certain size. By default, buckets are calculated by rounding to the nearest interval. This works well for intervals smaller than a day. For example, if a the summary interval is 1 hour, a value with the timestamp of 22:32 will end up in the bucket 22:00-23:00. Any null values in the time series are transformed to zero (0).

To apply a down-sampling function to a metric query, add a colon (:) after metric name, then the down-sampling function name to apply, followed by a dash (-) and the size of interval bucket to down-sample. For example, the following metric query uses the mem_used:avg-1hr metric and down-sampling/time period to obtain the average memory used by the specified job, summarized into one hour buckets, over the previous 24 hours:

metric=job::/apcera::lucid,mem_used.mean:avg-1hr,-24hr,now

The following down-sampling functions are available:

  • avg – Returns the mean of each bucket.
  • last – Returns the last value in each bucket.
  • max – Returns the maximum value in each bucket.
  • min – Returns the minimum value in each bucket.
  • sum – Returns the sum of each bucket.

The following table lists valid down-sampling time abbreviations and units:

Abbreviation Unit
s seconds
min minutes
h hours
d days
w weeks
mon month (30 days)
y year (365 days)

To demonstrate down-sampling, consider the following query and response for a job's mean CPU values over the last 30 minutes. The original time series, truncated for readability below, contains 180 data points (6 data points per-minute * 30 minutes) for the job target:

GET /metrics/jobs?metric=job::/apcera::lucid,cpu.mean,-30min

{
  "metrics": {
    "job::/apcera::lucid,cpu.mean,-30min": {
      "job::/apcera::lucid": {
        "times": [
          1507155550,
          1507155560,
          1507155570,
          ...,
          1507157320,
          1507157330,
          1507157340
        ],
        "values": [
          7327.75,
          5417.5,
          13067.5,
          ...,
          69326,
          5724.5,
          11050
        ]
      }
    }
  },
  "errors": {}
}   

In comparison, the following query uses a down-sampling function average the time series into 5 minute buckets over the 30 minute time period, resulting in a time series of six data points:

GET /metrics/jobs?metric=job::/apcera::lucid,cpu.mean:avg-5min,-30min HTTP/1.1

{
  "metrics": {
    "job::/apcera::lucid,cpu.mean:avg-5min,-30min": {
      "job::/apcera::lucid": {
        "times": [
          1507155780,
          1507156080,
          1507156380,
          1507156680,
          1507156980,
          1507157280
        ],
        "values": [
          17506.333333333332,
          17319.466666666667,
          17402.116666666665,
          17980.691666666666,
          16999.966666666667,
          18346.08620689655
        ]
      }
    }
  },
  "errors": {}
}

Job Metric Queries

The GET /metrics/jobs endpoint lets you query a job's performance metrics for a specified time period. A job metric query string specifies the following:

  • FQN of job or job namespace to query. (Required). If the FQN does not contain a local name (job::/apcera/apps) then the query response is a list MetricSeries objects, one for each job in the specified namespace; otherwise, the response contains a single MetricSeries object for the specified job. See Job Metric Query Examples.
  • Name of metric to query (required), including any functions. Valid metric names are listed below, where .sum, .mean, and .count fields contain the sum, average, and count of the metric for each time bucket in the response, respectively:
    • cpu.sum, cpu.mean, cpu.count– CPU usage (sum, mean, count)
    • mem_total.sum, mem_total.mean,mem_total.count – Memory total (sum, mean, and count)
    • mem_used.sum,mem_used.mean, mem_used.count – Memory used (sum, mean, and count)
    • disk_total.sum, disk_total.mean, disk_total.count – Disk space total (sum, mean, and count)
    • disk_used.sum, disk_used.mean, disk_used.count – Disk space used (sum, mean, and count)
    • bandwidth_total.sum, bandwidth_total.mean, bandwidth_total.count – Bandwidth total (sum, mean, and count)
    • bandwidth_used.sum, bandwidth_used.mean, bandwidth_used.count – Bandwidth used (sum, mean, count)
    • network_rx.sum, network_rx.mean, network_rx.count – Network received (sum, mean, count)
    • network_tx.sum, network_tx.mean, network_tx.count – Network transferred (sum, mean, count)
  • From date/time (optional). See Specifying Metric Query Time Periods
  • Until date/time (optional) See Specifying Metric Query Time Periods

Job Metric Query Examples

Example 1 – The following call contains two metric queries, one for mean CPU usage (cpu.mean) and one mean memory usage (mem_used.mean) on the /apcera::lucid job, averaged into 5 minute buckets, over the last 30 minutes:

GET /metrics/jobs?metric=job::/apcera::lucid,cpu.mean:avg-5min,-30min&metric=job::/apcera::lucid,mem_used.mean:avg-5min,-30min HTTP/1.1

{
  "metrics": {
    "job::/apcera::lucid,cpu.mean:avg-5min,-30min": {
      "job::/apcera::lucid": {
        "times": [
          1507065940,
          1507066240,
          1507066540,
          1507066840,
          1507067140,
          1507067440
        ],
        "values": [
          8671.383333333333,
          8781.683333333332,
          8702.316666666668,
          8497.866666666667,
          8334.416666666666,
          8411.133333333333
        ]
      }
    },
    "job::/apcera::lucid,mem_used.mean:avg-5min,-30min": {
      "job::/apcera::lucid": {
        "times": [
          1507065940,
          1507066240,
          1507066540,
          1507066840,
          1507067140,
          1507067440
        ],
        "values": [
          1202858.6666666667,
          1210094.9333333333,
          1173572.2666666666,
          1174596.2666666666,
          1205111.4666666666,
          1214327.4666666666
        ]
      }
    }
  },
  "errors": {}
}

Example 2 – The following example queries all jobs in the /apps/dev for the mean CPU usage, averged into 1 minute buckets over the past 5 minutes.

GET /metrics/jobs?metric=job::/apps/dev,cpu.mean:avg-1min,-5min HTTP/1.1

{
  "metrics": {
    "job::/apps/dev,cpu.mean:avg-30min,-60min": {
      "job::/apps/dev::app": {
        "times": [
          1507565880,
          1507567680
        ],
        "values": [
          0,
          298761
        ]
      },
      "job::/apps/dev::app_2": {
        "times": [
          1507565880,
          1507567680
        ],
        "values": [
          0,
          366244
        ]
      }
    }
  },
  "errors": {}
}

Job Namespace Metric Queries

The GET /metrics/namespaces endpoint lets you query the aggregated metrics for all jobs in given namespace. Namespace metrics are pre-aggregated up to four namespace levels deep (/teams/eng/apps/tests, for example). If a query specifies a namespace that exceeds four levels then the GET /metrics/jobs#jobmetric query behavior is invoked instead.

A job namespace metric query contains the following the following values, in the listed order:

  • FQN of a job namespace (job::/apcera/apps, for example) (required). Only job resource types can be queried. Must not include a local resource name.
  • Metric name to query (required), including any functions. Valid metric names are listed below, where .sum, .mean, and .count fields contain the sum, average, and count of the metric for each time bucket in the response, respectively:
    • cpu.sum, cpu.mean, cpu.count– CPU usage (sum, mean, count)
    • mem_total.sum, mem_total.mean,mem_total.count – Memory total (sum, mean, and count)
    • mem_used.sum,mem_used.mean, mem_used.count – Memory used (sum, mean, and count)
    • disk_total.sum, disk_total.mean, disk_total.count – Disk space total (sum, mean, and count)
    • disk_used.sum, disk_used.mean, disk_used.count – Disk space used (sum, mean, and count)
    • bandwidth_total.sum, bandwidth_total.mean, bandwidth_total.count – Bandwidth total (sum, mean, and count)
    • bandwidth_used.sum, bandwidth_used.mean, bandwidth_used.count – Bandwidth used (sum, mean, count)
    • network_rx.sum, network_rx.mean, network_rx.count – Network received (sum, mean, count)
    • network_tx.sum, network_tx.mean, network_tx.count – Network transferred (sum, mean, count)
  • From date/time (optional). See Specifying Metric Query Time Periods
  • Until date/time (optional) See Specifying Metric Query Time Periods

Examples

Example 1 – The following example queries for the aggregate mean CPU usage (cpu.mean) and mean memory usage (mem_used.mean) on all jobs in the /apcera namespace, averaged into 5 minute buckets, over the last 30 minutes:

GET /metrics/namespaces?metric=job::/apcera,cpu.mean:avg-5min,-30min&metric=job::/apcera,mem_used.mean:avg-5min,-30min HTTP/1.1

{
  "metrics": {
    "job::/apcera,cpu.mean:avg-5min,-30min": {
      "job::/apcera": {
        "times": [
          1507069860,
          1507070160,
          1507070460,
          1507070760,
          1507071060,
          1507071360
        ],
        "values": [
          11875.281481466667,
          12978.325925899997,
          12755.0759259,
          12830.333333399998,
          12809.60740736667,
          13481.055555433337
        ]
      }
    },
    "job::/apcera,mem_used.mean:avg-5min,-30min": {
      "job::/apcera": {
        "times": [
          1507069860,
          1507070160,
          1507070460,
          1507070760,
          1507071060,
          1507071360
        ],
        "values": [
          3541348.5037036333,
          3534893.5111111,
          3546256.1185184335,
          3550958.9333332004,
          3553515.140740534,
          3556230.6370369
        ]
      }
    }
  },
  "errors": {}
}

Example 2 – The following example queries for the sum of memory used (mem_used.sum) by all jobs in the /apcera namespace, averaged into 10 minute buckets (avg-10min), over the past 1 hour (-1hr).

GET /metrics/namespaces?metric=job::/apcera,mem_used.sum:avg-10min,-1hr HTTP/1.1

{
  "metrics": {
    "job::/apcera,mem_used.sum:avg-10min,-1hr": {
      "job::/apcera": {
        "times": [
          1507181240,
          1507181840,
          1507182440,
          1507183040,
          1507183640,
          1507184240
        ],
        "values": [
          99187097.6,
          100788838.4,
          110801237.33333333,
          99603660.8,
          99705378.13333334,
          99861174.23728813
        ]
      }
    }
  },
  "errors": {}
}    

Instance Manager Metric Queries

The GET /metrics/instance_managers endpoint lets you query performance metrics for your cluster's Instance Managers (IMs). Each metric query parameter value consists of the following values, in the specified order:

Examples

Example 1 – The following example queries the maximum memory used by each IMs over a one hour period, for the previous two hours.

GET /metrics/instance_managers?metric=*,mem_used:max-1hr,-2hr HTTP/1.1

{
  "metrics": {
    "*,mem_used:max-1hr,-2hr": {
      "mycluster-53fbd4ac": {
        "times": [
          1507135600,
          1507139200
        ],
        "values": [
          9546235904,
          9546235904
        ]
      },
      "mycluster-7c5e756c": {
        "times": [
          1507135600,
          1507139200
        ],
        "values": [
          0,
          0
        ]
      },
      "mycluster-b16c73b1": {
        "times": [
          1507135600,
          1507139200
        ],
        "values": [
          10099884032,
          10099884032
        ]
      }
    }
  },
  "errors": {}
}