Using the Metrics API
The Metrics API is a set of endpoints for querying an Apcera Platform cluster's metrics storage. It consists of the following endpoints:
GET /metrics/jobs
- Queries metrics for one or more jobs. See Job metric queries for examples.GET /metrics/namespaces
– Queries metrics for all job resources in a namespace. See Namespace metric queries for examples.GET /metrics/instance_managers
– Queries metrics for one or more Instance Managers. See Instance Manager metric queries for examples.
Metric Query Request and Response Formats
Each Metrics API endpoint takes one or more metric
query parameters, each of which is a comma-delimited list of values that provide the input for that query. The order of each value in the list is significant; it determines how the value is interpreted by the API Server.
For example, the following shows an example job metrics query that contains a single metric
query parameter:
GET /metrics/jobs?metric=job::/apcera::continuum-guide,cpu.mean:avg-1hr,-6h,now HTTP/1.1
The query string value (job::/apcera::continuum-guide,cpu.mean:avg-1hr,-6hr,now
) consists of the following parts:
job::/apcera::continuum-guide
– FQN of the job to query.cpu.mean:avg-1hr
– Name of the performance metric to query (cpu.mean
) followed by a colon (:
) and the name of a down-sampling function combined with a down-sampling time-period, separated by a hyphen (avg-1hr
).-6hr,now
– The time period (from,until) to consider for the query, in this case the 6 hour period preceding the current time (now
). You can also specify absolute times for a from/until time, and mix relative and absolute times. Specifying query times and dates for details.
The response to a Metrics API call is a MetricResponse JSON object consisting of a top-level metrics
object is a map of metric
query string values (from the original query) to a map of MetricSeries objects whose keys are job/namespace FQN or Instance Manager names that were the target(s) of the query. The response to the example query discussed above is shown below:
GET /metrics/jobs?metric=job::/apcera::continuum-guide,cpu.mean:avg-1hr,-6h,now HTTP/1.1
{
"metrics": {
"job::/apcera::continuum-guide,cpu.mean:avg-1hr,-6hr,now": {
"job::/apcera::continuum-guide": {
"times": [
1507219700,
1507223300,
1507226900,
1507230500,
1507234100,
1507237700
],
"values": [
18250.329166666666,
18389.229166666668,
18237.331944444446,
18250.30277777778,
18034.50138888889,
18193.968055555557
]
}
}
},
"errors": {}
}
Specifying Metric Query Time Periods
The last two items in each metric
query string specify the "from" and "until" dates/times to consider for the query, respectively. Each value can be an absolute or relative time value. Relative time values are preceded by a minus sign (-
) followed by a unit of time. Valid units of relative time are listed below:
Abbreviation | Unit |
---|---|
s | seconds |
min | minutes |
h | hours |
d | days |
w | weeks |
mon | month (30 days) |
y | year (365 days) |
If the "from" value is omitted from a query then it defaults to 24 hours ago; if the "until" value is omitted it defaults to the current time (now). You can also use the term now
to indicate the current time.
Absolute time values can be expressed in the format HH:MM_YYMMDD
, YYYYMMDD
, MM/DD/YY
, or any other time format compatible with the at(1)
Unix command.
Metric Down-sampling Functions
By default, a successful query returns the complete time series for the specified time period. Your query can optionally specify a down-sampling function to summarize the data into interval buckets of a certain size. By default, buckets are calculated by rounding to the nearest interval. This works well for intervals smaller than a day. For example, if a the summary interval is 1 hour, a value with the timestamp of 22:32 will end up in the bucket 22:00-23:00. Any null values in the time series are transformed to zero (0).
To apply a down-sampling function to a metric query, add a colon (:
) after metric name, then the down-sampling function name to apply, followed by a dash (-
) and the size of interval bucket to down-sample. For example, the following metric query uses the mem_used:avg-1hr
metric and down-sampling/time period to obtain the average memory used by the specified job, summarized into one hour buckets, over the previous 24 hours:
metric=job::/apcera::lucid,mem_used.mean:avg-1hr,-24hr,now
The following down-sampling functions are available:
avg
– Returns the mean of each bucket.last
– Returns the last value in each bucket.max
– Returns the maximum value in each bucket.min
– Returns the minimum value in each bucket.sum
– Returns the sum of each bucket.
The following table lists valid down-sampling time abbreviations and units:
Abbreviation | Unit |
---|---|
s | seconds |
min | minutes |
h | hours |
d | days |
w | weeks |
mon | month (30 days) |
y | year (365 days) |
To demonstrate down-sampling, consider the following query and response for a job's mean CPU values over the last 30 minutes. The original time series, truncated for readability below, contains 180 data points (6 data points per-minute * 30 minutes) for the job target:
GET /metrics/jobs?metric=job::/apcera::lucid,cpu.mean,-30min
{
"metrics": {
"job::/apcera::lucid,cpu.mean,-30min": {
"job::/apcera::lucid": {
"times": [
1507155550,
1507155560,
1507155570,
...,
1507157320,
1507157330,
1507157340
],
"values": [
7327.75,
5417.5,
13067.5,
...,
69326,
5724.5,
11050
]
}
}
},
"errors": {}
}
In comparison, the following query uses a down-sampling function average the time series into 5 minute buckets over the 30 minute time period, resulting in a time series of six data points:
GET /metrics/jobs?metric=job::/apcera::lucid,cpu.mean:avg-5min,-30min HTTP/1.1
{
"metrics": {
"job::/apcera::lucid,cpu.mean:avg-5min,-30min": {
"job::/apcera::lucid": {
"times": [
1507155780,
1507156080,
1507156380,
1507156680,
1507156980,
1507157280
],
"values": [
17506.333333333332,
17319.466666666667,
17402.116666666665,
17980.691666666666,
16999.966666666667,
18346.08620689655
]
}
}
},
"errors": {}
}
Job Metric Queries
The GET /metrics/jobs endpoint lets you query a job's performance metrics for a specified time period. A job metric query string specifies the following:
- FQN of job or job namespace to query. (Required). If the FQN does not contain a local name (
job::/apcera/apps
) then the query response is a list MetricSeries objects, one for each job in the specified namespace; otherwise, the response contains a single MetricSeries object for the specified job. See Job Metric Query Examples. - Name of metric to query (required), including any functions. Valid metric names are listed below, where
.sum
,.mean
, and.count
fields contain the sum, average, and count of the metric for each time bucket in the response, respectively:cpu.sum
,cpu.mean
,cpu.count
– CPU usage (sum, mean, count)mem_total.sum
,mem_total.mean
,mem_total.count
– Memory total (sum, mean, and count)mem_used.sum
,mem_used.mean
,mem_used.count
– Memory used (sum, mean, and count)disk_total.sum
,disk_total.mean
,disk_total.count
– Disk space total (sum, mean, and count)disk_used.sum
,disk_used.mean
,disk_used.count
– Disk space used (sum, mean, and count)bandwidth_total.sum
,bandwidth_total.mean
,bandwidth_total.count
– Bandwidth total (sum, mean, and count)bandwidth_used.sum
,bandwidth_used.mean
,bandwidth_used.count
– Bandwidth used (sum, mean, count)network_rx.sum
,network_rx.mean
,network_rx.count
– Network received (sum, mean, count)network_tx.sum
,network_tx.mean
,network_tx.count
– Network transferred (sum, mean, count)
- From date/time (optional). See Specifying Metric Query Time Periods
- Until date/time (optional) See Specifying Metric Query Time Periods
Job Metric Query Examples
Example 1 – The following call contains two metric queries, one for mean CPU usage (cpu.mean
) and one mean memory usage (mem_used.mean
) on the /apcera::lucid
job, averaged into 5 minute buckets, over the last 30 minutes:
GET /metrics/jobs?metric=job::/apcera::lucid,cpu.mean:avg-5min,-30min&metric=job::/apcera::lucid,mem_used.mean:avg-5min,-30min HTTP/1.1
{
"metrics": {
"job::/apcera::lucid,cpu.mean:avg-5min,-30min": {
"job::/apcera::lucid": {
"times": [
1507065940,
1507066240,
1507066540,
1507066840,
1507067140,
1507067440
],
"values": [
8671.383333333333,
8781.683333333332,
8702.316666666668,
8497.866666666667,
8334.416666666666,
8411.133333333333
]
}
},
"job::/apcera::lucid,mem_used.mean:avg-5min,-30min": {
"job::/apcera::lucid": {
"times": [
1507065940,
1507066240,
1507066540,
1507066840,
1507067140,
1507067440
],
"values": [
1202858.6666666667,
1210094.9333333333,
1173572.2666666666,
1174596.2666666666,
1205111.4666666666,
1214327.4666666666
]
}
}
},
"errors": {}
}
Example 2 – The following example queries all jobs in the /apps/dev
for the mean CPU usage, averged into 1 minute buckets over the past 5 minutes.
GET /metrics/jobs?metric=job::/apps/dev,cpu.mean:avg-1min,-5min HTTP/1.1
{
"metrics": {
"job::/apps/dev,cpu.mean:avg-30min,-60min": {
"job::/apps/dev::app": {
"times": [
1507565880,
1507567680
],
"values": [
0,
298761
]
},
"job::/apps/dev::app_2": {
"times": [
1507565880,
1507567680
],
"values": [
0,
366244
]
}
}
},
"errors": {}
}
Job Namespace Metric Queries
The GET /metrics/namespaces
endpoint lets you query the aggregated metrics for all jobs in given namespace. Namespace metrics are pre-aggregated up to four namespace levels deep (/teams/eng/apps/tests
, for example). If a query specifies a namespace that exceeds four levels then the GET /metrics/jobs#jobmetric query behavior is invoked instead.
A job namespace metric query contains the following the following values, in the listed order:
- FQN of a job namespace (
job::/apcera/apps
, for example) (required). Only job resource types can be queried. Must not include a local resource name. - Metric name to query (required), including any functions. Valid metric names are listed below, where
.sum
,.mean
, and.count
fields contain the sum, average, and count of the metric for each time bucket in the response, respectively:cpu.sum
,cpu.mean
,cpu.count
– CPU usage (sum, mean, count)mem_total.sum
,mem_total.mean
,mem_total.count
– Memory total (sum, mean, and count)mem_used.sum
,mem_used.mean
,mem_used.count
– Memory used (sum, mean, and count)disk_total.sum
,disk_total.mean
,disk_total.count
– Disk space total (sum, mean, and count)disk_used.sum
,disk_used.mean
,disk_used.count
– Disk space used (sum, mean, and count)bandwidth_total.sum
,bandwidth_total.mean
,bandwidth_total.count
– Bandwidth total (sum, mean, and count)bandwidth_used.sum
,bandwidth_used.mean
,bandwidth_used.count
– Bandwidth used (sum, mean, count)network_rx.sum
,network_rx.mean
,network_rx.count
– Network received (sum, mean, count)network_tx.sum
,network_tx.mean
,network_tx.count
– Network transferred (sum, mean, count)
- From date/time (optional). See Specifying Metric Query Time Periods
- Until date/time (optional) See Specifying Metric Query Time Periods
Examples
Example 1 – The following example queries for the aggregate mean CPU usage (cpu.mean
) and mean memory usage (mem_used.mean
) on all jobs in the /apcera
namespace, averaged into 5 minute buckets, over the last 30 minutes:
GET /metrics/namespaces?metric=job::/apcera,cpu.mean:avg-5min,-30min&metric=job::/apcera,mem_used.mean:avg-5min,-30min HTTP/1.1
{
"metrics": {
"job::/apcera,cpu.mean:avg-5min,-30min": {
"job::/apcera": {
"times": [
1507069860,
1507070160,
1507070460,
1507070760,
1507071060,
1507071360
],
"values": [
11875.281481466667,
12978.325925899997,
12755.0759259,
12830.333333399998,
12809.60740736667,
13481.055555433337
]
}
},
"job::/apcera,mem_used.mean:avg-5min,-30min": {
"job::/apcera": {
"times": [
1507069860,
1507070160,
1507070460,
1507070760,
1507071060,
1507071360
],
"values": [
3541348.5037036333,
3534893.5111111,
3546256.1185184335,
3550958.9333332004,
3553515.140740534,
3556230.6370369
]
}
}
},
"errors": {}
}
Example 2 – The following example queries for the sum of memory used (mem_used.sum
) by all jobs in the /apcera
namespace, averaged into 10 minute buckets (avg-10min
), over the past 1 hour (-1hr
).
GET /metrics/namespaces?metric=job::/apcera,mem_used.sum:avg-10min,-1hr HTTP/1.1
{
"metrics": {
"job::/apcera,mem_used.sum:avg-10min,-1hr": {
"job::/apcera": {
"times": [
1507181240,
1507181840,
1507182440,
1507183040,
1507183640,
1507184240
],
"values": [
99187097.6,
100788838.4,
110801237.33333333,
99603660.8,
99705378.13333334,
99861174.23728813
]
}
}
},
"errors": {}
}
Instance Manager Metric Queries
The GET /metrics/instance_managers
endpoint lets you query performance metrics for your cluster's Instance Managers (IMs). Each metric
query parameter value consists of the following values, in the specified order:
- Host name of IM to query, or
*
to query all IMs (required) - Metric name, one of the following (required):
bandwidth_total
bandwidth_used
cpu
disk_total
disk_used
mem_total
mem_used
- From time/date (optional). See Specifying Metric Query Time Periods
- Until time/date (optional). See Specifying Metric Query Time Periods
Examples
Example 1 – The following example queries the maximum memory used by each IMs over a one hour period, for the previous two hours.
GET /metrics/instance_managers?metric=*,mem_used:max-1hr,-2hr HTTP/1.1
{
"metrics": {
"*,mem_used:max-1hr,-2hr": {
"mycluster-53fbd4ac": {
"times": [
1507135600,
1507139200
],
"values": [
9546235904,
9546235904
]
},
"mycluster-7c5e756c": {
"times": [
1507135600,
1507139200
],
"values": [
0,
0
]
},
"mycluster-b16c73b1": {
"times": [
1507135600,
1507139200
],
"values": [
10099884032,
10099884032
]
}
}
},
"errors": {}
}