Monitoring Apcera Deployments

To monitor the jobs in your cluster (your code), you use the web console and the APC client, and perhaps third-party tools you use to instrument your applications, such as AppDynamics or New Relic.

To monitor the system (our code), Apcera supports both internal and external monitoring using third-party tools. Internal monitoring monitors the components of the Apcera system. External monitoring monitors the hosts running key components of the system.

Internal monitoring

Apcera provides internal monitoring using Zabbix. Each Apcera component (job manager, package manager, instance manager, etc.) is embedded with a Zabbix agent that communicates with the Zabbix Server. The Zabbix Server runs on the monitoring host that is configured for your cluster. Refer to the sizing guidelines for a complete list of components.

Zabbix comes with predefined OS monitoring, such as CPU, disk usage, etc., which we leverage out-of-the box without customizations. Zabbix also provides templates for monitoring applications, which we customize specifically for monitoring Apcera components. The monitoring rules we employ for these purposes are described below.

To implement Zabbix monitoring for your cluster, you must specify the monitoring parameters in the cluster configuration file. See configuring monitoring for examples. If you want to receive monitoring alerts, we support PagerDuty and email. Other alert mechanisms may be instrumented in our configuration on request.

When you deploy a cluster, each configuration parameter supported in cluster.conf is updated, including all group memberships and alert definitions. In terms of Zabbix, if you change entities in the Zabbix UI that are also referenced in the monitoring section of the cluster.conf, such as email and password, those entities will be updated according to their values in cluster.conf.

External monitoring

Zabbix does not monitor job URLs or endpoints. Apcera recommends that you monitor these components externally. For external monitoring, you can use a third-party tool such as Pingdom or Monitis to monitor the following cluster components:

  • Web Console

  • API Server

  • Auth Server

  • Monitoring Server (Zabbix)

If Apcera is managing your cluster, we will use one of the above external monitoring tools to monitor the listed endpoints. If you are managing the cluster, you will need to set up your own external monitoring facility. Unlike internal monitoring, external monitoring is not part of the cluster configuration.

Google Auth monitoring

You can use Zabbix to monitor Google Auth (default identity provider). See this topic for details.

Configuring cluster monitoring

To configure internal monitoring (Zabbix), you populate the apzabbix section of the cluster.conf.erb (Terraform) or cluster.conf file. The following exapmles demonstrate both types of configurations.

You only configure one of these. If you are using Terraform, you configure monitoring in the cluster.conf.erb file, otherwise configure monitoring using the cluster.conf file.

Terraform example

  "apzabbix": {
    "db": {
# TERRAFORM OUTPUT: monitoring-database-address
      "hostport": "<%= `terraform output monitoring-database-address`.chomp %>:5432",
      "master_user": "apcera_ops",
# TERRAFORM OUTPUT: monitoring-database-master-password
      "master_pass": "<%= `terraform output monitoring-database-master-password`.chomp %>",
      "zdb_user": "zabbix",
      "zdb_pass": "YOUR_PASSWORD_HERE"
    },
    "users": {
      "guest": { "user": "monitoring", "pass": "YOUR_PASSWORD_HERE" },
      "admin": { "user": "Admin", "pass": "YOUR_PASSWORD_HERE", "delete": false }
    },
    "web_hostnames": ["monitoring.clustername.example.com"]

    # To enable alerts via PagerDuty, create a service in PagerDuty of type 'Zabbix'
    # and insert the API key here
    "pagerduty": {
      "key": "API-ACCESS-TOKEN-GOES-HERE"
    }
    "email": {
      "sendto": "target@example.com",
      "smtp_server": "localhost",
      "smtp_helo": "localhost",
      "smtp_email": "from@example.com"
    }
  }

Cluster.conf example. Note that if you are using Terraform, you do not edit this file. See the example above instead.

  "apzabbix": {
    "db": {
      "hostport": "abcdefghijklm.nopqrstuvwzyx.us-west-2.rds.amazonaws.com:5432",
      "master_user": "apcera_ops",
      "master_pass": "YOUR_PASSWORD_HERE",
      "zdb_user": "zabbix",
      "zdb_pass": "YOUR_PASSWORD_HERE"
    },
    "users": {
      "guest": { "user": "monitoring", "pass": "YOUR_PASSWORD_HERE" },
      "admin": { "user": "admin", "pass": "YOUR_PASSWORD_HERE", "delete": false }
    },
    "web_hostnames": ["clustermon.clustername.tld"], 
    "PF_pagerduty": {
      "key": "API-ACCESS-TOKEN-GOES-HERE"
    },
    "PF_email": {
      "sendto": "monitoring-events@example.com",
      "smtp_server": "localhost",
      "smtp_helo": "localhost",
      "smtp_email": "zabbix@example.com"
    }
  }

Monitoring configuration parameters

The following section describes each of the monitoring configuration parameters that you can set.

db.hostport

Zabbix monitoring requires a Postgres DB backend, which can be installed on the same host as the zabbix-server (internal) or an RDS (external). Whether or not the zabbix-database is external or internal is set by the Postgres connection string provided in cluster.conf for the chef.apzabbix.db.hostport parameter. If this value is set to localhost:5432, the zabbix-database is internal. Else, as shown in the example, an RDS is used.

  • For Terraform, the hostport is generated for you.
  • For cluster.conf, populate this value with the host name and port of the monitoring server.
  • For AWS, get the value from the Outputs tab > MonitoringPostgresEndpoint resource.

db.master_user

The DB administrator name for the Postgres monitoring database. The default is "apcera_ops."

db.master_pass

  • For Terraform, the master_pass is auto-populated from the terraform.tfvars file, which you edit.
  • For clsuter.conf, the value is a password you set here.
  • For AWS, get the value from the base.json file for the PostgreSQL Monitoring DB.

For the password, you must use a string that does not require URL escaping (that is, does not include the @, /, or \ characters.)

db.zdb_user

The monitoring user name for the Postgres monitoring database. The default is "zabbix."

db.zdb_pass

The db.zdb_pass is a new password made by you here. Use a string that does not require URL escaping (see note above).

users.guest

The users.guest takes a user name and URL-safe password that you enter here. The default user name is "monitoring."

users.admin

The users.admin parameters include a user name and URL-safe password that you enter here. The default user name is "admin."

web_hostnames

The web_hostnames are used by nginx to define the virutal hosts server_name values. The URL you provide is the URL to the Zabbix monitoring console.

pagerduty.key

The pagerduty.key is output by Pagerduty when you create a PagerDuty service. This block is an optional monitoring configuration. Without it there are no alerts sent via Pagerduty.

email

The email block is used to configure email alerts from Zabbix. Email alerts are optional and can be configured in lieu of or in addition to PagerDuty alerts. Macros are documented in the Zabbix manual. Customization is possible on request.

The email > "sendto" field supports a single email address. You can add more using the Terraform console.

The subject of the email alerts is {TRIGGER.STATUS}: <cluster-name> - {HOST.NAME1} - {TRIGGER.NAME}. The format of the email message is as follows:

name:{TRIGGER.NAME}
id:{TRIGGER.ID}
status:{TRIGGER.STATUS}
hostname:{HOSTNAME}
ip:{IPADDRESS}
value:{TRIGGER.VALUE}
event_id:{EVENT.ID}
severity:{TRIGGER.SEVERITY}

Logging in to Zabbix

To log in to Zabbix, use the following URL:

http://<subdomain>.clustermon.<cluster-name>.<tld>

For example, if your cluser name is california and the cluster domain is acme.com, the monitoring URL will be:

http://california.clustermon.acme.com/

The log in credentials are set in either the cluster.conf.erb or cluster.conf file. You can log in with the guest or admin account depending on your role. Typically monitoring users who are responsible for performing monitoring log in as guests, whereas monitoring admins who configure monitoring rules log in as admins.

Application monitoring rules

We define monitoring templates for each of the following Applications:

  • API Server
  • Auth Server
  • Component Database
  • Cluster Monitor
  • Graphite Server
  • Health Manager
  • IP Manager
  • Instance Manager
  • Job Manager
  • Metrics Manager
  • NATS Server
  • Package Manager
  • Redis Server
  • Riak Server
  • Router
  • Splunk Server
  • Splunk Forwarder
  • Stagehand
  • TCP Router

For each template:

  • The ItemName is the human readable description that will be reported in alerts.

  • The ItemKey maps to a Zabbix Item Key. Any standard item can be used, such as proc.num[processname] or net.tcp.listen[port]. Apcera custom item keys are given names with a prefix of cntm, such as cntm.component.api.healthz, and must be defined in the target system's Zabbix config, typically by a matching file in
    /etc/zabbix/zabbix_agentd.conf.d created by the continuum chef recipe for the host. If the ItemKey is empty, no Item will be added for this line.

  • The Interval is the number of seconds between checking this item, typically 30 seconds. Can be omitted if ItemKey empty

  • The DataType is the Zabbix type of the data returned by the Item Key: decimal, octal, hexadecimal or boolean. Can be omitted if ItemKey is empty.

  • The TriggerCondition is the Zabbix trigger expression, for example: '{$TMPL:$ITEM.last(0)}<1'. In the Trigger Condition, $TMPL and $ITEM will be replaced with the template name and Item key before inserting into the Zabbix configuration file. If the trigger condition is empty, no triggers will be added for this Item. (i.e. this item will be used as part of a composite trigger pulling data from multiple items, configured on another line).

The TriggerPriority is the Zabbix priority level of the Trigger.

Here is the template structure:

def continuum_items_list
  [
    [ Template/Application,
      Item Name,
      Item Key,
      Interval,
      DataType,
      TriggerCondition,
      TriggerPriority (Optional, defaults to High)],

API Server tests

Alert if zero processes exist.

    [ 'API Server',
      'API Server process running',
      'proc.num[api_server]',
      30,
      Chef::Zabbix::API::DataType.decimal,
      '{$TMPL:$ITEM.last(0)}=0'],

API must reply to health query.

    [ 'API Server',
      'API Server responds to health query',
      'cntm.component.api.healthz',
      30,
      Chef::Zabbix::API::DataType.boolean,
      '{$TMPL:$ITEM.last(0)}<1'],

Auth Server tests

Alert if zero processes exist.

    ['Auth Server',
     'Auth Server process running',
     'proc.num[auth_server]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}=0'],

Auth server must reply to health query.

    ['Auth Server',
     'Auth Server responds to health query',
     'cntm.component.auth.healthz',
     30,
     Chef::Zabbix::API::DataType.boolean,
     '{$TMPL:$ITEM.last(0)}<1'],

Auth server must be able to connect to Google OAuth2 servers.

    ['Auth Server',
     'Auth Server can refresh Google OAuth tokens',
     'cntm.component.auth.oauth2-refresh',
     60,
     Chef::Zabbix::API::DataType.boolean,
     '{$TMPL:$ITEM.last(0)}#0',
     Chef::Zabbix::API::TriggerPriority.information],

Cluster Monitor test

Alert if zero processes exist.

    ['Cluster Monitor',
     'Cluster Monitor process running',
     'proc.num[cluster_monitor]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}=0'],

Component Database tests

Postgres database master process must be running.

    ['Component Database',
     'Postgres process running',
     'proc.num[postgres,postgres,,bin/postgres]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}=0'],

Postgres database accepting connections.

    ['Component Database',
     'Postgres database accepting connections',
     'cntm.component.database.psql.running',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}#0'],

Postgres database connection counts of various states.

    ['Component Database',
     'Postgres database max connections',
     'cntm.component.database.psql.server_maxcon',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['Component Database',
     'Postgres database active connections',
     'cntm.component.database.psql.active_connections',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['Component Database',
     'Postgres database idle connections',
     'cntm.component.database.psql.idle_connections',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['Component Database',
     'Postgres database idle(in transaction) connections',
     'cntm.component.database.psql.idle_tx_connections',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['Component Database',
     'Postgres database locked connections',
     'cntm.component.database.psql.locks_waiting',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],

Postgres database connections nearing configured max.

    ['Component Database',
     'Postgres database at 80% of connection limit',
     '',
     '',
     '',
     '{$TMPL:cntm.component.database.psql.active_connections.avg(5m)}/{$TMPL:cntm.component.database.psql.server_maxcon.last(0)}>0.8'],

Continuum All tests (tests for all hosts)

Orchestrator-agent must be running.

    ['Continuum All',
     'Orchestrator Agent process running',
     'proc.num[orchestrator-agent]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}=0'],

Process sshd must be running.

    ['Continuum All',
     'SSH daemon running',
     'proc.num[sshd,root,,bin/sshd]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}=0'],

No disks should be mounted read-only.

    ['Continuum All',
     'Disks mounted read-only',
     'cntm.all-servers.disk.read-only',
     60,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}#0'],

Graphite Server tests

The graphite-web must be running.

    ['Graphite Server',
     'Graphite-web process running',
     'proc.num[graphite-web]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}=0'],

    ['Graphite Server',
     'Graphite nginx process running',
     'proc.num[nginx,root,all,master process]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}#1'],

Nginx status collection.

    ['Graphite Server',
     'Graphite nginx current active connections',
     'cntm.component.graphite.active',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],

Connections being read from.

    ['Graphite Server',
     'Graphite nginx connections being read',
     'cntm.component.graphite.reading',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],

Connections being written to.

    ['Graphite Server',
     'Graphite nginx connections being written',
     'cntm.component.graphite.writing',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],

Connections sitting idle.

    ['Graphite Server',
     'Graphite nginx current idle connections',
     'cntm.component.graphite.waiting',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],

Total accepted connections.

    ['Graphite Server',
     'Graphite nginx total accepted connections',
     'cntm.component.graphite.accepted',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.diff(0)}=0'],

Total handled connections.

    ['Graphite Server',
     'Graphite nginx total handled connections',
     'cntm.component.graphite.handled',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],

Compare accepted to handled. After initial startup we'll alarm if more then 5% different.

    # ['Graphite Server',
    #  'Graphite nginx accepted vs handled connection ratio',
    #  '',
    #  30,
    #  Chef::Zabbix::API::DataType.decimal,
    #  '({$TMPL:cntm.component.graphite.accepted.last(0)} > 100) & ({$TMPL:cntm.component.graphite.handled.last(0)} / {$TMPL:cntm.component.graphite.accepted.last(0)} < 0.95)'],

Total client requests.

    ['Graphite Server',
     'Graphite nginx total client requests',
     'cntm.component.graphite.requests',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.diff(0)}=0'],

Health Manager tests

Alert if zero processes exist.

    ['Health Manager',
     'Health Manager process running',
     'proc.num[health_manager]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}=0'],

IP Manager tests

Alert if zero processes exist.

    ['IP Manager',
     'IP Manager process running',
     'proc.num[ip_manager]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}=0'],

Job Manager test

Alert if zero processes exist.

    ['Job Manager',
     'Job Manager process running',
     'proc.num[job_manager]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}=0'],

Instance Manager tests

Alert if other than one distinct IM process exists, for more than 5 minutes. (IM is singleton per box.)

    ['Instance Manager',
     'Instance Manager count of processes running',
     'cntm.component.instance.im-count',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '({$TMPL:$ITEM.last(0)}=0) | ({$TMPL:$ITEM.min(300)}>1)'],

IM must reply to health query.

    ['Instance Manager',
     'Instance Manager responds to health query',
     'cntm.component.instance.healthz',
     30,
     Chef::Zabbix::API::DataType.boolean,
     '{$TMPL:$ITEM.last(0)}<1'],

Metrics Manager test

Alert if zero processes exist.

    ['Metrics Manager',
     'Metrics Manager process running',
     'proc.num[metrics_manager]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}=0'],

NATS Server tests

Alert if zero processes exist.

    ['NATS Server',
     'NATS Server process running',
     'proc.num[gnatsd]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}=0'],

Issue PING command to NATS server.

    ['NATS Server',
     'NATS Server responds to PING',
     'cntm.component.nats.ping',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}#1'],

Collect NATS stats.

    ['NATS Server',
     'NATS Server active connections',
     'cntm.component.nats.connections',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['NATS Server',
     'NATS Server max connections',
     'cntm.component.nats.max_connections',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['NATS Server',
     'NATS Server messages received',
     'cntm.component.nats.in_msgs',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['NATS Server',
     'NATS Server messages sent',
     'cntm.component.nats.out_msgs',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['NATS Server',
     'NATS Server bytes received',
     'cntm.component.nats.in_bytes',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['NATS Server',
     'NATS Server bytes sent',
     'cntm.component.nats.out_bytes',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['NATS Server',
     'NATS Server memory used',
     'cntm.component.nats.mem',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],

NATS connections nearing configured max.

    ['NATS Server',
     'NATS Server at 80% of connection limit',
     '',
     '',
     '',
     '{$TMPL:cntm.component.nats.connections.avg(5m)}/{$TMPL:cntm.component.nats.max_connections.last(0)}>0.8'],

Package Manager tests

Alert if zero processes exist.

    ['Package Manager',
     'Package Manager process running',
     'proc.num[package_manager]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}=0'],

PM must reply to health query.

    ['Package Manager',
     'Package Manager responds to health query',
     'cntm.component.package.healthz',
     30,
     Chef::Zabbix::API::DataType.boolean,
     '{$TMPL:$ITEM.last(0)}<1'],

RIAK Server test

riak server must respond to ping.

    ['Riak Server',
     'Riak Server ping',
     'cntm.component.riak.ping',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)} # 0'],

riak server read/write test.

    ['Riak Server',
     'Riak Server read/write test',
     'cntm.component.riak.test',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)} # 0'],

riak server ring-status.

    ['Riak Server',
     'Riak Server ring status',
     'cntm.component.riak.ring-status',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)} # 0'],

riak-cs server must respond to ping.

    ['Riak Server',
     'Riak-CS Server ping',
     'cntm.component.riak-cs.ping',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)} # 0'],

Redis Server test

Must be exactly one redis-server process running.

    ['Redis Server',
     'Redis Server process running',
     'proc.num[redis-server]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}#1'],

Redis statistics and triggers.

    ['Redis Server',
     'Redis server uptime',
     'cntm.component.redis.uptime_in_seconds',
     300,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['Redis Server',
     'Redis connected clients',
     'cntm.component.redis.connected_clients',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)} > 500'],
    ['Redis Server',
     'Redis blocked clients',
     'cntm.component.redis.blocked_clients',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)} > 5'],
    ['Redis Server',
     'Redis used memory',
     'cntm.component.redis.used_memory',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['Redis Server',
     'Redis RDB changes since last save',
     'cntm.component.redis.rdb_changes_since_last_save',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['Redis Server',
     'Redis operation load',
     'cntm.component.redis.instantaneous_ops_per_sec',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['Redis Server',
     'Redis AOF rewrite status',
     'cntm.component.redis.aof_last_bgrewrite_status',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)} # 0'],

Redis per-keyspace stats.

    ['Redis Server',
     'Redis Keyspace db0 keys',
     'cntm.component.redis.db0[keys]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['Redis Server',
     'Redis Keyspace db0 expires',
     'cntm.component.redis.db0[expires]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],
    ['Redis Server',
     'Redis Keyspace db0 average TTL',
     'cntm.component.redis.db0[avg_ttl]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],

## Router (HTTP) test

Must be exactly one nginx process running with 'master process' in its command line.

['Router',
 'Router nginx master process running',
 'proc.num[nginx,root,all,master process]',
 30,
 Chef::Zabbix::API::DataType.decimal,
 '{$TMPL:$ITEM.last(0)}#1'], ```

nginx status collection.

    ['Router',
     'Router nginx current active connections',
     'cntm.component.router.active',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],

Connections being read from.

    ['Router',
     'Router nginx connections being read',
     'cntm.component.router.reading',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],

Connections being written to.

    ['Router',
     'Router nginx connections being written',
     'cntm.component.router.writing',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],

Connections sitting idle.

    ['Router',
     'Router nginx current idle connections',
     'cntm.component.router.waiting',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],

Total accepted connections.

    ['Router',
     'Router nginx total accepted connections',
     'cntm.component.router.accepted',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.diff(0)}=0'],

Total handled connections.

    ['Router',
     'Router nginx total handled connections',
     'cntm.component.router.handled',
     30,
     Chef::Zabbix::API::DataType.decimal,
     ''],

Total client requests.

    ['Router',
     'Router nginx total client requests',
     'cntm.component.router.requests',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.diff(0)}=0'],

Splunk forwarder tests

splunkd must be running.

    ['Splunk Forwarder',
     'Splunk daemon status',
     'proc.num[splunkd]',
     300,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}=0'],

Stagehand tests

No current tests for stagehand, as there is not a long-running job to monitor.

TCP Router test

Alert if zero processes exist.

    ['TCP Router',
     'TCP Router process running',
     'proc.num[tcp_router]',
     30,
     Chef::Zabbix::API::DataType.decimal,
     '{$TMPL:$ITEM.last(0)}=0']
  ]

Addendum

Items in this list will be used to override the Zabbix default

  • TriggerName - name of existing trigger to find, must be an exact match

  • OrigCondition - the original zabbix trigger condition to override, must be an exact match

  • NewCondition - the new zabbix trigger expression

  • Status - set to Chef::Zabbix::API::TriggerStatus::{active,disabled}

  • Priority - set to Chef::Zabbix::API::TriggerPriority::{information,warning,average,high,disaster}

  • Comment - Comment to insert on the new trigger condition. Should be referenced if this was an override from the chef configuration.

def continuum_trigger_overrides
  [
    # [TriggerName,
    #  OrigCondition,
    #  NewCondition,
    #  Status,
    #  Comment]

    # Don't alert on swap space if no swap space is configured!
    # also don't alert on swap if there's plenty of system/cached/buffered memory left
    ['Lack of free swap space on {HOST.NAME}',
     '{Template OS Linux:system.swap.size[,pfree].last(0)}<50',
     '{Template OS Linux:system.swap.size[,total].last(0)}>0&{Template OS Linux:vm.memory.size[available].last(0)}<20M&{Template OS Linux:system.swap.size[,pfree].last(0)}<50',
     Chef::Zabbix::API::TriggerStatus.active,
     Chef::Zabbix::API::TriggerPriority.warning,
     'It probably means that the systems requires more physical memory. (Expression Overridden by Apcera via Chef)'],

    # Don't alert on swap space if no swap space is configured!
    # also don't alert on swap if there's plenty of system/cached/buffered memory left
    ['Lack of free swap space on {HOST.NAME}',
     '{Template OS Linux:system.swap.size[,total].last(0)}>0&{Template OS Linux:system.swap.size[,pfree].last(0)}<50',
     '{Template OS Linux:system.swap.size[,total].last(0)}>0&{Template OS Linux:vm.memory.size[available].last(0)}<20M&{Template OS Linux:system.swap.size[,pfree].last(0)}<50',
     Chef::Zabbix::API::TriggerStatus.active,
     Chef::Zabbix::API::TriggerPriority.warning,
     'It probably means that the systems requires more physical memory. (Expression Overridden by Apcera via Chef)'],

    # System reboot should be at severity warning not information,
    # as we're disabling pagerduty alerts below warning
    ['{HOST.NAME} has just been restarted',
     '{Template OS Linux:system.uptime.change(0)}<0',
     '{Template OS Linux:system.uptime.change(0)} < 0',
     Chef::Zabbix::API::TriggerStatus.active,
     Chef::Zabbix::API::TriggerPriority.warning,
     'System rebooted. (Severity Overridden by Apcera via Chef)'],

    # Override default per-system process limit of 300.
    # Alert at 1000 processes (minimum over 5 minutes), clear when avg processes drops below 900
    ['Too many processes on {HOST.NAME}',
     '{Template OS Linux:proc.num[].avg(5m)}>300',
     '({TRIGGER.VALUE}=0 & {Template OS Linux:proc.num[].min(5m)}>1000) | ({TRIGGER.VALUE}=1 & {Template OS Linux:proc.num[].avg(5m)}>900)',
     Chef::Zabbix::API::TriggerStatus.active,
     Chef::Zabbix::API::TriggerPriority.warning,
     'Too many processes. (Trigger condition overridden by Apcera via Chef)'],

    # Override default I/O overload threshold of 20% (average over 5 minutes).
    # Alert at 20% IOwait continuously for 5 minutes, clear when avg IOwait drops below 15%
    ['Disk I/O is overloaded on {HOST.NAME}',
     '{Template OS Linux:system.cpu.util[,iowait].avg(5m)}>20',
     '({TRIGGER.VALUE}=0 & {Template OS Linux:system.cpu.util[,iowait].min(5m)}>20) | ({TRIGGER.VALUE}=1 & {Template OS Linux:system.cpu.util[,iowait].avg(5m)}>15)',
     Chef::Zabbix::API::TriggerStatus.active,
     Chef::Zabbix::API::TriggerPriority.warning,
     'OS spends significant time waiting for I/O (input/output) operations. It could be indicator of performance issues with storage system. (Trigger condition overridden by Apcera via Chef)'],
  ]