Logging and Debugging

Apcera Platform provides a rich set of logging facilities to help you troubleshoot and debug your deployed apps, services, and cluster components. You can also audit all actions taken within the system.

This section describes the various log types the system provides and includes examples for using them, including:

Job logs

A job log is a continuous live stream of time-ordered events output from apps and system components. You use logs to debug your deployed apps and to audit the actions taken within the system.

The system captures all logs written by a job to stdout/stderr. These logs are called job logs.

Job logs are streamed to a ring buffer that holds 1500 lines per job. You can retrieve job logs using APC or the web console.

For long-term storage, you can send job logs to a syslog endpoint by configuring a job log drain. You can use your own syslog-compatible log collector or a commercial log archive tool and analysis service such as Splunk or Papertrail.

The following screenshots show example job log messages for stdout and stderr:

screenshot

screenshot

Using job logs

For your apps, anything written to standard out (stdout) or standard error (stderr) is captured in the job logs. This means that you can log from anywhere in your application code using a simple output statement. The system keeps the last 1,500 lines of job logs available for inspection using the apc app logs command. You can also access the job logs using the web console.

By default, apc app logs retrieves 20 log lines. You can fetch up to 1,500 lines using the --lines option:

apc app logs --lines 200

All jobs have logs which you can tail, including capsules, services, service gateways, semantic pipelines, and stagers.

As with apps, you can view logs emitted by these other kinds of jobs using APC:

  • apc job logs <job-name>
  • apc capsule logs <capsule-name>
  • apc gateway logs <gateway-name>
  • apc pipeline logs <pipeline-name>

Even if the job is stopped or has errored and failed, you can still view the logs of that job. In such a case, the system will simply retrieve and return the log rather than tail (stream) it. Check the Apcera Support Portal for more information about errored jobs.

To view the logs for any job, use command apc job list to get the job name. Then run apc job logs <job-name> to get the log. If the job is stopped, the system will retrieve and return the log.

For additional information on troubleshooting application staging errors, see Stager degugging.

Tailing job logs

Apcera Platform lets you tail the job logs and stream real time job information. Similar to the tail -f Unix program, the system provides a realtime log tail that displays recent logs and leaves the session open for realtime logs to stream in.

By viewing a live stream of logs from your app or a job, you can gain insight into its behavior to troubleshoot and debug problems.

You can tail job logs using APC and the web console.

Using APC

To tail job logs using APC:

apc app logs <app-name>

Using the web console

You can also tail logs from the web console by clicking Tail logs at the job details screen for a particular job.

screenshot

Configuring job log drains

For long-term job log storage, you can attach a log drain to a job. Adding a log drain endpoint to an app lets you to configure log sinks to store syslog-formatted logs externally. You must configure the service or log server to be able to receive syslog packets from the system, and then add the URL (which contains the host and port) as a syslog drain. The system support TCP log drain endpoints.

To add a job log drain, use the following command:

apc drain add <drain-url> --app <app-name>

Where the <drain-url> is a syslog URL in the form of syslog://<hostname>:<port>.

For example:

apc drain add syslog://logs2.papertrailapp.com:26902 --app my-app

You can discover all the drains for your application by running apc drain list --app <app-name>.

Log drain messages are formatted according to RFC5424.

Job log drain examples

You can configure your apps to send log messages to a syslog service such as Splunk, Papertrail, or Sumo Logic.

System event logs

In addition to job log events, the system also provides a continuous live stream of time-ordered events output from system components.

All system components publish system events. Sytem events are broadcast to the same stream as job logs. System events are tagged as comming from one of the following "system" sources:

  • system for standard system events
  • system-error for fatal system events

System event logs provide you with additional detail when things go wrong, and informational messages when things are working as expected.

The following screenshots show system and system error log messages captured in the job log.

screenshot

screenshot

Check the Apcera Support Portal for more information about flapping jobs.

Audit logs

Any changes to the system that are made through any user-accessible endpoint (APC, Web Console, or API call) are recorded to a persistent audit log. You can query a cluster's audit log using the apc audit command, the Web Console, or by calling the GET /v1/audit API endpoint.

Each audit log item contains the following fields:

Field Description
Date/Time Time when the action was initiated (server UTC time)
Event Type A string that identifies the of event that generated the audit log item. See Audit log event types.
Local Name The local name of the object that generated the audit log item.
Namespace The namespace of the object that generated the audit log item.
Payload An optional JSON payload that contains additional information about the logged item. Requires --verbose or --json flag in APC . See Viewing audit payloads.
Resource type The resource type of the object that generated the audit log item.
User Principal name of user on whose behalf the change was requested.

You can query audit log using APC, the Web Console, or the /v1/audit API endpoint. You can filter your queries by FQN (resource type, namespace, and local name) of the audited item, event type, and date range. You can also paginate through query results. For example, the following example uses APC to query for job.update audit log items generated on October 11, 2016 for the job::/sandbox/admin::mycap resource:

apc audit --event-type=job.update --fqn=job::/sandbox/admin::mycap \
--start="Oct 11 2016" \
--end="Oct 12 2016"
╭──────────────────────────┬────────────┬─────────────────┬───────────────┬────────────────┬────────────╮
│ Date/Time                │ Event Type │ User            │ Resource Type │ Namespace      │ Local Name │
├──────────────────────────┼────────────┼─────────────────┼───────────────┼────────────────┼────────────┤
│ Oct 11, 2016 10:37am PDT │ job.update │ admin@apcera.me │ job           │ /sandbox/admin │ mycap      │
│ Oct 11, 2016 10:33am PDT │ job.update │ admin@apcera.me │ job           │ /sandbox/admin │ mycap      │
│ Oct 11, 2016 10:33am PDT │ job.update │ admin@apcera.me │ job           │ /sandbox/admin │ mycap      │
│ Oct 11, 2016 10:33am PDT │ job.update │ admin@apcera.me │ job           │ /sandbox/admin │ mycap      │
│ ...                      │     ...    │     ...         │      ...      │      ...       │  ...       │
╰──────────────────────────┴────────────┴─────────────────┴───────────────┴────────────────┴────────────╯
Load more results? [Y/n]:

If more results are available you are prompted to load them or quit (Load more results? [Y/n]).

You can accomplish the same thing using filters in the Web Console, as shown below. See Viewing audit logs for more information.

Alt text

Audit log item payloads

Some audit log items include a payload that provides additional details about the logged item. There are two types of audit payloads: policy access payloads and JSON Patch payloads. Policy access payload provide details about a policy access or deny event. JSON Patch payloads describe changes made to a job or other resource's JSON representation.

By default, audit payloads are not displayed in the output displayed by the apc audit command or in the Web Console. To display payloads in APC output add the --verbose flag to the apc audit command.

apc audit --event-type=network.leave --verbose
╭───────────────┬───────────────────────────────────────────────────────╮
│ Time:         │ Oct 31, 2016 2:41pm PDT                               │
│ Event Type:   │ network.leave                                         │
│ On Behalf of: │ admin@apcera.me                                       │
│ ResourceType: │ job                                                   │
│ Namespace:    │ /myapps                                               │
│ Local Name:   │ myapp-4                                               │
│ Payload:      │ {                                                     │
│               │   "patches": [                                        │
│               │     {                                                 │
│               │       "op": "remove",                                 │
│               │       "path": "network_ref/uuid",                     │
│               │       "value": "e9a76724-d5c3-4a51-9fb0-8b5e70a82018" │
│               │     }                                                 │
│               │   ]                                                   │
│               │ }                                                     │
╰───────────────┴───────────────────────────────────────────────────────╯

To view audit log payloads in the Web Console, click the audit log item to open a details dialog that displays all standard audit log item fields, as well as the JSON payload.

Alt text

You can alo pass the --json flag to apc audit to have the output displayed as JSON, for example:

apc audit --event-type=job.create --json | python -m json.tool
[
    {
        "audit_payload": {},
        "event_type": "job.create",
        "event_type_int": 1,
        "fqn": "job::/sandbox/admin::nats-client",
        "localname": "nats-client",
        "namespace": "/sandbox/admin",
        "on_behalf_of": "admin@apcera.me",
        "principal_name": "api_server@apcera.me",
        "resource_type": "job",
        "timestamp": 1477945475,
        "uuid": "09e09271-9fa8-11e6-a7e2-06bc450cb24a"
    },
    {
        "audit_payload": {},
        "event_type": "job.create",
        "event_type_int": 1,
        "fqn": "job::/sandbox/admin::nats-server-0.8.1",
        "localname": "nats-server-0.8.1",
        "namespace": "/sandbox/admin",
        "on_behalf_of": "admin@apcera.me",
        "principal_name": "api_server@apcera.me",
        "resource_type": "job",
        "timestamp": 1477945474,
        "uuid": "09d4e36d-9fa8-11e6-a7e0-06bc450cb24a"
    }
] 

JSON Patch audit payloads

Some audit log item payloads contain an array of JSON Patch objects that describes the changes made to the resource's JSON representation. Every JSON Patch object specifies an operation, the path to the modified JSON field, the field's new value. The Apcera audit log system supports the following JSON Patch operations:

  • "add" – Indicates a new field at the specified path and with the specified value was added to the resource's JSON representation. For example, the following JSON Patch object represents the addition of a network_ref/uuid field a job resource (not shown) when the job was joined to a virtual network:

      {
        "op": "add",
        "path": "network_ref/uuid",
        "value": "e9a76724-d5c3-4a51-9fb0-8b5e70a82018
      }
    
  • "remove" – Indicates a field at the specified path and with the specified value was removed from the resource's JSON representation. For example, the following payload represents the removal of the job/labels/mylabel field from a job resource (not shown):

      {
        "op": "remove",
        "path": "job/labels/mylabel",
        "value": "foobar"
      }
    
  • "replace"– Indicates that the value of the field at the specified path was changed from value to oldvalue. For example, the following payload represents the change to the number of requested instances for a job resource (not shown):

     {
       "op": "replace",
       "path": "job/num_instances",
       "value": 2,
       "oldvalue": 1
     }
    

For example, the following APC output shows the audit log item for a network.leave event that's generated when a job is removed from a virtual network. The payload object for this event indicates that the corresponding job's newtork_ref/uuid field (whose value was e9a76724-d5c3-4a51-9fb0-8b5e70a82018) was removed:

apc audit --event-type=network.leave --verbose
╭───────────────┬───────────────────────────────────────────────────────╮
│ Time:         │ Oct 31, 2016 2:41pm PDT                               │
│ Event Type:   │ network.leave                                         │
│ On Behalf of: │ admin@apcera.me                                       │
│ ResourceType: │ job                                                   │
│ Namespace:    │ /myapps                                               │
│ Local Name:   │ myapp-4                                               │
│ Payload:      │ {                                                     │
│               │   "patches": [                                        │
│               │     {                                                 │
│               │       "op": "remove",                                 │
│               │       "path": "network_ref/uuid",                     │
│               │       "value": "e9a76724-d5c3-4a51-9fb0-8b5e70a82018" │
│               │     }                                                 │
│               │   ]                                                   │
│               │ }                                                     │
╰───────────────┴───────────────────────────────────────────────────────╯

As another example, the job.update audit log item below corresponds to removing SSH access from the corresponding app (by running apc app update nats-client --remove-ssh, for example). The payload shows that the job object's job/ports/number, job/ports/optional, and job/tags/ssh fields were removed from the job definition, while the job/updated_at and job/version_id job fields were updated with new values.

╭───────────────┬───────────────────────────────────────────────────╮
│ Time:         │ Oct 31, 2016 2:49pm PDT                           │
│ Event Type:   │ job.update                                        │
│ On Behalf of: │ admin@apcera.me                                   │
│ ResourceType: │ job                                               │
│ Namespace:    │ /sandbox/admin                                    │
│ Local Name:   │ nats-client                                       │
│ Payload:      │ {                                                 │
│               │   "patches": [                                    │
│               │     {                                             │
│               │       "op": "remove",                             │
│               │       "path": "job/ports/number",                 │
│               │       "value": 222                                │
│               │     },                                            │
│               │     {                                             │
│               │       "op": "remove",                             │
│               │       "path": "job/ports/optional",               │
│               │       "value": true                               │
│               │     },                                            │
│               │     {                                             │
│               │       "op": "remove",                             │
│               │       "path": "job/tags/ssh",                     │
│               │       "value": "true"                             │
│               │     },                                            │
│               │     {                                             │
│               │       "op": "replace",                            │
│               │       "path": "job/updated_at",                   │
│               │       "value": "2016-10-31T21:49:31.855140279Z",  │
│               │       "oldvalue": "2016-10-31T20:24:35.09082864Z" │
│               │     },                                            │
│               │     {                                             │
│               │       "op": "replace",                            │
│               │       "path": "job/version_id",                   │
│               │       "value": 3,                                 │
│               │       "oldvalue": 2                               │
│               │     }                                             │
│               │   ]                                               │
│               │ }                                                 │
╰───────────────┴───────────────────────────────────────────────────╯

Policy access audit payloads

For policy.access.allowed and policy.access.denied events the payload is a JSON object that specifies the policy claim type and value of the policy engine's "ask" that generated the event. For example, the following APC command displays audit log items whose event type is policy.access.denied. The output indicates that a request to map a route (myapp.example-cluster.apcera-platform.io) to a job was denied by policy.

apc audit --event-type=policy.access.denied --verbose
╭───────────────┬────────────────────────────────────────────────╮
│ Time:         │ Oct 25, 2016 3:54pm PDT                        │
│ Event Type:   │ policy.access.denied                           │
│ On Behalf of: │                                                │
│ ResourceType: │ route                                          │
│ Namespace:    │ /http/io/apcera-platform/example-cluster/myapp │
│ Local Name:   │                                                │
│ Payload:      │ {                                              │
│               │   "claim": {                                   │
│               │     "claimtype": "permit",                     │
│               │     "claimvalue": "map"                        │
│               │   }                                            │
│               │ }                                              │
╰───────────────┴────────────────────────────────────────────────╯

In the Web Console, this same log event is displayed as follows:

Alt text

Audit log event types

Each audit log item has an event type field that specifies the type of event that caused audit log item to be generated. For example, when you create a new job a log item is generated with a job.create event type. The following table lists the possible audit log event types. Each event type is a string composed of a resource type ("job", for example) and an action ("update") separated by a dot. For example, "job.update" is the event type generated when a job is updated.

As indicated below, some type of audit log items have a payload field that further describes the event. A payload is either an array of JSON Patch objects that describes changes to a resource, or a policy access payload that describes the policy claim and value.

Event type Description Audit payload contents (if any)
auth.allow Authorization to access the cluster was allowed.
auth.deny Authorization to access the cluster was denied.
auth.issue An authorization token was issued to an authenticated user.
drain.add A log drain was added to a job. The URL of the added log drain.
drain.delete A log drain was deleted from a job. The URL of the deleted log drain.
gateway.demote A job was demoted from a service gateway to a regular application.
gateway.promote The specified job was promoted to a service gateway.
job.addbinding A service binding was added to the specified job. Values of the newly created Binding object.
job.addroute A route was added to a port exposed the job. New values of the updated Route object.
job.connect An SSH connection (via APC) was established with the specified job.
job.create The specified job was created.
job.delete The specified job was deleted.
job.instancesnapshot A snapshot was created from the specified job.
job.instancestop A job instance was stopped.
job.removebinding A service binding was removed from the specified job. Fields removed from the Binding object and their values.
job.removeroute A route was removed from the specified job. Fields removed of the updated Route object and their values.
job.start The specified job was started.
job.stop The specified job was stopped.
job.update The specified job was updated. An array of JSON Patch objects that describe the changes made to the job.
network.create The specified network was created.
network.delete The specified network was deleted.
network.join The specified job was joined to a network. A JSON Patch object that specifies the UUID of the network the specified job joined.
network.leave The specified job was removed from a network. A JSON Patch object that specifies the UUID of the network from which the specified job was removed.
network.show The specified network's information was requested.
pkg.create The specified package was created.
pkg.delete The specified package was deleted.
pkg.update The specified package was updated.
policy.access.allowed A change to the system was allowed by policy. By default, access audit events are not logged, and the behavior is controlled by policy. See Controlling access audit logging. A policy access audit payload that specifies the claim type and value that allowed the action.
policy.access.denied A change to the system was denied by policy. By default, access audit events are not logged, and the behavior is controlled by policy. See Controlling access audit logging. A policy access audit payload that specifies the claim type and value that denied the action.
policy.add The specified policy document was added to the cluster. A JSON Patch object that lists the FQNs that were added to the new policy document and the version of the policy document (0 for new policy documents). Note that the payload does not contain the policy text additions.
policy.delete The specified policy document was deleted from the cluster. A JSON Patch object that lists the FQNs that the deleted policy document contained and the version of the policy document that was deleted.
policy.update The specified policy document was updated. A JSON Patch object that lists the FQNs that were removed or updated from the policy document and the version of the policy document that was changed. Note that the payload does not contain the policy text changes.
provider.create The specified provider was created.
provider.delete The specified provider was deleted.
service.bind The specified service binding was created. The event's fqn field is the FQN of the binding.
service.create The specified service was created.
service.delete The specified service was deleted.
service.unbind The specified service binding was deleted. The event's fqn field specifies the binding object's FQN.

Controlling access audit logging

Access audit logging is controlled by the log claim type you define on resources. This claim type can take as values none, allow (only log policy.access.allowed events), deny (only log policy.access.denied events), or allow, deny to log both types of events. For example, the following policy enables both policy.access.allowed and policy.access.denied audit logging for the job::/apcera::continuum-guide resource:

job::/apcera::continuum-guide {
  { log allow, deny }
}

If no log claims are found when a resource access request is performed, then policy.access.denied are logged. Note that if you specify log none and log allow, deny for the same resource in different policy documents, log none takes precedence. For example, suppose a policy document enables logging of policy.access.allowed and policy.access.denied on job::/dev::app:

job::/dev::app { 
    { log allow, deny }
}

If another policy document sets the log level to none on the same resource, then logging is disabled for that resource:

job::/dev::app { 
    { log done }
}

However, if log none is specified on a parent namespace, it does not take precedence over log levels specified on child resources in that namespace. For example, consider the the following policy:

job::/dev { 
    { log none }
} 

This will disable access audit logging on job::/dev::abc or job::/apcera::xyz, but access allow and deny items will continue to be logged for job::/dev::app, per the policy shown previously.

As mentioned previously, the default system behavior is to log access denied events for a resource, if a log level isn't specified. Policy installed with an Apcera cluster uses policy variables to set the log claim type to none for all resources. Specifically, the systemDataTableAccessAuditLogging policy document defines a policy variable named AccessAuditLogging that contains two fields: resource and accessEvents. The first field specifies an auditable resource, and the second is an array of log levels (none, allow, deny) for that resource. The default systemDataTableAccessAuditLogging policy document is shown below. Note that the accessEvents field is set to [none] for each resource.

Alt text

The following change to the AccessAuditLogging table enables access audit logging of policy.access.denied events for all job resource types, and both policy.access.denied and policy.access.allowed for all package resource:

Alt text

The audit access levels specified in the AccessAuditLogging policy variable data are applied to the system by the clusterPermissions policy document (see Referencing policy variables):

on all::/ {
  if (query->target fqnMatch PV->DefaultPermissions.resource && permit == PV->DefaultPermissions.from) {
     permit PV->DefaultPermissions.to
  }
  if (query->target fqnMatch PV->AccessAuditLogging.resource ) {
     log PV->AccessAuditLogging.accessEvents
  }
 }

Also see Granting Policy Permissions.

Component logs

For support purposes you may be directed to collect system component logs.

Logs for system components are written to their respective directories under /var/log. For example, to tail the logs for an Instance Manager component:

tail -F /var/log/instance_manager/current

The -F ensures you follow the file descriptor, so tailing will continue even as the log file is rotated.

To pull the component logs, see troubleshooting deployments.