Working with Rolling Restarts

To update a job with a new or updated package, or to change certain properties of the job, the Apcera platform by default stops all of the running instances of the job at once and then restarts them all at once with the changes. This leads to operational downtime and loss of service. With the 3.2.0 release, you can configure jobs so that updates are performed in a rolling restart fashion, where Apcera platform ensures that a job or service will remain operational during the updates.

Overview

With the rolling restart feature, the overall goal is to ensure maximum availability and no loss of service during a rollout or configuration change. To ensure this the strategy applicable for the rolling restart feature is make-before-break, that is a new instance starts and reaches the RUNNING state before the old outdated instance is stopped.

Note that the strategy stop-all-start-all is the default restart mode, to ensure compatibility with previous versions of Apcera platform, and that rolling restart mode must be enabled as part of the job configuration.

Rolling restart can be configured for either

  • Restart of a running job (without any changes)
  • Updates of a running job that requires a restart, such as
    • Replacing a Docker image
    • Replacing a package of a running job
    • Redeploying a running job
    • Changing settings of a running job

Rolling restart mode can be configured on all regular user jobs and service provider jobs.

State reconciliation

When a change is made to a job, the job details are updated by the Job Manager (JM) in the jobs database. The resulting new version of the job is then advertised to all Instance Managers (IMs), which propagate the new version across all instances. An instance may accept or reject the update, that is if it accepts it, the change is applied to the container (a so called dynamic update​); if it rejects it, the instance becomes out of date and needs to be replaced.

The process of bringing instances up to date is called ​state reconciliation​.
State reconcilliation follows a make-before-break behavior as follows:

  • Once an instance rejects an update and becomes out of date, the Health Manager (HM) requests a new instance to be started; new instances always run the latest version of the job. This is the ​make ​instance.
  • Once the make instance reaches the RUNNING state (and thus has passed all startup health checks), the HM requests a stop of one of the out-of-date instances. This is the ​break instance.
  • The break instance needs to reach the STOPPING state for the reconciliation process to continue.
  • The process is repeated until all of the instances of the job (the configured number of instances) are running the latest version.

The order in which out-of-date instances are stopped depends on the overall health of each instance, less healthy instances are picked first.

It is ​not​ necessary that the reconciliation process finish the replacement of all of the instances before a new update can be made. If a new update is made while the reconciliation of the immediately earlier version is in progress, the process will pick up the new version and start the reconciliation again.

In the absence of repeated failures, the reconciliation process is active in the HM all of the time, always monitoring the job if a new version becomes available and instances reject it.

Dynamic updates

Port addition and deletion are examples of dynamic updates. Examples of non-dynamic updates are network join and leave, and memory changes.
For more information see Updating app properties

Failure

Start requests made, or instances started, by the reconciliation process may fail. The reconciliation process can tolerate such failures. The degree of tolerance is configured by the job’s ​rolling failure threshold ​(0 by default). If a failure occurs, the reconciliation process will retry until the number of failures has crossed the threshold.

The following failures can occur:

  • Start request failed: The start request was rejected by the JM, or it never reached the JM before a timeout.
  • Start request timed out: The JM accepted the start request, but it wasn’t executed; the most common reasons are that the IMs don’t have enough resource or that IMs are not available.
  • Aborted by the container runtime: The start request was accepted and executed, but the resulting instance failed during setup (for example, a binding could not be configured).
  • Exited abnormally: The start request was accepted and executed, the resulting instance was set up, but the start command exited with non-zero code before the instance reached the RUNNING state.

When the number of failures crosses the threshold, the reconciliation process ​pauses​, that is new replacements do not take place. When the process is paused, the job continues to run, but it may do so with instances that run different versions. The process does not resume automatically. The user must submit a new update or restart request, or request the resumption explicitly. When this occurs, the reconciliation process can continue, with a failure count of 0, and will not pause unless the count crosses the threshold again.

Coexistence with other health functions

State reconciliation is a function of the HM and is designed to coexist with other health functions:

  • Over- and under-scheduling: If the job has fewer instances running than it is configured to run with, HM starts the instance(s) needed. Similarly, the HM gets rid of any excess instances. The reconciliation process yields to this health function and won’t take action unless the job is running with exactly the configured number of instances.
  • Flapping detection: If the number of instances that fail and get restarted crosses a configured threshold (flapping state threshold), and this number represents a certain percentage of the total number of instances, the job is set in a ​flapping state for the duration of a configurable window of time (Not specific for rolling restarts, part of an already existing health function). When the current make instance of the reconciliation process fails, the failure doesn’t count towards the flapping state threshold; similarly, the failure of instances other than the current make instance don’t count towards the reconciliation failure threshold. When the job enters the flapping state, the reconciliation process refrains from taking action.
  • Autoscaling: If enabled, the autoscaler will yield to the reconciliation process. Only when all of the instances are running the latest version of the job will the autoscaler take action.

Concurrency and supersession

At any given moment, the reconciliation process reconciles the running state of the job with that of the database, which is always the latest version. If the rollout of a change is in progress and hasn’t finished, it will be superseded​ by that of a newer change. The reconciliation process will cease to roll out the old change and begin to roll out the new one.

Example:
A job with 10 instances.
At time ​t1, a request to update the job’s memory limit is made. The
reconciliation process will begin to replace old instances with instances that have the updated memory limit.
The rollout may be half way (5 instances updated, 5 instances to update). At time ​t2​, a request is made to join the job to a virtual network. The job’s database state now includes the new memory limit (changed at ​t1​) and the VN membership change (requested at t2). The reconciliation process will stop reconciling the job towards the old state and will start to do it towards the new state, for both the 5 instances that had not been replaced after ​t1 and the 5 that had been replaced.

Configuration

This section describes how to configure rolling restart.

Rolling restart uses the following configuration settings:

  • Rolling Mode: a boolean flag; when true, the job’s non-dynamic updates and restart requests are performed in rolling mode; when false, non-dynamic updates and restarts are performed by stopping the job completely and then starting it again. The latter is the default mode. Updating a job’s Rolling Mode is performed dynamically.
  • Rolling Failure Threshold: an integer; the reconciliation failure threshold. 0 is the
    default value (i.e. pause at first failure). Only available when Rolling Mode is true. Updating a job’s Rolling Failure Threshold is performed dynamically, and will reset the current count of reconciliation failures (as do any other kind of updates).

All regular user jobs and service provider jobs are eligible. Platform jobs (such as stagers, semantic pipelines, Docker layer downloaders and service gateways) are not eligible: rolling mode is not supported for these jobs and the platform enforces this condition. Furthermore, user jobs with Rolling Mode enabled cannot be promoted to service gateways.

Rolling Mode can be enabled, and the Rolling Failure Threshold configured, at creation time. It can be disabled, and the threshold changed, at any other time after creation via update.

Configuration using APC

When creating a new job

Use the --rolling-mode-enable and --rolling-failure-threshold flags of apc app create, apc docker run, apc capsule create, and apc app from package commands.

If --rolling-mode-enable is left unspecified, default value is false. --rolling-failure-threshold can only be set if --rolling-mode-enable is used; if unspecified, the default value is 0.

        apc app create mywebsite --rolling-mode-enable --rolling-failure-threshold 5
    Deploy path [/Users/admin/dev/mywebsite]:
    Instances [1]: 5
    Memory [256MB]:
Enable Autoscaling [y/N]:
    ╭──────────────────────────────────────────────────────────────╮
    │                Application Settings                          │
    ├────────────────────────────┬─────────────────────────────────┤
    │                       FQN: │ job::/sandbox/admin::mywebsite  │
    │                 Directory: │ /Users/admin/dev/mywebsite      │
    │                 Instances: │ 5                               │
    │                   Restart: │ always                          │
    │              Rolling Mode: │ true                            │
    │ Rolling Failure Threshold: │ 5                               │
    │          Staging Pipeline: │ (will be auto-detected)         │
    │                       CPU: │ 0ms/s (uncapped)                │
    │                    Memory: │ 256MiB                          │
    │                      Disk: │ 1GiB                            │
    │                    NetMin: │ 5Mbps                           │
    │                    Netmax: │ 0Mbps (uncapped)                │
    │                  Route(s): │ auto                            │
    │           Startup Timeout: │ 30 (seconds)                    │
    │              Stop Timeout: │ 5 (seconds)                     │
    ╰────────────────────────────┴─────────────────────────────────╯

When updating an existing job

Use --rolling-mode-enable to enable, --rolling-mode-disable to disable, and --rolling-failure-threshold to change the threshold (when Rolling Mode is enabled).

        apc job update my-job --rolling-mode-disable
         ╭────────────────────────────────────╮
         │       Job Update Settings          │
         ├───────────────┬────────────────────┤
         │          FQN: │ job::/myns::my-job │
         │ Rolling Mode: │ disable            │
         ╰───────────────┴────────────────────╯
    apc job update my-job --rolling-mode-enable --rolling-failure-threshold 3
    ╭─────────────────────────────────────────────────╮
    │                    Job Update Settings          │
    ├────────────────────────────┬────────────────────┤
    │                       FQN: │ job::/myns::my-job │  
    │              Rolling Mode: │ enable             │
    │ Rolling Failure Threshold: │ 3                  │
    ╰────────────────────────────┴────────────────────╯ ```

Configuration using MRM

When creating a Multi-Resource Manifest

The following manifest declares a job created with rolling restarts enabled and the rolling failure threshold set to 5.

 {
      "jobs": {
        "job::/dev::nats-server": {
          "docker": {
             "image": "nats:latest"
        },
        "state": "started",
        "instances": 5,
        "rollout": {
           "rolling_mode": true,
           "failure_threshold": 5
        }
      }
    }
 }

When updating a Multi-Resource Manifest

Change the values of rolling_mode and failure_threshold and redeploy the manifest.

Configuration using Web console

When creating a new capsule:

  1. Click Show Advanced in the Create New Capsule form.
  2. Click the Enable Rolling Mode check box.
  3. The field Rolling Failure Threshold is available only when Enable Rolling Mode has been selected. Enter an applicable tolerance threshold (0 is default).
  4. Click Submit to enable Rolling Mode Restart.

    Alt text

When creating a new Docker image:

  1. Click Show Advanced in the Create New Docker Image form.
  2. Click the Enable Rolling Mode check box.
  3. The field Rolling Failure Threshold is available only when Enable Rolling Mode has been selected. Enter an applicable tolerance threshold (0 is default).
  4. Click Submit to enable Rolling Mode Restart.

    Alt text

When updating a job:

  1. Select the Summary tab.
  2. Click the Enable Rolling Mode check box.
  3. The field Rolling Failure Threshold is available only when Enable Rolling Mode has been selected. Enter an applicable tolerance threshold (0 is default).

    Alt text

Alt text

Progress and status display

Online reports**

The invocation of deployments, restarts, and updates of a single job report the progress of the operation online.

MRM deployments via APC do not have online progress reports. When MRMs are deployed by CI/CD systems, no user observes the output of apc manifest deploy. To monitor the rolling-mode deployments of multiple jobs, use reconciliation events (​see Reconciliation events) or the apc manifest status offline report (​see Offline reports).


Progress is reported until one of these conditions is fulfilled:

  • The number of instances that are up-to-date and available reaches the expected, configured number of instances of the job, and all of the old instances have been stopped.

      This operation requires restarting running instances.
      The restart will be performed in rolling mode.
      Instances up-to-date and running: 30/30. Stopped: 30/30. Errors: 0/0 (threshold).
      Success!
    
  • The number of reconciliation errors exceeds the job’s Rolling Failure Threshold, at which point the reconciliation process pauses. Reconciliation errors are displayed as they occur.

      This operation requires restarting running instances.
      The restart will be performed in rolling mode.
      Applying update...done
      Warning: instance 65997a3c: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 00:09:57.078735606 +0000 UTC)
      Warning: instance 2f6a86f4: aborted by the container runtime (reason: instance timed out during startup): check the IM logs (2018-03-12 00:09:58.742212265 +0000 UTC)
      Warning: instance 960580fa: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 00:10:06.532938463 +0000 UTC)
      Warning: instance 2bba2b7a: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 00:10:11.313037167 +0000 UTC)
      Warning: instance 6860f594: aborted by the container runtime (reason: process failed): check the IM logs (2018-03-12 00:10:12.778383918 +0000 UTC)
    
      ...
    
      Instances up-to-date and running: 17/30. Stopped: 17/30. Errors: 11/10 (threshold).
      Error: the state reconciliation process was paused: submit another restart or update request, or resume the reconciliation process explicitly
    
  • The rollout has been superseded by a newer one.

      This operation requires restarting running instances.
      The restart will be performed in rolling mode.
      Applying update...done
      Instances up-to-date and running: 8/30. Stopped: 8/30. Errors: 0/0 (threshold). 
      Error: superseded by a concurrent restart or update
    
  • The job enters the flapping state (​see Coexistence with other health functions​).

      This operation requires restarting running instances.
      The restart will be performed in rolling mode.
      Applying update...done
      Warning: instance f0234e51: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 00:43:27.302913836 +0000 UTC)
      Warning: instance cda7af56: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 00:43:28.129188522 +0000 UTC)
      Instances up-to-date and running: 0/15. Stopped: 0/15. Errors: 2/10 (threshold).
      Error: job is flapping
    

Interruptions

The APC command will exit with code 1 when the online progress report is interrupted by any of the error events listed above.

If you interrupt the progress report, there is no consequence for the process of reconciliation: unlike non-rolling mode updates and restarts, a rolling mode operation is not orchestrated by the client and takes place entirely behind the API. You must be careful not to interrupt the command before the progress report has begun, though.

Note that the reconciliation process itself cannot be interrupted by the user.

Dynamic updates

Dynamic updates are applied on all running instances at once and do not require replacements. Progress of dynamic updates is therefore not reported.

For example:

          apc route add 
	Route Type (http/tcp/udp) [http]:
	HTTPS Only [y/N]
	Endpoint [auto]:
            Job Name []: example-go
	Port Number (on job) [choose open port]:
	Weight (%) [0%]:
        ╭───────────────────────────────────────────╮
        │            Route Add Settings             │
        ├─────────────────┬─────────────────────────┤
        │     Route Type: │ http                    │
        │     HTTPS only: │ false                   │
        │ Route Endpoint: │ auto                    │
        │	    Job Name: │ example-go              │
        │   Job TCP Port: │ (auto-detected)         │
        │   Route Weight: │ 0%                      │
        ╰─────────────────┴─────────────────────────╯

        Is this correct? [Y/n]: 
        Success!

Mixed dynamic and non-dynamic updates

Some updates may, at the same time, be applied dynamically on some instances, and require restart for others. This is the case when adding a host affinity tag: instances that are already placed correctly will accept the update dynamically, whereas those that aren’t will be replaced. The progress report reports the former under the label “w/o restart” (that is, without restart).

        apc app update target --hard-tags-add host-08ed2538f 

        ╭────────────────────────────────────────────────────────╮
        │          Job Update Settings                           │
        ├──────────────────────────────┬─────────────────────────┤
        │                         FQN: │ job::/rr::target        │
        │ Hard Scheduling Tags to Add: │ host-8ed2538f           │
        ╰──────────────────────────────┴─────────────────────────╯

        Is this correct? [Y/n]: 
        This operation requires restarting running instances.
        The restart will be performed in rolling mode. Proceed? [Y/n]:
        Applying update...done
        Instances up-to-date and running: 15/15 (w/o restart: 4). Stopped: 11/11. Errors: 0/0 (threshold).
        Success!

Requests that include dynamic and non-dynamic updates require restart. In the following example, port addition is dynamic, but environment variable addition is not, so the update requires restart.

        apc job update example-go -pa 8080 -o --env-set HOLA=hola 

        ╭─────────────────────────────────────────────────╮
        │          Job Update Settings                    │
        ├───────────────────────┬─────────────────────────┤
        │                  FQN: │ job::/rr::example-go    │
        │      TCP Port to Add: │ 8080 (optional)         │
        │ Env Variables to Set: │ HOLA=hola               │
        ╰───────────────────────┴─────────────────────────╯

        Is this correct? [Y/n]: 
        Setting 'Hola="hola"'
        Exposed TCP port 8080 (optional)
        This operation requires restarting running instances.
        The restart will be performed in rolling mode. Proceed? [Y/n]:
        Applying update...done
        Instances up-to-date and running: 30/30. Stopped: 30/30. Errors: 0/10 (threshold).
        Success!

Web Console

In the Web console, you can see an online progress report for a single job. This report appears after you have confirmed Rolling restart/update. In the report you will get information about reconciliation progress, number of errrors and specific error details.

Alt text

Unlike APC, the Web Console displays the progress of an MRM deployment online. Note that it only does so for jobs configured for Rolling Mode.

Alt text

Offline reports

The status of the reconciliation of a job can be monitored ​offline using the following reports:

Job detail

apc job show displays the rolling-mode configuration, as well as the status of the reconciliation: Rolling Mode Paused: {true,false}.

    $ apc job show example-go
    ╭────────────────────────────┬─────────────────────────────────────────────────╮
    │ Job:                       │ example-go                                       │
    ├────────────────────────────┼──────────────────────────────────────────────────┤
    │ FQN:                       │ job::/rr::example-go                             │
    │ UUID:                      │ 0b045788-5471-4423-987d-32e0ae47c0c8             │
    │                            │                                                  │
    │ State:                     │ started                                          │
    │ Status:                    │ running                                          │
    │ Running Instances:         │ 30/30 started                                    │
    │ Health Score:              │ 100%                                             │
    │                            │                                                  │
    │ Created by:                │ admin@apcera.me                                  │
    │ Created at:                │ 2018-03-11 21:22:30.793261116 +0000 UTC          │
    │ Updated by:                │ admin@apcera.me                                  │
    │ Updated at:                │ 2018-03-12 01:18:57.337685462 +0000 UTC          │
    │                            │                                                  │
    │ Restart:                   │ always                                           │
    │                            │                                                  │
    │ Rolling Mode:              │ true                                             │
    │ Rolling Mode Paused:       │ false                                            │
    │ Rolling Failure Threshold: │ 10                                               │
    ╰────────────────────────────┴─────────────────────────────────────────────────╯

The Web Console displays the status of the reconciliation at the top of the details page of a job:

Alt text

Alt text

APC JOB INSTANCES

This report shows:

  • The status of the reconciliation process: active or paused.
  • The current number of reconciliation errors against the threshold, followed by the list such errors.
  • Up To Date and Available: The number of instances that are running the latest version of the job and that are available and fully operational, out of the expected, configured number.

The "Up To Date" column indicates whether the instance is running the latest version of the job; "Available", if the instance is fully functional (it passed the startup health check and is running); and “Will Restart”, whether the reconciliation process has yet to process and replace the instance.

During a rolling restart that doesn’t experience ​non-reconciliation errors, the Total at the bottom of the “Availability” column will always display either the expected, configured number of instances, or an extra one (that is, the current, in-progress make). This number can be used to verify that the job remains at least 100% available during the entire operation.

The rows are ordered by “Up To Date” (latest version first), then by “Status” (in the following order: NEW, SETUP, STARTING_WAIT, STARTING, FIRST_RUNNING, RUNNING, UPDATING, STOPPING_WAIT, STOPPING, TEARDOWN, REMOVED), and then by “Will Restart” (No’s first).

When executed in a loop, the 1-by-1, make-before-break behavior can be observed.

    $ while true; do apc job instances example-go; sleep 2; done

    # First MAKE (14db1044). All instances went out of date; only the current, in-progress MAKE instance is up to date. 10 instances remain available.
    Looking up "example-go"... done
    Rolling Mode: true
    Reconciliation Status: active
    Reconciliation Errors: 0/10
    Up To Date and Available: 0/10
     ──────────┬─────────┬────────────┬───────────┬──────────────┬────────┬──────────────────
    │ UUID     │ Status  │ Up To Date │ Available │ Will Restart │ Uptime │ Host             │
    ├──────────┼─────────┼────────────┼───────────┼──────────────┼────────┼──────────────────┤
    │ 14db1044 │ SETUP   │ Yes        │ No        │ No           │ 0s     │ host-8ed2538f    │
    │ dcf371aa │ RUNNING │ No         │ Yes       │ Yes          │ 13m    │ host-87febd29    │
    │ 08f557f6 │ RUNNING │ No         │ Yes       │ Yes          │ 13m    │ host-d7bfa338    │
    │ 1585a8db │ RUNNING │ No         │ Yes       │ Yes          │ 13m    │ host-43e3d4c8    │
    │ 69db95ad │ RUNNING │ No         │ Yes       │ Yes          │ 13m    │ host-7d650b17    │   
    │ dc8a4e4d │ RUNNING │ No         │ Yes       │ Yes          │ 13m    │ host-7d650b17    │
    │ 06a53a97 │ RUNNING │ No         │ Yes       │ Yes          │ 13m    │ host-87febd29    │ 
    │ 6cb115f0 │ RUNNING │ No         │ Yes       │ Yes          │ 13m    │ host-87febd29    │
    │ f12851fe │ RUNNING │ No         │ Yes       │ Yes          │ 13m    │ host-8ed2538f    │
    │ 21cd1b39 │ RUNNING │ No         │ Yes       │ Yes          │ 13m    │ host-d7bfa338    │
    │ 3505c70b │ RUNNING │ No         │ Yes       │ Yes          │ 13m    │ host-43e3d4c8    │
    ├──────────┼─────────┼────────────┼───────────┼──────────────┼────────┼──────────────────┤
    │          │         │ Total: 1   │ Total: 10 │ Total: 10    │        │                  │
     ──────────┴─────────┴────────────┴───────────┴──────────────┴────────┴──────────────────

    # First MAKE (14db1044) succeeded. First BREAK (f12851fe) occurred. Second MAKE (cb071415) in progress. 2     instances are up to date. 10 instances remain available. 9 instances remain to be processed (1 of them is the 2nd BREAK).
    Looking up "example-go"... done
    Rolling Mode: true
    Reconciliation Status: active
    Reconciliation Errors: 0/10
    Up To Date and Available: 1/10
     ──────────┬──────────┬────────────┬───────────┬──────────────┬────────┬──────────────────
    │ UUID     │ Status   │ Up To Date │ Available │ Will Restart │ Uptime │ Host             │
    ├──────────┼──────────┼────────────┼───────────┼──────────────┼────────┼──────────────────┤
    │ cb071415 │ SETUP    │ Yes        │ No        │ No           │ 0s     │ host-8ed2538f    │
    │ 14db1044 │ RUNNING  │ Yes        │ Yes       │ No           │ 1s     │ host-8ed2538f    │
    │ 69db95ad │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-7d650b17    │
    │ 1585a8db │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-43e3d4c8    │
    │ 06a53a97 │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-87febd29    │
    │ 6cb115f0 │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-87febd29    │
    │ dc8a4e4d │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-7d650b17    │
    │ 3505c70b │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-43e3d4c8    │
    │ 21cd1b39 │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-d7bfa338    │
    │ 08f557f6 │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-d7bfa338    │
    │ dcf371aa │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-87febd29    │
    │ f12851fe │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-8ed2538f    │
    ├──────────┼──────────┼────────────┼───────────┼──────────────┼────────┼──────────────────┤
    │          │          │ Total: 2   │ Total: 10 │ Total: 9     │        │                  │
     ──────────┴──────────┴────────────┴───────────┴──────────────┴────────┴──────────────────

    # Second MAKE (cb071415) succeeded. Second BREAK (21cd1b39) occurred. Third MAKE (cb9de68d) in progress. 3 instances are up to date. 10 instances remain available. 8 instances remain to be processed (1 of them is the 3rd BREAK).
    Looking up "example-go"... done
    Rolling Mode: true
    Reconciliation Status: active
    Reconciliation Errors: 0/10
    Up To Date and Available: 2/10
     ──────────┬──────────┬────────────┬───────────┬──────────────┬────────┬──────────────────
    │ UUID     │ Status   │ Up To Date │ Available │ Will Restart │ Uptime │ Host             │
    ├──────────┼──────────┼────────────┼───────────┼──────────────┼────────┼──────────────────┤
    │ cb9de68d │ SETUP    │ Yes        │ No        │ No           │ 0s     │ host-d7bfa338    │
    │ 14db1044 │ RUNNING  │ Yes        │ Yes       │ No           │ 2s     │ host-8ed2538f    │
    │ cb071415 │ RUNNING  │ Yes        │ Yes       │ No           │ 1s     │ host-8ed2538f    │
    │ 06a53a97 │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-87febd29    │
    │ 6cb115f0 │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-87febd29    │
    │ 1585a8db │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-43e3d4c8    │
    │ 3505c70b │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-43e3d4c8    │
    │ dc8a4e4d │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-7d650b17    │
    │ dcf371aa │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-87febd29    │
    │ 08f557f6 │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-d7bfa338    │
    │ 69db95ad │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-7d650b17    │
    │ 21cd1b39 │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-d7bfa338    │
    ├──────────┼──────────┼────────────┼───────────┼──────────────┼────────┼──────────────────┤
    │          │          │ Total: 3   │ Total: 10 │ Total: 8     │        │                  │
     ──────────┴──────────┴────────────┴───────────┴──────────────┴────────┴──────────────────

    # Continues in this fashion.   
    …

    # 10th MAKE (de2526e6) in progress. 10 instances are up to date (10th MAKE included). 10 instances remain available. 1 instance remains to be processed: the 10th BREAK.
    Looking up "example-go"... done
    Rolling Mode: true
    Reconciliation Status: active
    Reconciliation Errors: 0/10
    Up To Date and Available: 9/10
     ──────────┬──────────┬────────────┬───────────┬──────────────┬────────┬──────────────────
    │ UUID     │ Status   │ Up To Date │ Available │ Will Restart │ Uptime │ Host             │
    ├──────────┼──────────┼────────────┼───────────┼──────────────┼────────┼──────────────────┤
    │ de2526e6 │ SETUP    │ Yes        │ No        │ No           │ 0s     │ host-8ed2538f    │
    │ 14db1044 │ RUNNING  │ Yes        │ Yes       │ No           │ 9s     │ host-8ed2538f    │
    │ 367254b6 │ RUNNING  │ Yes        │ Yes       │ No           │ 4s     │ host-d7bfa338    │
    │ 95d41713 │ RUNNING  │ Yes        │ Yes       │ No           │ 3s     │ host-7d650b17    │
    │ 96622bf5 │ RUNNING  │ Yes        │ Yes       │ No           │ 5s     │ host-43e3d4c8    │
    │ 97f0985f │ RUNNING  │ Yes        │ Yes       │ No           │ 6s     │ host-7d650b17    │
    │ cb071415 │ RUNNING  │ Yes        │ Yes       │ No           │ 8s     │ host-8ed2538f    │
    │ cb9de68d │ RUNNING  │ Yes        │ Yes       │ No           │ 7s     │ host-d7bfa338    │
    │ cdedc331 │ RUNNING  │ Yes        │ Yes       │ No           │ 2s     │ host-87febd29    │
    │ d990f3f2 │ RUNNING  │ Yes        │ Yes       │ No           │ 1s     │ host-87febd29    │
    │ 1585a8db │ RUNNING  │ No         │ Yes       │ Yes          │ 13m    │ host-43e3d4c8    │
    │ 3505c70b │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-43e3d4c8    │
    │ dc8a4e4d │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-7d650b17    │
    │ 06a53a97 │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-87febd29    │
    │ 6cb115f0 │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-87febd29    │
    │ 69db95ad │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-7d650b17    │
    │ 21cd1b39 │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-d7bfa338    │
    │ dcf371aa │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-87febd29    │
    │ 08f557f6 │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-d7bfa338    │
    ├──────────┼──────────┼────────────┼───────────┼──────────────┼────────┼──────────────────┤
    │          │          │ Total: 10  │ Total: 10 │ Total: 1     │        │                  │
     ──────────┴──────────┴────────────┴───────────┴──────────────┴────────┴──────────────────

    # 10th MAKE (de2526e6) succeeds. 10th BREAK (1585a8db) occurred. 10 instances are up to date. 10 instances are available. None remain to be processed.
    Looking up "example-go"... done
    Rolling Mode: true
    Reconciliation Status: active
    Reconciliation Errors: 0/10
    Up To Date and Available: 10/10
     ──────────┬──────────┬────────────┬───────────┬──────────────┬────────┬──────────────────
    │ UUID     │ Status   │ Up To Date │ Available │ Will Restart │ Uptime │ Host             │
    ├──────────┼──────────┼────────────┼───────────┼──────────────┼────────┼──────────────────┤
    │ 14db1044 │ RUNNING  │ Yes        │ Yes       │ No           │ 11s    │ host-8ed2538f    │
    │ 367254b6 │ RUNNING  │ Yes        │ Yes       │ No           │ 5s     │ host-d7bfa338    │
    │ 95d41713 │ RUNNING  │ Yes        │ Yes       │ No           │ 4s     │ host-7d650b17    │
    │ 96622bf5 │ RUNNING  │ Yes        │ Yes       │ No           │ 7s     │ host-43e3d4c8    │
    │ 97f0985f │ RUNNING  │ Yes        │ Yes       │ No           │ 8s     │ host-7d650b17    │
    │ cb071415 │ RUNNING  │ Yes        │ Yes       │ No           │ 10s    │ host-8ed2538f    │
    │ cb9de68d │ RUNNING  │ Yes        │ Yes       │ No           │ 9s     │ host-d7bfa338    │
    │ cdedc331 │ RUNNING  │ Yes        │ Yes       │ No           │ 3s     │ host-87febd29    │
    │ d990f3f2 │ RUNNING  │ Yes        │ Yes       │ No           │ 2s     │ host-87febd29    │
    │ de2526e6 │ RUNNING  │ Yes        │ Yes       │ No           │ 1s     │ host-8ed2538f    │
    │ 21cd1b39 │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-d7bfa338    │
    │ 06a53a97 │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-87febd29    │
    │ 1585a8db │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-43e3d4c8    │
    │ 69db95ad │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-7d650b17    │
    │ 3505c70b │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-43e3d4c8    │
    │ dc8a4e4d │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-7d650b17    │
    │ 08f557f6 │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-d7bfa338    │
    │ 6cb115f0 │ TEARDOWN │ No         │ No        │ -            │ 13m    │ host-87febd29    │
    ├──────────┼──────────┼────────────┼───────────┼──────────────┼────────┼──────────────────┤
    │          │          │ Total: 10  │ Total: 10 │ Total: 0     │        │                  │
     ──────────┴──────────┴────────────┴───────────┴──────────────┴────────┴──────────────────

Note the amount of instances in TEARDOWN state left behind: a BREAK is successful as soon as the instance transitions into the STOPPING_WAIT state, not when it has stopped completely and been removed. These instances will disappear from the report once the IM finishes tearing them down. ​

An instance transitions into STOPPING_WAIT before its stop command, if any, has been executed; the stop command may fail, and the instance exit abnormally, and that won’t be a reconciliation error; the IM makes sure that every stopped instance gets removed, though (they never linger forever), and they cease using the IM resources as soon the transition takes place.

The report also displays the current number of reconciliation errors and lists them:

    Looking up "example-go"... done
    Rolling Mode: true
    Reconciliation Status: active
    Reconciliation Errors: 6/10
    Warning: instance 9ca0667d: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:30:55.593573196 +0000 UTC)
    Warning: instance 11a76405: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:30:57.145660932 +0000 UTC)
    Warning: instance af536407: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:31:01.719872075 +0000 UTC)
    Warning: instance 54bb6bc2: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:31:03.423936744 +0000 UTC)
    Warning: instance 4de91645: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:31:08.100499549 +0000 UTC)
    Warning: instance aa336a22: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:31:14.264948428 +0000 UTC)
    Up To Date and Available: 16/30
     ──────────┬──────────┬────────────┬───────────┬──────────────┬────────┬───────────────
    │ UUID     │ Status   │ Up To Date │ Available │ Will Restart │ Uptime │ Host          │
    ├──────────┼──────────┼────────────┼───────────┼──────────────┼────────┼───────────────┤
    │ 7eccad8f │ SETUP    │ Yes        │ No        │ No           │ 0s     │ host-43e3d4c8 │
    │ 05292844 │ RUNNING  │ Yes        │ Yes       │ No           │ 27s    │ host-8ed2538f │
    │ 09369774 │ RUNNING  │ Yes        │ Yes       │ No           │ 5s     │ host-7d650b17 │
    │ 2935abf1 │ RUNNING  │ Yes        │ Yes       │ No           │ 17s    │ host-7d650b17 │
    │ 2fbdafa7 │ RUNNING  │ Yes        │ Yes       │ No           │ 19s    │ host-7d650b17 │
    │ 588937f6 │ RUNNING  │ Yes        │ Yes       │ No           │ 33s    │ host-d7bfa338 │
    │ 73c89c73 │ RUNNING  │ Yes        │ Yes       │ No           │ 31s    │ host-d7bfa338 │
    │ 742f4fc9 │ RUNNING  │ Yes        │ Yes       │ No           │ 23s    │ host-8ed2538f │
    ...

When the number of errors exceeds the threshold and the reconciliation process gets paused, the report informs about the fact:

    Looking up "example-go"... done
    Rolling Mode: true
             Reconciliation Status: paused
    Reconciliation Errors: 11/10
    Warning: instance 9ca0667d: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:30:55.593573196 +0000 UTC)
    Warning: instance 11a76405: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:30:57.145660932 +0000 UTC)
    Warning: instance af536407: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:31:01.719872075 +0000 UTC)
    Warning: instance 54bb6bc2: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:31:03.423936744 +0000 UTC)
    Warning: instance 4de91645: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:31:08.100499549 +0000 UTC)
    Warning: instance aa336a22: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:31:14.264948428 +0000 UTC)
    Warning: instance 1e7c88f2: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:31:19.000066944 +0000 UTC)
    Warning: instance 32ec055d: aborted by the container runtime (reason: process failed): check the IM logs (2018-03-12 02:31:22.180265126 +0000 UTC)
    Warning: instance 174ff9d0: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:31:23.501932291 +0000 UTC)
    Warning: instance 62c9795f: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:31:26.205546644 +0000 UTC)
    Warning: instance 7b611ee1: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 02:31:27.704941303 +0000 UTC)
    Up To Date and Available: 19/30
     ──────────┬─────────┬────────────┬───────────┬──────────────┬────────┬──────────────────
    │ UUID     │ Status  │ Up To Date │ Available │ Will Restart │ Uptime │ Host             │
    ├──────────┼─────────┼────────────┼───────────┼──────────────┼────────┼──────────────────┤
    │ 05292844 │ RUNNING │ Yes        │ Yes       │ No           │ 3m39s  │ host-8ed2538f │
    │ 09369774 │ RUNNING │ Yes        │ Yes       │ No           │ 3m17s  │ host-7d650b17 │
    │ 2935abf1 │ RUNNING │ Yes        │ Yes       │ No           │ 3m29s  │ host-7d650b17 │
    │ 2fbdafa7 │ RUNNING │ Yes        │ Yes       │ No           │ 3m31s  │ host-7d650b17 │
    │ 310f9ecb │ RUNNING │ Yes        │ Yes       │ No           │ 3m5s   │ host-87febd29 │
    │ 588937f6 │ RUNNING │ Yes        │ Yes       │ No           │ 3m45s  │ host-d7bfa338 │
    
    ...

Web Console

The Web Console has an offline report equivalent to APC JOB INSTANCES, with the columns Up-to-date and Running? and Will Restart?. You find it on the SCHEDULING tab (refreshes automatically).

![Alt text](/assets/img/jobs/job_scheduling_table-57a37a32.png "Job Scheduling Table"){: style="max-width: 100%; border: 1px solid black;"}

APC MANIFEST STATUS

This report is used to query the current status of the deployment of the multiple jobs that may be included in an MRM. If the jobs are configured for Rolling Mode and are running, a summary that includes the status of the reconciliation process and the current number of reconciliation errors is provided.

    $ apc manifest status mrm.json

    ────────────────────────────────────────────┬─────────┬───────────┬────────────┬───────────┬──────────────┬────────────────┬────────
    │ Job                                        │ Status  │ Instances │ Up To Date │ Available │ Rolling Mode │ Reconciliation │ Errors │
    ├────────────────────────────────────────────┼─────────┼───────────┼────────────┼───────────┼──────────────┼────────────────┼────────┤
    │ job::/example::existing-job-no-rr-ready    │ ready   │ 5         │ 0          │ 0         │ false        │ -              │ -      │
    │ job::/example::existing-job-no-rr-started  │ started │ 5         │ 5          │ 5         │ false        │ -              │ -      │
    │ job::/example::existing-job-no-rr-stopped  │ stopped │ 5         │ 0          │ 0         │ false        │ -              │ -      │
    │ job::/example::existing-job-yes-rr-ready   │ ready   │ 5         │ 0          │ 0         │ true         │ -              │ -      │
    │ job::/example::existing-job-yes-rr-started │ started │ 20        │ 20         │ 20        │ true         │ active         │ 0/15   │
    │ job::/example::existing-job-yes-rr-errored │ started │ 20        │ 20         │ 20        │ true         │ paused         │ 6/5    │
    │ job::/example::existing-job-yes-rr-stopped │ stopped │ 5         │ 0          │ 0         │ true         │ -              │ -      │
    │ job::/example::new-job-no-rr-ready         │ ready   │ 5         │ 0          │ 0         │ false        │ -              │ -      │
    │ job::/example::new-job-no-rr-started       │ started │ 5         │ 5          │ 5         │ false        │ -              │ -      │
    │ job::/example::new-job-yes-rr-ready        │ ready   │ 5         │ 0          │ 0         │ true         │ -              │ -      │
    │ job::/example::new-job-yes-rr-started      │ started │ 5         │ 5          │ 5         │ true         │ active         │ 0/0    │
     ────────────────────────────────────────────┴─────────┴───────────┴────────────┴───────────┴──────────────┴────────────────┴────────
    Use APC JOB INSTANCES <job> to get more details.
    Success!

Redeployment, update, and restart

App redeployment, update, and restart all obey the job’s Rolling Mode setting: if false, a non-rolling, STOP-OPERATION-START orchestrated by the client takes places; if true, a 1-by-1, rolling, make-before-break behavior takes place.

Starting a stopped job and stopping a running job are not influenced by the job’s Rolling Mode: all instances are started and stopped at once.

Redeployment

A running job can be redeployed with an updated package in rolling mode.

APC APP DEPLOY

This command redeploys a single job with a package built from the contents of the selected directory.

    $ apc app deploy example-go
    Deploying from: "..."
    Is this correct? [Y/n]: 
    Warning: By default previously deployed app package will be removed.
    Use --keep-previous flag to keep the old package.
    Continue? [Y/n]: 
    This operation requires restarting running instances.
    The restart will be performed in rolling mode. Proceed? [Y/n]: 
    Packaging... done
    Creating package "example-go-1520824889"... done
    Uploading package contents... 100.0% (1.4/1.4 KiB)
    [package] -- Uploading -- received all bytes from client, start upload to storage backend
    [package] -- Uploading -- uploaded resource "66a9b5e5-5c5b-4c2d-8a42-d0b34a19445f" to storage backend successfully in 0.000s
    [package] -- Uploading -- updated package "721219ec-a395-4430-8399-d334839f56e7" with new resource

    [staging] Subscribing to the staging process...
    [staging] Log forwarding initializing for job=job::/apcera::staging_coordinator/example-go-1520824889/721219ec/0600d3e9 path=/var/lib/continuum/instances/fb014366/data/aufs/logs/staging.log tag=staging.721219ec-a395-4430-8399-d334839f56e7
    [staging] Beginning staging with 'stagpipe::/apcera::go' pipeline, 1 stager(s) defined.
    [staging] Launching instance of stager 'go'...
    [staging] Log forwarding initializing for job=job::/rr::go/stager/example-go-1520824889/721219ec/96bc0b5b path=/var/lib/continuum/instances/8ef8d32d/data/aufs/logs/stdout.stagers.log tag=staging.721219ec-a395-4430-8399-d334839f56e7
    [staging] Log forwarding initializing for job=job::/rr::go/stager/example-go-1520824889/721219ec/96bc0b5b path=/var/lib/continuum/instances/8ef8d32d/data/aufs/logs/stderr.stagers.log tag=staging.721219ec-a395-4430-8399-d334839f56e7
    [staging] Checking build dependencies...
    [staging] Adding build dependencies... [go, build-essential]
    [staging] Stager needs relaunching
    [staging] Launching instance of stager 'go'...
    [staging] Log forwarding initializing for job=job::/rr::go/stager/example-go-1520824889/721219ec/6d7242eb path=/var/lib/continuum/instances/3f4afb84/data/aufs/logs/stdout.stagers.log tag=staging.721219ec-a395-4430-8399-d334839f56e7
    [staging] Log forwarding initializing for job=job::/rr::go/stager/example-go-1520824889/721219ec/6d7242eb path=/var/lib/continuum/instances/3f4afb84/data/aufs/logs/stderr.stagers.log tag=staging.721219ec-a395-4430-8399-d334839f56e7
    [staging] Checking build dependencies...
    [staging] Downloading packages for processing...
    [staging] Go src copied to: src/github.com/apcera/sample-apps/example-go
    [staging] Copying binaries into place.
    [staging] Removing build dependencies... [go, build-essential]
    [staging] Staging is complete.
    
    Updating package name from "example-go-1520824889" to "example-go"... done
    Updating "example-go"... done
    Start Command: ./example-go
    Instances up-to-date and running: 40/40. Stopped: 40/40. Errors: 0/0 (threshold).
    Success!
    Deleting old package "package::/rr::example-go-1520824900" [--keep-previous=false]
    Package "package::/rr::example-go-1520824900" deleted

APC PACKAGE REPLACE

This command replaces a package and redeploys all of the running jobs that use it with the new package, one by one. In the following example, 4 jobs use the package that is to be replaced; 3 of them are configured for rolling mode and 1 is not confgured for rolling mode:

    $ apc package replace example-go example-go.tar.gz 
    Staging Pipeline []: stagpipe::/apcera::go
    Provides []: 
    Dependencies []: os.linux
    Environment []: GOPROJECT="src/github.com/apcera/sample-apps/example-go", START_COMMAND="./example-go", START_PATH="/app"
    ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
    │                                                   Package Replace Settings                                                    │
    ├───────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────┤
    │     Package Name: │ package::/rr::example-go                                                                                  │
    │   File to Package │ example-go.tar.gz                                                                                         │
    │ Staging Pipeline: │ stagpipe::/apcera::go                                                                                     │
    │     Dependencies: │ os.linux                                                                                                  │
    │      Environment: │ GOPROJECT="src/github.com/apcera/sample-apps/example-go", START_COMMAND="./example-go", START_PATH="/app" │
    ╰───────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────╯
    
    Replace package? (This action will stop or restart any dependent jobs) [Y/n]: 
    Checking for jobs to update... done
    Packaging... done
    Creating package "example-go"... done
    Uploading package contents... 100.0% (653/653 B)
    [package] -- Uploading -- received all bytes from client, start upload to storage backend
    [package] -- Uploading -- uploaded resource "9415c387-0b3a-46da-8702-9d1ec32d939e" to storage backend successfully in 0.000s
    [package] -- Uploading -- updated package "1ffe5ed7-01bb-4647-80a4-6ab77d66c0de" with new resource
    
    [staging] Subscribing to the staging process...
    [staging] Log forwarding initializing for job=job::/apcera::staging_coordinator/example-go/1ffe5ed7/b25f7a9b path=/var/lib/continuum/instances/be96fb81/data/aufs/logs/staging.log tag=staging.1ffe5ed7-01bb-4647-80a4-6ab77d66c0de
    [staging] Beginning staging with 'stagpipe::/apcera::go' pipeline, 1 stager(s) defined.
    [staging] Launching instance of stager 'go'...
    [staging] Log forwarding initializing for job=job::/rr::go/stager/example-go/1ffe5ed7/a158d691 path=/var/lib/continuum/instances/f02d54f9/data/aufs/logs/stdout.stagers.log tag=staging.1ffe5ed7-01bb-4647-80a4-6ab77d66c0de
    [staging] Log forwarding initializing for job=job::/rr::go/stager/example-go/1ffe5ed7/a158d691 path=/var/lib/continuum/instances/f02d54f9/data/aufs/logs/stderr.stagers.log tag=staging.1ffe5ed7-01bb-4647-80a4-6ab77d66c0de
    [staging] Checking build dependencies...
    [staging] Adding build dependencies... [go, build-essential]
    [staging] Stager needs relaunching
    [staging] Launching instance of stager 'go'...
    [staging] Log forwarding initializing for job=job::/rr::go/stager/example-go/1ffe5ed7/d71b6e5e path=/var/lib/continuum/instances/e4fc2229/data/aufs/logs/stdout.stagers.log tag=staging.1ffe5ed7-01bb-4647-80a4-6ab77d66c0de
    [staging] Log forwarding initializing for job=job::/rr::go/stager/example-go/1ffe5ed7/d71b6e5e path=/var/lib/continuum/instances/e4fc2229/data/aufs/logs/stderr.stagers.log tag=staging.1ffe5ed7-01bb-4647-80a4-6ab77d66c0de
    [staging] Checking build dependencies...
    [staging] Downloading packages for processing...
    [staging] Go src copied to: src/github.com/apcera/sample-apps/example-go
    [staging] Copying binaries into place.
    [staging] Removing build dependencies... [go, build-essential]
    [staging] Staging is complete.
    
    This operation requires restarting running instances.
    The restart will be performed in rolling mode.
    Updating job "example-go"... done
    Instances up-to-date and running: 40/40. Stopped: 40/40. Errors: 0/0 (threshold).
    Success!
    This operation requires restarting running instances.
    The restart will be performed in rolling mode.
    Updating job "example-go-2"... done
    Instances up-to-date and running: 5/5. Stopped: 5/5. Errors: 0/0 (threshold).   
    Success!
    This operation requires restarting running instances.
    The restart will be performed in rolling mode.
    Updating job "example-go-3"... done
    Instances up-to-date and running: 3/3. Stopped: 3/3. Errors: 0/0 (threshold).   
    Success!
    This operation requires restarting running instances.
    All instances will be stopped, then relaunched.
    Stopping job... done
    Updating job "example-go-no-rr"... done
    Starting job... done
    Waiting for the job to start...
    All instances started!                                                          
    Success!
    Deleting the existing package... done
    Success!

APC MANIFEST DEPLOY

The deployment of each job contained in an MRM is influenced only by that job’s Rolling Mode. There’s no MRM-level configuration that overrides it for all jobs on an invocation basis.

Upon executing the deployment of the following example MRM, the deployment of each job will behave as follows:

  • job::/example::new-job-yes-rr-started: Despite rolling_mode=true and being started, no reconciliation will take place, because the job is new.
  • job::/example::existing-job-no-rr-started: rolling_mode=false, so the deployment will stop the job completely, perform the database update, and start the job again.
  • job::/example::existing-job-yes-rr-ready: Despite rolling_mode=true and the fact that the job exists, no reconciliation will take place, because the job won’t be running.
  • job::/example::existing-job-yes-rr-started: Reconciliation will take place.
  • job::/example::existing-job-yes-rr-stopped: Despite rolling_mode=true and the fact that the job exists, no reconciliation will take place, because the job will be stopped.
{
   "jobs":{
      "job::/example::new-job-yes-rr-started":{
         "packages":[
            {
               "fqn":"package::/apcera/pkg/os::ubuntu-14.04-apc3"
            }
         ],
         "state":"started",
         "rollout":{
            "rolling_mode":true
         },
         "instances":5
      },
      "job::/example::existing-job-no-rr-started":{
         "packages":[
            {
               "fqn":"package::/apcera/pkg/os::ubuntu-14.04-apc3"
            }
         ],
         "state":"started",
         "rollout":{
            "rolling_mode":false
         }
      },
      "job::/example::existing-job-yes-rr-ready":{
         "packages":[
            {
               "fqn":"package::/apcera/pkg/os::ubuntu-14.04-apc3"
            }
         ],
         "state":"ready",
         "rollout":{
            "rolling_mode":true
         }
      },
      "job::/example::existing-job-yes-rr-started":{
         "packages":[
            {
               "fqn":"package::/apcera/pkg/os::ubuntu-14.04-apc3"
            }
         ],
         "state":"started",
         "rollout":{
            "rolling_mode":true,
            "failure_threshold":5
         }
      },
      "job::/example::existing-job-yes-rr-stopped":{
         "packages":[
            {
               "fqn":"package::/apcera/pkg/os::ubuntu-14.04-apc3"
            }
         ],
         "state":"started",
         "rollout":{
            "rolling_mode":true
         }
      }
   }
}

When rolling_mode=true and the job exists, reconciliation won’t take place in the following cases:

  • Only dynamic updates are being submitted (e.g. a new port, or rollout:failure_threshold is changed).
  • The job’s declaration hasn’t changed since the last time the MRM was deployed. Or the job hasn’t been changed via APC or Web Console since the last MRM deployment in a way such that the declaration differs from the job’s state in the database.

MRMs are deployed in 2 stages:

  1. Preparation and database update: the MRM is validated, policy is checked, package dependencies are resolved, Docker images (if any) are downloaded, and the jobs and associated objects are updated in the database. An MRM deployment will fail if any of these steps fail for any of the jobs; APC and Web Console will report the errors timely before exiting.
  2. Reconciliation: the reconciliation of the applicable jobs (i.e. rolling_mode=true, started, and with changes) will start as soon as the job is updated in the database. Note that after the database update, reconciliation will take place asynchronously with respect to the MRM deployment: it may complete before or after APC MANIFEST DEPLOY exits; this is not the case for non-rolling-mode jobs: it is guaranteed that the deployment of such jobs will take place in full before the command exits (i.e. all instances will run the newly deployed version of the job).
{
   "jobs":{
      "job::/example::existing-job-no-rr-ready":{
         "docker":{
            "image":"nats"
         },
         "state":"ready",
         "instances":5,
         "rollout":{
            "rolling_mode":false
         }
      },
      "job::/example::existing-job-no-rr-started":{
         "docker":{
            "image":"nats"
         },
         "state":"started",
         "instances":5,
         "rollout":{
            "rolling_mode":false
         }
      },
      "job::/example::existing-job-no-rr-stopped":{
         "docker":{
            "image":"nats"
         },
         "state":"started",
         "instances":5,
         "rollout":{
            "rolling_mode":false
         }
      },
      "job::/example::existing-job-yes-rr-ready":{
         "docker":{
            "image":"nats"
         },
         "state":"ready",
         "instances":5,
         "rollout":{
            "rolling_mode":true
         }
      },
      "job::/example::existing-job-yes-rr-started":{
         "docker":{

            "image":"nats"
         },
         "state":"started",
         "instances":5,
         "rollout":{
            "rolling_mode":true,
            "failure_threshold":5
         }
      },
      "job::/example::existing-job-yes-rr-stopped":{
         "docker":{
            "image":"nats"
         },
         "state":"started",
         "instances":5,
         "rollout":{
            "rolling_mode":true
         }
      }
   }
}
    $ apc manifest deploy create_docker.json 
    Deploying manifest...
    [manifest] -- Deploy -- execution started
    [existing-job-no-rr-ready] -- Creating Docker package -- checking policy
    [existing-job-no-rr-ready] -- Creating Docker package -- checking if package FQN is taken
    [existing-job-no-rr-ready] -- Creating Docker package -- fetching image metadata
    [existing-job-no-rr-ready] -- Creating Docker package -- pulling image metadata from registry
    [existing-job-no-rr-ready] -- Creating Docker package -- creating package
    [existing-job-no-rr-ready] -- Creating Docker package -- all layers downloaded
    [existing-job-no-rr-started] -- Creating Docker package -- checking policy
    [existing-job-no-rr-started] -- Creating Docker package -- checking if package FQN is taken
    [existing-job-no-rr-started] -- Creating Docker package -- fetching image metadata
    [existing-job-no-rr-started] -- Creating Docker package -- using local image metadata
    [existing-job-no-rr-started] -- Creating Docker package -- creating package
    [existing-job-no-rr-started] -- Creating Docker package -- all layers downloaded
    [existing-job-no-rr-stopped] -- Creating Docker package -- checking policy
    [existing-job-no-rr-stopped] -- Creating Docker package -- checking if package FQN is taken
    [existing-job-no-rr-stopped] -- Creating Docker package -- fetching image metadata
    [existing-job-no-rr-stopped] -- Creating Docker package -- using local image metadata
    [existing-job-no-rr-stopped] -- Creating Docker package -- creating package
    [existing-job-no-rr-stopped] -- Creating Docker package -- all layers downloaded
    [existing-job-yes-rr-ready] -- Creating Docker package -- checking policy
    [existing-job-yes-rr-ready] -- Creating Docker package -- checking if package FQN is taken
    [existing-job-yes-rr-ready] -- Creating Docker package -- fetching image metadata
    [existing-job-yes-rr-ready] -- Creating Docker package -- using local image metadata
    [existing-job-yes-rr-ready] -- Creating Docker package -- creating package
    [existing-job-yes-rr-ready] -- Creating Docker package -- all layers downloaded
    [existing-job-yes-rr-started] -- Creating Docker package -- checking policy
    [existing-job-yes-rr-started] -- Creating Docker package -- checking if package FQN is taken
    [existing-job-yes-rr-started] -- Creating Docker package -- fetching image metadata
    [existing-job-yes-rr-started] -- Creating Docker package -- using local image metadata
    [existing-job-yes-rr-started] -- Creating Docker package -- creating package
    [existing-job-yes-rr-started] -- Creating Docker package -- all layers downloaded
    [existing-job-yes-rr-stopped] -- Creating Docker package -- checking policy
    [existing-job-yes-rr-stopped] -- Creating Docker package -- checking if package FQN is taken
    [existing-job-yes-rr-stopped] -- Creating Docker package -- fetching image metadata
    [existing-job-yes-rr-stopped] -- Creating Docker package -- using local image metadata
    [existing-job-yes-rr-stopped] -- Creating Docker package -- creating package
    [existing-job-yes-rr-stopped] -- Creating Docker package -- all layers downloaded
    [manifest] -- Deploy -- creating "job::/example::existing-job-yes-rr-started"
    [manifest] -- Deploy -- created "job::/example::existing-job-yes-rr-started"
    [manifest] -- Deploy -- creating "job::/example::existing-job-yes-rr-stopped"
    [manifest] -- Deploy -- created "job::/example::existing-job-yes-rr-stopped"
    [manifest] -- Deploy -- creating "job::/example::existing-job-no-rr-ready"
    [manifest] -- Deploy -- created "job::/example::existing-job-no-rr-ready"
    [manifest] -- Deploy -- creating "job::/example::existing-job-no-rr-started"
    [manifest] -- Deploy -- created "job::/example::existing-job-no-rr-started"
    [manifest] -- Deploy -- creating "job::/example::existing-job-no-rr-stopped"
    [manifest] -- Deploy -- created "job::/example::existing-job-no-rr-stopped"
    [manifest] -- Deploy -- creating "job::/example::existing-job-yes-rr-ready"
    [manifest] -- Deploy -- created "job::/example::existing-job-yes-rr-ready"
    [manifest] -- Finish -- execution was successful

    Get a summary of the current status of the deployment with APC MANIFEST STATUS.
    Monitor the status of the deployment of each job with APC JOB INSTANCES and/or APC EVENT SUBSCRIBE.

     ────────────────────────────────────────────┬─────────┬───────────┬────────────┬───────────┬──────────────┬────────────────┬────────
    │ Job                                        │ Status  │ Instances │ Up To Date │ Available │ Rolling Mode │ Reconciliation │ Errors │
    ├────────────────────────────────────────────┼─────────┼───────────┼────────────┼───────────┼──────────────┼────────────────┼────────┤
    │ job::/example::existing-job-no-rr-ready    │ ready   │ 5         │ 0          │ 0         │ false        │ -              │ -      │
    │ job::/example::existing-job-no-rr-started  │ started │ 5         │ 5          │ 0         │ false        │ -              │ -      │
    │ job::/example::existing-job-no-rr-stopped  │ started │ 5         │ 5          │ 0         │ false        │ -              │ -      │
    │ job::/example::existing-job-yes-rr-ready   │ ready   │ 5         │ 0          │ 0         │ true         │ -              │ -      │
    │ job::/example::existing-job-yes-rr-started │ started │ 5         │ 0          │ 5         │ true         │ active         │ 0/5    │
    │ job::/example::existing-job-yes-rr-stopped │ started │ 5         │ 5          │ 0         │ true         │ active         │ 0/0    │
     ────────────────────────────────────────────┴─────────┴───────────┴────────────┴───────────┴──────────────┴────────────────┴────────
    Use APC JOB INSTANCES <job> to get more details.
    Success!

APC MANIFEST DEPLOY includes at the end the same table that APC MANIFEST STATUS displays. Note that reconciliation might not have finished by the time this table gets displayed.

To monitor the deployment of an MRM, use APC MANIFEST STATUS with the JSON file that was deployed. Or monitor individual jobs using reconciliation events or the offline reports.

Update

All job operations and updates that require the job to be restarted will do so in rolling mode if Rolling Mode is configured.

APC

    # Under way.
    $ apc job update example-go --disk 512Mb
     ──────────────────────────────
    │     Job Update Settings      │
    ├───────┬──────────────────────┤
    │  FQN: │ job::/rr::example-go │
    │ Disk: │ 512Mb                │
     ───────┴──────────────────────

    Is this correct? [Y/n]: 
    This operation requires restarting running instances.
    The restart will be performed in rolling mode. Proceed? [Y/n]: 
    Applying update...done
    Instances up-to-date and running: 9/40. Stopped: 9/40. Errors: 0/0 (threshold). 

    # When done.
    $ apc job update example-go --disk 512Mb
     ──────────────────────────────
    │     Job Update Settings      │
    ├───────┬──────────────────────┤
    │  FQN: │ job::/rr::example-go │
    │ Disk: │ 512Mb                │
     ───────┴──────────────────────
    
    Is this correct? [Y/n]: 
    This operation requires restarting running instances.
    The restart will be performed in rolling mode. Proceed? [Y/n]: 
    Applying update...done
    Instances up-to-date and running: 40/40. Stopped: 40/40. Errors: 0/0 (threshold).
    Success!

Another example:

    $ apc network join net1 --job example-go  
    This operation requires restarting running instances.
    The restart will be performed in rolling mode.
    Joining "job::/rr::example-go" to "network::/rr::net1"... done
    Instances up-to-date and running: 38/40. Stopped: 38/40. Errors: 0/0 (threshold).
    
    $ apc network leave net1 --job example-go
    This operation requires restarting running instances.
    The restart will be performed in rolling mode.
    Removing "job::/rr::example-go" from "network::/rr::net1"... done
    Instances up-to-date and running: 4/40. Stopped: 4/40. Errors: 0/0 (threshold).

Web Console

![Alt text](/assets/img/jobs/job_rolling_update_notification.png "Rolling update notification"){: style="max-width: 100%; border: 1px solid black;"}

Restart

The reconciliation process also takes care of carrying out restarts. The 1-by-1, make-before-break behavior is exhibited, even though there is no new state to reconcile the job with. Since the process is the same, a restart may incur reconciliation errors.

APC

    $ apc job restart example-go
    The restart will be performed in rolling mode.
    Restarting...
    Instances started: 6/35. Stopped: 6/35. Errors: 0/0 (threshold).

Web Console

![Alt text](/assets/img/jobs/job_rolling_restart_notification.png "Rolling restart notification"){: style="max-width: 100%; border: 1px solid black;"}
![Alt text](/assets/img/jobs/job_online_progress_single_job.png "Rolling restart progress report"){: style="max-width: 100%; border: 1px solid black;"}

Error handling

As explained in​ State reconciliation > Failure​, the process of reconciliation has a degree of tolerance to failure that can be configured on the job (see ​Configuration​).

Online reports display reconciliation errors as they happen. Offline reports display the current list of them.

A reconciliation error is displayed in the following format:

    [instances UUID]: <error_message> [exit code | reason]: <debug_message>: <timestamp>,

where ​

    [...] ​ 

means optional. The debug message indicates which log file(s) may contain more information about the failure.

Errors

APC

In the following example, the job has a Rolling Failure Threshold of 10. During the restart, it experienced 11 reconciliation errors, which caused the reconciliation process to pause.

Note that it experienced 4 different errors:

  1. instance X: exited abnormally (exit code: 1): check the instance's logs: ​The instance’s start command exited with non-zero code almost immediately after it was executed. The instance’s log file may provide context (access it with APC JOB LOGS and filter by instance UUID).
  2. instance X: aborted by the container runtime (reason: process failed): check the IM logs:​ The instance exited, but the IM couldn’t retrieve the exit code. The IM log, around the timestamp of the error, will provide context.
  3. start request timed out: check the JM and IM logs: ​The reconciliation process didn’t observe that an instance was started by its last start request, after waiting for 1 minute. The following reasons can explain such a failure:
    • all IMs were unavailable,
    • IMs lacked the resources to accept and schedule a new instance,
    • no IM was suitable to have the instance scheduled on it (affinity/scheduling tags were not satisfied).
      The JM log will provide an explanation as to what really happened; the IM logs will help support the explanation. ​Note that the lack of a more specific error message is a limitation of the HM; in particular, scheduling failures are detected more accurately and quickly in non-rolling-mode (in less than 1 minute).
  4. instance X: aborted by the container runtime (reason: instance timed out during startup): check the IM logs:​ The job has a mandatory port and the instance did not respond to the startup health/readiness check. ​Note that periodic TCP liveness probes don’t cause reconciliation errors, because they are executed after startup. ​
    Also note that UDP ports don’t have startup health/readiness checks, so they cannot cause reconciliation errors.

These are examples of the type of errors described in ​State reconciliation > Failure​. There can be other errors, for example binding configuration failures that will raise ​aborted by the container runtime ​errors.

    $ apc job restart example-go
    The restart will be performed in rolling mode.
    Restarting...
    Warning: instance deee857d: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 07:14:41.086595147 +0000 UTC)
    Warning: instance e73e019d: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 07:14:42.664314609 +0000 UTC)
    Warning: instance d1d5ed90: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 07:14:54.698044278 +0000 UTC)
    Warning: instance 4e24f56c: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 07:14:56.197512144 +0000 UTC)
    Warning: instance ece6c9dc: aborted by the container runtime (reason: process failed): check the IM logs (2018-03-12 07:15:15.920149749 +0000 UTC)
    Warning: instance 38b15673: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 07:15:17.404951223 +0000 UTC)
    Warning: instance 74d5c6fb: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 07:15:24.834624693 +0000 UTC)
    Warning: start request timed out: check the JM and IM logs (2018-03-12 07:15:26.349856476 +0000 UTC)
    Warning: instance 6555739d: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 07:15:34.220272796 +0000 UTC)
    Warning: instance 0fe1fa06: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 07:15:37.327927821 +0000 UTC)
    Warning: instance 6a2d0086: aborted by the container runtime (reason: instance timed out during startup): check the IM logs (2018-03-12 07:15:38.73896946 +0000 UTC)
    
    Instances started: 28/30. Stopped: 28/30. Errors: 11/10 (threshold).            
    Error: the state reconciliation process was paused: submit another restart or update request, or resume the reconciliation process explicitly

In the following example, the job has a Rolling Failure Threshold of 0 (0 is the default value if not configured explicitly). Reconciliation paused at the 1st error.

    $ apc job restart example-go
    The restart will be performed in rolling mode.
    Restarting...
    Warning: instance e131de3c: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 07:18:38.710685101 +0000 UTC)
    Instances started: 3/30. Stopped: 3/30. Errors: 1/0 (threshold).                
    Error: the state reconciliation process was paused: submit another restart or update request, or resume the reconciliation process explicitly

In the following example, the restart succeeded in spite of the reconciliation error: the reconciliation process tolerated the failure because the Rolling Failure threshold was high enough. Subsequent deployment, update, and restart requests will start out with a count of 0 reconciliation errors.

    $ apc job restart example-go
    The restart will be performed in rolling mode.
    Restarting...
    Warning: instance 93d24d36: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:04:03.389872657 +0000 UTC)
    Instances started: 5/5. Stopped: 5/5. Errors: 1/10 (threshold).  
    $ apc job restart example-go
    The restart will be performed in rolling mode.
    Restarting...
    
    Instances started: 1/5. Stopped: 1/5. Errors: 0/10 (threshold).   

Web Console

The web console also displays reconciliation errors. The list of errors can be expanded (and collapsed).

Closing the dialog window does not have any impact on the reconciliation process, nor does losing the client connection to the cluster, but you will not be able to recover the online progress report. Offline reports can be used to continue monitoring the progress in such an event.

Alt text

Handling

When the reconciliation process pauses during a deployment or update, the job will be left running with a mix of job versions (in most cases only 2: the original version, and the version which rollout caused the pause).

In this example, 6 reconciliation errors occurred, when only 5 are tolerated. As the APC JOB INSTANCES report shows, only 9 instances are running with the latest changes (i.e. the new start command timeout) and the other 6 without them. Availability remains at 100%, though, because out-of-date instances are also fully functional.

    $ apc job update example-go --timeout 15
     ───────────────────────────────────────
    │       Job Update Settings             │
    ├────────────────┬──────────────────────┤
    │           FQN: │ job::/rr::example-go │
    │ Start Timeout: │ 15 (seconds)         │
     ────────────────┴─────────────────────
    
    Is this correct? [Y/n]: 
    This operation requires restarting running instances.
    The restart will be performed in rolling mode. Proceed? [Y/n]: 
    Applying update...done
    Warning: instance 9f77ff21: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:12:05.656570628 +0000 UTC)
    Warning: instance 3d4bf549: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:12:08.656747481 +0000 UTC)
    Warning: instance 2495e163: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:12:15.998577869 +0000 UTC)
    Warning: instance de6bbce8: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:12:20.722587332 +0000 UTC)
    Warning: instance 0128bdd0: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:12:22.177632616 +0000 UTC)
    Warning: instance 2f7f37d7: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:12:25.188550009 +0000 UTC)
    Instances up-to-date and running: 9/15. Stopped: 9/15. Errors: 6/5 (threshold). 
    Error: the state reconciliation process was paused: submit another restart or update request, or resume the reconciliation process explicitly
    
    $ apc job instances example-go
    Looking up "example-go"... done
    Rolling Mode: true
    Reconciliation Status: paused
    Reconciliation Errors: 6/5
    Warning: instance 9f77ff21: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:12:05.656570628 +0000 UTC)
    Warning: instance 3d4bf549: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:12:08.656747481 +0000 UTC)
    Warning: instance 2495e163: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:12:15.998577869 +0000 UTC)
    Warning: instance de6bbce8: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:12:20.722587332 +0000 UTC)
    Warning: instance 0128bdd0: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:12:22.177632616 +0000 UTC)
    Warning: instance 2f7f37d7: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:12:25.188550009 +0000 UTC)
    Up To Date and Available: 9/15
     ──────────┬─────────┬────────────┬───────────┬──────────────┬────────┬──────────────────
    │ UUID     │ Status  │ Up To Date │ Available │ Will Restart │ Uptime │ Host             │
    ├──────────┼─────────┼────────────┼───────────┼──────────────┼────────┼──────────────────┤        
    │ 07341547 │ RUNNING │ Yes        │ Yes       │ No           │ 29s    │ host-8ed2538f    │
    │ 0ddd8576 │ RUNNING │ Yes        │ Yes       │ No           │ 17s    │ host-43e3d4c8    │
    │ 1e6b3c6a │ RUNNING │ Yes        │ Yes       │ No           │ 24s    │ host-8ed2538f    │
    │ 4a77b236 │ RUNNING │ Yes        │ Yes       │ No           │ 25s    │ host-43e3d4c8    │
    │ 7cb2bd51 │ RUNNING │ Yes        │ Yes       │ No           │ 26s    │ host-8ed2538f    │
    │ 800df3fb │ RUNNING │ Yes        │ Yes       │ No           │ 13s    │ host-43e3d4c8    │
    │ 8e047141 │ RUNNING │ Yes        │ Yes       │ No           │ 32s    │ host-8ed2538f    │
    │ f3e40229 │ RUNNING │ Yes        │ Yes       │ No           │ 22s    │ host-43e3d4c8    │
    │ f77c0226 │ RUNNING │ Yes        │ Yes       │ No           │ 19s    │ host-8ed2538f    │
    │ 9dc2153c │ RUNNING │ No         │ Yes       │ Yes          │ 8m19s  │ host-87febd29    │
    │ f1b77eb1 │ RUNNING │ No         │ Yes       │ Yes          │ 1m45s  │ host-43e3d4c8    │
    │ 12fde6f3 │ RUNNING │ No         │ Yes       │ Yes          │ 1m45s  │ host-43e3d4c8    │
    │ 8884847c │ RUNNING │ No         │ Yes       │ Yes          │ 1m45s  │ host-43e3d4c8    │
    │ 1106691a │ RUNNING │ No         │ Yes       │ Yes          │ 8m14s  │ host-87febd29    │
    │ 8d6f0155 │ RUNNING │ No         │ Yes       │ Yes          │ 1m45s  │ host-d7bfa338    │
    ├──────────┼─────────┼────────────┼───────────┼──────────────┼────────┼──────────────────┤
    │          │         │ Total: 9   │ Total: 15 │ Total: 6     │        │                  │
     ──────────┴─────────┴────────────┴───────────┴──────────────┴────────┴──────────────────

The job will remain like this, and its reconciliation process paused, until one of the following events takes place:

  • Instances begin to fail at random: if the job’s restart mode is {always, on failure}, HM will begin to replenish the job by starting new instances to replace the failed ones. This is particularly dangerous, because the new instances will be started with the latest version of the job, the one that caused the reconciliation to pause (i.e. a bad change). To contain the impact of such actions, HM will set the job in the flapping state (see docs.apcera.com > Monitoring and Managing Job Instance Health > Instance states and flags > flapping). The flapping state will be cleared on the next job update (or after the flapping window has elapsed).

  • A new update is submitted by the user: any operation that updates the job will resume the process of reconciliation implicitly. During the pause, the user is advised to conduct an investigation and determine why the change was bad (reconciliation error messages are designed to help in the investigation). Then she can submit the update that she thinks might resolve the problem (or redeploy the application with an updated package). The count of reconciliation errors will be cleared and the process will begin the reconciliation anew, now against the version that presumably contains the fix.

  • A restart request is submitted by the user: if the pause was caused by a bad change, it is unlikely that a restart will solve the problem. But if the problem that caused the pause was transient, a restart might help (e.g. network connectivity within the cluster was flaky only when the pause occurred, but not when the subsequent restart took place).

  • The reconciliation process is resumed explicitly: the APC JOB UPDATE –rolling-mode-resume command, and the Resume button in the Web Console, resume a paused reconciliation process explicitly. In this case, the reconciliation process will continue right where it left off. Again, if the pause was caused by a bad change, it is unlikely that an explicit resumption will solve the problem, but might help when the problem is transient.

Alt text
Alt text

In the next example, an update was made and it caused the reconciliation to pause at the 3rd error. Observe the output of APC JOB INSTANCES and notice that 30 minutes have elapsed: the job remains with mixed job versions, but available for 30 minutes.

    This operation requires restarting running instances.
    The restart will be performed in rolling mode. Proceed? [Y/n]: 
    Applying update...done
    Warning: instance 6785e006: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:55:12.849550036 +0000 UTC)
    Warning: instance 6b7b454b: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:55:15.841972932 +0000 UTC)
    Warning: instance 81745b16: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:55:22.200466941 +0000 UTC)
    Instances up-to-date and running: 8/20. Stopped: 6/20. Errors: 3/2 (threshold). 
    Error: the state reconciliation process was paused: submit another restart or update request, or resume the reconciliation process explicitly
    
    $ apc job instances example-go
    Looking up "example-go"... done
    Rolling Mode: true
    Reconciliation Status: paused
    Reconciliation Errors: 3/2
    Warning: instance 6785e006: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:55:12.849550036 +0000 UTC)
    Warning: instance 6b7b454b: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:55:15.841972932 +0000 UTC)
    Warning: instance 81745b16: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 08:55:22.200466941 +0000 UTC)
    Up To Date and Available: 8/20
     ──────────┬──────────┬────────────┬───────────┬──────────────┬────────┬───────────────
    │ UUID     │ Status   │ Up To Date │ Available │ Will Restart │ Uptime │ Host          │
    ├──────────┼──────────┼────────────┼───────────┼──────────────┼────────┼───────────────┤
    │ 0ca027d0 │ RUNNING  │ Yes        │ Yes       │ No           │ 30m    │ host-8ed2538f │
    │ 11416803 │ RUNNING  │ Yes        │ Yes       │ No           │ 29m    │ host-8ed2538f │
    │ 1314ff78 │ RUNNING  │ Yes        │ Yes       │ No           │ 29m    │ host-87febd29 │
    │ 1fa9aa7e │ RUNNING  │ Yes        │ Yes       │ No           │ 28m    │ host-87febd29 │
    │ 35afab94 │ RUNNING  │ Yes        │ Yes       │ No           │ 28m    │ host-43e3d4c8 │
    │ 975a709c │ RUNNING  │ Yes        │ Yes       │ No           │ 28m    │ host-87febd29 │
    │ afde59ed │ RUNNING  │ Yes        │ Yes       │ No           │ 28m    │ host-43e3d4c8 │
    │ d22f5a98 │ RUNNING  │ Yes        │ Yes       │ No           │ 27m    │ host-43e3d4c8 │
    │ 445797bf │ RUNNING  │ No         │ Yes       │ Yes          │ 50m    │ host-8ed2538f │
    │ 4ea79ad6 │ RUNNING  │ No         │ Yes       │ Yes          │ 53m    │ host-87febd29 │
    │ 7d108bb5 │ RUNNING  │ No         │ Yes       │ Yes          │ 54m    │ host-8ed2538f │
    │ 777d64cc │ RUNNING  │ No         │ Yes       │ Yes          │ 53m    │ host-8ed2538f │
    │ 199c4ef8 │ RUNNING  │ No         │ Yes       │ Yes          │ 52m    │ host-8ed2538f │
    │ 7bf4802e │ RUNNING  │ No         │ Yes       │ Yes          │ 50m    │ host-43e3d4c8 │
    │ e96db003 │ RUNNING  │ No         │ Yes       │ Yes          │ 54m    │ host-87febd29 │
    │ 28212950 │ RUNNING  │ No         │ Yes       │ Yes          │ 52m    │ host-8ed2538f │
    │ 159dc3c0 │ RUNNING  │ No         │ Yes       │ Yes          │ 49m    │ host-87febd29 │
    │ 4f5a1b9f │ RUNNING  │ No         │ Yes       │ Yes          │ 54m    │ host-87febd29 │
    │ 0ad70668 │ RUNNING  │ No         │ Yes       │ Yes          │ 53m    │ host-8ed2538f │
    │ b7d42fcb │ RUNNING  │ No         │ Yes       │ Yes          │ 52m    │ host-87febd29 │
    ├──────────┼──────────┼────────────┼───────────┼──────────────┼────────┼───────────────┤
    │          │          │ Total: 8   │ Total: 20 │ Total: 12    │        │               │
     ──────────┴──────────┴────────────┴───────────┴──────────────┴────────┴───────────────

Assume that the bug was found and the fix is a change to the start command. The job update that changes and fixes the start command will resume the reconciliation process, starting out with 0 errors, and will run through to completion. The 12 instances that ran the original version and the 8 ones that ran the bad change (successfully in their case) will all be replaced by instances that run with the fix.

    This operation requires restarting running instances.  
    The restart will be performed in rolling mode. Proceed? [Y/n]: 
    Applying update...done
    
    Instances up-to-date and running: 20/20. Stopped: 20/20. Errors: 0/2 (threshold). 
    Success!
    
    $ apc job instances example-go
    Looking up "example-go"... done
    Rolling Mode: true
    Reconciliation Status: active
    Reconciliation Errors: 0/2
    Up To Date and Available: 20/20
     ──────────┬─────────┬────────────┬───────────┬──────────────┬────────┬───────────────
    │ UUID     │ Status  │ Up To Date │ Available │ Will Restart │ Uptime │ Host          │
    ├──────────┼─────────┼────────────┼───────────┼──────────────┼────────┼───────────────┤
    │ 096d1691 │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-43e3d4c8 │
    │ 1cfdddab │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-8ed2538f │
    │ 24f76e9c │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-8ed2538f │
    │ 277f0cca │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-43e3d4c8 │
    │ 31a8ebb6 │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-43e3d4c8 │
    │ 38d89eaf │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-8ed2538f │
    │ 4df7c207 │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-8ed2538f │
    │ 599a7423 │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-8ed2538f │
    │ 5afc0508 │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-8ed2538f │
    │ 679f6541 │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-87febd29 │
    │ 7af61c81 │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-43e3d4c8 │
    │ 9ffdcb4d │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-87febd29 │
    │ a869f124 │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-8ed2538f │
    │ c2885290 │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-43e3d4c8 │
    │ c873aabd │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-87febd29 │
    │ d3911878 │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-43e3d4c8 │
    │ d8941534 │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-87febd29 │
    │ e8d8c5cf │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-87febd29 │
    │ f2580c53 │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-87febd29 │
    │ f9f6dff2 │ RUNNING │ Yes        │ Yes       │ No           │ 6s     │ host-87febd29 │
    ├──────────┼─────────┼────────────┼───────────┼──────────────┼────────┼───────────────┤
    │          │         │ Total: 20  │ Total: 20 │ Total: 0     │        │               │
     ──────────┴─────────┴────────────┴───────────┴──────────────┴────────┴───────────────

The reconciliation process may also be paused explicitly by the user. Doing so is helpful when the reconciliation takes a long time (e.g. the number of instances is large, or they take a long time to start) and the user begins to notice problems (that don’t manifest as reconciliation errors). It’s also helpful when the user decides that the update or redeployment should not have taken place: she can pause the rollout, roll out a change that undoes the previous ones.

Reconciliation events

The reconciliation process emits events on the job’s events stream. To retrieve them, subscribe: APC EVENT SUBSCRIBE .

A reconciliation event is identified by the job_rolling_mode_action event_id and contains the following information:

  • action: one of:
    • enabled: Rolling Mode was enabled on the job.
    • disabled: Rolling Mode was disabled on the job.
    • job_update: An update was made to the job. Reconciliation will follow.
    • reconciliation_started: The reconciliation started (triggered by redeployment, update, or restart).
    • starting_replacement_instance: Attempt to MAKE.
    • replacement_instance_running: MAKE succeeded.
    • stopping_replaced_instance: BREAK.
    • instance_failed_to_start: Reconciliation error. The action_detail specifies the exact error.
    • reconciliation_complete: Reconciliation finished successfully.
    • paused: Reconciliation paused (due to errors or explicit pause)
    • waiting_on_affinity_peer/done_waiting_on_affinity_peer: The reconciliation yielded: a job that the job depends on is not ready.
    • waiting_on_unhealthy_job/done_waiting_on_unhealthy_job: The job doesn’t have the expected number of instances (in excess or lack), or is flapping, and the reconciliation yielded to other health functions (see Coexistence with other health functions).
  • failures: the number of reconciliation errors that there were when the event was emitted.

  • max_failures: the job’s Rolling Failure Threshold.

  • paused: whether the reconciliation process was paused when the event was emitted.

  • waiting_on_affinity_peer: whether the reconciliation process is waiting for the jobs that this job depends on to be ready.

  • waiting_on_unhealthy_job: whether the reconciliation process yielded to other health functions (over- and under-scheduling, and flapping; see Coexistence with other health functions).

  • time: the timestamp of the event.

Monitoring of reconciliation can easily be automated/scripted by parsing the stream of reconciliation events.

Example: success:

    $ apc job update example-go -m 256
     ────────────────────────────────
    │      Job Update Settings       │
    ├─────────┬──────────────────────┤
    │    FQN: │ job::/rr::example-go │
    │ Memory: │ 256                  │
     ─────────┴──────────────────────
    
    Is this correct? [Y/n]: 
    This operation requires restarting running instances.
    The restart will be performed in rolling mode. Proceed? [Y/n]: 
    Applying update...done
    Instances up-to-date and running: 5/5. Stopped: 5/5. Errors: 0/0 (threshold). 

    $ apc event subscribe job::/rr::example-go
    
    # Job update. Reconciliation will follow.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"job_update","failures":0,"max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846671142982400,"type":0}
    
    
    # Reconciliation starts.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"reconciliation_started","failures":0,"max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846671148311552,"type":0}


    # 1st MAKE-THEN-BREAK.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"starting_replacement_instance","failures":0,"max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846671148412160,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"replacement_instance_running","failures":0,"instance_uuid":"81179a3e-128c-4133-8fcd-2b30f5ca3eeb","max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846672113527296,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"stopping_replaced_instance","failures":0,"instance_uuid":"57a1ff83-bba9-4709-90ee-0b94f022388c","max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846672118870528,"type":0}


    # 2nd MAKE-THEN-BREAK.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"starting_replacement_instance","failures":0,"max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846672123834368,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"replacement_instance_running","failures":0,"instance_uuid":"6c184c82-eed0-422d-9710-4950c735bf87","max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846673147912960,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"stopping_replaced_instance","failures":0,"instance_uuid":"f239be1a-9618-4d2d-9fba-c3c93c0951ee","max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846673151539200,"type":0}


    # 3rd MAKE-THEN-BREAK.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"starting_replacement_instance","failures":0,"max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846673161676544,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"replacement_instance_running","failures":0,"instance_uuid":"e5495be3-00b0-44fc-9846-c3ed188214b9","max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846674119842048,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"stopping_replaced_instance","failures":0,"instance_uuid":"9adec88a-a246-4cce-9867-415fd3ed306b","max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846674125035520,"type":0}


    # 4th MAKE-THEN-BREAK.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"starting_replacement_instance","failures":0,"max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846674134782976,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"replacement_instance_running","failures":0,"instance_uuid":"8402fc9b-8e7c-4682-934d-ee00bb71b83c","max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846675142730240,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"stopping_replaced_instance","failures":0,"instance_uuid":"6c84a153-8a45-4c98-9fab-f0fdf262429f","max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846675146165760,"type":0}


    # 5th MAKE-THEN-BREAK.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"starting_replacement_instance","failures":0,"max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846675152953088,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"replacement_instance_running","failures":0,"instance_uuid":"f0adfb40-21b2-4bbe-928d-bae9b780c9a8","max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846675991412736,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"stopping_replaced_instance","failures":0,"instance_uuid":"9a8db0a0-1ea5-4bb9-99b2-ae28d041f784","max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846675994513920,"type":0}


    # Reconciliation finishes.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"reconciliation_complete","failures":0,"max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520846675996534784,"type":0}

Example: Rolling Mode disabled and enabled:

    $ apc job update example-go --rolling-mode-disable
    ╭──────────────────────────────────────╮
    │         Job Update Settings          │
    ├───────────────┬──────────────────────┤
    │          FQN: │ job::/rr::example-go │
    │ Rolling Mode: │ disable              │
    ╰───────────────┴──────────────────────╯

    Is this correct? [Y/n]: y
    Applying update...done
    Success!

    $ apc event subscribe job::/rr::example-go
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"disabled","failures":0,"max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520847097503272960,"type":0}

    $ apc job update example-go --rolling-mode-enable
    ╭──────────────────────────────────────╮
    │         Job Update Settings          │
    ├───────────────┬──────────────────────┤
    │          FQN: │ job::/rr::example-go │
    │ Rolling Mode: │ enable               │
    ╰───────────────┴──────────────────────╯

    Is this correct? [Y/n]: y
    Applying update...done
    Success!

    $ apc event subscribe job::/rr::example-go
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"enabled","failures":0,"max_failures":0,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::example-go","time":1520847103550436608,"type":0}

Example: reconciliation errors and pause:

    This operation requires restarting running instances.
    The restart will be performed in rolling mode. Proceed? [Y/n]: 
    Applying update...done
    Warning: instance 1d5ea2db: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 09:45:39.864848938 +0000 UTC)
    Warning: instance 5f5e876b: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 09:45:41.420974538 +0000 UTC)
    Warning: instance a8b317de: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 09:45:42.905475949 +0000 UTC)
    Warning: instance dafba7e5: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 09:45:44.32857961 +0000 UTC)
    Warning: instance 9f315e14: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 09:45:47.275960171 +0000 UTC)
    Warning: instance 129d86f7: exited abnormally (exit code: 1): check the instance's logs (2018-03-12 09:45:48.632553462 +0000 UTC)
    Instances up-to-date and running: 1/5. Stopped: 1/5. Errors: 6/5 (threshold).   
    Error: the state reconciliation process was paused: submit another restart or update request, or resume the reconciliation process explicitly


    $ apc event subscribe job::/rr::example-go

    # Job update. Reconciliation will follow.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"job_update","failures":0,"max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847938480890112,"type":0}


    # Reconciliation starts.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"reconciliation_started","failures":0,"max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847938492458240,"type":0}


    # 1st MAKE attempt. 1st reconciliation error.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"starting_replacement_instance","failures":0,"max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847938492549120,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"instance_failed_to_start","action_detail":"process failed","failures":1,"instance_uuid":"1d5ea2db-828b-476d-9a3c-9923a02a2820","max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847939873433856,"type":0}


    # 2nd MAKE attempt. 2nd reconciliation error.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"starting_replacement_instance","failures":1,"max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847939878374912,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"instance_failed_to_start","action_detail":"process failed","failures":2,"instance_uuid":"5f5e876b-e44c-4888-9476-79a95e94b5f8","max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847941433603840,"type":0}


    # 3rd MAKE attempt. 3rd reconciliation error.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"starting_replacement_instance","failures":2,"max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847941439667200,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"instance_failed_to_start","action_detail":"process failed","failures":3,"instance_uuid":"a8b317de-47c2-4762-9646-c0806a64b04a","max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847942916052736,"type":0}


    # 4th MAKE attempt. 4th reconciliation error.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"starting_replacement_instance","failures":3,"max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847942919392768,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"instance_failed_to_start","action_detail":"process failed","failures":4,"instance_uuid":"dafba7e5-b75d-44a4-8d16-f2e7d931b9db","max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847944336540672,"type":0}


    # 1st successful MAKE-THEN-BREAK.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"starting_replacement_instance","failures":4,"max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847944343323904,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"replacement_instance_running","failures":4,"instance_uuid":"d9e9509b-2719-4ce9-85a3-0778844c0a30","max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847945869794816,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"stopping_replaced_instance","failures":4,"instance_uuid":"26901a4a-4f51-42cc-856e-c9b9ef78721d","max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847945875367936,"type":0}


    # 5th MAKE attempt. 5th reconciliation error.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"starting_replacement_instance","failures":4,"max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847945883064064,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"instance_failed_to_start","action_detail":"process failed","failures":5,"instance_uuid":"9f315e14-fd0d-42ed-95bd-ff0e2fdf267a","max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847947288875776,"type":0}


    # 6th MAKE attempt. 6th reconciliation error.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"starting_replacement_instance","failures":5,"max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847947294780672,"type":0}
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"instance_failed_to_start","action_detail":"process failed","failures":6,"instance_uuid":"129d86f7-4d52-49b3-93c9-c3ea4bc69654","max_failures":5,"paused":false,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847948643163392,"type":0}


    # Pause.
    {"event_id":"job_rolling_mode_action","event_source":"","payload":{"action":"paused","action_detail":"failure count exceeded threshold","failures":6,"max_failures":5,"paused":true,"waiting_on_affinity_peer":false,"waiting_on_unhealthy_job":false},"resource":"job::/rr::rando","time":1520847948651228416,"type":0}

Policy

There is no specific policy for rolling restarts. Any common read/write policies that applies to jobs will also apply to jobs that will initiate a rolling restart/update.