Migrating to Store3 (Consul)

This section describes how to migrate from component store2 (PostgreSQL DB) to component store3 (Consul).

Overview

To improve cluster availability, Apcera Platform release 3.0 changes the component datastore from PostgreSQL DB to the Consul key value store, also called store3. Newly installed clusters use store3 by default. Existing clusters that are upgraded to release 3.0 will continue to use store2 (Postgres). Once your cluster has been upgraded to release 3.0, you can optionally migrate the cluster's component store from store2 to store3 by following the procedure described below.

screenshot

Migration process

The Job Manager (JM), Package Manager (PM), and Auth (Security) Server (AS) communicate with the component store. Each component performs migration in the same way but not at the same time. Migration is attempted if the cluster.conf file contains the continuum.component_database.migrate_store2_to_store3 flag. The default is set to false.

You can optionally perform a dry run migration on each component (JM, PM and AS) to catch any potential issues before performing the actual migration.

The migration process uses Consul locking to make sure only one component is allowed to perform the migration. This means that after migration succeeds once per component type (JM/PM/AS), presence of the flag does not change cluster behavior. A component that acquires the lock and starts the migration deletes all store3 keys in the destination key prefix before migrating any store2 data. This is done to deal with results of previously failed partial migrations. If the migration finishes successfully, the component sets a special key set in Consul to prevent migration from triggering again. This ensures that no store3 data is deleted if it's been created during successful migration.

Migration requirements

Note the following requirements and caveats about the migration process:

  • Do not migrate a cluster that has never been used with store2. Release 3.0 clusters still use Postgres for the Audit Log component, and secret storage (Vault) might have valid Postgres credentials, so the cluster may try to use an existing Postgres DB to migrate (non-existent) store2 data.
  • Reverse migration is not supported. Once you successfully migrate to using store3 you cannot go back to store2.
  • If something goes wrong during the migration, the migration will not be marked as successful and the deploy will fail. If it is a permanent problem, such as something wrong with the record itself and it cannot be inserted to Consul, components will be flapping indefinitely trying to migrate. At this point it is recommended to change the continuum.component_database.kind back to store2, clear the migrate_store2_to_store3 attribute and re-deploy to restore to store2.

Migration steps

To migrate from store2 to store3 and start using store3 in all components, complete the following steps.

To migrate from store2 to store3:

  1. (optional) To trigger a dry run of the migration on each component (JM, PM, and AS), SSH into each Central host in your Apcera cluster and run each component with the dry_run_store_migration flag, e.g.:

     /opt/apcera/continuum/bin/job_manager  -f /opt/apcera/continuum/conf/job_manager.conf \
     -db.kind store3 \
     -db.migrate_store2_to_store3 \
     -dry_run_store_migration
    

    The component will exit after the dry run completes. Repeat this step, as desired, with the Package Manager (/opt/apcera/continuum/bin/package_manager) and Auth Server (/opt/apcera/continuum/bin/auth_server) components, and corresponding configuration files (/opt/apcera/continuum/conf/package_manager.conf and /opt/apcera/continuum/conf/auth_server.conf). Multiple dry run processes can safely be run simultaneously.

    When you are satisfied with the results of the dry run migration, continue to the next step.

  2. Stop component processes. This step is optional but recommended to ensure there are no new writes to any component database by not-yet-updated components during and after the migration. Stopping the Health Manager is recommended, as well, to avoid any race conditions on the Job Manager. To stop the component processes, SSH into each Central host and run sv stop on the Job Manager (continuum-job-manager), Package Manager (continuum-package-manager), and Auth (Policy) Server (continuum-auth-server) services, as well as the Heath Manager (continuum-health-manager), for example:
    sv stop continuum-job-manager
    ok: down: continuum-job-manager: 1s, normally up
    
  3. Edit your cluster.conf file by setting continuum.component_database.kind to store3 and continuum.component_database.migrate_store2_to_store3 to true. The cluster will start using store3 immediately after migration.

     chef: {
       "continuum": {
         "component_database": {
           "migrate_store2_to_store3": true
           "kind": store3
           }
         }
       }
     }
    

    (Note that setting continuum.component_database.kind to store2 and continuum.component_database.migrate_store2_to_store3 to true is invalid and will result in components failing to start.)

  4. Re-deploy your cluster with the updated cluster.conf file.

    The Job Manager, Package Manager, and Auth Server will be able to migrate their store2 data to store3. Assuming continuum.component_database.kind is set to store3, the JM, PM, and AS components will start using the migrated data immediately. Since version numbers do not change during migration, the rest of the system (Health Manager, API Server, etc.) should not notice any changes in the underlying storage.

  5. Verify migration.

    The actual migration happens in the background, not during the deploy. There are entries in each component log showing the migration occurred. For example, if you check the JM, PM, or Auth (Policy) Server logs, you should see messages similar to the following indicating that the migration began and completed successfully.

     Starting store2 -> store3 migration...
    
     2017-07-28 19:52:44.716518056 +0000 UTC [INFO  pid=10182 requestid='' source='migrator.go:236'] Migrated store2 -> store3 (took: 8.819773973s, records migrated: 281, records failed: 0)
    

    One semi-random component will do the migration. It will be whichever host ran Chef first most likely. So, if you are running multiple centrals, you will have to check the logs for each component until you locate the log message indicating successful migration. Use the following commands to search for the successful migration message run the following commands:

     cat /var/log/continuum-job-manager/current | grep "Migrated store2 -> store3"
     cat /var/log/continuum-package-manager/current | grep "Migrated store2 -> store3"
     cat /var/log/continuum-auth-server/current | grep "Migrated store2 -> store3"
    

    Or, equivalently, run the following commands:

     grep migrator.go /var/log/continuum-job-manager/*
     grep migrator.go /var/log/continuum-package-manager/*
     grep migrator.go /var/log/continuum-auth-server/*
    

    If you see a message similar to the following, it means another component successfully performed the migration:

     grep migrator.go /var/log/continuum-job-manager/*
     /var/log/continuum-job-manager/current:2017-07-28 18:14:53.110658758 +0000 UTC [INFO  pid=26008 requestid='' source='migrator.go:57'] Starting store2 -> store3 migration...
     /var/log/continuum-job-manager/current:2017-07-28 18:14:53.113883027 +0000 UTC [INFO  pid=26008 requestid='' source='migrator.go:78'] Migration has already been performed (key "apcera/store3/jmdb/migration/.done" exists)
    
  6. (optional) Once you've validated the migration you can remove the continuum.component_database.migrate_store2_to_store3 entry from cluster.conf, or leave it in; migration will not run again unless the "migration complete" flag is cleared in Consul.