DMTN-191: Schema Migration for Butler Registry Database

  • Andy Salnikov

Latest Revision: 2021-05-27

1   Introduction

It is clear that Registry database schema will continue to improve and evolve over time and there is a need for a well-defined migration process and tools to support that process.

This technical note describes a proposed design of the system to support schema evolution of the Registry database. The system is based on Alembic which is a tool developed to support schema migration for the database applications built with SQLAlchemy. This is the summary of work performed on DM-29593 ticket, many more details of the process are available on that ticket.

2   Schema Versions in Registry

Overall Registry database and schema is split into several domains, today there are six domains defined: - attributes - dimensions - collections - datasets - opaque - datastores

Each domain is controlled by a manager which is an instance of a Python class. Schema of the corresponding domain is determined by the manager class and also by its version number. All versions have a “semantic” numbering format (MAJOR.MINOR.PATCH) and compatibility rules.

For a particular domain there may be more than one class implementing manager for that domain, specific class to use is determined by Butler configuration when REgistry database is created. The version of the class is defined by the class itself and is exposed via member method. When a Registry database is created all manager classes and their corresponding versions are recorded as a database metadata in a special table with the name butller_attributes. Here is an example of the manager class names stored in that table:

                 name                 |                                                value
--------------------------------------+-----------------------------------------------------------------------------------------------------
 config:registry.managers.attributes  | lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager
 config:registry.managers.collections | lsst.daf.butler.registry.collections.synthIntKey.SynthIntKeyCollectionManager
 config:registry.managers.datasets    | lsst.daf.butler.registry.datasets.byDimensions._manager.ByDimensionsDatasetRecordStorageManagerUUID
 config:registry.managers.datastores  | lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager
 config:registry.managers.dimensions  | lsst.daf.butler.registry.dimensions.static.StaticDimensionRecordStorageManager
 config:registry.managers.opaque      | lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager

And this is an example of how version numbers are stored in the same table (some names are truncated to make it fit on a page):

                                                    name                   | value
-----------------------------------------------------------------------------------
 version:lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager | 1.0.0
 version:<...>.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager  | 0.2.0
 version:<...>.collections.synthIntKey.SynthIntKeyCollectionManager        | 2.0.0
 version:<...>.datasets.<...>.ByDimensionsDatasetRecordStorageManagerUUID  | 1.0.0
 version:<...>.dimensions.static.StaticDimensionRecordStorageManager       | 6.0.0
 version:lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager   | 0.2.0

In addition to version numbers we also store a schema digest for each manager, this allows us to detect inadvertent schema changes. Digest is calculated from the definition of all tables controlled by a manager, digest value is stable for a particular back-end but can differ between back-ends.

Obviously schema definitions are not completely independent, e.g. column types used for foreign keys in one manager need to be the same as types of primary keys of referenced table, and the referenced table can belong to a different manager. Thus the schema definition is a function of version number not only of its own manager but also another manager. To simplify our model we ignore that extra dependency and assume that managers are independent, while at the same time understanding that change in a version of one manager can trigger changes in the tables controlled by other managers. Better way to put it may be that management boundaries are not at table level but at individual column level, but that can become complicated too.

3   Schema Changes and Migrations

Every new database created by Registry code uses the latest/current versions of all configured managers defined in the release. When new release is deployed with the existing database the code in the release could have newer version of one or few managers. These new version may be incompatible with the database schema, in that case database schema needs a schema upgrade to match schema version in a new release.

Usually for any incremental schema version change there is a corresponding schema upgrade process called a migration which is a single step bringing database schema from one well-defined state/version into a new state. In some cases release upgrade can result in several migration steps that need to be performed sequentially, e.g. 1.0.0 → 1.0.1 → 1.1.0.

Migrations can include any combination of DDL and DML statements that need to be executed on a database, few simple examples: - creating new table or dropping existing one - adding or dropping columns from a table - inserting/replacing values into columns based on values of existing columns

Some trivial migrations can be performed while users are still using the database, but more complex migrations usually require a scheduled downtime to avoid conflicts.

Schema versions and their corresponding migrations are typically ordered, migrations have to be applied in a specific order. In case of multiple unrelated sub-schemas their respective order may not matter, though in some cases there may be implied dependencies between versions of those sub-schemas.

Registry schema management is complicated by the fact that there may be more than one implementation of a particular manager type. In this situation schema migration may include two distinct types of changes:

  • Regular incremental changes in version of a particular implementation class, e.g. switching from version 1.0.0 of DefaultButlerAttributeManager class to version 1.0.1 of the same class.
  • Switching to different implementation class of the same manager.

For latter case there is no implied ordering between the versions of different manager implementations. If for example we have two manager classes ManagerA and ManagerB, each with its own versions, then these migrations may be necessary to support:

  • from ManagerA 1.0.0 to ManagerB 1.0.0
  • from ManagerA 1.2.0 to ManagerB 1.0.1
  • from ManagerB 1.0.1 to ManagerA 1.0.0
  • from ManagerB 1.0.1 to ManagerA 2.0.0

Using git analogy regular migrations (for the same manager class) can be represented as a linear history on a single revision branch. Latter switching between managers can be thought of as a “checkout” of some state on a different branch, with a complication that data in the database needs to be “ported” to that new branch.

4   Migration Tools

Probably most commonly used tools for schema upgrades is just a collection of SQL scripts that are executed manually by DBAs. In more complicated cases when complex migrations are needed it may also involve dynamic applications which can generate new data or mange dynamic parts of schema. One common complication for such tools is management of multiple variants of migrations and controlling their execution. More advanced migration tools can derive current schema version used by a database from database contents and can automate application of all necessary migrations to bring database up to date.

One of such advanced tools is Alembic. It is designed to work optimally with SQLAlchemy which makes it attractive for our use case as our Registry’s SQL back-end build on top of SQLAlchemy. Alembic is used extensively by many open source software projects and there is significant experience available for support.

Summary of the main Alembic features:

  • Alembic is built on ideas of revisions and migrations, revision is a named version of a database schema (or part of the schema), migration is a script which is ran by alembic to upgrade (or downgrade) schema from one revision to another.
  • Revision names have a syntax of identifier (few characters are reserved and cannot be used in a revision name). Revision names generated by Alembic itself are 12-character hex strings, but client can provide alternative name.
  • Migration scripts are regular Python scripts which contain some metadata, most important is its parent revision(s) and final revision. Migrations are identified by the revisions, normally there cannot be more than one migration script for a revision.
  • Alembic has some support for branching and merging of revisions, somewhat similar to git branches (but not exactly). It also supports collection of completely independent revision branches (a.k.a. forest).
  • Alembic needs to store database current revision(s) in a special table in a database (named alembic_version), there is one record for each active revision branch.
  • Whole revision history is derived from the contents of migration scripts, the database does not store its history.
  • Alembic has flexible configuration, allowing to separate migration scripts into multiple directories and including/excluding those directories.
  • Alembic has CLI with large command set to manage and apply migrations, and Python API which allows execution of the same tasks from any Python script.

This extensive set of features should allow us to design the process that satisfies our requirements and build a tool that wraps those features into a set of command with specific knowledge of Registry design.

5   Versions and Revisions

Alembic history and database current state is expressed in terms of Alembic revisions. Registry identifies its schema state using manager class names and their version numbers. It would be nice to avoid unnecessary duplication and translation between two sets of identifiers that represent essentially the same state.

We already store Registry versions in butler_attributes table, one obvious idea is to avoid using alembic_version table to keep alembic revisions and derive those revisions from Registry versions instead. Unfortunately Alembic doe not provide enough encapsulation for its alembic_version table to try to replace it cleanly with some other mechanism so it is hard to avoid using that table. Still, this idea may be attractive and it may be possible to send a request to Alembic developers to add support for that feature, or make a pull request adding that feature.

Alembic revisions names can be almost arbitrary strings with some limitations:

  • they cannot contain -, + or @ characters
  • migration script name usually includes revision name so it should avoid using characters that are potentially problematic in file names (slashes, colons, semicolons, etc.)

We could use Registry version identifier in some encoded format to represent Alembic revisions. One obvious way could be to combine manager class name and its version, but unfortunately manager class names are unwieldy long which makes it hard for humans to work with. An example could be name like ByDimensionsDatasetRecordStorageManager_0.1.0, would be hard to parse, especially if there are few revisions appearing on the same line (as it is common for history formatting).

Instead of these long revision names and to avoid random-generated Alembic revisions it is suggested to user shorter names that are derived from a deterministic hash of the Registry version, including manager type, manager class name (without module) and manager version name.

6   Revision Forest

As we have multiple (almost) independent domains it makes sense to use independent revision history for each domain. Each domain will have separate revision tree with these properties:

  • Root of the tree represents an empty schema, it is only used for identification and it is assigned a branch label with the name of the manager type (e.g. “datasets”). Its alembic revision name is derived from the same manager type name. Its corresponding migration script will be empty (except for metadata) an it is not is supposed to even run (we do not plan to use alembic to re-create schemas from scratch).
  • Root will have one or more branches originating from it, one branch per manager class. Branches will be labeled after manager class name (e.g. “datasets-ByDimensionsDatasetRecordStorageManager”).
  • Each revision on a branch corresponds to a version of the manager, its revision name is derived from the version number.

Here is an example of a revision tree for “datasets” manager which includes revisions for versions know today:

_images/revision-tree-datasets.png

Note that in our case the branches are mutually exclusive, a database can only be on one branch at a time. Alembic branching model is more flexible, in general a database can contain multiple branches which can be merger at later time.

7   Branch Switching

Complications with the above model arise when there is a need to switch from one branch to another, e.g. “upgrade” datasets manager from ByDimensionsDatasetRecordStorageManager 1.0.0 to ByDimensionsDatasetRecordStorageManagerUUID 1.0.0. This is a simple “checkout” of a revision from a different branch but preserving and/or migration the data. The diagram for this may look like, with new migration arrow in red:

_images/revision-tree-switch-not.png

Unfortunately this sort of tree is not supported by alembic (branch merging is supported in general but not in a configuration like this). One possible workaround for this limitation is to generate separate branch which will have the same registry versions on it but different alembic revisions. This workaround is not very attractive as it will result in many more revisions and branches which is going to be a management issue.

Another workaround for this is to abuse alembic configuration flexibility and temporarily hide regular tree from alembic and replace it with a tree that enables the migration that was not allowed. To illustrate this approach this diagram shows this sort of “one-shot” revision tree with “hidden” parts of a tree grayed out:

_images/revision-tree-switch-hide.png

First migration on this tree (from root to «int-1.0.0») does not need to do anything as migration starting point will be «int-1.0.0» revision. Migration from «int-1.0.0» to «uuid-1.0.0» will have to do actual schema upgrade. After the one-shot schema upgrade the regular tree can be used again to continue migrations along linear branch history. Clearly this mechanism requires extreme care and these one-shot trees (there may be many of those) are normally hidden from alembic configuration. Special switch should be used to select exactly one one-shot tree.

This combination of regular trees with linear history and special one-shot trees should cover all necessary migration options.

8   Migration Directories

Hiding regular revision trees and replacing them with special one-shot trees can be implemented by using multi-directory feature of alembic. In its simplest setup alembic keeps all migration scripts in a single directory, but it can also be configured to search more than one directory for the scripts. This allows almost arbitrary sets of the scripts to be combined together to form complete revision history. The limitation is that current revisions that are specified in the database’s alembic_version table have to be found in the resulting tree(s).

Because our migration history are spit into independent per-manager type trees it is natural to split the scripts between per-manager directories. One-shot trees will be stored in a special location, with a folder per-one-shot migration. Here is a possible directory hierarchy for such setup (migration scripts are named after their corresponding revision name):

├── _alembic             // special folder for alembic use
│   ├── alembic.ini
│   ├── env.py
│   ├── script.py.mako
├── attributes           // folder with scripts for 'attributes' manager
│   ├── f0a3531f97ca.py
│   ├── f22a777cf382.py
├── collections          // folder with scripts for 'collections' manager
│   ├── 079a1bc77f25.py
│   ├── 1a93ca89bc27.py
│   ├── 3ce2d3adf1f5.py
├── datasets             // folder with scripts for 'datasets' manager
│   ├── 059cc7b7ef13.py
│   ├── 2101fbf51ad3.py
│   ├── 38a9414ea7a2.py
│   ├── 576045cb7831.py
│   ├── 635083325f20.py
│   └── eb5a3cc76666.py
├── _oneshot             // folder with scripts for one-shot migrations
│   └── datasets
│       └── int_1.0.0_to_uuid_1.0.0    // a specific one-shot migration
│           ├── 059cc7b7ef13.py
│           ├── 2101fbf51ad3.py
│           └── 635083325f20.py

For regular operations the configuration will include all per-manager directories (attributes, collections, datasets in the tree above). When a special one-shot migration needs to be performed the datasets folder will be excluded for configuration and instead _oneshot/datasets/int_1.0.0_to_uuid_1.0.0 will be used in its place.

9   Command Line Tool

Using alembic CLI for managing of this sort of setup is not very convenient, one would need to modify its configuration filer frequently to reflect location of the migration sets and database connection string. It would be straightforward to implement our own set of commands that know how to manipulate alembic configuration, extract database connection string from butler configuration, and interpret contents of butler_attributes table.

Initial version of such interface was implemented in a separate package daf_butler_migrate on top of the butler extensible CLI. Below is the brief description of butler migrate sub-command and all its available sub-commands. There are two major groups of commands, those that manage migration history as a set of scripts and folders on local file system, and those that work with the database performing checks or schema upgrades.

By default location of migration scripts is in $DAF_BUTLER_MIGRATE/migrations folder, it can be changed with --mig-path=PATH option.

9.1   Make revision tree

Usage:

butler migrate add-tree [--one-shot] TREE_NAME

Creates new revision tree, TREE_NAME is a manager type name (e.g. “datasets”) for regular trees or a manager name and special migration name (e.g. “datasets/int_1.0.0_to_uuid_1.0.0”) for one-shot migration trees.

9.2   List revision trees

Usage:

butler migrate show-trees [--one-shot] [-v|--verbose]

Prints a list of regular tree names, or special one-shot trees.

9.3   Make new revision

Usage:

butler migrate add-revision [--one-shot] TREE_NAME MANAGER_CLASS VERSION

Generates new migration script for migrating from most recent version of a manager class defined in history to a new version. MANAGER_CLASS is a manager class name with out module name.

9.4   Show revision history

Usage:

butler migrate show-history [--one-shot] [-v|--verbose] [TREE_NAME]

Print revision history, possibly limiting it to a single tree.

9.5   Stamp alembic_version table

Usage:

butler migrate stamp [--purge] REPO

Fills alembic_version table with the revision names derived from current versions defined in butler_attributes table.

9.6   Display current database revisions

Usage:

butler migrate show-current [--butler] [-v|--verbose] REPO

Prints revisions defined currently in alembic_version table or in butler_attributes table if --butler option is specified.

9.7   Upgrade database schema

Usage:

butler migrate upgrade [--one-shot-tree=TREE_NAME] [--sql] REPO REVISION

Upgrades database schema to a given revision using regular migration trees or a special one-shot tree if --one-shot-tree is given.

9.8   Downgrade database schema

Usage:

butler migrate downgrade [--one-shot-tree=TREE_NAME] [--sql] REPO REVISION

Similar to migrate upgrade but works in opposite direction.