Continuous delivery for database schemas in a distributed monolith

Published in

Flock

7 min readAug 22, 2019

Introduction

Flock’s mission is to make the world a safer, smarter place. We do this by providing data-driven risk insights in real-time to enable customers when buying insurance.

In order to power this mission, amongst many other tools and techniques, we deliver our database migrations continuously. Continuous delivery in the context of selling insurance globally via an app is pivotal to everything we do here from a technology perspective. This article will walk you through how we arrive at this somewhat unusual setup, why it has been valuable to Flock, as well as the implementation details.

This article is for people who have an interest in relational database deployment methodologies and testing strategies. We assume a level of familiarity with the SQL programming language.

Background

Here at Flock, we run a service-based architecture which allows us to:

Isolate deployment of individual parts of our system
Separate development efforts across both teams and individuals
Promote codebase simplicity

In a traditional microservice-based architecture this is accomplished by having completely decoupled resources. Each service ships with its own database, caches, and instances of other resources.

This looks something like this:

What sets us apart

We operate in a regulated financial environment with various regulatory conditions both domestically and overseas. As a result, we must prioritize things that many other businesses of a similar size and stage can afford not to.

For example, transactional consistency is indispensable when selling insurance. The impact of a discrepancy within our data could tangibly impact the price we produce algorithmically for a customer. This could result in an unfair charge or, in the worst case, implicate us in a regulatory nightmare. In either case, an event of this nature would pose a clear audit failure.

Theoretical background

Brewster’s Theorem

Brewster’s Theorem (colloquially known as the CAP triangle) states that a networked shared-data system can have at most two of the following three guarantees:

Consistency (at any given point, every node sees the same data, and the data is up to date)
Availability (every request is notified by every node)
Partition tolerance (The system functions irrespective of network partitioning).

This theorem is further used to classify databases into three families, CA, CP, and AP, to represent database systems that fulfil two of these criteria.

CA systems are impossible in the wild, as the introduction of even a single networked client renders the system partitioned.
CP systems will, in the presence of a partition, choose to serve only consistent data. If data consistency can not be ensured, they will refuse to service the request.
AP systems will, in the presence of a partition, continue responding, but will make no guarantees on the consistency of the response.

ACID Transactions

ACID stands for:

Atomic(Everything in the transaction is part of a single ‘atom’. Everything either succeeds or fails together.)
Consistent(Both the starting and finishing state of the transaction are valid and do not violate any database rules)
Isolated(Transactions are isolated from each other and can not interfere with each other)
Durable(All completed transactions persist)

What this means for Flock

As should be evident from the above description, we at Flock value transactional database consistency. We value CP databases with ACID compliant transactions. What this means in practice is that the compromise typically made when building a service based architecture is unacceptable to us. We require the enforcement of transactional database consistency. When deploying multiple databases, this is a notoriously difficult problem, known as the distributed transaction problem. As such we deploy a single database, straddled by multiple servers, and partitioned by schema(s).

This looks something like this:

Each service writes to its own schema, but enforces consistency guarantees across schemas using foreign key relationships, joins, triggers, and constraints. We call this a distributed monolith.

This architecture introduces an interesting question — how do we track our migrations?

In a monorepo, the migrations live with the code, in the only repo. In a microservices system, the migrations for each database are stored with the service responsible for that database.

Neither one of those approaches can work with our infrastructure.

A first attempt

Initially, we thought we would have each service be responsible for its own migrations. We used the wonderful db-migrate npm package to add and track migrations. The resulting migrations table was stored per schema.

This worked well for a while. The problems started as our schema evolved. For a simplistic example let’s look at the following three migrations.

Policy Migration 1

Payments Migration 1

Update Migration

We create a policies table with a BIGINT primary key. We vendor this migration with the policies services.

We create a payments table. Its foreign key references the id column of the policies table. We vendor this migration with the payments service.

We change the column type on the policies table to be an INT. We also modify the associated foreign key column.

The problem is in the implicit ordering. The payments service has to track the policy service as a foreign key from its initial migration. Therefore, the policy service has to deploy first.

This means that as soon as the third migration is added, to either service, the entirety of the infrastructure becomes undeployable. Irrespective of which service the third update migration is added to, deployment from a clean slate will become impossible. The policy expects a payments table that can be altered, or the payments service expects a policy table which can be altered. As one service has to be deployed first, this is guaranteed to fail.

If the third migration is separated across both services, the initial migration of the payments service will fail to run as the policy id has changed its column type relative to the foreign key type by the time the payments server migrations kick in.

So what did we do?

We decided to vendor our migrations as a common dependency. We shared this through an internal npm package. All migrations are then tracked sequentially on a shared migrations table in the public schema. This presented us with a new challenge — how do we ensure that dozens of migrations across multiple servers, when coalesced to a single migration package, create the same state as we had running at that point in time? What assertions can we put into our CI/CD pipelines so that this rewrite succeeds?

Our solution

After some research, we discovered a tool called pgTap, installation, both in CI and for local development is straight forward:

pgtap installation

You can then add the following to your migration package’s package.json

bin script for installation

This will expose your new bin script to the installing node server, allowing you to run this as a script during your CI/CD pipeline.

You can now start writing tests such as:

an example pgtap test

Running them with npm test, provided you add something like the following to your package.json.

pacakge json test script for pgtap

An in-depth discussion of pgTap’s full functionality is outside the scope of this blog post, but the framework’s documentation is fantastic!

You may wonder how to figure out the entirety of your schema so that you know what tests need to be written. If you log in to your Postgres instance using the psql command, you can type \d table_name to see the details of any given table.

Once we had written a comprehensive test suite for our schema, we knew that any changes we introduced in the rewrite would be immediately picked up on.

More advanced consistency guarantees

As alluded to above, here at Flock data consistency is a first-class concern. As such we use some more advanced features: check constraints, and update/insert triggers.

Let’s provide an example schema:

So now we have, at the data level, enforced specific values on payment status and policy type, enforced our allowed payment amount range, and ensured that only successful payments can be stored on a policy. That last trigger is, for obvious reasons unwise (in a real scenario we would actually want to know what policy the payment was attempting to bind), but it is illustrative of the kind of power you can bake into your data model, which is valuable in a regulated market.

This introduces an interesting conundrum. While we’ve gained value, we’ve lost test coverage. There is now logic at the database tier. Not only is this logic expensive to test through traditional application-level unit tests, but also leads to a misrepresentation of test coverage. After all, the application level test suite will now overestimate our level of test coverage.

Luckily pgTap provides us with the ability to write tests for these cases. We will not be providing an exhaustive test suite for all the logic here, but rather illustrate the possibilities. As with all testing, it is advisable to aim for 100% branch coverage in practice:

To assert that a particular query would pass, lives_ok can be used in place of throws_ok.

Running the migrations

So now we’ve got a centralized repository of database migrations, with a robust test suite ensuring functionality. How do we allow upstream libraries that install this package to run the database migrations, given that they will no longer install db-migrate directly?

The answer is npm explore. Adding the following script to our bin directory in the migrations package:

Then providing the following script in the package.json of any upstream service:

Voila! migrations can now run from the service. Every time you publish a new version of your migrations package, you will need to bump the dependency version in your upstream services, but that’s more or less all there is to this.

Where to next?

In our unceasing drive towards innovation, we’ve since taken this a step further, by dockerizing the migration service and deploying it as a standalone Docker image that runs migrations through Kubernetes. But that is a topic for another blog post.