Modelling Tree Like Finite State Machines At The Persistence Layer

David OConnor
Flock
Published in
8 min readJan 4, 2022

--

This is part 4 in a series of articles, the first article can be found here.

There are many options for modelling finite state machines at the code level when writing an application. In the Javascript ecosystem, XState instantly springs to mind as a strong contender. Here at Flock, we’ve been questioning how we can safely and accurately model finite state machines for long term storage within the persistence level. In particular, we will look at storage in a relational database and we will be using PostgreSQL syntax throughout.

In this article, we’ll look at our latest technique for ensuring type-safe storage of a tree-like finite state machine. Our tree-like structure is a directed graph where any two nodes have a single vertex connection and there are no cycles (loops) within the graph. We will explain our technique through the use of a sequence of practical examples.

Problem Definition I

To help frame our proposed technique, we are going to examine it through an example scenario, again utilising the idea of a fictitious e-commerce site. Within this example, we will be looking at the orders that customers can place on-site and how the state of an order can evolve over time. What makes this problem interesting is that at each state additional information, pertinent to the particular state, will be required.

Let’s assume that the initial state is when a customer places an order, we will refer to this as the placed state. Suppose the order details can be described by a small number of fields:
- A unique identifier for the order,
- A reference to an identifier for the product ordered,
- A reference to the customer making the order.

At some point in the future, after an order has been placed, an invoice will be issued. This creates the second state invoiced, for which we need only know when the invoice has been sent.

After receiving the invoice customers will be able to pay for their order via one of two payment methods, card or direct debit. We will keep track of when payment was taken and what payment method was used. This creates our third state, paid.

A Flawed Solution

Given the three states described above, it should be clear that we have a well defined finite state machine with an order progressing between each state in a linear fashion.

Linear Order State Flow Diagram

With this in mind, we can create an enumerated type to keep track of the state within our database.

CREATE TYPE order_state AS ENUM ('placed', 'invoiced', 'paid');

To store the orders themselves, we can create a single table with a field for each of the pieces of data described in the problem definition. Assuming that we have existing tables for products and customers we can create our orders table.

CREATE TABLE orders (
id UUID NOT NULL PRIMARY KEY,
created_at TIMESTAMP WITH TIME ZONE NOT NULL,
product_id UUID NOT NULL REFERENCES products (id) NOT NULL,
customer_id UUID NOT NULL REFERENCES customers (id) NOT NULL,
invoiced_at TIMESTAMP WITH TIME ZONE,
paid_at TIMESTAMP WITH TIME ZONE
payment_source TEXT,
state order_state NOT NULL
)

Although this schema will allow us to store all three states of an order, there is a lack of safety enforced upon us. With the schema as is, we could insert inaccurately shaped data into our storage layer. That is data that doesn’t represent a real state within our domain or doesn’t provide all of the required information for a given state. For example:

INSERT INTO orders (
‘8369cae3–6afc-442a-a1e2–37307926ad53’,
‘2021–01–01 10:00:00.00+00’,
‘bb7f1883–467f-4f98-b4f8–16c82929c77e’,
‘88ba3fd0–8742–452e-b1c2–8f54b0af8bbd’,
NULL,
‘2021–01–02 10:00:00.00+00’,
NULL,
‘paid’
)

There are two inaccuracies being displayed here. Firstly that the order does not have an invoiced_at entry despite having passed through the invoiced state. Secondly that despite being in the paid state no payment source has been provided.

At Flock, our old approach to solving these types of problems was to use constraints. Altering the table above we can add constraints to give us a version of type safety within our storage layer.

ALTER TABLE orders 
ADD CONSTRAINT state_constraint CHECK (
(
state = 'placed' AND
invoiced_at IS NULL AND
paid_at IS NULL AND
payment_source IS NULL
) OR (
state = 'invoiced' AND
invoiced_at IS NOT NULL AND
paid_at IS NULL AND
payment_source IS NULL
) OR (
state = 'paid' AND
invoiced_at IS NOT NULL AND
paid_at IS NOT NULL AND
payment_source IS NOT NULL
)
);

We can see that with this constraint that all of the required information must be provided for the relevant state. Thus when an order is invoiced we must set invoiced_at and similarly for the paid state.

Our New Approach

At this stage, you may be wondering what the rest of this article is about. We appear to have a schema that can model each of our states correctly and prevent us from inserting inaccurate data. Although this technique has been working for us for a number of years it starts to break down as the underlying model becomes more complex, as well as becoming harder to reason about.

Our new technique, in short, utilises multiple tables to represent each state. This has the benefits of being easier to understand and harder to get wrong whilst providing the type of safety that protects from mistakes.

CREATE TABLE orders (
id UUID NOT NULL PRIMARY KEY,
created_at TIMESTAMP WITH TIME ZONE NOT NULL,
product_id UUID NOT NULL REFERENCES products (id) NOT NULL,
customer_id UUID NOT NULL REFERENCES customers (id) NOT NULL
);CREATE_TABLE invoiced_orders (
id UUID NOT NULL FOREIGN KEY REFERENCES orders(id),
invoiced_at TIMESTAMP WITH TIME ZONE NOT NULL
);CREATE TABLE paid_orders (
id UUID NOT NULL FOREIGN KEY REFERENCES invoiced_orders(id),
paid_at TIMESTAMP WITH TIME ZONE NOT NULL,
payment_source TEXT NOT NULL
);

We can see that the above is instantly clearer and easier to reason about than our old approach. An order contains a strict set of required fields. An invoiced order follows from an existing order and must have the field invoiced_at provided. Additionally, a paid order can only come from an invoiced order and must have its context provided. All of this was achieved whilst reducing the amount of SQL required to create our schema.

By having multiple tables we have also ensured that each table should only be written once. There should be no need to update an existing record. This allows us to keep a perfect record of the state at all times.

This new approach is also more open to modifications. Adding new information to a state or adding new states is a relatively easy operation. Suppose that invoicing is not an automatic process and we wish to record who issued the invoice. Adding a migration to the invoiced_orders table is a much easier process with multiple tables as opposed to a single constraint.

Updating the multi-table format:

ALTER TABLE invoiced_orders 
ADD COLUMN issued_by TEXT NOT NULL;

Updating a constraint:

ALTER TABLE orders
DROP CONSTRAINT state_constraint,
ADD CONSTRAINT state_constraint CHECK (
(
state = ‘placed’ AND
invoiced_at IS NULL AND
issued_by IS NULL AND
paid_at IS NULL AND
payment_source IS NULL
) OR (
state = ‘invoiced’ AND
invoiced_at IS NOT NULL AND
issued_by IS NULL NOT AND
paid_at IS NULL AND
payment_source IS NULL
) OR (
state = ‘paid’ AND
invoiced_at IS NOT NULL AND
issued_by IS NOT NULL AND
paid_at IS NOT NULL AND
payment_source IS NOT NULL
)
)

There is a far greater opportunity to make a mistake when dealing with constraints, a problem that only grows as the model expands in size and complexity.

Problem Definition II

Although our new approach is a large improvement over what we’ve used before, looking at linear state flows is quite a restrictive domain space. To expand upon the above we consider the possibility that once an order has been invoiced it may go unpaid and that the order will need to be cancelled. This introduces a bifurcation of states after invoicing.

Bifurcating Order State Flow Diagram

The added complexity comes from the fact that an order can be either cancelled, paid or neither, but never both. We want to ensure that this restriction on states is enforced within our schema.

Extending Our New Approach

As the states of placed and invoiced have not changed we can keep the schema for these two tables the same as above.

We will again use independent tables for the two other states, paid and cancelled. To account for the fact that an order can only occupy one of these states we will consider them as a group of states representing a `settled` order and add an additional table to keep track of settled orders.

CREATE TABLE settled_orders (
id UUID NOT NULL FOREIGN KEY REFERENCES invoiced_orders(id),
paid_order_id UUID FOREIGN KEY REFERENCES paid_orders(id),
cancelled_order_id UUID FOREIGN KEY REFERENCES cancelled_orders(id)
CONSTRAINT single_state CHECK (
(paid_order_id IS NOT NULL AND cancelled_order_id IS NULL) OR
(paid_order_id IS NULL AND cancelled_order_id IS NOT NULL)
)
);CREATE TABLE paid_orders (
id UUID NOT NULL FOREIGN KEY REFERENCES settled_orders(id),
paid_at TIMESTAMP WITH TIME ZONE NOT NULL,
payment_source TEXT NOT NULL
);CREATE TABLE cancelled_orders (
id UUID NOT NULL FOREIGN KEY REFERENCES settled_orders(id),
cancelled_at TIMESTAMP WITH TIME ZONE NOT NULL
);

The addition of the settled_orders table represents the additional context surrounding paid and cancelled orders. The presence of foreign keys on the settled_orders table may at first seem strange but this is what enforces a one-to-one relationship between a settled invoice and its state.

We have expressed the check constraint on the table settled_invoices in this fashion to ease understanding. If we were in a situation where our bifurcation produced many states, then expressing the constraint in this manner would be overly expressive. We could simplify the constraint as such:

CONSTRAINT single_state CHECK (
CASE WHEN paid_order_id IS NULL THEN 0 ELSE 1 +
CASE WHEN cancelled_order_id IS NULL THEN 0 ELSE 1 = 1
)

With the constraint expressed in this fashion adding additional states is as easy as adding an additional `CASE` statement to the sum.

Summary

We’ve looked at modelling tree-like finite state machines within our storage layer in a type-safe manner that is easy to understand, difficult to get wrong and simple to extend. By composing the techniques discussed we can model any size of tree with an arbitrary number of bifurcations and with generic data models within each state.

--

--

A mathematician by training and a software developer by trade, you’ll usually find me on the rugby pitch somewhere.