How to keep track of your data evolution? A journey through time and (memory) space

Léon Stefani
Brigad Engineering
Published in
7 min readJul 1, 2019

--

In any project, of any size, always happens a time when it is used by real users and when you stumble upon a certain wonder: what happened to my data in order to reach its current state? Whether it is a state that should never have been possible to reach or a mistake made by a user, it’s always in these cases we wished we had some tools to keep track of what happened. A solution for this is to keep track of all historical data through time.

Unfortunately (and Murphy’s law obliging), it’s always when we reach the step where we need that historical data, that we realize we don’t have it yet, and that we should’ve found ways to keep it sooner. But how to do it? And which methods best suit our needs? That’s what we will try to cover on this post, by taking our journey through different methods we use at Brigad as a red line to explain different methods and needs.

I — Immutable state machines to keep track of actions

Suitable for some needs, especially when dealing with quantifiable data, like money or stocks, immutable state machines are by design keeping track of the data because they can’t be edited at all. Each entry representing a single action, associated to an amount, we can know the current state of a resource by summing the values of all the actions, instead of editing a global result, and we can know when each action took place, and in which order. This way, it becomes always possible to know what happened for the data to be in its current state.

For example, if you take a database table where you keep track of your inventory, each time you have an order, you add an entry with a negative amount for the product, and each time you have a delivery you add an entry with a positive amount, allowing to know the current state of your goods simply by summing all amounts, and to keep track of when the total amount changed, for which amount, and eventually some other data on the transaction.

This kind of state machine is really simple to set up, and efficient in memory since you just store your data without ever duplicating it, but is difficult to use when having some table with non-quantifiable values, for example, the names of the products you want to sell. And you can use it only for non-qualified rows, for example, if you have a status on the order in the same row than the amount, it won’t keep track of the status changes.

This is the first solution we put in place at Brigad, in order to keep track of payments, because it is a subject that needs rock-solid data and architecture (we’re talking about real money here), and thus we needed a complete history there. But after that, we had other needs about data history that couldn’t be resolved this way, which led us to use another tool to achieve data history.

II — Immutable and versioned rows

While keeping the advantage of the immutability of rows that prevents data loss, another option to achieve a clean data history is to keep all versions of a row inside the same table. This way, it’s always possible to know what change happened and when it did. The main advantage is that it is very flexible, and can be applied to any row with any kind of data inside, and doesn’t require a lot of code to be put in place (just replacing the updates by insertions).

In order to be able to retrieve the current state of the data, we need to make a query to select only the latest inserted version of the row. This way, we can use normal database queries, on the current data or on the data that was the correct at a certain point in time. But we need to first select the state we want, and then execute the query on this subset, because we don’t want the filters of the main query to be applied to historical data.

Another way to use this method is, instead of relying on the date of insertion in the database, to add a version column and always query against the subset with the latest version. This allows having clearer data and group some data within the same version.

However, this method has the drawback to increasing the size needed to store the data, since each modification will insert a new row in the table, making it less suitable for heavily updated tables, or tables with rows containing a huge amount of data.

You don’t want this to happen 😺

Another drawback is the fact that it complexifies all queries to retrieve the data, as we first have to get the subset with the current data, and then execute the actual query against this subset, which can be pretty tricky when using ORMs.

At Brigad, we started to use it to keep the history of some date ranges, but when we did some research in order to generalize it for other resources, we found out the complexity drawback on the requests and started to look after some alternative methods. That’s when we found out about a Postgresql’s extension: temporal_tables, that suited our needs since it is the database engine we currently use.

III — Temporal tables: historical data for Postgres, the easy way

Existing as a Postgresql extension, temporal tables allows creating some ‘history’ tables whose role is to keep track of all the versions of the data inside the main table.

This extension is based on the concept of a temporal database, which is a higher-level database concept and can be implemented in other engines than Postgres. While this concept is a part of the SQL:2011 standard, most engines do not natively implement it, explaining the need for extensions or scripts.

History crawling IS important!

This is made at the database level (so no need to worry about concurrent access or transactions rollback), by inserting into the history tables the data in its state before it was updated or deleted. The only thing we have to add is a sys_period column of type tstzrange , on the table, because it is this column that will indicate the time range for when the data was valid.

When we have the history and the main table, we just need to add the temporal_table extension’s triggers, and it becomes impossible to lose data, since every database action modifying a row in the main table will also insert this data on the history table, with a validity range between last update and time of current update.

You can add a sys_period column this way :

ALTER TABLE products
ADD COLUMN sys_period tstzrange NOT NULL DEFAULT tstzrange(current_timestamp, null);

And create the temporal database, then adding the triggers like this :

CREATE TABLE products_history (LIKE subscriptions);
CREATE TRIGGER versioning_trigger
BEFORE INSERT OR UPDATE OR DELETE ON products
FOR EACH ROW EXECUTE PROCEDURE versioning(
'sys_period', 'products_history', true
);

For example, if I was to modify a name on a product in a database, in order to correct a typo, it would create an entry in the history table, stipulating that between product’s creation and now, the data with the typo was valid, but since now, the correct data is the one without a typo.

In order to retrieve data from the database, at the current time or at any previous time, the query is fairly simple :

SELECT {columns} FROM {table}
WHERE {where clauses} AND sys_period @>
{time we want or now() for latest}::timestamptz
UNION ALL
SELECT {columns} from {history_table}
WHERE {where clauses} AND sys_period @>
{time we want or now() for latest}::timestamptz;

Just by wrapping all of our where clauses in a union with history tables we can query the table at the state it was at a certain time: a true Wayback machine!

It shares some of the drawbacks of the versioning method, such as the increasing memory on each update, but the querying is a lot simpler (and faster), and the history keeping is made at the database level, not the application level.

We started using it on multiple tables that were business-critical, and thus would greatly profit from easy-to-implement history, and so far it seemed to perfectly fill our needs, as it allowed for a generic way to query data through history, and the implementation was really easy thanks to few reusable utils.

In order to show how to implement temporal tables in Postgres, I’ve made a simple project, using Node.js, express, and Sequelize, for an API that uses a database with temporal tables, available here: https://github.com/Brigad/temporal-tables-demo

Sources :

Hope this post will help some of you not to struggle as much as we did on the matter of history, and that you found it as pleasing to read as it was for me to write! Would you have any concerns or questions you can reach me on twitter (@shrodingers42)!

--

--

Back end developer at brigad. Enthousiastto discover new skills and to explore new fileds !