Replication of Enterprise Digital State — or why you need REDS for Data Mesh

Steve Jones
6 min readJan 17, 2022

Ok, I’ve said that you don’t need Data Warehouse thinking in a Digital world, and some people rightly said “ok smart arse, so what do you need?”. So this will be one of a series of posts that first of all talks concepts, and then brings them together, hopefully, in a coherent architectural approach.

Now Data Mesh has got a lot of traction and I think its got some good ideas in it, particularly its reuse of Domain Driven Design something I am particularly keen on as its exactly how I view business service architecture. This idea of business aligned data product management is fundamental, and its exactly why I think that data governance is an operational problem, its part of the business domain.

In Data Mesh Zhamak makes clear that operational data can’t just be left in the source system, because thrashing queries against that system isn’t going to be popular, even for ‘linear’ queries let alone analytical ones. So how to solve that? Well for that, I’d like to introduce you to REDS — Replication of Digital State, a pattern for creating a low latency operational replica of enterprise state.

Pattern: REDS — Replication of Enterprise Digital State

Intent: A pattern to create a low latency (from point of write) data store that enables eventually consistent transactional patterns like Digital Looking Glass as well as analytical use of operational data

Motivation: Lots of enterprise systems are already at max capacity, hitting them with digital requests for data can cause them issues and their databases are not intended for analytical use.

Applicability: This is intended to be a standard pattern for creating replicated transactional system state into a digitally focused data store.

Structure:

REDS structure

Participants: The source system, a CDC feed from that system, an event streaming platform, a reconciliation engine, and a data store as a target.

Collaboration:

The pattern requires the following:

  1. A source system
  2. A CDC feed from that source system
  3. An eventing infrastructure to receive the CDC feed
  4. An environment to append the history of change
  5. A mechanism to calculate the current state
  6. A mechanism to onward propagate change

The different here between a traditional database to database replication is two things.

  1. We want the history of change not just the current state
  2. We want to onward propagate changes

The latter point is important, we are using data and CDC here not just as a way to replicate enterprise state, but to provide a mechanism via which onward transformation into data products, as well as AI, can be automatically triggered.

Example:

SAP’s data is considered hugely important to the organization, but we don’t want people using the operational database as either an analytical store, or building transformations within that store. Therefore what we do is implement a replication of the data from SAP into a store that supports onwards transformation, AI eventing and business operational reporting. We also archive off the history into a large data store which is directly accessible from the operational view.

Well we don’t have to imagine this, its basically how SAP works today.

Consequences:

The impact of this is made pretty clear in the Data Mesh paper, you can’t have a data mesh if you don’t have this operationally accessible data store, and you don’t have data products if they aren’t constructed quickly at operational speeds from the source information.

By implementing REDS you can establish the pipelines needed to create data products, as well as the operational governance and reporting within a domain that ensures it can be governed and then published accurately to the rest of the business.

Implementation:

The source system exports a Change Data Capture feed for its data, this means it sends the following:

If you have a primary key and if you don’t (which gets more fun):

Note: If you don’t have a primary key and there are multiple instances of “Old Data” then your table will be consistent but that doesn’t mean the row’s history will be completely correct, so try and avoid this.

On reception of the record I need to add the following

  1. A System Identifier, so I know where it came from
  2. A universal timestamp, something that is consistent no matter what the channel is
  3. A GUID — something that makes utterly sure the record is unique.
  4. Body of Work identifier

The basic rule though is that if I touch a table I take everything, I’m including blank fields in that, there is absolutely nothing I don’t want to take if I’m taking something of value from the table. I’m not interested in wasting time thinking about what I need, I’m going to be mechanistic and take everything.

Body of Work

A quick shout out to Body of Work, when doing CDC (or Batch), certain records can be within a transactional boundary, so the state is only valid once all pieces are committed. For instance Order + Order Line, the state of the system is only valid after all of the elements are committed. As you get into transformations you may need to create your own bodies, and sub-bodies, of work. This is not trivial, I’ll write something separate on it and why it is important, particularly when looking at x-system transformation updates (e.g. MDM alignment)

History First

So the first piece is we just append to the history for that system, append only for this, we want the entire history of change for the system.

Current State or Onward history

The next question is to decide whether you want to propagate history onwards or just the current state. I personally prefer to propagate everything so everyone can also choose to have full history, including fixes for drift that don’t apply to current state (see below).

Calculating Current State

So one of the great things about CDC is its ability to handle drift and reconciliation by ‘patching’ data. Basically if I missed an update I can force the update back into history, I can also correct a historical record. This means calculating current state isn’t just “last record in wins” but it is a really simple calculation:

Latest record is based on latest system timestamp THEN universal timestamp

In other words for each key you are selecting records with the latest system timestamp, then if there are multiple records with that timestamp you use the universal timestamp. This enables me to send a correct record after the fact but have it appear as if it was in the past. Now there is something else I should be doing here which relates back to the body of work, it doesn’t change the timestamp part but it adds an if

Latest record is based on latest system timestamp THEN universal timestamp if the body of work is marked as complete

So if my body of work is an Order + 15 Order lines, but I’ve only received 10 of them, then that Order isn’t yet valid so I don’t use it.

An advantage of this approach is that it accepts that the store will be eventually consistent and doesn’t require the sort of complex and costly message ordering elements that can sometimes lead to an input pipe becoming completely blocked because an out of order message is blocked or missed.

Reconciliation

One thing that I have to do here is perform a periodic reconciliation back to source. There are a number of patterns for this that I’ll document in future, but to start with you can reference this piece on CDC, Drift and Reconciliation.

Uses:

REDS and the frameworks that need to go underneath it for Audit, Balance, Control and Reconciliation (ABCR) is the foundation of an active data infrastructure. Many operational use cases can run from a REDS data store, the Digital Looking Glass pattern and Alice and the Digital Looking Glass are just two examples of how that can be done. It also provides the foundation for a Data Mesh, giving the ability for operational data governance to be tightly aligned with business operations.

--

--

My job is to make exciting technology dull, because dull means it works. All opinions my own.