Data-First Architecture

I recently had a light bulb moment when I saw a tweet from Evan Todd. It helped bring together some ideas I have had for a while on software architecture.

Data characteristics excluding software functionality should dictate the system architecture.

The shape, size and rate of change of the data are the most important factors when starting to architect a system. The first thing to do is estimate these characteristics in average and extreme cases.

Functional programming encourages this mindset since the data and functions are kept separate. F# has particular strengths in data-oriented programming.

I am going to make the case with an example. I will argue most asset management systems store and use the wrong data. This limits functionality and increases system complexity.

Traditional Approach

Most asset management systems consider positions, profit and returns to be their primary data. You can see this as they normally have overnight batch processes that generate and save positions for the next day.

This produces an enormous amount of duplicate data. Databases are large and grow rapidly. What is being saved is essentially a chosen set of calculation results.

Worse is that other processes are built on top of this position data such as adjustments, lock down and fund aggregation.

This architecture comes from not investigating the characteristics of the data first and jumping straight to thinking about system entities and functionality.

Data-First Approach

The primary data for asset management is asset terms, price timeseries and trades. All other position data are just calculations based on these. We can ignore these for now and consider caching of calculations at a later stage.

We can use the iShares fund range as an extreme example. They have many funds and trade far more often than most asset managers.

Downloading these funds over a period and focusing on the trade data gives us some useful statistics:

Now we have a good feel for the data we can start to make some decisions about the architecture.

Given the sizes we can decide to load and cache by whole fund history. This will simplify the code, especially in the data access layer, and give a greater number of profit and return measures that can be offered. Most of these calculations are ideally performed as a single pass through the ordered trades stored in a sensible structure. It turns out with in memory data this requires negligible processing time and can just be done as the screen refreshes.

More advanced functionality can be offered, such as looking at a hierarchy of funds and perform calculations at a parent level, with various degrees of filtering and aggregation. As the data is bitemporal we can easily ask questions such as "what did this report look like previously?" or even "what was responsible for a change in a calculation result?". Since the data is append only we can just update for latest additions and save cloud costs.

Conclusion

By first understanding the data, we can build a system that is simpler, faster, more flexible and cheaper to host.

Software developers cannot always answer questions on the size and characteristics of their system's data. It has been abstracted away from them. People are often surprised that full fund history can be held in memory and queried.

We are not google. Our extreme cases will be easier to estimate. Infinitely scalable by default leads to complexity and poor performance.

With cloud computing, where architectural costs are obvious, right sizing is essential.

Most of the references I could find come from the games industry. I would be interested to hear about any other examples or counterexamples.

References

The One Weird Trick: data first, not code first - Even Todd
Data first, not code first - Hacker News
Practical Examples in Data Oriented Design - Niklas Frykholm
Data-Oriented Design - Noel Llopis
Queues and their lack of mechanical sympathy - Martin Fowler

FAQ - some questions I've been asked

  1. How do you deal with previously reported values and make sure they will be the same in the future?

    The data model is bitemporal so we can request any reporting data as at any prior time. Lockdown process design becomes simply storing a timestamp for a reporting period. Reporting can make use of lockdown timestamps to produce a complete view of prior period adjustments with full details. Without a bitemporal data model this often becomes a reconciliation process, leading to further manual steps.

  2. What about reported values changing due to code changes?

    Reporting data can be saved when key reports are generated and used in regression testing. Regression testing of all reports using the old and new code can also be automated. This is very good practice for high quality systems and is not very difficult to implement.