Data Denormalisation in Health

NathanK · 19 April 2023 10:15

Hmmm - first time I’ve come across that concept!

My one line definition:

The deliberate duplication of data in order to make specific application processes simpler and faster.

This is the best explainer I could find:

In short, we’ve been doing this like crazy in health for decades, with each application typically having its own database with a local copy of data (such as patient demographics, etc). Great for making an application fly quickly in a local context. Terrible for the system as a whole!

Would love to have the thoughts of someone who actually understands data!

Greig · 19 April 2023 21:47

Without delving into the weeds of database design, I assumed that @lara was adopting a “folk” usage. I read her statement as that we can all agree that having multiple overlapping, semi-duplicated datasets with non-homogeneous data dictionaries is a bad idea. Which it is, regardless of one’s stance or perspective.

While happy to generally geek out on what denormalisation means more formally or the processes that seemingly inevitably drive it in health, I will avoid either boring the majority or mansplaining to the minority.

DrJo · 21 April 2023 11:14

A few thoughts on (de)normalization

Quite because this important topic has come up, I thought I’d give my 2c worth…

Back in the 1970s, while working at IBM, British scientist Edgar F Codd thought about certain problems that arise when we store data. He saw issues (‘anomalies’) with insertion, updating and deletion of data—in context. Three problems that compromised data integrity.

His fix was a ‘relational model’. The central idea is that—to prevent the above anomalies—data need to be joined up right. This is called ‘normalization’. Here is part of the abstract of Codd’s original paper. It’s worth a careful read:

“Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such information is not a satisfactory solution. Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed. Changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information. Existing noninferential, formatted data systems provide users with tree-structured files or slightly more general network models of the data. In Section 1, inadequacies of these models are discussed.” [my emphasis]

I find it fascinating that, when Codd presented his paper to IBM, his ideas were not taken on board. Only when customers understood the benefits of Codd’s model and pressured IBM, did they come up with the first version of SEQUEL, later restyled ‘SQL’ (Codd was unhappy). Now, again, we see a large number of companies espousing tree structures and ‘slightly more general network models of data’!

What's the problem, then?

Let’s start with the problems, and the benefits of Codd’s model. At the very end of this post we’ll discover the breadth of the implications of this model. But we’ll start simply.

Whichever way you arrange hierarchical or network models, problems arise. These come about because we are also modelling relationships. These relationships can be represented mathematically as groupings or ‘tuples’. It turns out that some simple rules allow us to group things in ways that prevent the three classes of anomalies I described above. The rules describe the process of ‘normalization’.

But before we move on to the rules, we need to establish some terminology that will help us. The first idea is that of a ‘primary key’. This is an identifier or combination of identifiers that uniquely identifies a tuple. In other words, using the primary key, you can immediately find a specific row in a database table. It also turns out that if you build things properly, it’s necessary to link tuples to other tuples—thinking in terms of tables, we can refer to the primary key (PK) of one table by creating an entry (foreign key) in a second table that refers to that original PK.

An example

Here’s Codd’s original example:

It’s from: https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

Now we can talk normalization. There are a number of levels of normalization, but we generally find three sufficient. To understand these more easily, look at the above figure. How might you turn this into data tables?

One naïve way is to have an EMPLOYEES table with the specified entries. But how do we (for example) represent multiple children of one employee? We’re compelled either to pack the children into a single field—imposing internal structure on that field—or create multiple columns, one per child. A much smarter way is to have a CHILDREN table that has a row-per-child, and relates each child back to their parent. A similar approach can be used to normalize the JOBHISTORY and SALARYHISTORY tables. We apply three levels of normalization:

First normal form removes multiple values in columns, and duplicate columns—as with the ‘children’ in the example above, and also the ‘jobhistory’ field. Second normal form (2NF) is a bit more tricky. If the primary key of a table is a single field, then it’s in 2NF; otherwise there’s the possibility that some fields may only depend on part of the key, which needs to be fixed. Finally (for us) we have 3NF, which removes things called ‘transitive dependencies’. These are internal: one field depends on another field, and not just the PK. If this has left you thoroughly confused, at https://en.wikipedia.org/wiki/Database_normalization is a Wikipedia page with some simple examples.

The bottom line though is that this is not that simple—but if you structure the data correctly, you also remove the anomalies already mentioned.

Finally, denormalization

You can see that denormalization can happen in two ways:

A designer can fail to understand normalization. Basically, design badly.
Someone can deliberately introduce denormalization.

As so many databases are built badly—in ways that are inimical to normalization—it might surprise you that someone who understands normalization can deliberately do this. There are however two reasons to denormalize:

To make things more efficient.
To build in checks and balances.

The first is often either a lie (post hoc justification) or a decision that will have data-damaging long-term consequences. A case can be made for denormalization precisely where it doesn’t matter: the data structures are so trivial that large volume transactions are the only thing that really counts.

But let’s consider one case where a careful denormalization adds both efficiency and a useful check. A clinical example; a patient-centric example. Imagine a properly normalized database where there’s a long path of joins from the patient identity to some distant, large table T. If you add a patient-identifying field to T, this allows you to extract rows directly from the table—without all of those joins. A side-effect is that you can also partition (shard) the table to make data extraction even more efficient. But here’s the thing. It’s not unknown for data insertion mechanisms to go awry—and the last thing you want is information attached to the wrong patient. Here we have a check. Redundant, but reassuring.

But who cares, now?

Three arguments will be advanced that suggest all of the above is irrelevant. The first is that modern machines are so fast and memory is so cheap, that all of the above careful consideration is irrelevant.
This is silly—because the relationships are still there. All you’re doing is making more voluminous mistakes more rapidly.

The second argument seems more subtle. It goes: (a) We’re not worried about deletion anomalies because we don’t delete anything; (b) As for update anomalies, we make new documents, we don’t update old ones; and finally (c) insertion effectively means just updating the document description, so no worries, right?

Here’s the thing. The relationships are still there. Hiding them in documents and obfuscating things using metasyntactic legerdemain serves only to confuse. At some point you will still need to reconcile things—and that’s where the pain begins.

The third counter-argument concerns “AI” or machine learning. It runs like this. “We can amalgamate unstructured data in vast volumes—and then all we need do is apply some AI and all the relationships will materialise”. No problem.

Why is this argument unwise? (It is) Because ML works at the level of association. It has no internal model of causal associations, or indeed counterfactuals. So if you don’t encode these things, they won’t magically arise post hoc. (There are other issues but I won’t cover them here).

So what?

All of the above might seem irrelevant to clinical medicine. But consider some of the fundamental problems we currently have.

Clinicians are overloaded. A lot of this is digital overload.
Medical records—even digital ones—are often complex and poorly organised.
Computerisation often does not address defective medical workflow. At worst, we make the same errors more quickly.
Good clinical reasoning is built around causality. We don’t treat symptoms: we treat causes. We build complex, causal models and test these against reality. We explore counterfactuals.

A lot of our overload is repeatedly doing the same work—and re-working things we should not need to re-work. A lot of the disorder is because we haven’t imposed structure on the data we have. A lot of the workflow problems are because the right data aren’t in the right place at the right time; or because clinician A thought ‘this’ and clinician B thought ‘that’ and these thoughts were never reconciled. Taking relationships and clinical hypotheses and flattening them into ‘pure data’ often destroys the tuples we should be preserving.

The solution is often just—remarkably—normalization. Some smart people have realised that this is not confined to databases. In Estonia about 20 years ago, the authorities decided that their nascent computerisation of health care needed to be done right. They have indeed done remarkably well. How did they achieve this? They established a principle: that government agencies can legally ask for a given datum just once. This in turn forced government agencies to co-operate in at least three meaningful ways: robust communication, proper data association, and not allowing data to get out of synch.

In a word: normalization. Burnt into the system.

Dr Jo.

Greig · 21 April 2023 18:37

Many thanks, @DrJo

That’s a really good summary of normalisation, so thanks for sharing. To add several minor points, you have nicely demonstrated how the complexity of database design scales faster than the size of the database. Choices must be made, and denormalisation as a design choice prioritises table reads over table writes, which translates to the system working faster for most end-users.

This also illustrates what Digital & Data brings to the clinical informatics table—the knowledge to sustainably deliver at scale a performance computing experience. Decisions about database design dramatically impact end-user experience, which is the work of the data architect team. To @lara’s point, this is where the value of teamwork as clinicians have clinically relevant ideas but can’t scale, while the Data & Digital team can scale but doesn’t have the clinical insights. The wheels fall off when each does the other job because it is “easy”, and all you need to do is [insert Dunning-Kreuger level statement]. It is also where Clinical Informaticians are critical to literally bridge the two worlds that speak different languages.

Thanks again for taking the time to outline denormalisation.