I understand that SQL relational databases are the most common way of storing health data. I wonder if graph-based databases might be better suited to the highly connected nature of health data. Any thoughts/experience around this?
A handy reference from @NathanK:
The biggest issue with graph database usage structure will likely be the fundamental shift in analytics education that needs to apply to the workforce to consider multivariable relationships and attributes, with a baysian probablistic manner of the truth.
Health already suffers from lack of understanding in the richness of the data and a graph database, only if architected well, will likely benefit from this fact. Misdiagnositics and error correction is a major factor in health data so how that is managed will benefit on graph databases if one is able to use it.
In short, itâs not storage that is a problem but do we have enough of an understanding of the benefits and outputs (with caveats) that stops us from using graph databases. Graph databases is highly useful but so far only against certain clean production or relationship process data.
I look forward to reading this in detail. At first glance, I wonder if a better comparison would be with schemas prevalent in other reporting databases, such as data warehouses. I think that itâs well established, in IT as a whole, that the requirements of transactional and reporting databases are different. I also agree with the points that (I think) @SamuelWong makes that retaining context is key in the healthcare domain and the data quality challenges.
are you saying that a well-implemented graph database could give richer data insights and improve error correction/misdiagnostics?
Technically yes, but you must first account for the dirty data in the system, and mitigate them where possible. But the point I was making is that it really comes down to how information is collected, validated and stored in the correct context before architecture can have a view on how itâs managed in a database. We currently donât understand our data well enough to do that due to the fact that most of the time, itâs running off financial or contractual sourced datasets, and thatâs not true health datasets.
And this is precisely why Iâm such a fan of OpenEHR, which is a true health data structure (co-designed from the ground up with clinicians) which applications can be built on. It addresses the three key issues with the proprietary EHRs available (Epic, Cerner, etc):
-
Structured (persistent) true health data
-
Interoperability
-
Vendor lock-in
OpenEHR is technically sound but has limited business model to make it scale. If we donât have incentives for scaling, and a supportive ecosystem beyond informed clinicians, it makes it very hard to be embraced.
Hi Melanie
Iâd recommend caution here, for several reasons. Iâll list them, and then explain.
- Thereâs a widespread belief in healthcare that we are âspecialââand that we therefore have âspecial needsâ

- Iâd also suggest that even if you confine your attentions to people who actually work with and create databases, about 98% donât really get the term ânormalizationâ (or even ârelational integrityâ)
- All the fancy stuff has been tried before. Every couple of decades, someone comes up with the stupendous idea that âNoSQLâ or some variant is going to fix things. 'Cos #2 above.
Letâs look at this. First, a cute summary of where we (nominally) are. Then Iâll focus on 1â3, before ending with an explanation of what we really need, and why itâs not going to happen in my lifetime.

I love this picture.
What's the problem?
In 1970, Codd pointed out that if your thought processes and thus your database construction aren't properly joined up, anomalies arise. These lead to a host of problems, that increase in severity as the complexity of your data model increases.So the demonstrably correct way to build a schema is to normalize it. Iâm not going to go into the details here. But suffice it to say that nobody at IBM initially listened because they didnât understand what Codd was on about. Some of their smarter customers basically bullied them into building SQL.
Since then, any number of people have tried to finesse things because they still donât understand. So itâs easy. Anyone who cannot build a complex database in 3NF should not be allowed anywhere near anything more than the simplest database task.
But practically
Most people start with Excel, and then seem to graduate to 'flat file' databases. Then, when they build things in SQL without a properly normalised schema, things become ever more complex. So rather than going back to 1970 and reading Codd, they come up with other brilliant ideas that initially seem to impress, but come with a host of long-term liabilities. As we found out with NoSQL in the ten years after 2000.The solution is to use SQL, but do it right. Sadly, most people think the problem is with SQL, and not with them.
"But Medicine is special
A large problem with medicine is indeed that it's complex. The solution is twofold:- Get the basic processes right.
- Simplify appropriately.
The solution is not some widget or gadget or tool or indeed âthrow some AI at itâ.
The solution is to realise that the only thing really special about Medicine is that we think weâre special. It turns out that Deming (and earlier, Walter Shewhart) worked out what we need to do, respectively in 1950 and 1930. But we donât listen. If youâre interested in reading further, I cover this in a lot of detail in this overview.
How do we 'simplify appropriately'?
This isn't easy, or it would have been done, despite the warring factions both on the clinical side and on the information science side. But it's easy to see what we've got wrong---which gives a few pointers as to what we need to do right. Very briefly:- If you havenât got your head right (good processes, standard operating procedures, ways of checking whether things are going right or need to be fixed) then all sorts of dysfunctional behaviour arises. The commonest thing here is to set âtargetsâ (if people havenât heard of Goodhartâs law), or to shoot the messenger, or to encourage people to work harder (if theyâve ignored Deming).
- Medicine runs on causal thinking. This too is tricky. If you canât even record âwhat someone was thinkingââwhy they gave a particular drug or did some action, or what they thought the problem was, and why, then you have a huge problem. You also need a bit of subtlety hereâthe ability to record how sure you are (and change your mind). This is generally unsupported, so people fall prey to availability bias.
- People mix up the information layer and the functional layer. This means that programs are a mix of information and ways of looking at and managing that information. This never ends well.
- People hide their work. Many of the older, fiendishly expensive databases get a lot of the above wrong, so they invent proprietary reasons why you canât criticise them, and must sign NDAs.
- Everyone does things their own way (especially, mixing up the information and functional layers) so interoperability is impossible. If there are âcommon standardsâ, then they are often so flexible as to be useless, or everyone then goes out and implements their own variant.
- We donât even attach the same meanings to the same concepts. We therefore spend a lot of time arguing about minutiae while we continue to kill patients unnecessarily (often not even noticing this because our data are so bad).
So what do we do?
The temptation is just to cry a lot.The right thing to do is first to (a) Read and understand Deming; (b) Read and understand Codd; (c) Read and understand Judea Pearl, who explains the causality thing rather well. But this takes time. People would rather re-invent solutions that donât work.
The next thing to do is sit down with like-minded people and actually get the basics right. A common agreement on terms (data dictionary). An agreement to keep things simple and listen to everyoneâespecially the patient/health care consumer and people doing the hard yards on the front line. A commitment to making their lives easier using simple measures that can be built on because the foundation is right. Ruthless elimination of needless complexity and the resulting waste. Simple stuff that works.
To do this you need to make simple information models that embody what you now understand. These must be loosely coupled to the functional layer. They must be open-source, easily available, consistent, well-documented, and representationally adequate so you donât need to go and re-invent things every time something new comes up.
In simple terms, a new thing is turned into entries in the right tables, not a new table (a consequence of good normalisation). If you have more than about 200 tables, then youâre doing things wrong. Itâs worth noting at this point that Cerner has about 5â10,000 tables in its data suite; Epic has the equivalent of about 16,000. But with Epic, theyâre not even tables. So every night, they turn their structures into something that looks like SQL.
Epic, by the way, uses a b-tree based form of NoSQL because M(umps), the language itâs written in was invented before 1970.
So effectively, youâre asking for Epic.
Even if you have a few billion to spend, Iâd suggest thatâs less than wise.
But thatâs just my 2c, and heck, I could be wrong.
Dr Jo.
Love your detailed review.
This backs up your assertions:
After nearly 40 years as an IT Professional, I still believe that RDBMS & SQL are the best technologies for transactional databases. Perhaps, if openEHR had chosen them as their persistence and querying technologies, it would have been far more widely adopted.
However, IMHO, the woes surrounding digital healthcare applications have little to do with database technology. The overarching problem is designing applications that attempt to automate inefficient manual processes and serve the needs of dysfunctional healthcare systems.
The large US systems cited here have been primarily built to meet the exceedingly complex reimbursement requirements of a payer-centric healthcare system. In most other developed countries, the major vendor systems are provider-centric, and predicated on management & administration functions - such as scheduling - with little emphasis given to providing patient data that has uses beyond those for which it was initially collected.
Sadly, I donât see any of these things changing in the next decade, or so. The one hope is that the supply/demand crises that is crippling healthcare systems worldwide might lead to the development of a new breed of applications that would enable engaged consumers to collect & manage their own healthcare and wellness information without it being tethered to practice & hospital management systems.
Thanks for the engaging discussion. Sounds like the effort Iâve put into learning about SQL and normalisation was worth it. Interesting example of Epic.
This has been a very interesting discussion, and as Iâve been pondering these same questions over the past few months. What will our next PAS database be?
I can appreciate the value of visual abstraction for things we canât see easily in our existing SQL databases, like inferred relationships between patients, or episode and diagnosis patterns that might imply something important is being missed.
Recently we were casually informed by our PAS vendor that they would be desupporting the (Oracle SQL) product that weâve spent the last 24 years productively customising and investing functionality in. The viable alternatives? A handful of expensive, ancient North American do-it-all systems where the patient ID field is labelled âPayorâ, and none of the NZ-specific quirks and functions have been considered.
Three of these supposed alternatives have been evaluated recently for NZ public health, and all three use NoSQL back ends (albeit not graph databases), because that probably seemed like a good decision many decades ago when they were designed. This gives them good performance in a high transaction environment, but makes database development - and reporting/analysis - by customers tricky if not impossible (i.e. stuck with whatever the vendor lets them see/do).
As a system administrator and having seen how cryptic NoSQL databases are in the health context, I canât say Iâm keen on the idea of handing the keys over to the vendor and putting on the blindfold that they bring with them.
Hi Hamish,
It sounds like a challenging situation that youâre in. I can see that there are a lot of details to consider in terms of available products being fit for purpose.
I think this was the kind of concept I had in mind, with only my university-level discrete maths/SQL understanding to go on. It seems to me that FHIR itself is somewhat âgraphyâ and I understand that some vendors have been adapting FHIR to use for data storage as well. Any FHIR experts keen to elaborate/correct me here? ![]()
I guess as an end-user too, it seemed like so much of the information I wanted to see on a PMS was this relationship stuff. It felt like (more of a suspicion/intuition) I was always pushing the software to be querying things that felt like they should be core information from a clinical perspective. Does that make sense?
Melanie
The FHIR Specification contains a Graph Definition Resource, but itâs at a low level of maturity and Iâm not sure to what extent itâs used by production systems. There is also a lot of chat among the community about using GraphQL as a language to query FHIR resources.
As for using FHIR as a persistence format, the specification is agnostic on this and I have mixed feelings on this topic. The largest FHIR Store in NZ is probably the Medicines Data Repository (FWIW, I wrote the component that converts the NZePS CDA Documents to FHIR), but I think that itâs fair to say that some users are struggling to use the FHIR API for complex queries. The FHIR resources are transformed to a Snowflake Schema when the data is passed to the National Data Platform.
Within the FHIR Community, there is a highly active âSQL on FHIRâ group which has produced a specification for mapping FHIR Resources to a Relational Format.
Ultimately, this is all depends on requirements. The major advantage of persisting FHIR Resources is that the data is in an open standards-based format based on a transparent information model. Whether one favors that particular model (and the major cloud platform suppliers all do) is another discussion topic.
That does make sense, and some districts have implemented GUI-based analysis tools like Qlik Sense to make it easier for staff to slice and dice the data as they wish. However, this has required a team of data specialists to transform the relational (SQL) data into a new format that Qlik can work with.
I think many people find the GUI-based data presentation much easier to get stuck into, but we canât escape the fact that the systems which capture this data are necessariliy relational in nature, and IMHO work best and are more easily administered and developed when they have a relational back end.
Recently I helped implement a national FHIR-based system which has a CosmosDB (NoSQL) back end, and it has been very difficult indeed to glean any information from it other than what is presented on the GUI. Trying to ask questions of the data, like âhow many clinicians are registered in NZâ is a surprisingly difficult task involving the vendor and often requiring bulk data updates and data duplication before the query can even be run. Once the query is set up however, an end user can click a few buttons on the GUI to see this information; very convenient.
To answer the same question in our SQL database containing the exact same information would be a 2 minute task involving a SQL query joining three tables together. This scenario requires the user to have database access and enough knowledge of SQL and the relevant schema to write the query, so less accessible in some ways.
FHIR is one of the implementations that may help, but needs to account for the stakeholders who need to interrogate the datasets and the end-users. Having implemented transactional, relational, and NoSQL solutions in the past, technologists fail to take account of the business stakeholders that as @DrJo has pointed out, Medicine (or healthcare) IS special. Traditionally most service providers are B2B (Business to Business) or B2C (Business to consumer), or B2G (Business to government). However, in reality, any EHR solutions have the following business models to contend with B2G2F/I/R/2cP2scP2SP2C to make sense of the benefits of EHR/openEHR etc.
So what does that squiggle of a code B2X mean?
B= A business (Digital Supplier/altruistic organisation/startup/developer) has to go through;
G = Government policies/procurement/directives;
F/I/R = Funder/Insurer/Regulatory who define (often inconsistently and changes every contract cycle) what they want measured/incentivised/recorded for claims/activities/funded measures, and defines the guidelines of the day that is considered quality;
cP= Contracted provider - the entity that receives the contract for services with specified measures to be delivered (otherwise known as commissioning)
scP= Subcontracted provider - the one who runs the facility, employs the staff and manages the local relationships as a brand
SP = Service provider , otherwise known as the clinician, but also the clinical admin staff and anyone involved in recording the activity at the facility
C = consumer of the service = who has received a recording of a record that has to account for everything the G, F/I/R, cP, scP and SP has determined is important for them to have costs reduced to a minimal out-of-pocket expense, and then recorded towards activities the consumer and Service provider can act on.
Try interpreting the context in a graph database!
Hi Hamish,
Hmm yes complex. I suspect my end-user experience is more of a UI issue than a database one.
If I understand right: youâre saying that when it comes to health databases there are a lot of interested parties who would all like to develop the database but have limited access?
Hi Samuel,
Thanks for this summary of the business challenges in this arena!
Quite a few openEHR CDR solutions actually use an RDMS with SQL. Having built a system myself I strongly believe the physical persistence is a matter of choice and almost irrelevant to the users based on factors like licensing fees, performance and scaling etc. Iâll respond in more detail later. Good discussion.