OCR / data analysis

Anyone here know of a tool/service that can

  • recognise data in tables in PDF format
  • transpose to tables in an evaluable format
  • analyse the data as required

I have clinical trial CSRs (case study reports) hard copies scanned into PDFs. I would like to pool data from several CSRs and analyse the combined data using various research questions.

Thanks

That is tricky - you actually need some pretty solid Optical Character Recognition (OCR) to achieve that. And for complex information like that it will take some serious horsepower.

There are several ad-supported online tools that can do this, but with variable efficacy and variable trustworthiness (with the data). Apparently, OneNote can do OCR, and you might be able to do this within your organisational environment as you really don’t want to put private / sensitive info like that somewhere unsafe.

Any premium generative AI (i.e. paid account) should be able to help with assembling stuff like that once you’ve got the text extracted - or maybe even extract the text as well.

1 Like

Bumped into McCrae Tech at the AI meeting. I believe they are close to finalising a solution.

1 Like

I recently used a couple of online generative AI tools to do just this with non-patient related information (PDFs of protocol tables). With the right prompt Gemini in particular did a stellar job and created some very clean JSON.

But… if any of the content is sensitive you would need to use a private hosted model, and validating the output could take a lot of time…

1 Like

That’s very insightful, thank you. Yes, it will need to be a private hosted approach, even though any personal identifying information will be redacted. I am anxious about the validation though; I wonder if random sampling will suffice or some other approach?

Still, the process is likely to be less onerous than a person sitting in a windowless room transposing the hard copy data on to an evaluable digital platform :nerd_face:

1 Like

Hi, there are tools already in use in HNZ that should be able to do this. The Waikato team are using TotalAgility software that should have this capability and there are other regions where the software is also in use. Let me know if you want me to connect you to the Waikato team :slight_smile:

1 Like

Gemini flash is the cheapest/fastest/best value in my experience.

There are OCR/cost leaderboards like https://idp-leaderboard.org/. Note open source models if you want to run locally.

I suggest trying each and benchmarking for yourself.

Can we access Gemini models through some kind of enterprise arrangement with Health NZ?

1 Like

Thanks, Nuala. I may well take you up on that offer of introduction. Discussing with a kiwi outfit right now … let’s see how it goes.

1 Like

I don’t think so - it is CoPilot (i.e. OpenAI) all the way in Health NZ unless there is a special arrangement (such as the one alluded to by @NualaFitz above).

My advice, is if you are looking at a solution, make sure you understand not just how accurate it is, but perhaps more importantly, what the user interface is like for managing the ‘exceptions’. In my experience this is critical to making the solution as efficient as possible for use. Reach out for a chat if you think it would help, I’m happy to talk you through my learnings and experience.

1 Like