Hyperscience Data Redaction Demo
This demo highlights how Hyperscience can be used to automate the process of extracting and redacting data from unstructured, handwritten documents, routing it through bespoke workflows to apply validation, enrichment, case collation, keyword search, or more advanced Natural Language Processing. See how Hyperscience is able to create an accurate, text searchable PDF from an unstructured source, like an image that contains a combination of typed and handwritten text, redact information to automatically hide PII data in the source document, and search for keywords in handwritten letter to help identify and classify the type of document.
So we’re looking at the Hyperscience Platform and we’ll get into the demo, but a quick introduction: turning documents into data. And not just any data. This is the key here. It’s turning it into efficient data and meaningful data as fast as possible. It’s really about accuracy here.
When we look at what is being done today, we’ve got about 80% of all documents in a human-readable format at the moment. This is typically tackled by OCR, building our own OCR solutions or using OCR technology to do character-by-character recognition in order to turn those documents into machine-readable processing data.
What we have there is speed and efficiency. The negatives are typically that the machine lacks context, is often inaccurate, and is really difficult to set up and maintain. With OCR technologies, typically every time you see a new layout, you have to do an adjustment and add that in manually. And typically the accuracy is behind character-by-character recognition. So you’re only as good as the character recognition. If you have three characters correct out of a four-character word, that’s the whole word incorrect. It’s meaningless around the context of the characters.
The cases that we typically can solve for involve printed text. Very rarely are we able to cater for handwriting and poor quality documents within the context of extraction. For the things that we fail at, we will throw to a human and have manual intervention, which is great because we now have context and decisions can be made by the human. It is highly accurate, typically around 98% accuracy. But now we lose speed; we’re slow, expensive, and lack scale.
So we really are at an impasse when we’re talking about trying to work with all cases and improving that document into data as fast as possible with accuracy. The flow-on effect downstream involves risk: operational risk, reputational risk, as well as financial risk around the inaccuracies of what you’re producing downstream that might have to be reproduced.
Hyperscience uses the human right at the center. The human in the loop is well and truly part of the process from an end-to-end perspective. What we can do with Hyperscience is dial in the accuracy target that we’re looking to achieve, which is typically around 99 point something percent accuracy. We then go through the process of allowing the machine with its confidence to automate the classification to hit that target. It’ll have a confidence score and because Hyperscience uses machine learning technology, we can build models out so that we can have a machine at a level of confidence. Anything it’s not confident to hit our accuracy target, we will flow down into a human in the loop. Even from day one, if we’re doing that with a human in the loop, we can still achieve lowering of average handling times just purely because of the process improvement in the technology in the actual user interface.
First of all, we have the classification, then we have identification of fields. Hyperscience works at the field level. We’re looking at it from the perspective of things like payslips or invoices where we’re looking for specific information, but they may be in different formats. A machine learning model will typically be able to find those fields even on unseen documents. New invoices and new pay stubs that it has never seen before will be able to identify those fields very accurately, and the ones that I can’t it’ll throw down to a human in the loop.
Same with transcription; the machine will try and hit that 99% accuracy target. Hyperscience has built-in data types of models that have been pre-trained with hundreds of millions of names, of dates, of addresses. So we’re looking at the whole name, the whole address as a complete field. And again, we’ll flow down to the human in the loop to meet that 99% accuracy target.
At this point, we now have 99% accurate data to do something with. The Hyperscience Platform also has the ability to do validation, so it can go out to do API calls or hit databases. For instance, if a passport has been scanned into the submission as well as an application form and we want to check that the passport number is valid against what’s been written on the application form, we can do that validation into submission. We can go outside to an API and validate things. And once again, we can add a human in the loop into that validation so that once we get down here, we have highly accurate and complete machine-readable data that has passed through validation points that typically humans would have done.
Not only that, Hyperscience has this notion of a QA which is an asynchronous task that allows the answers that have been put through by the humans to be QA’d and then compounded to build a stronger model. Which means: was the machine correct? Yes. Well next time I see something similar I’ll be able to be more confident on either transcribing or identifying. So it is this notion of continuous improvement that you get on day one vs what you get on day 10 or day 20.
This is the Hyperscience Platform. Before I jump into the demonstration, we can see what we’re getting from an output. We combine that with things like integrations with many different systems that we have out of the box. We have things like case collation and reporting and we also can do custom code blocks using a flow studio. So that means we can mix and match and add things within the flow to complete that work.
Once again with Hyperscience, in general you’ve got three types of documents: structured (like forms, fixed fields), semi-structured (invoices or bank statements, pay slips where we know the fields we’re looking for but can have different layouts), and unstructured. The third one is what we’ll be looking at in the demo today. Totally unstructured. Anything that is just plain text, handwritten documents that has no complete structure, but we may want to do something like a full page transcription of that and do certain things with the data that we’re transcribing accurately.
This is the Hyperscience system. It’s web-based, it can be deployed on premises or even in a private cloud if you want to run it on AWS or Azure, or even on your own on-premise. We do have a SaaS model as well. Immediately you can see the tasks that are visible. These are the different supervision tasks that we just ran through: classification, identification, and transcription. Here we can restrict certain users to only do whatever we want them to see and do and validate. We also have the QA task that we spoke about. We can have specific supervisors that can go through and do the QA’ing or data keyers in order to make sure that they’re accurate on the QA process in order to make the machine compound better on the ML model that we release.
If we have a look at what we can do with the flows, we have this notion of flows, which means we can create as many flows as we like for specific use cases. We can tie specific documents and releases to those flows and allow very complex outcomes at an end-to-end perspective. Looking at what that looks like within Hyperscience, we can have many different inputs, anything from API ingestion, messaging queues, we can do a folder suite, email ingestion and then run through the building blocks of what we have. As we showed before we have the machine classification, layout validation, and then we can run through this complex flow. This is one that can be deduced from an end-to-end perspective and tailored to that specific use case.
This is quite complex where we are using things like database lookups, validating against an internal database. Does a customer exist? Does a policy number exist? Yes. Let’s validate and pull down the name and address from the system. As an example, this is quite a complex workflow but again, we’re able to specify the target accuracy of what we’re trying to achieve out of that. Conversely, we can do things that are much simpler. If we have a look at our redaction flow, Hyperscience actually has the redaction feature within its systems so we have a redaction flow. And again, this is a lot simpler so you can be as complex or as simple as you need it to be.
We’ll look at the redaction as a demonstration here. Here in Australia, we have a tax file number for the Australian taxation office and what we’re looking at doing here is validating against that tax file number and making sure that it is a valid tax file number based upon the ATO’s algorithm which they use a mod 10 algorithm in order to make sure that those are correct.
Looking at a sample document, this is what we mean by outline use cases or looking at the varying use cases that are limited when you start trying to do things with OCR technology. Hyperscience has really great segmentation in order to identify the words, the entire fields and then do something with them. Here we have ‘here is my tax file number.’ This however is not a valid tax file number with one digit difference there. And down below, we also have another valid tax file number. So when we look at what is being able to be achieved by Hyperscience we have a very accurate transcription of the information. Where we’re able to now do something meaningful with the extracted data. In this instance when we have a look down, Hyperscience has all that information available in a JSON output file by default.
All the information is there including if you wanted to supply snippets of the image of the segments that have been transcribed. And right down the bottom if we click on the link, we have the ability to see the redacted image in a simple flow where we’re doing a full page transcription, identifying the valid numbers that are there and then also validating those not only based upon say regular expression or other but we are validating that against the mod 10 taxation algorithm and only redacting what’s necessary. And it’s not over-redacting what we shouldn’t be.
That’s one simple use case. There’s another one that I will bring up. This is a typical scenario where you’ll have possibly a scanned document and you’ve got writing in and around the edges and down the bottom there. Now once again if we have a look at how Hyperscience transcribes it, it is very good at picking up everything that we’ve got on the page including handwriting. Now, this is again another outline use case of what Hyperscience is able to achieve over and above a lot of other solutions out there. So we can now say that we have the handwritten transcribed information throughout the document there. And this is useful for when we’re doing a full page text searchable PDF possibly at the output and we want to put that out to another system or even have that made text searchable and then put back into a repository.
The last use case that we’ll look at here, again if we look at the writing and the style that’s here, if we were to apply OCR to this character by character, you tell me what you see here. And what potentially the outcome would be if we look at that let alone that it’s handwriting but looking at just the characters themselves. Again, because we’re running whole words and when we’re segmenting we’re able to get a confidence score on the words themselves in this instance. What we may be looking at doing is looking for the words ‘complain’ and the word ‘legal’ in order to identify and classify the type of document that we’re looking at. So again, even in a totally unstructured sense, we’re able to get this accurate data, accurate transcription of the information that’s there and then do something meaningful in this instance. So we’re looking for those keyword hits on complaint and the word legal and then we’re able to do something about it downstream.
These are some of the use cases that we have available. This opens up the possibilities of the art of the possible as we call it within Hyperscience, whereby we’ve got the ability to work on very complex flows, simplified flows as well as working with printed documents, semi-structured documents, handwriting, poor quality and really are able to include the majority of outlined use cases that are there in producing some really great results.