Top Trends Shaping Automation and IDP in 2025 and Beyond, Featuring Forrester
Join Hyperscience for a forward-looking webinar on the top trends shaping automation and intelligent document processing (IDP) in 2025 and beyond, featuring guest speaker Boris Evelson, Vice President & Principal Analyst at Forrester.
Discover how organizations are shifting toward human-centered automation in the back-office, prioritizing AI compliance for trust and transparency, leveraging data-centric approaches for smarter business decisions, and unlocking the untapped potential of unstructured data in the age of GenAI. With data-driven insights, practical use cases, and actionable strategies, this session is a must-watch for those looking to stay ahead in an AI-driven economy.
Gain a strategic edge and discover how Hyperscience is redefining how enterprises unlock the full potential of their data to modernize back office processes.
Brian Weiss: Hello everyone, and thank you for joining. Welcome. I want to introduce you to myself, Brian Weiss. I’m the CTO for Hyperscience. I have Boris Evelson here joining me from Forrester. We are gonna spend some time today talking about the top trends that shape automation in IDP going forward. It’s gonna be a little bit of looking back and also trends going forward. I’m really thrilled to work with Boris on this. I particularly admire Forrester’s technology-oriented approach to this market. Boris covers everything from deep text analytics as well as document mining itself from a technology perspective for many years. Boris, let me hand it off to you to talk a little bit about how Forrester does research, some of the work you’ve already done historically, and what you’re working on going forward.
Boris Evelson: Thank you for having me. I have been covering this market for many, many years. I’ve been covering the market of structured data and analytics for much longer because that has been feasible for over four decades. But in terms of doing unstructured data analytics and document analytics at scale with reasonable ROI, that’s probably been possible for about 10 years. Let me set the stage by explaining to everyone how we cover this very interesting market. First of all, there are some differences between mining data and then extracting data and analyzing data from a completely free form text type data source. We call that text mining and analytics versus documents that are semi-structured that have sections, forms, tables, maybe even images. We call that document mining and analytics market subsegment.
Boris Evelson: We always research both of these markets in parallel. There are a lot of overlapping capabilities, but still, when you look at text mining and analytics use cases, the use cases are characterized by very high volume. For example, if you’re trying to mine data from social media, you are talking about billions of posts that come at a very high velocity, but they’re typically relatively not complex. But when we look at document mining and analytics use cases, we do see lower volume—thousands or even tens of thousands of documents like purchase orders, invoices per day—but now you have to deal with a lot more complexity.
Boris Evelson: Some of the use cases definitely overlap, for example, knowledge management, which is basically the next iteration of search. Market intelligence, investigative intelligence, machine learning or cognitive search capabilities are covered by both market segments. Something I know you and I are gonna be talking about at depth is intelligent document extraction and processing. When you are processing complex documents with lots of sections and tables, sometimes you do that to extract data to populate downstream systems. Sometimes you do that for information governance and data protection; you do that just to categorize and classify the documents so that you can treat those documents appropriately. Some documents need to be secured and encrypted. Some documents may remain in an open domain. Obviously all of the above is tied together by intelligent automation.
Boris Evelson: Starting to do deeper dives into specific use cases, the way I would describe intelligent document extraction processing use cases, probably about two types of documents. There are what we call transactional documents. So things like claim forms, invoices, purchase orders, shipping labels, anything that’s coming in at a relatively high volume that has a lot of structure like tables, check boxes. You are primarily interested in extracting information from these complex forms. Another very interesting use case here is when you have very long documents, like a legal claim or some kind of a contract, hundreds of pages where in addition to extracting information from each page, each table, each cell, you also worried about what paragraph 23 on page 99 says that contradicts paragraph two on page one. So you need to understand the in-document dependencies.
Boris Evelson: Before we start talking about some of the trends, I do want to mention that whenever we talk about AI, because obviously AI is everything these days, and especially generative AI, we do not use the term AI necessarily synonymously with machine learning. Even before machine learning became scalable enough, we’ve been mining text for information for at least a couple of decades. We’ve been using a combination of linguistic rules and ontology. An ontology is a collection of texts that you map your text against. For example, you’re trying to extract a particular topic. What’s the topic of this contact center conversation? If I’ve got a list of typical topics and I can map what I extract from the contact center conversation to one of these topics, well, I got that topic. Obviously that doesn’t scale as highly as machine learning scales, but for certain use cases, knowledge-based AI is transparent as opposed to machine learning models are much more opaque. I have a lot more control over linguistic rules and a topic list. I can just go at it and change as opposed to needing to retrain the machine learning models. Typically, we see all modern vendors, including yourself, use a combination of both techniques. We call that hybrid AI.
Brian Weiss: My comment here is I appreciate working with you, Boris, because now we’re getting the idea that AI solves everything, but it really is an ensemble of approaches to trying to make machines do what people do at the end of the day. Whether you’re looking at a page or whether you’re understanding a paragraph of digital text, or whether you’re automating. There are very different ways to skin the cat to make that happen. Some of them cost way more than others, and some of them are more accurate, and some of ’em are less transparent. As we dig into this, certainly Hyperscience, our approach here really is to use the best tool for the job. Not just for getting the right answer, but also controlling for errors and be able to have transparency into what we decided and how, and can you make it better over time.
Boris Evelson: Well, we all know that garbage in, garbage out, the good old axiom doesn’t go away. If anything, it becomes by several orders of magnitude more critical and more complex to solve and address. The right approach, the right technique, the right model for each use case using clean and highly governed data is absolutely the mantra of the day.
Brian Weiss: I couldn’t agree more. Sometimes good enough is good enough. And what about when it isn’t, and when, then the garbage in, it’s orders of magnitude more important now to think about garbage in, garbage out.
Boris Evelson: Exactly. Up until about a decade ago, everybody talked about unstructured data. Everybody said, yeah, the majority of our data is unstructured and it’s a huge black box. We don’t really know what’s in it. Now it is absolutely back in the spotlight for multiple reasons. Let me just set the stage with some interesting numbers. Over half of all of you out there listening to this webinar store and process over a hundred terabytes of data, and about 20 something percent of you are processing more than one petabyte. That’s a lot. About 64% of that is unstructured. There is an urban myth out there that about 80 or 90% of all enterprise data is unstructured. But whenever we ask our clients, it comes out to about 64%. That’s still a lot, a few petabytes.
Boris Evelson: There is absolutely no question that in addition to much more scalable computing platforms that we have today that are a lot more efficient that allow us to scale unstructured data mining at reasonable cost, there is no question that generative AI is bringing in a whole new set of opportunities of unstructured data. Because if you take a look at what is it that generative AI does, it’s mostly about unstructured data: extracting data from forms, from text, classifying documents, summarizing documents, translating images to text or audio to text. It’s all about unstructured data. So in addition to more scalable and budget-wise palatable computing power, we now have a new set of models that allow us to process and mine and analyze unstructured data at scale.
Boris Evelson: We are seeing very much a steady increase in interest from our clients in unstructured document processing. About quarter of the clients responding to this particular survey said that yeah, we are processing unstructured text or video or document data. And about 23% of them are mentioning that they are using Gen AI in all sorts of low code, no code and digital process automation solutions. So there is no question that generative AI plus all of the other cost reduction is bringing the interest in managing and deriving valuable business insights from unstructured data and automating processes that have some kind of unstructured data somewhere at the beginning, middle or the end of the processes now forefront of everyone’s interest.
Brian Weiss: You and I have been doing this for almost 15 years through enterprise search and looking at text analytics. I vacillate between whether or not people are excited because it’s finally cost effective to do things we’ve always wanted to do, but the tech wasn’t cheap enough to do it. Are we seeing incremental, or do you think this is a quantum step? I see a little bit of both. I see companies succeeding when they’re trying to do a little bit of both. Like, I can now do what I wanted to do cost effectively in ROI.
Brian Weiss: I think I saw the quote the other day that look, these LLMs understand they’re just probability calculators for words. They’re very good at analyzing the way people generate content. But at some certain point, they’ve now read everything that can be read, at least the frontier models. And so what really becomes interesting is how do you mine the data that’s unstructured and inside the enterprise, and maybe isn’t even digital. Like, you still have computer vision to work with. You gotta convert it from depths of problems. There’s one thing I need, I don’t understand how to read the paragraph or what it means, but it’s on a piece of paper that someone spilled a bunch of coffee on and scribbled on. You have a whole problem with computer vision converting it before you can even do any of the deep text mining solutions to figure out what the intent of the person is. So you’ve got this kind of two-step process when you’re trying to understand documents that makes it even more complex to have to deal with that process.
Boris Evelson: I do think it’s incremental at this point. I think at some point, we’re going to reach this critical mass where we’ll be able to say, yeah, it’s a quantum leap at this point. I really have to caveat this and use a lens of the company that I work for, we work with large enterprises mostly. Most of them come from regulated or even highly regulated industries. And in that segment of businesses out there, there is no option for two plus two maybe equal 4.1 or maybe 3.9. Even though you are using these new probabilistic techniques, but you gotta wrap them with all sorts of controls to make sure that two plus two still comes out as four.
Boris Evelson: So I think that’s one of the reasons everyone is taking a relatively cautious approach, because it has to be governed, it has to be a single trusted source of data and information. A lot of people are automating mission critical processes, and every single error is potentially not just losing money, but potential litigation risk. So we’re still at a point where we’re balancing the barrier that we already overcame in structured data analysis, but here we are still kind of experimenting. We’re still seeing a lot of risks, but we’re seeing a lot of absolutely awesome, amazing opportunities. All of a sudden, I’m not just mining information from 36% of my data, but from a hundred percent of my data.
Boris Evelson: Nothing says it better in terms of being cautious and being pragmatic and doing this for large enterprise use cases, is that human in the loop is still the mantra of the day. I’m not really sure personally where you could completely give something to a probabilistic black box model without any human intervention. What we are seeing today is our rendition of enterprise automation fabric. As you can see, human in the loop in terms of interacting with models and with processes and facilitating processes, and then human knowledge via data catalogs and repositories and knowledge graphs is still front and center of this process.
Boris Evelson: Whenever I look at these platforms and these solutions, one of the key questions that I always ask is how intuitive and how comprehensive is the user experience for this human in the loop. Based on that, within the last year, we did some very interesting research on how knowledge graphs, which are basically repositories of human knowledge addressing the highly connected nature of enterprise data, how Gen AI and AI is not replacing those. And neither of them can really stand on their own. Knowledge graphs are great, but they don’t really scale. And they’re great for highly predictable use cases versus Gen AI, which can be scaled almost instantly at this point, really just a matter of how many GPUs you can throw at it. But it is probabilistic and it does not really have human knowledge. It may be trained on the world of Wikipedia, but it’s definitely not trained on my enterprise knowledge, my personal knowledge. So creating applications where you take advantage of the best of both worlds: in use case number one, it is a knowledge graph that trains a large language model. In use case number two, it’s the other way around, and you use a large language model to augment or enrich a knowledge graph. And the nirvana here is when the two are working together in synergy, learning from each other and improving each other.
Boris Evelson: Whenever I talk to clients and they ask us about best practices and lessons learned when deploying document automation, document mining, unstructured data based processes, we talk to them about a lot of things. But the main thing is that you have to plan for human in the loop. You have to plan for it in terms of staffing, in terms of processes, in terms of budgets. Because no matter how accurate the systems are, or they tell you what the accuracy percentage rate is, you still need to check. You need to understand what kind of rules are being triggered or not, what kind of ML models are producing low confidence scores or high confidence scores. And most importantly, no system is a hundred percent accurate out of the box. You can get it to 99.999% accuracy with lots of tender care, with lots of iterations, and constant periodic accuracy audits. Going in with some kind of a golden set of documents that you know exactly what you’re gonna get out of it and checking that against the latest version of your application and comparing the two. So absolutely one of the top best practices is making sure that the platform you are deploying has a comprehensive, intuitive human in the loop user interface and user experience.
Brian Weiss: Boris, look, our position on this is we’re very passionate about that human in the loop and machine combination being as elegant as possible. When correctness matters and good enough isn’t, you’re absolutely gonna have to do that. I encounter every single day where somebody says, well, can’t I just use this one model? And that’ll be fine. Well, what are you gonna do when it’s wrong? You’re gonna call ’em up and tell ’em they need to train it? So you end up having to build all of this architecture around a model. There’s tools to try and either understand data or get information. But the overall framework that you’re describing here is what we do in data science. You have to annotate, you have to train, you have to watch. So a platform that makes that simple and easy to create and train and manage models that maybe do a very specific task, they don’t have to boil the ocean, but maybe they’re narrow and not large, but they’re very tailored to what you’re doing.
Brian Weiss: For Hyperscience, our customers get a huge advantage of our approach to human in the loop because look, we all know the way that the whole IDP industry was built is you have a black box and it does its thing, and then you have to go figure out if it’s wrong. And when it’s wrong, the 40% of things that it’s wrong, you send ’em to a BPO and get the cheapest people possible to go fill out that form by themselves. What we are seeing and pioneering is this idea, well, instead of having a machine that does it right sometimes and wrong sometimes, and have people fix it, take a fraction of that money you spend on those people and put it right next to the model. And have it so you’re really thinking more of a digital worker concept that as it’s confused, it asks you for help and says, what is this an A or a U? So you can drive accuracy with that human intervention very efficiently. And then the best part of this platform approach is that for every penny you spend on a person who sits next to the model, it gets better over time. It improves continuously.
Brian Weiss: We’ve really focused a lot of our attention and product development on how to make those processes continuous improvement, like very efficient human in the loop. You don’t have to key a whole document and you just fix one thing and it moves on. And it gets smarter in that process. And that to me is a change in this architectural view. A model based approach where you think less about how cheap is it to fix what’s wrong, versus is the time I spent gonna make it better? Can I get there? Is it transparent? Those things are something that we think deeply about, no matter what kind of model we’re using, whether it’s a third party or it’s a sovereign model that’s customer data or something you’ve trained. That pipeline of insured accuracy with the right amount of human intervention. And for us, we’ve worked hard to ensure that that’s not a data science thing. I mean you and I can look at this and say, great, let’s go hire a bunch of data scientists and get all the tools. And that’s the old world. The new world is how do I make that turnkey? How do I make a platform that can use these tools, but ensure accuracy with the least amount of human intervention, which also gets better over time.
Boris Evelson: Before I even look at any kind of solutions or look at a new platform from a vendor, the very first thing that I ask, it’s not accuracy. It’s not whether using GenAI, what kind of models do they use. I say, I am a business process owner. I come to the office in the morning, I turn on the lights, what do I see? In the last 24 hours we processed a hundred thousand documents. What does that mean? How many of them were processed with high confidence scores and I can probably let them through? How many were processed with low confidence scores and I need to start ringing some alarm bells and calling some people? That’s just very basic business process 101, GenAI or not.
Brian Weiss: I would ask a question when people say, well, how often are you right? It’s actually: what does your platform do with wrong? How do you make it better? And that’s more about what does wrong cost you? Where are you spending money on a BPO for people? Okay, my machine is wrong. Let’s start there and look at what went wrong. And does your system take accountability for the completeness of correctness? Or are you having to figure out what to do with wrong? I really appreciate the fact that I think the market is pivoting that way, that it’s not gonna be some magical model. You really have to ask: how do you handle exceptions? How does it get better over time? And frankly, do you have control over it? Because sometimes you peel back the covers quickly and it’s they’re renting a third party model from somebody, and you really can’t ask that vendor to make it better. You can’t call ’em up and send ’em samples and say, please retrain because you got it wrong. You’re gonna have to work with the exceptions.
Boris Evelson: One of the reasons I really caution clients whenever they say they’re going to build something versus buying it from a vendor who’s already gone through all of these challenges, is how are you going to build all this? Controlling the output of an LLM is probably even a more daunting task. No matter how much you engineered and controlled the prompt to an LLM, and no matter how much you’ve finagled it with retrieval augmented generation, and you even tune the model, but it is going to spill out whatever is going to spill out. So how do you moderate that content? How do you block some of the potentially toxic content? One of the interesting examples that I run into in my other world, the structured data world, is that when you try to translate a natural language into an SQL query, well what if that natural language says delete star? Are you still gonna let that SQL through? Well, hopefully not.
Boris Evelson: And once it does come up with something, it needs to be explained. And part of the explainability is lineage. How did I come up with that answer? Where did it come from? And then you need to deal with the transparency of some ML models are more transparent than others. So if they are transparent, you need to show the data stewards the owners of the process what’s behind them. And if they are opaque, you’ve gotta interpret at the very least where the data come from and what’s the pattern of the output so that you can understand how it’s working. But in some cases where two plus two absolutely has to equal to four, you actually have to on the backend create some deterministic rules to make sure that if an LLM comes up with an answer that’s not really acceptable in this particular use case, then you gotta block it. So this is not easy. And that’s why I say, please work with vendors who have done this before and know how to build this.
Brian Weiss: I think we’re into the world of enterprise AI versus what might have been coming in back in search days. It was public search against public data. But when you’re talking about enterprise data, it’s a very different world. Hyperscience grew up as an on-prem software which needed to be deployed in air gap environments. So there wasn’t any question of hitting a third party or training ’em. The transparency is actually core to the operation of it. We allow customers to build sovereign models, plural, that run on a pipeline, but it’s trained on their data. So yes, there’s IP in there, but those models are very transparent.
Brian Weiss: The critical point is that at any given time, I can see why the decision is being made. I can influence it by training it more. And there’s accuracy harnesses built around that for validating the outputs. So we sort of grew up in the world of highly correct and accurate data, transparent models. And so if I bring, say, for example, a third party model onto the pipeline to maybe answer a question I need to know about a paragraph that I’ve found inside a document, I’m not gonna throw the whole document at that LLM. What I’m gonna do is I’m gonna take that paragraph, understand what it is, put it in a RAG. But I can accuracy harness now the third party model in a much more contained way. So we’re really investing heavily in being able to wrap that same level of governance and transparency that you have with a native Hyperscience model you’ve built on your own data when you bring in third party models to assist in the problem.
Boris Evelson: The value now in truly accuracy harnessed data in creating a data estate that will drive AI use cases is readily apparent. It’s really fascinating that somehow the back office, so even the mail room is in some ways the eye of the hurricane and all of this, because everything you’ve got in Iron Mountain is actually a gold mine now. If you could train maybe a computer to understand all of that and be able to answer questions about it in natural language and maybe help you make decisions based on its understanding of what it’s read, all of those are really a really compelling end state.
Brian Weiss: But I find so many people are so focused on that, that they forget the first thing you have to do is get the data out in a very, very clean and accurate way. You gotta be able to read the stuff that’s in handwriting and this, and you can’t get it wrong. And so it’s garbage in, garbage out at scale in a way that is, I mean that you could train a model on garbage and it’ll hallucinate faster than I can spell my middle name. It’s not like, oh, we good enough. We’re not writing poetry or Abraham Lincoln term paper here. It’s a process. So we are heavily engaged with enterprises who have realized that wow, I actually already have this data, and all I need to do is ensure that I’m getting it clean and accurate in an accessible and transparent way. And I’m three quarters of the way to the data state I need to be able to create an agentic approach to decisioning and all of that. It’s interesting that the back office is now somehow critically and understanding information in pictures of things is critical to unlocking AI use cases.
Boris Evelson: Don’t forget what Todd had mentioned about our deployment options, we are an on-premise deployment option. So everything he talked about are your security controls built into your existing infrastructure. So if you harden your data, we are hardened. The other thing is to understand that we do deploy on AWS GovCloud as an on-premise solution. So all that is built into AWS GovCloud as far as controls can be built into Hyperscience. And this allowing you to get your ATO or your Authority to Operate actually quicker than you would if you were trying to do it on your own environment.
Brian Weiss: If anyone is interested in hearing more about Hyperscience and our approach to this, it’s a multi-model platform for enterprise data. Feel free to give us a call. I also wanna thank Boris, it’s always a pleasure. I appreciate the research Forrester does. I appreciate the depth and the technical lens that you are able to break some of these things apart. And for everyone who’s spent the last hour listening to Boris and myself talk about these interesting trends, thank you very much for joining us today.