For many organizations, the ideal scenario is to receive one discrete document per file, ready for processing. However, the reality of business operations often looks very different and much messier. Whether it’s multiple patient visits sent as a single medical history, a batch of invoices from a single vendor, or a packet of mortgage application documents (W-2s, bank statement, and multiple paystubs) arriving as one file, organizations frequently receive multiple, unrelated documents bundled together.
As a result, the journey of transforming large submissions into actionable data often begins with a critical first step: accurately dividing large, multi-document files into their individual components. While this may appear to be a simple task, our customers have faced several challenges when trying to tackle this problem with automation:
- Lack of clear document boundaries: If a submission contains several documents of the same type from the same source, they might look virtually identical. Consider a batch of invoices that arrive from a vendor you purchase goods from. Each invoice represents a distinct payment that needs to be made to your vendor and mapped within your ERP, but because all the invoices look very similar, it can be difficult to create automated solutions to split the documents.
- Diverse Document Types: Enterprises handle a vast array of document types (invoices, contracts, medical records, forms, correspondence, etc.), each with its own layout, structure, and content. A document processing flow needs to identify boundaries across all the different documents that a business process requires.
- Added complexity to document pipelines: In order to solve this problem, customers often look at adding third-party solutions or custom scripting logic to their document ingestion pipelines. Both of these options increase the complexity of your document automation projects, leading to increased costs, delayed project timelines, and long-term maintenance pains.
Introducing Auto-Splitting
Hyperscience understands these pain points deeply, and that’s precisely why we introduced a powerful tool to our platform: Auto-Splitting. This feature is integrated into our document classification flow, allowing our platform to classify documents into their specific layout as well as split them into their distinct documents for processing when we receive the same document consecutively in a submission. Users can define layout-specific grouping logic that helps our classification flow determine where there is a document boundary. The type of logic available ranges from simple rules – such as number of pages or always process as a single document – to complex regular expression logic that can be configured by:
- First Page – Easily identify document beginnings by searching for unique text strings that appear only on the first page of the document type. Think “Page 1 of X,” “Document Title,” or a specific company header.
- Last Page – Pinpoint document endings by looking for text unique to the final page. Examples include “Invoice Total,” “Signature,” or “End of Report.”
- Some / All Pages – This new feature can compare text patterns that appear on some or all pages and define document boundaries when they change. For instance, you can define a document boundary when an invoice number changes, automatically splitting a multi-invoice PDF into individual files. This is ideal for batches of similar documents where a key identifier shifts from one document to the next.
The benefits of Auto-Splitting are already evident at Hirschbach Motor Lines, an industry-leading carrier delivering state-of-the-art transportation solutions for more than 80 years, where it’s accelerating a project to automate data extraction from bills of lading for driver payments. These documents are notoriously tricky: they’re often bundled, have inconsistent page counts depending on shipment size, and can arrive in completely different formats. But with Auto-Splitting, Hirshbach quickly set up logic in their layout definition to look for bills of lading numbers on each page. Whenever the number changed, the file automatically split. As a result, Hirschbach dramatically accelerated their project and is now seamlessly processing bills of lading through the Hyperscience platform, unlocking significant value by slashing manual time spent on paying their drivers.
Unlock the Full Potential of Your Automation
The introduction of Auto-Splitting marks a leap forward in our commitment to providing a truly comprehensive and intelligent document processing platform. We believe that by tackling this problem head-on, we’re removing a major bottleneck that has historically plagued many of the document automation projects.
We encourage you to reach out to your Hyperscience representative or our support team if you have any questions or would like to see Auto-Splitting in action!