What Benchmarks Can Miss
Benchmarks are a valuable way to establish a common standard to compare solutions, track progress over time and drive innovation. However, as the quote above highlights, organizations can end up stacking the deck in their favor by optimizing for the known, tested scenario rather than for a diverse set of real-world settings.
One way organizations skew benchmark results is by using public test sets which can be seen and known ahead of time. This allows developers, either knowingly or accidentally, to overfit their models to perform well on those specific examples.
The Hyperscience Approach to Benchmarking
At Hyperscience, our global machine learning (ML) team brings years of experience in building and testing models for a wide range of enterprise scenarios. A key element that helps our research team continually improve our platform is running intelligent document processing (IDP) experiments and benchmark testing on proprietary datasets that are unfamiliar to LLMs/VLMs. When benchmarking LLMs that are only publicly available, we utilize public datasets. This dual approach delivers more objective results and better simulates real-world performance and scenarios.
Our benchmarks cut through industry buzz and punchy headlines to deliver substance and real-world tested results. The principles below guide how we design and run tests to ensure they generate meaningful insights for our R&D teams while giving customers, partners, and the market up-to-date, trustworthy data on the state of document processing ML models.
- No generic comparisons, always document specific use-cases: We perform in-house experiments to compare all our machine learning models against industry specific document sets (Bills of Lading, Invoices, Receipts, Government IDs, etc.). If you’ve heard of a model, we’ve likely benchmarked it against a real world document use case.
- Open-source & Closed-source comparisons: As new models come out we are constantly looking at how they perform on our specific document processing use-cases, specifically around classification, transcription, and extraction of the most common fields and formats in government, insurance, healthcare, transportation, banking, etc.
- Why isn’t [Model X] in the Benchmark?: Hyperscience constantly evaluates the performance of existing and emerging models. For the purposes of broader benchmarking, we focus on the top performing and most commonly used models. Our machine learning team spends countless hours analyzing, evaluating, and debating the latest ML models and trends. We even have an ML research reading club! If you have a question about the performance of a specific model, reach out – our team would love to hear from you!
Comparative Document Processing Accuracy: Hyperscience vs. Leading Models
The ML benchmarking team recently tested Hyperscience models against the most well-known LLMs, as well as against models from traditional document processing vendors. In both categories, Hyperscience clearly outperformed the alternative models, delivering industry-leading document processing accuracy rates.
Hyperscience vs. LLMs in Document-Specific Use Cases
The table below outlines the accuracy performance percentages of Hyperscience models when compared to some of the most common LLMs.
The data indicates that Hyperscience models almost always achieve higher accuracy percentages in document processing across the various tested document types compared to the other LLMs and VLM models listed.
Specifically, Hyperscience models show superior performance on Bills of Lading, Invoices, and Government IDs. While some LLMs, like Gemini 2.5 Pro and Claude 3.7 Sonnet, occasionally show competitive accuracy on specific document types such as Receipts, Hyperscience consistently maintains a strong lead in overall accuracy across the benchmarked documents.
Bills of Lading | Invoices | Receipts | Government IDs | |
---|---|---|---|---|
(h[s]) Specialized GPU (ORCA) | 98 | 94 | 93 | 100 |
(h[s]) Specialized CPU (OICR) | 93 | 93 | 77 | 98 |
Claude 3.7 Sonnet | 75 | 74 | 82 | 98 |
Claude 3.5 Sonnet v2 | 77 | 71 | 49 | 90 |
Gemini 2.5 Pro | 66 | 74 | 86 | 94 |
InternVL3-8B | 68 | 67 | 73 | 90 |
NVIDIA Llama Nemotron Nano VL 8B | 49 | 46 | 80 | 83 |
OpenAI GPT4o* | * | * | 76 | * |
*Hyperscience cannot submit private datasets through OpenAI GPT4o for Bills of Lading, Invoices or Government IDs due to Terms & Conditions.
Hyperscience vs Enterprise AI Platforms: Measuring Accuracy of Printed & Handwritten Text
In the charts below, we’ve removed the LLM startups and compared Hyperscience to models often encountered in enterprise AI extraction. Hyperscalers like Amazon, Microsoft, and Google offer many models. In this benchmark, we compare the ones most commonly used to power their document focused solutions.
When compared to both open-source and closed-source models, Hyperscience consistently outperforms, with our CPU and GPU models delivering the highest “exact match” accuracy.
When it comes to documents, there are no models we know of today that are as accurate as Hyperscience for printed and handwritten text. When you factor additional layers of Human in the Loop, orchestration, and accuracy harnessing, Hyperscience’s accuracy percentages only increase compared to other options.

Note: To be considered an exact match, the model must correctly extract the entire word or phrase from the start. For example, if “Cat” is presented, the model only gets credit for returning a response like “Cat” or “cat” both of which preserve the meaning. Other responses like “dogcat”, “CotAT”, or “ca” would be interpreted as incorrect.
What About Multi-Language Accuracy?
Hyperscience also provides multi-language document support for 200+ languages.
For any multinational business, the ability to process multilingual documents is of paramount importance. Errors and inaccurate processing lead to costly delays, poor customer experience, and damage to reputation and brand. This is why the world’s leading enterprises turn to Hyperscience for transformational business process automation in whatever language or languages they operate in. Below are the benchmark results when run with Spanish-language models. These results are representative of model performance we achieve for other language models as well.

What’s Coming Next
Our team runs an ongoing machine learning benchmarking program that is constantly testing new models. Up next, we’ll be looking at the recently released ChatGPT 5.
What do you think of these results? What would you like to see us test next? We’d love to hear from you! Please reach out to learn more.