Out-of-the-Box to State-of-the-Art: How Vision Language Models Are Transforming Document Processing

Intro

Kicking Off with OOTB VLMs

Vision Language Models (VLMs) are incredibly powerful, mixing computer vision with large language models to extract information from images and videos. Instruct tuned VLMs such as Claude-3.5, can be used out-of-the-box (OOTB) for document processing tasks like field extraction from invoices, forms, or receipts. Simply provide a text prompt and an image, and the model reasons through the content to extract relevant information, delivering immediate value without requiring extensive setup or training.

A variety of OOTB VLMs are available on the market, including robust closed-source models like Claude, Gemini, and ChatGPT, as well as highly capable open-source options such as Qwen, DeepSeek, and InternVL. We tested leading VLMs on our internal field extraction evaluation datasets, and while the results were promising, they were not quite ready to replace human workers or Hyperscience Field ID models. Below are the benchmark results for Claude 3.5 Sonnet on the internal invoice evaluation dataset.

This is understandable, as OOTB VLMs are typically pretrained on a broad range of tasks to ensure versatility and the ability to handle diverse challenges. However, they may not excel in highly specific tasks (especially when the document is complicated). Does this mean VLMs are just a hyped-up technology, far from being production-ready for document processing?

Few-Shot Prompting: The Fast-Track Upgrade to Squeeze Performance

Few-shot prompting supercharges the VLM by pairing the task instructions with a small set of curated demonstrations. A model with sufficient capabilities is able to leverage this context for enhanced performance.

How it works

For the sake of illustration, let’s envision a field-extraction use-case over invoices. Let’s say we are trying to extract “date”, “tax amount” and “total amount” from the document images. In that scenario, a few-shot prompt would look like the following.

This approach is quite appealing as we don’t have to spend any time training our model and we can provide demonstrations as soon as we have a couple samples annotated.

Suppose now we have a pool of annotated invoices, typically issued by a finite set of vendors. Within each given vendor group, the main differences between documents are the specific values we need to extract and the rest remains largely invariant. Because of this low internal variability, once the VLM is shown to extract the fields from one sample in a specific cluster, it can easily extend that capability to other samples, since the page layout, field positions, and textual context remain consistent.

At inference time, the goal is to retrieve annotated documents that closely resemble the query document. We therefore compute a visual embedding (a numeric vector that captures each page’s appearance) so that visually similar pages cluster together in the vector space a.k.a. embedding space. For a new document, we simply select the annotated pages nearest to its embedding and use them as demonstrations when prompting the VLM.

Performance Insights

The bar chart right above highlights the positive correlation between field extraction accuracy and number of demonstrations provided in the prompt. However, it’s not always better to add more examples, and we can clearly see a performance plateau for “Bill of Lading” and “Invoices”. Doing so also significantly increases the memory usage of our model, as we need to process a much larger context.

Another interesting point is the performance behavior once we increase the size of the pool from which we draw our few-shot examples. With more annotated examples at hand, we’re more likely to find ones that closely resemble the target which will get us higher accuracy, as the plot below demonstrates on the “Receipts” dataset.

Supercharging VLMs with Fine-Tuning

OOTB VLMs offer a strong starting point for document processing tasks like field extraction from invoices, but their general-purpose pre-training may fall short when handling complex, domain-specific documents. Fine-tuning is the key to unlocking their full potential. This process involves further training a pre-trained VLM on a targeted dataset such as domain-specific documents like invoices or checks—to improve its accuracy in extracting relevant fields.

Unlike pre-training, which requires vast datasets and significant computational resources to build a versatile model from scratch, fine-tuning is efficient and cost-effective. It leverages the VLM’s existing knowledge from pre-training, and refines it with specialized data. For example, feeding the model a curated set of invoices allows it to better recognize patterns, layouts, and terminology specific to an organization’s documents. This targeted approach bridges the performance gap, making VLMs more reliable and production-ready for niche tasks without reinventing the wheel.

The Major Steps to Obtain an Improved Model

To transform an OOTB VLM into a high-performing tool for document field extraction, supervised fine-tuning is a key technique. Using tools like HuggingFace’s transformers package, you can fine-tune open-source VLMs efficiently. Below are the essential steps to achieve significant performance improvements:

Prepare Your Dataset:
Fine-tuning requires a curated dataset of documents with ground truth annotations for the fields you want to extract (e.g., invoice numbers, dates, or totals). Unlike pre-training, which requires massive datasets, fine-tuning can yield significant results with just a few hundred high-quality documents. For each document, prepare the image, a text prompt describing the task, and ground truth field transcriptions. We chose the JSON format for the ground truth fields because its structured format ensures consistent data representation and simplifies parsing. It also supports complex data, such as multiple field occurrences, and aligns with formats used in most VLM pre-training and instruction fine-tuning.

Document Images	Text Prompt	Field Extraction Ground Truth
	You are an expert annotator in charge of annotating documents. Extract the following fields from the invoice: name address invoice total items The output should be in json format {“name”: value, “address”: value, “items”: value}	‘{“name”: “Jessica”, “address”: “4729 Maple Grove Lane, Boise, ID 83702”, “items”: [“Cloud Hosting Service ”, “Custom API Integration ”, “Annual Software License ”]}’
	You are an expert annotator in charge of annotating documents. Extract the following fields from the invoice: name address invoice total items The output should be in json format {“name”: value, “address”: value, “items”: value}	‘{“name”: null, “address”: “8217 Cedarstone Drive, Chattanooga, TN 37421”, “items”: [“Light Bulb”, “Water Bottle”, “Earbuds”, “Digital Clock”]}’

Pro Tip: Ensure the ground-truth field names in your dataset match the order specified in the text prompt. Mismatched orders can confuse the model, resulting in incorrect penalties during training.

Set Up the Fine-Tuning Pipeline
The transformers package’s Trainer class simplifies the fine-tuning process for open-source VLMs hosted on HuggingFace. This high-level API streamlines training configuration and supports a variety of models. Alternative tools like ms-swift or unsloth are also worth exploring for specialized needs.
Pro Tip: Fine-tuning an entire VLM can be memory-intensive due to the storage of gradients, activations, and other components. If GPU resources are limited, consider Low-Rank Adaptation (LoRA) or Quantized LoRA (QLoRA). These methods add a compact adapter (as small as 0.3% of the base model’s size) to the pre-trained model, freezing the base parameters and training only the adapter. This approach significantly reduces GPU memory usage while maintaining near equivalent performance.

Image reference: https://arxiv.org/pdf/2106.09685
Select the Right Checkpoint
Fine-tuning duration varies from hours to days, depending on dataset size, GPU hardware, and training settings. To choose the best model checkpoint, rely on task-specific metrics like field accuracy or token accuracy, rather than generic loss values, to ensure alignment with your document extraction goals.
Pro Tip: When using LoRA/QLoRA, the fine-tuning process typically saves only the adapter’s weights. By default, transformers keep the adapter separate from the base model during inference, which can slow down predictions. Merge the adapter back into the base model post-training to eliminate this overhead and streamline inference.

Performance Insights

Fine-tuning a Vision Language Model (VLM) significantly enhances its performance across a wide range of document field extraction tasks, moving beyond the limitations of OOTB models. Additionally, fine-tuning is essential for embedding use case-specific logic, which an OOTB model cannot achieve due to its generalized design. When we evaluated our fine-tuned model on multiple internal test datasets—including invoices, government IDs, driver’s licenses, bills of lading, and more—the results were consistently impressive, improved by 10-30% across these document types compared to the OOTB model. This demonstrates the power of fine-tuning to adapt VLMs to varied and complex document layouts:

By integrating fine-tuned VLMs with Hyperscience’s human-in-the-loop workflows, we achieved a remarkable 99% accuracy, matching or even surpassing human-level performance. This combination allows the model to efficiently handle the majority of extraction tasks, with human oversight ensuring precision for edge cases, delivering a robust and scalable solution for real-world document processing.

Wrapping up

Although generalist VLM models can offer a strong base performance, using them out-of-the-box usually means leaving accuracy on the table. Both few-shot prompting and fine-tuning are promising avenues to leverage annotated data for the target task for enhanced performance.

Let’s review the key advantages of both of these approaches:

Few-shot prompting:

Lower time-to-value: no need for training; works as soon as you have a couple samples annotated
More resilient to data drift: new demonstrations from the drifting distribution can be leveraged as soon as they are added to the pool
One model for all tasks: you are using the same weights for all your tasks, so you don’t need to store and switch between specific weights for all your use-cases

Fine-tuning:

Higher peak accuracy: tuning the weights generally outperforms tweaking the prompt on complex tasks
Lower latency: if you merge the LoRA adapter with the base model, you will be processing less input tokens than few-shot with the same architecture
Lower memory usage: you must store longer prompts in memory
No PII living in the prompt: sensitive values are absorbed into weights rather than sitting in plaintext prompts (no need for redaction)

At Hyperscience, our accuracy-harness toolbox relies on fine-tuning. This choice maximizes accuracy, handles longer documents, and keeps latency low. Modern VLMs already offer strong priors, so just a few dozen labeled samples are often enough to reach the accuracy levels you need to automate your predictions.

Chuyao Shen

Senior Machine Learning Engineer

Antonin Vidon

Senior Machine Learning Engineer

Out-of-the-Box to State-of-the-Art: How Vision Language Models Are Transforming Document Processing

Jump To Section

Kicking Off with OOTB VLMs

Few-Shot Prompting: The Fast-Track Upgrade to Squeeze Performance

How it works

Performance Insights

Supercharging VLMs with Fine-Tuning

The Major Steps to Obtain an Improved Model

Performance Insights

Wrapping up

Related Articles

What does Meta’s investment in Scale AI mean for your organization’s data strategy?

How Hyperscience Builds Trustworthy AI: A Look Inside Our Transparency Report

Forget LLMs. ORCA Is the Enterprise-Ready Future of Document AI

The Internet Wasn’t Built for AI. It's Time to Rebuild It.

Infrastructure Enhancement for Scalability and Choice

Automatically organize and prepare large file submissions for processing