How to Capture Data Accurately From Poor Quality Documents ?

How to Capture Data Accurately From Poor Quality Documents ?

(Last Updated On: May 11, 2022)

By now, you must know that accurate data extraction from receipts, invoices and other types of documents can be a real challenge. Even more so when there are tons of documents from which the data needs to be extracted and entered into the system manually. For your organization, this process is highly time-consuming, filled with errors and may not even generate accuracy as the data is entered physically into the system.

So is standard OCR based software a solution to it?

The answer is No. Standard OCR can also not solve this issue accurately. Without understanding the context of text written in a document and just extracting the data as it is, may lead to data being fed into the wrong fields. This can increase the burden of sorting out and clearing data fields again and again.

So when the system is processing an invoice, how will it know who is the supplier?

who is the customer? And how will the data extraction software be able to differentiate between a supplier and a customer? Between what data needs to be extracted and what not to be extracted.

AI-enabled data extraction software can help to identify such data easily and in this blog, we will walk you through techniques and methods through which KlearStack AI can extract data, intelligently.

Let’s consider the Figure 1 invoice as an example of a bad quality document. I am sure your organization would receive many such types of invoices and receipts on daily basis.

The invoice has printed text that the standard OCR can extract. However, the standard OCR may also extract other handwritten and stamp info that is not required or not relevant for data entry purposes.

In the case of the above-given invoice, you can notice how the invoice is torn out from the top right corner which has led to crucial information such as the supplier name missing or partial. Apart from that, there are also stamps and handwritten text overlapping parts of printed text on the invoice.

Such an invoice can be considered a bad quality document as handwritten text and stamps are not properly placed on the printed invoice and the important information is missing as the pages are torn out.

Extracting data accurately becomes a massive challenge from such documents for two key reasons:

  1. Crucial information could not be extracted from the document to the system accurately.
  2. False or unrequired text from the document is extracted leading to more errors, even with the use of technology.

KlearStack AI: Solution for Unclear Documents

So, we have seen the challenges with unclear documents and how a standard OCR is only going to increase the workload for your organization. This is why KlearStack AI is recommended as there are three main techniques put in place to ensure that no matter how unclean the documents are, KlearStack AI will ensure that all crucial data and only information that is required, is captured and stored in the system.

In Figure 2, you can notice that the supplier name has been extracted accurately even with a torn invoice. This is possible because the supplier name is mentioned elsewhere in the invoice (refer Figure 1, bottom right) and the KlearStack AI was able to capture this data intelligently and identify it as a supplier name thanks to AI trained model.

In Figure 3, you can notice that the other information such as Purchase Order No:, Date and so on is extracted accurately and filled in the respective fields on the KlearStack AI software automatically. This enables you to reduce clerical data entry work for your organization.

Let us take a look at techniques that makes this possible through KlearStack AI.

Computer Vision

Computer Vision is a field of AI that enables the extraction or capturing of meaningful data from visual or textual cues of the documents. KlearStack AI uses computer vision to ensure that every piece of data is extracted from documents adds value to your document processes and ensures that only required data is captured.

For your organization, computer vision can enable accurately and required datasets to be captured, stored and processed while processing invoices that are not very clear.

In the case of the above invoice, the computer vision can help identify where is the name of the supplier and customer and placed on the top right, the supplier name is partially displayed. Computer vision can therefore help to scan for this information elsewhere on the document.

NLP/Deep Learning

Now meaningful information is captured through computer vision technology. But how will the system be able to differentiate whether this is a stamp that is not required or this handwritten text is not required?

While computer vision allows capturing printed text accurately, deep learning and natural language processing scan the documents thoroughly through multi-layered processes to identify what text is to be retained and what is not to be retained. With the means of various neural networks, data is scanned thoroughly before the system decides what data it needs to retain and what needs to be discarded.

In the case of the above invoice, the crucial information in the top right corner of the document was missing. However, through deep learning, KlearStack AI was able to identify that the same information is placed on the bottom of the invoice and therefore, the technology would be able to extract data from there.

Also, through deep learning, it will identify that the stamp and handwritten text is not to be extracted as there are no designated fields for it and that information should not be retained.


To ensure that your organization consistently extracts and retains meaningful data from unclear documents, the heuristics method is put in place through KlearStack AI. Learning from past datasets and with more documents being processed, the KlearStack AI learns and evolve on its own, without any manually fed customized conditions for data extraction.

Say, the invoice issued by ABC corporation has its name placed in the top left corner of the document and the customer name, XYZ Ltd., is placed below the supplier, ABC corporation. Next time the invoice from ABC corporation is processed, the KlearStack AI can detect and know that ABC Corp. is the supplier and XYZ Ltd. Is the customer automatically. This is possible through the heuristics technique adopted via machine learning.

How is KlearStack Different from Other Solutions?

Contextual AI

KlearStack AI simply does not extract data accurately. Say your organization has received two different sets of invoices one where the description field is labelled as “Description of Item” and another is labelled as “Item Descriptions”. KlearStack AI will understand that in the context of both these invoices, the field name of that particular column as “Description” and therefore, will standardize the field name to “Description”, as shown in Figure 4.

Similarly, say in one invoice the field name is “Qty.” and in another invoice, it is “Quantity”. KlearStack AI will read the context and understand that the data in this column is about the “Quantity” of items and standardize it accordingly.

No False Positive Data Extraction

Handwritten notes, text on the stamp, all this data can be captured and extracted. However, KlearStack AI does not do that if the document is unclear and has printed text already. This can create a huge problem and may lead to False Data Extraction by the software. By False Data Extraction, we mean that the data that is unintended to be extracted will also be extracted.

KlearStack AI understands that at times there are notes and stamps on invoices that do not need to be captured and therefore, the AI models are trained in such a way that it understands this and does not capture such data. This brings more accuracy to data extraction and is a huge differentiator compared to any other data extraction software.

The KlearStack Edge

Through AI, deep learning, machine learning, and ensuring that the software is updated with the latest features and technology, KlearStack AI is an advanced level data extraction software that ensures only the intended data is extracted and documents are processed with minimal to zero human intervention. If your organization is struggling with processing an enormous amount of documents, you can click here and consult our experts to know how our solution can be a huge boon for your organization.

Ashutosh Saitwal
Ashutosh Saitwal

Ashutosh is the founder and director of the award winning KlearStack AI platform. You can catch him speaking at NASSCOM events around the world where he speaks and is an evangelist for RPA, AI, Machine Learning and Intelligent Document Processing.