In this blog, we explore document classification through AI and ML. It covers the core parts, benefits, methods, challenges, and future paths.
Document classification is a crucial step in managing modern data. It is fundamental for every organization to organize, process, and retrieve its data on a large scale. Algorithms have replaced manual classification. They make the task scale better and cost less. The early process of document classification involved humans labeling documents.
This process was time-consuming and error-prone. As computers became more common, rule-based systems became important. They used manually defined keywords. However, their scaling problems persisted.
Document classification has experienced a significant breakthrough, all thanks to the emergence of AI and ML technologies. These technologies enable computers to learn from data. They identify patterns and generalize from what they have been trained on, thus generating faster, more accurate, and more scalable classification processes.
Document classification assigns a document to one or more pre-defined categories based on its content. This is crucial to identifying the right category for a document based on the features of its text, image, or whatever else it might contain.
This way of organizing information helps us find information. It does this by aiding storage, search, and document management.
Take a look at some of the benefits:
Document classification is the process of automatically assigning documents to pre-determined categories. It is important in many fields. Here, we explore two prominent types of document classification:
Text classification assigns one or more predefined category names to each document in the input text collection. It is used in email filtering, sentiment analysis, news categorization, and spam filtering. Some examples of text classification techniques are:
Image classification assigns an image a label or category based on its visual content. It has a wide range of applications, including medicinal imaging, autonomous driving vehicles, satellite monitoring, and object identification in surveillance systems.
Major characteristics of image classification include:
Automated document classification uses AI and ML algorithms. These systems analyse documents and group them into predefined categories. With enough labeled data, these systems can learn patterns and predict what categories new documents should be placed in.
Automating the document classification process eliminates human errors and makes processing many documents more efficient.
Machine learning is at the core of Automated Document Classification. It allows systems to learn from examples or labeled data. Document Classification Machine Learning use this learning to predict labels for new documents that have never been classified.
Some important contributions of Document Classification Machine Learning are:
Manual document classification involves a human operator categorizing each document after close inspection. This process is error-prone, labor-intensive, and time-consuming, especially when handling large and numerous documents.
Automatic document categorisation involves using machine learning to analyse and categorize documents.
This kind of sorting saves time. It scales well and is usually more accurate once well-trained. It also allows for the reduction of people needed for the big tasks of data mining and sorting. The demands for processing and sorting documents are high.
This is true in legal, healthcare, and administrative sectors. They come in a short time. Automated document sorting has shown its edge and usefulness in some areas.
Automated Document Classification provides approaches based on different machine-learning strategies for categorising documents. Some Document Classification Machine Learning approaches include but are not limited to the following:
Here, supervised learning means that a model is trained on labelled data. The class label for each document is known in advance. The process is usually composed of the following steps:
Unsupervised learning methods are used when you do not have labelled data at hand. It does not use predefined categories. Instead, it finds documents that are similar. They are similar in their content. Steps:
Semi-supervised learning takes a middle ground. It sits between supervised and unsupervised learning. It uses a small amount of labelled data and a much larger pool of unlabelled data.
For instance, when labelled data is very costly or time-consuming to obtain, this approach could be effective. Here’s how it works.
In this section, we’ll go over some code and break it down to understand how you can make your auto-classification algorithm with Python.
First, we start with importing the following libraries:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, Normalizer
from nltk.text import wordnet
from nltk import SnowballStemmer
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, precision_recall_curve
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
NUM_OF_THREADS = 10
Our data, let’s say, consists of comments from different users that we are going to try to label as toxic or not toxic. For this particular example, we’ll work on the data provided in Kaggle at the Toxic Comment Classification challenge given as under:
We have some utility, models, visualizations, evaluations, and other NLP libraries that have been imported.
We’ll now import our data into the notebook: (this format helps to import data into a Kaggle Notebook)
data = pd.read_csv("../input/traindata/train.csv")
data.dropna(inplace=True)
data.reset_index(inplace=True, drop=True)
Text = "comment_text"
On stack overflow, one can easily find the commands to remove certain regex pattern matching.
def regex(text):
text = text.apply(lambda x: re.sub("\S*\d\S*"," ", x))
text = text.apply(lambda x: re.sub("\S*@\S*\s?"," ", x))
text = text.apply(lambda x: re.sub("\S*#\S*\s?"," ", x))
text = text.apply(lambda x: re.sub(r'http\S+', ' ', x))
text = text.apply(lambda x: re.sub(r'[^a-zA-Z0-9 ]', ' ',x))
text = text.apply(lambda x: x.replace(u'\ufffd', '8'))
text = text.apply(lambda x: re.sub(' +', ' ', x))
return text
Data[Text] = apply_regex(data[Text])
Essentially this now removes text such as numbers and words that are concatenated with numbers, emails, hashtags, URLs, multiple spaces, etc. All of these are not necessary for our model to verify if a comment is toxic or not.
Preprocess the data using an NLP library - SpaCy to remove the stop words that are predefined in the library:
preprocess = spacy.load("en_core_web_sm")
stop_words = preprocess.Defaults.stop_words
Apply stemming using the following code:
stemmer = SnowballStemmer(language="english")
def applyStemming(listOfTokens):
return [stemmer.stem(token) for token in listOfTokens]
data['stemmed'] = data['tokenized'].apply(applyStemming)
Check the sample of the data using the code here:
data.sample(10)
Some visualizations that you can make to check the non-toxic words are given under.
(Note: the visuals for toxic text have a lot of unprofessional words so it's best if you don’t execute it, this code is here only for demonstration and is picked off from one of the other competition participants at Kaggle. Best to stick to the non-toxic word cloud, thanks!)
wordcloud_pos = WordCloud(collocations=False,
width=150
height=600,
max_font_size=150
).generate(pos_text)
plt.figure(figsize=(15, 10))
plt.imshow(interpolation = “bilinear”)
plt.axis("off")
plt.title(f"Most common words associated with non-toxic comment", size=20)
plt.show()
Splitting the train test data:
X_train, X_test, y_train, y_test = train_test_split(data[“stemmed"], data["label"])
Now converting the data into the TF-IDF embeddings is a basic frequency-based approach.
Term Frequency–Inverse Document Frequency (TF-IDF) is a very common method that is used to compute word importance across documents in data. The assumption is that the more times a word appears in a document, the more important that word is for that document compared to the rest.
The TF-IDF assigns, accordingly, a weight to each word based on its order of occurrence frequency. In the end, the words assigned a lower weight are words that occur in all documents in general.
A bag of word representation is used to show an array containing the scores of each word and the word order is lost and context is not considered this way to improve the computing speed on a local machine.
Use the following code to fit and transform the data into an array as required:
tfid = TfidfVectorizer(lowercase=False, max_features=500)
train_vectors_tfidf = tfid.fit_transform(X_train).toarray()
test_vectors_tfidf = tfid.transform(X_test).toarray()
Use the following code to normalize the TF-IDF vectors:
norm_TFIDF = Normalizer(copy=False)
norm_train_tfidf = norm_TFIDF.fit_transform(train_vectors_tfidf)
norm_test_tfidf = norm_TFIDF.transform(test_vectors_tfidf)
In terms of the algorithm being used, we’ll use a Naive Bayes classifier as the algorithm on our model.
model = MultinomialNB()
A custom function to receive back a dataframe with all our evaluation metrics:
def classifier(y_test, predictions, modelName):
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
prec=precision_score(y_test, predictions)
rec=recall_score(y_test, predictions)
f1=f1_score(y_test, predictions)
acc=accuracy_score(y_test, predictions)
# specificity
spec=tn/(tn+fp)
score =
df_score = pd.DataFrame(data=score)
return df_score
To train the data and test it:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model.fit(X_train, y_train)
preds = model.predict(X_test)
scores = classifier(y_test, preds, name)
To check the model scores:
scores
Starting with cleaning the data and performing a common NLP pipeline, the embedding methods are used to form a frequency-based basic TF-IDF approach. A different baseline model would give a different outcome and that might also change with other hyperparameters that work with the best performing model.
The end is to just depict how accurately you can take a document classifier or a comment classifier in this case and make a working model.
Note: This is only an example demonstration and of course, to put the classifier to production use cases, a much more advanced algorithm will be required which might not be possible for everyone to develop. This is where alternative ways of using a document classifier come into the picture which leads to optimization of the process as a whole.
OCR is a computer technology, that lets different types of documents, like scanned paper, PDFs, or digital camera images, become searchable and editable. It does this by recognizing text characters in an image.
OCR technology typically involves several steps:
Combining OCR with document classification leads to more precise and reliable results. It does this by turning images of documents into searchable text.
With document classification, machine learning models can now work directly on documents' textual contents, shifting them away from dependence on metadata or external data sources.
Some benefits of Document Classification OCR include:
AI-based document classification is an emerging AI technique. It assigns documents to categories. It helps manage documents by their content. Document Classification uses machine learning (ML) and natural language processing (NLP). AI improves document classification in several ways:
Document Classification NLP techniques that play a crucial role:
AI revolutionizes document management strategies and decision-making processes:
In an IDP workflow, irrespective of supervised or unsupervised learning technique adopted, document classification works on 3 levels:-
Since IDP solutions deal with multiple document formats, the first step is to determine whether the file is a jpeg/png/pdf/tiff or any other format. Whether the file is scanned or non-scanned pdf is determined at this level.
Based on the structure, documents come in 3 categories:-
These documents have fixed templates, layouts, key-value pairs, and tables. Tax return forms and mortgage applications are the best examples of structured documents.
These documents may have a fixed set of key-value pairs and tables but they vary in terms of layouts and templates. They may often have similar information at different places in the different documents. Invoices are the best example of semi structured documents.
These documents have no structure at all. There are no key-value pairs, formatting, or tables. Documents are textual in nature and carry information embedded in paragraphs. Contracts are best examples of unstructured documents.
Documents are classified into respective categories at this level. This process has certain steps:
In some IDP workflows, this step comes before identifying the document structure. The aim of this step is to identify/distinguish the text from background. Certain techniques such as binarization, deskewing, and noise reduction are used to improve the quality of the document to be processed.
The quality of the tagged dataset is the most important component of a statistical Natural Language Processing (NLP) classifier. The dataset needs to be large enough and must be of a high-quality so that the model has sufficient information of clear delineation for a document type from others.
Classification methods are of two types:
i) Visual Approach
In this approach, computer vision analyzes the visual structure of the document without reading its text. This approach works well for structured documents, and in some cases for semi-structured documents as well.
It works on the idea that different document types have information laid out in a document at definite places and patterns. If the model is able to identify those patterns and distinguish them from the patterns on other document types, it classifies the document accordingly.
The advantage of this approach is that it happens during the scanning phase thus saves a lot of time.
ii) Text classification approach
In this approach, OCR reads the text from the documents, classifies the text, and moves on to classifying the document based on the information derived. With text classification, text can be analyzed at different levels:
1. Document level - All the text in a document is read.
2. Paragraph level - Text in a particular paragraph is read.
3. Sentence level - Reads text from a particular sentence.
4. Sub-sentence level - Specific phrases are read.
Let’s discuss both the visual and textual recognition techniques in detail.
Document classification algorithms function based on different recognition methods. The recognition techniques work based on text classification or visual classification.
The types of recognitions involved in classifying documents in the aforementioned types of learnings are given as under:
At times, documents in question are so different from each other that there is no need to read their text to classify them - they can be classified by just looking at their structure and style.
For example - an invoice and a tax form are so different from each other that you don’t have to read and analyze their entire text to classify them. They can be classified solely based on their structure.
With the capabilities of Computer vision, a document is broken down into pixels to learn about its structure, style, and layout. The pixels are analyzed to make up an image and are then identified as objects when together, and subsequently classified.
Computer Vision has grown out as a branch of computer science today where computers are being taught to make sense of an image. From Self Driving Cars to AI recognition in your smartphone, it all involves computer vision. The possibilities with computer vision are only growing with the years and it has a wide range of applications such as facial recognition, character recognition, pattern recognition, etc.
The backend CV algorithm is complex and depends on the use case to work precisely. It requires a lot of data and sometimes can be trained to even recognize hand gestures etc. in the simplest CV algorithms. The more modern approaches of self-driving cars etc. use Deep Learning models that involve CNNs, LSTM, Transformers, etc.
In computer vision and image processing, a feature is simply a piece of information about the image being processed. This information is used to classify different building blocks in documents. Based on the format of a particular document type, different blocks of information are recognized by the CV algorithm eventually using this information to classify documents.
Textual recognition works on the idea to recognize text with a definitive context associated with it. This is then used with lexical processing to understand the underlying genre, theme, and emotion of the sentence to lead the organization to pick off the class that the document might belong to (to a certain level of accuracy).
There are 3 ways in which textual recognition works:
In a simple OCR scanner that is added hardware to a system, light and dark areas are identified. The dark areas are then processed to be classified as alphabetical characters or numbers, and then it takes one character or word at a time and is recognized.
Taking this to the next level using computer vision in an algorithm with a back-end language, pattern recognition is used to feed examples of text in different forms for a system to recognize out of a scanned document or an image.
Feature detection then applies the rules of OCR for identifying features of a document. The features can be added to identify the number of lines, curves, crosses, etc. in a particular character, and then the character is identified and stored as the ASCII code within the system to handle any further manipulations to the elements.
OCR program can do this on a series of blocks, texts, tables, images, formats, pages, etc. to break down the documents to their character levels and then make sense of them together to create a program that will present you with the recognized text and its classification, as required.
Essentially, optical character recognition (OCR) technology makes data entry effortless and classification simpler by creating effortless text classifications that would otherwise take hours to do manually.
One step beyond the OCR is to perform Document Classification with an NLP algorithm using a programming language like Python.
Rule-based text recognition recognizes words in a document in different ways such as isolation, explicit word segmentation, simultaneous recognition, etc. It can also be based on searching certain terms in a document to understand where they might belong.
‘Rules’ in a rule-based text recognition system guide the system to identify semantically relevant elements of a text to classify it into relevant categories based on its content. Each rule consists of an antecedent i.e. a pattern with a category or classification.
For example, if you want to classify topics into two groups: Food and Careers. First, define words that fall into each category (for example - Dark chocolate, Lettuce, Fries, etc. fall into Food, and Engineers, doctors, accountants, etc. fall into careers)
Counting the instances of these words in an incoming text based on the trained algorithm will simply see which type of words occur more than another and then classify the text accordingly.
For example - a sentence that says, “Careers in the industry of engineers and doctors are seeing a massive trend of eating more dark chocolate.” - the classifier will classify this document with the text as one that falls into the ‘careers’ category.
Rule-based systems are not black box algorithms and can be developed easily. These algorithms have certain disadvantages as they require domain knowledge and are time-consuming.
Since generating rules for a complex system can be quite challenging, it needs a lot of data too. Rule-based systems are also difficult because, in their upkeep, they require a lot of new rules which don’t scale well with existing rules at times.
Natural Language Processing algorithms differentiate between documents by using different lexical and semantic processes that can be combined with techniques like a bag of words, tokenization, and word stemming and using stopword removal processes to arrive at an algorithm that can differentiate between different classes of documents based on the words in the document.
It is easy to find a platform to conduct document auto-classification to skip the entire hassle of having to code the feature recognition engine or a textual classifier with NLP using a coding language like Python.
Document classification involves putting documents into predefined classes. The classes are based on their content. Document classification, while beneficial, poses several challenges that organizations must address for successful implementation:
To address Document Classification challenges, organizations can employ several strategies and tools:
Clean data to protect it from unnecessary characters, standardize its formats, eliminate errors, and maintain consistency. The use of Optical Character Recognition (OCR) ensures better data quality when images are transformed into text documents.
Use natural language processing (NLP) techniques such as TF-IDF or word embeddings to extract features from text data that improve the accuracy of classification models.
Option 1: Train the machine learning models (Naive Bayes, SVM, different type of deep learning models etc.) on training data consisting of documents with predefined categories, then apply the model on the test collection documents and assess the results in terms of precision and recall.
Option 2: Semi-supervised learning and unsupervised learning approaches. This mainly depends on availability of pre-labelled documents or computational resources for unsupervised learning.
Automate workflows and integrate document classification with existing enterprise systems (eg ECM, CRM) to help streamline processes and ensure consistency.
Observe and assess the performance of the classifier with metrics like accuracy, precision, recall, and F1-score. Adjust models based upon feedback from the observed results and the changing patterns in the data.
To ensure successful implementation and maintenance of document classification systems, follow these best practices:
Classifying documents helps organisations in many industries. It improves their workflow, productivity, and compliance. Here are some real-life examples, which show how the industry uses document classification.
Legal proceedings involve many legal documents (case files, contracts, legal filings, and more). So, legal firms must be able to classify documents. Automated classification lets courts add documents to the right folders and index them.
This indexing helps for future reference. It also improves case-management efficiency. It does so by adding extra fields of information to the indexing metadata. These fields can support the legal process.
Also, automated classification is key for compliance. The software can automatically find and tag sensitive information. This includes client or crime data, and confidential legal trade deals.
Banks use document steganalysis to speed up tasks. These include invoice filings and loan approvals in Financial Services. Document classification helps to organize invoices, statements, and loan applications.
It lets banks get data from these documents faster, check their accuracy, and obey regulations. This speeds up decision-making and enhances the customer experience through shorter response times.
Sorting documents can handle medical files, patient history, and insurance claims. Automating this task lets us group documents by patient, disease, or remedy. This lets us prepare them quickly and find them easily. It speeds up registering insurance claims and helps clinicians make decisions to offer better care.
Categorizing documents is very useful during employee onboarding. They are overwhelmed with many papers. Automated pre-coding can sort out hiring, training, managing, evaluating, promoting, and salary-classifying documents.
It can keep up with HR policies and smooth annual audits. This helps as hundreds of employees of all ranks occur.
Government agencies rely on document classification. They use it to publish public and legislative documents, and to organize regulatory filings. Automating this classification helps agencies. It makes transparency and governance better.
It makes government operations more accessible to the public. Automated classification supports decision-making, policy-making and regulatory compliance.
Retail and eCommerce rely on document classification. It speeds up order processing, customer service, and inventory management. For example, sorting sales orders into the right categories helps. Sorting customer queries and supplier agreements does too.
It lets them streamline their processes and cut customer service wait times. It also helps them keep accurate and up-to-date inventories. This improves customer service and lowers operating costs.
Educational enterprises use document tagging to organize student records. They also use it for theses, dissertations, and research findings. Automatic categorisation helps academic planning. It also aids research, publication, and admin operations.
These include admissions and student services. It aids in compliance with education regulations and accreditation standards.
Document classification helps a real estate firm. It assists with managing property documents, transaction records, and legal contracts. Automated categorisation simplifies property transactions, lease agreements, and compliance.
It cuts the administrative overhead and improves accuracy with documents. This helps accelerating deal closures, increasing client satisfactions.
You can automatically detect the type of document, including invoices, passports, Aadhar cards, and more. This can be done using Docsumo for document auto-classification.
Step 1: Click ‘API and Services’ First, go to the section in Docsumo’s user interface that says ‘API and Services’ . This is where you can set up document classification settings and configurations.
Step 2: Create Classifications for Document Types. Go to the ‘API and Services’ menu and click the ‘Actions’ menu (on the right side of the page). Use the 'Actions' menu to create classifications for the document prototypes you want to classify. Enabling a document type gives Docsumo instructions to identify and classify the documents according to its type. You can identify the status or ability of a document type to be classified by the colour of its label – if it’s green, Docsumo is ready to classify documents of the type. Select the document type if the label is grey and click ‘Enable’ from the drop-down menu.
Step 3: Click ‘Auto-classification’ and Enable ‘Auto-classification’ after ticking the required document types. Each document type you selected in step 2 must be trained with at least 20 sample documents so that Docsumo’s machine-learning models can perform their tasks successfully based on features and patterns observed from those samples.
Step 4: Upload Your Documents Go back to the ‘Document Types’ section and upload the documents that you want to classify. Uploading documents in bulk into the auto-classification zone will process them according to the enabled document types and classification rules.
Step 5: Get Secret Document Classification After all OCR processing is done, Docsumo provides the intelligently classified outputs. The output will be classified under different document types and displayed under the ‘Types’ tab of the Docsumo dashboard. All documents can be organised based on various categories. Direct navigation to a category will allow storing and retrieving documents related to that category.
Auto-Assign for Team MembersIf there are different document types that need to be evaluated and validated by different team members, on Docsumo, you can use the ‘Auto-Assign’ option shown below:
1. Hardcoding an algorithm can cost your organization a huge sum to set up a server, get the developers, work on preparing the data for the algorithm to work on, and can be time taking while none of these costs are incurred when using a service like Docsumo instead.
2. It is almost impossible to consider manually entering millions of rows of data from millions of documents and hard coding an algorithm to do that is expected to end up in some errors. With Docusmo you can always go back and review the outcome that the algorithm delivers to ensure that accuracy is not compromised.
3. You can not define a predefined function to directly verify things like the difference between Gross Income and tax to be the Net income. You can identify them in the document to be at a specific location but not double-check the values using custom settings because it requires very complicated semantic analyses on the back-end and you can add multiple such checks in Docsumo.
At Docsumo, we take data protection and security very seriously. Docsumo is a GDPR compliant and SOC-2 certified company. All requests get transferred over HTTPS only, and data transfer gets encrypted with AES 256. All the stored data on S3 & Mongo dB also gets encrypted.
You remain in power by choosing to delete the data from our servers promptly or periodically after you have completed document processing. You can monitor individuals with access to different data types in your organization via advanced user management.
We realize that no platform exists in a vacuum, so we have built our solutions to integrate with other software and solutions. By employing plug-in APIs and out-of-the-box input and output connectors, our platform can conveniently get integrated into any workflow.
Document classification software automates the work, streamlines workflows, and reduces manual effort and human error.
It ensures deadlines and compliance, saves time and money processing documents, and generally makes information readily available for your in-house staff, who need to quickly get back to needing customers.
Organizations must embrace AI and ML-driven technologies to streamline the document life cycle, increase productivity, better aid decision-making in complex situations, and ensure compliance—in other words, to survive and thrive in this new data-driven landscape.