Pre-processing Techniques in OCR

Verified

Added on  2020/10/01

|4
|3201
|127
AI Summary
The provided assignment is on pre-processing techniques in Optical Character Recognition (OCR). It includes a list of references [1-15] to research papers and articles that discuss different methods used to enhance OCR accuracy. The references cover topics such as Chinese handwriting databases, confusion sets, radical decomposition, error correction, and deep learning-aided OCR techniques. This document is likely intended for students studying computer science or related fields and provides valuable information on pre-processing in OCR.
tabler-icon-diamond-filled.svg

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
Research Proposal
An Integrated Model for Offline Handwritten Chinese Char
Recognition Based on Convolutional Neural Networks
Kate Nguyen
ktnguyen17@earlham.edu
Department of Computer Science
Earlham College
Richmond, Indiana
ABSTRACT
Optical Character Recognition (OCR) is an important technology
in computer vision and pattern recognition that recognizes text
embedded in images. Although the OCR achieved high accuracy for
languages with alphabet-based writing systems, its performance
on handwritten Chinese text is poor due to the complexity of the
Chinese writing system.In order to improve the accuracy rate,
this paper proposes an integrated OCR model for Chinese hand-
writing that combines existing methods in pre-processing phase,
recognition phase, and post-processing phase.
KEYWORDS
Chinese handwritten recognition, convolutional neural networks
(CNN), feature extraction, statistical text models, image pre-processing
.
1 INTRODUCTION
Optical Character Recognition (OCR) is a technology that enables
text recognition within digital images. OCR has become one of the
most critical techniques in computer vision and pattern recognition
for its broad applicability to different problems in these fields [9].
However, applying OCR to recognize handwritten Chinese char-
acters is challenging due to the complexity of the Chinese writing
system, which is not alphabet-based.
This research project aims to improve the accuracy rate of the
offline OCR for handwritten Chinese characters.An integrated
OCR model is proposed for Chinese handwriting based on exist-
ing methods used in different phases. It is a combination of image
pre-processing techniques, a feature extraction and classification
architecture based on CNN,and statistical text models for error
correction in the post-processing phase. This research will also pro-
pose a novel statistical model and compare the new combination
with the combinations consisting of existing methods.
The remaining of this paper is structured as follows: Section 2 pro-
vides an overview of the related works.Section 3 describes the
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
CS388, May 2020, Earlham College
© 2020 Association for Computing Machinery.
designs of different components in this study. Section 4 discusse
the research budget. Section 5 provides a timeline for this projec
2 RELATED WORK
There has been much research on different methods in compute
vision and natural language processing that enables the researcher
to enhance the performance of the OCR. This section will provide
an overview of the two most common approaches adopted by re
searchers to solve the performance problem.These approaches
include the one that focuses on building the appropriate CNN archi-
tecture for feature extraction and classification and the approac
applying statistical text models for error correction after the recog-
nition phase.
2.1 Feature Extraction with Convolutional
Neural Networks
One traditional approach to Chinese text recognition is feature ex
traction. This technique extracts features from character images fo
the neural network classifiers to learn. Although many researchers
apply feature extraction, there are various ways they can approach
this problem. Some studies focus on choosing the proper neural
network models for Chinese handwritten text. For instance, Yin et
al. conducted experiments on four different neural networks, includ
ing convolutional neural network (CNN), Visual Geometry Group
neural network (VGG), capsule network (CapsNet), and residual
network (ResNet), on a database of Chinese uppercase character
[14]. The ResNet achieved the highest accuracy, which was 99.38%
Zheng et al. also focused on developing a suitable neural network
model for handwritten Chinese character recognition system [15].
In their research paper, they proposed a back propagation algorithm
of artificial neural networks that extracts the features of Chinese
characters. The authors mentioned that the model achieved a co
rect rate of over 98% when tested with printed Chinese character
However, the recognition accuracy for the handwritten Chinese
character dataset was not reported.
Another method dealing with feature extraction is finding the group
of features that facilitates the highest performance of the OCR. Kim
et al. examined two groups of features to find out the better grou
of features for the recognition process and proposed synthetic fe
tures extraction and classification [2]. They concluded that the
group of features that includes ratio, Hu-Moments, massive cen
tral coordinates,x-projection,y-projection,shape numbers,and
perimeter length,is more useful to the recognition system.The
classic classifier using these features achieved an accuracy of 98
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
CS388, May 2020, Earlham College Kate Nguyen
and the synthetic classifier,which is twice as fast as the former,
obtained an accuracy of 92.4% over the dataset of printed Chinese
characters.
While some methods achieved significantly higher rates of accu-
racy, it is worth noting that these studies used different types of
datasets. The method that Yin et al. obtained an accuracy of 99.38%.
However, the proposed targeted Chinese uppercase characters ex-
clusively, which is a particularly small set of Chinese characters
[14]. Similarly,the methods that Zheng et al. [15] and Kim et al.
[2] proposed, which achieved the accuracy rates of 98% and 92.4%,
respectively, both tested the models on printed Chinese characters,
which are clearer and less complicated to be recognized compared
to handwritten characters. Hence, the rates of accuracy achieved by
the methods mentioned in this section might not reflect the state-of-
the-art performance of handwritten Chinese character recognition.
2.2 Post-processing with Statistical Text
Models
A different approach to improve the accuracy of the OCR is working
on error correction in the post-processing phase instead of directly
improving the performance of the recognition models. Most studies
in this category applied statistical text models to correct errors of
the OCR after the recognition phase.
Tseng proposed an error correction technique for OCR that iden-
tifies keywords and confusable terms [11]. The algorithm first ex-
tracts maximally repeated patterns from each document; then, it
will collect terms with three or over three characters. If two terms
only differ in one character, the two characters that differentiate
these two terms will be reported as a confusing pair. The author
claimed that this post-processing method achieved an accuracy
of 84% and,therefore,can be used to improve the performance
of the OCR.Liu and Cao also demonstrated a similar approach
called a seed-based method [5]. This method consists of two stages:
forming a seed of confusion set for each character and creating a
graph of confusion sets, then developing a self- extension method
to extend the graph. The proposed method resulted in an increase
of confusing pairs by 36.4%. Although the impact of this study on
the performance of the OCR was unknown as this work had not
been directly applied to the OCR model, a significant increase of
confusing pairs might lead to a substantial improvement in the
accuracy of the OCR.
Wang and Liu proposed a method that helps develop a more accu-
rate OCR system for Chinese text in rare books [13]. The proposed
method consists of N-gram, long short-term memory (LSTM), and
ard and forward N-gram (BF N-gram) models. Through their exper-
iments, they concluded that N-gram, LSTM, or BF N-gram are not
sufficient to reduce the errors of OCR. Adding rules of OCR Top 5
and the probability of Top 1, where Top 5 are the candidate words
with the highest classification probability, and Top 1 is the result of
it, provides more information to make choices.
Adopting the same approach, Li et al. presented a contextual post-
processing method that combines character-based bigram post-
processing and word-based bigram post-processing for Chinese
characters recognition [3]. They first executed character-based bi-
gram post-processing using a forward-backward search on a large
candidate set in order to improve the accuracy and efficiency of the
set. The word-based bigram post-processing was then executed o
a small candidate set to improve accuracy further. They concluded
that their new post-processing method both improves accuracy and
boosts the processing speed, which was 100 times faster compare
to word-based bigram post-processing.
3 DESIGN
This research project will study different combinations of the exist-
ing methods that aim to solve the same problems, which include
methods in the recognition phase and the post-processing phas
and add an image pre-processing phase that happens before impl
menting each combination. A novel statistical text model that is
applied in the post-processing stage is proposed. The new comb
nation that consists of the novel model will be compared with the
combinations of existing methods in terms of recognition accuracy.
3.1 Framework
The handwritten Chinese characters dataset will be arbitrarily di-
vided into two sets: train dataset and test dataset. The images i
both datasets will be treated using image pre-processing technique
including binarization,skew correction,and denoise in order to
improve the quality of the images. This facilitates the recognition
model to achieve a better result. The treated train dataset will be f
to the CNN to extract the features and train the classification mode
After the training stage, the treated test dataset will be fed to th
trained CNN model to classify the characters. In the post-processin
phase, the errors existing in the results from the recognition phas
will be corrected to increase the accuracy rate of the OCR. The pos
processing process is based on statistical text models. The mode
that will be applied in this research include Bayesian, backward
and forward n-gram [13], and a novel model. The framework of
this project is shown in Figure 1.
Figure 1: Framework of the project
3.2 Pre-processing Techniques for OCR
3.2.1 Binarization.Binarization is the process that converts col-
ored image into an image that consists of only black and white
pixels [1]. Pixel values in an image will be separated in to two
Document Page
Research Proposal
An Integrated Model for Offline Handwritten Chinese Character Recognition Based on Convolutional Neural NetworksCS388, May 2020, Earlham College
groups (black as foreground and white as background) depending
on the selected threshold.
3.2.2 Skew Correction.The text in scanned documents might be
oriented at a certain angle with horizontal. Skew correction is the
process in which the skew angle is calculated and corrected it by
rotating the image by an angle that is equal to the skew angle in
the opposite direction [8].
3.2.3 Denoise.Denoise is an image processing technique that re-
moves small dots that have high intensity than the rest of the image
to make it smoother [12].
3.3 Convolutional Neural Networks
A CNN architecture consists of two parts: feature extraction and
classification [7]. In the feature extraction part, different filters will
be applied to input images to generate feature maps. An activation
function, in this case, the rectified linear unit (ReLU) function, will
be applied to increase non-linearity.The next step is applying a
pooling layer to reduce the spatial size of the feature maps, which
helps decrease the computational power. The above process might
be repeated multiple times. The pooled images will be flattened into
a column vector, which will be fed to a fully connected layer. The
fully connected layer will classify the images into different classes.
The number of neurons in this layer is the same as the number of
classes.
3.4 Statistical Text Models
In the post-processing phase,statisticaltext models,including
Bayesian, backward and forward N-gram [13], and a novel model,
will be applied to correct errors from the previous phase. The post-
processing model will apply probability to find the non-existing
characters (non-word errors) or characters that exist but do not
fit the context (real-word errors), which indicates that the model
predicted the wrong characters, and eliminate these errors [10].
3.5 Dataset
The training and testing datasets are from CASIA Online and Of-
fline Chinese Handwriting Databases [6][4]. The Offline Database
consist of separated datasets for training and testing purposes. The
isolated character dataset (pre-segmented) will be use to train the
character recognition model. The text line dataset, which includes
lines of characters, will be used in the OCR accuracy testing step
and the post-processing phase.
3.6 Experiment Design and Evaluation
The project will be evaluated based on the rate of classification
accuracy, which is the percentage of correctly classified characters.
The trained CNN model will be tested using the text line dataset.
The errors in the results from the recognition phase will then be
fed to the statistical text models in the post-processing phase. After
the statistical models correct the errors, the pipeline will be tested
using the text line dataset. The accuracy rates after post-processing
phase with different statistical text models will be compared to find
out the most appropriate model.
4 BUDGET
This project uses Python and TensorFlow or PyTorch, which are
open source software and are available on some of the servers
the department. Hence, this project does not require purchasing
software or hardware.
5 TIMELINE
First month
First half: Implement image processing techniques and do
experiments
Second half:Do experiments on different type of CNN
architectures,choose the most appropriate CNN model
(number of layers,layers to include,type of activation
function)
Second month
First half: Continue to work on CNN model
Second half: Implement and test some existing statistical
text models
Third month
First half:Try to come up with a novelstatisticaltext
models; Implement and test it
Second half: Do experiments on different combinations of
methods in three phases
Fourth month
First half:Continue with the previous step;Debug and
make sure the pipeline works
Second half: Finish up; Write paper
6 ACKNOWLEDGEMENT
I would like to thank Dr. David Barbella and Dr. Xunfei Jiang for
their support and feedback on my project idea and research pro
posal.
REFERENCES
[1] Rawia I. O. Ahmed and Mohamed E. M. Musa. 2016.Preprocessing Phase for
Offline Arabic Handwritten Character Recognition.International Journal of
Computer Applications Technology and Research 5,12 (2016),760–763.https:
//doi.org/10.7753/ijcatr0512.1005
[2] Chul Kim, Jang Su Kim, and U. Ju Kim. 2019.A study on features for improving
performance of chinese OCR by machine learning.ACM International Conference
Proceeding Series (2019), 51–55.https://doi.org/10.1145/3341069.3342991
[3] Yuanxiang Li, Chew Lim Tan, and Xiaoqing Ding. 2002.Combining Character-
Based Bigrams with Word-Based Bigrams in Contextual Postprocessing for Chi-
nese Script Recognition.ACM Transactions on Asian Language Information
Processing 1, 4 (2002), 297–309.https://doi.org/10.1145/795458.795461
[4] Cheng Lin Liu, Fei Yin, Da Han Wang, and Qiu Feng Wang. 2011.CASIA online
and offline Chinese handwriting databases.Proceedings of the International
Conference on Document Analysis and Recognition, ICDAR (2011), 37–41.https:
//doi.org/10.1109/ICDAR.2011.17
[5] Liangliang Liu and Cungen Cao.2016.A seed-based method for generating
Chinese confusion sets.ACM Transactions on Asian and Low-Resource Language
Information Processing 16, 1 (2016).https://doi.org/10.1145/2933396
[6] Institute of Automation of Chinese Academy of Sciences. 2011.CASIA Online
and Offline Chinese Handwriting Databases.http://www.nlpr.ia.ac.cn/databases/
handwriting/Home.html
[7] Prabhu.2018. Understanding ofConvolutionalNeuralNetwork (CNN)
Deep Learning. https://medium.com/@RaghavPrabhu/understanding-of-
convolutional-neural-network-cnn-deep-learning-99760835f148
[8] Susmith Reddy.2019. Pre-Processing in OCR. https://medium.com/
@susmithreddyvedere/pre-processing-in-ocr-fc231c6035a7
[9] Daming Shi, Robert I. Damper, and Steve R. Gunn. 2003.Offline handwritten
Chinese character recognition by radical decomposition.ACM Transactions on
Asian Language Information Processing 2,1 (2003),27–48. https://doi.org/10.
1145/964161.964163
Document Page
CS388, May 2020, Earlham College Kate Nguyen
[10] Xiang Tong and David A. Evans. 1996.A Statistical Approach to Automatic OCR
Error Correction in Context.Proceedings of the Fourth Workshop on Very Large
Corpora (WVLC-4 (1996).
[11] Yuen Hsien Tseng. 2002.Error correction in a Chinese OCR test collection.SIGIR
Forum (ACM Special Interest Group on Information Retrieval) (2002),429–430.
https://doi.org/10.1145/564437.564478
[12] Neerugatti Varipally Vishwanath and K Navya Durga Pavani. 2018.THE DIF-
FERENT PRE-PROCESSING TECHNIQUES USED IN HANDWRITTEN TELUGU
CHARACTER RECOGNITION.2 (2018), 17–21.
[13] Hsiang An Wang and Pin Ting Liu. 2019.Towards a higher accuracy of optical
character recognition of Chinese rare books in making use of text model.ACM
International Conference Proceeding Series (2019), 15–18.https://doi.org/10.1145/
3322905.3322922
[14] Yue Yin,Wei Zhang,Sheng Hong,Jie Yang,Jian Xiong,and Guan Gui.2019.
Deep Learning-Aided OCR Techniques for Chinese Uppercase Characters in the
Application of Internet of Things.IEEE Access 7 (2019),47043–47049.https:
//doi.org/10.1109/ACCESS.2019.2909401
[15] Liangbin Zheng,RuqiChen,and Xiaojin Cheng.2012.Research on Offline
Handwritten Chinese Character Recognition Based on BP Neural Networks.2019
9th International Conference on Information Science and Technology (ICIST) 51
Iccsit 2011 (2012), 323–329.https://doi.org/10.7763/IPCSIT.2012.V51.55
chevron_up_icon
1 out of 4
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]