Abstract

The performance of any machine learning model largely depends on the type of input data provided. The higher the volume and variety of the data, the better the machine learning models get trained, thereby producing more accurate results. However, it is a challenging task to get high volume of data in some cases containing enough variety. Handwritten character recognition for Odia language is one of them. NITROHCS v1.0 for handwritten Odia characters and the ISI image database for handwritten Odia numerals are the standard Odia language datasets available for the research community. This paper shows the performance of five different machine learning models that uses a convolutional neural network to identify handwritten characters in response to handwritten datasets that are manipulated and expanded using several augmentation techniques to create variation and increase the volume of the data in the given dataset. These models, with the augmentation techniques discussed in the paper, even lead to a further increase in accuracy by approximately 1% across the models. The claims are supported by the results from the experiments done on the proposed convolutional neural network models on standard available Odia character and numeral data set.

1. Introduction

By using their eyes and brains, humans can see and visually feel the world around them. Making computers capable of perceiving and processing images in the identical way that individuals can is the goal of the vision of a computer. The domain of computer vision has produced a number of techniques for image recognition. From a given sensory input, hierarchical layers of representation are learned by a deep neural network (DNN) to perform pattern recognition [13]. These deep architectures have recently shown extremely remarkable outcomes, often on par with the results of humans [4, 5]. However, despite more than five decades of intensive research, the computer’s reading ability is still far below that of humans. Most optical character recognition (OCR) technologies are still unable to read deteriorated documents or handwritten notes.

In the past, handwriting recognition algorithms relied heavily on handcrafted features and needed extensive prior information. Based on these requirements, it is difficult to train an optical character recognition (OCR) system and comparatively produces lesser classification accuracy. Deep learning methods are now at the forefront of handwriting recognition research, which has produced some outstanding achievements in recent years. Nonetheless, the growing amount of handwritten dataset, combined with the availability of massive computational power, results in an increase in recognition accuracy, inspiring researchers to continue their research in the field of character recognition using convolutional neural networks (CNN).

CNNs are particularly effective at extracting the various features of handwritten characters and recognizing their structure automatically. However, there are some limitations, such as the fact that CNN models frequently require massive amounts of data for training. Data augmentation techniques are used to generate different replicas of the same data, introducing variants as well as boosting the data for training, in addition to artificially enhancing the volume of an existing dataset. A deep learning model trained on augmented images along with the original images outperforms a deep learning model that is only trained on the original images. Other than this, in general, the augmentation reduces the cost of collecting data when it is scarce and enhances the generalization ability of the data models.

If we see the state-of-the-art in HCR for Odia language, there is very less amount of contributions towards this area of research compared to other Indian languages; due to the roundish shape of Odia character, the presence of large number of modified and compound characters and similarity between different characters makes this language very hard to create a satisfactory classifier. So in our proposed CNN model, we are trying to achieve human-like accuracy for Odia HCR.

The proposed work has two objectives:O1: one goal is to attain comparable accuracy for handwritten Odia digit and character recognition using a regularized CNN architecture.O2: another goal is to look into various augmentation methods and how they affect the proposed CNN architecture’s performance.

So, this work’s main contributions are as follows:C1: it is a thorough evaluation of five different baseline models proposed by varying the number of features in the convolutional layers and the number of units in the dense layer from one architecture to the next.C2: to avoid overfitting the models, L2 regularization and spatial dropout were added to the models to enhance accuracy and the performance of the baseline and regularized models analyzed.C3: different augmentation techniques are applied to the databases used for our experimentation to create variation and increase the volume of the data. A set of the best data augmentation techniques is proposed and supported by the experimental results.

The rest of the paper is as follows: The related work is detailed in Section 2 and Section 3 presents the technique, which includes the datasets utilized for the research and the five distinct CNN architectures that are used in this research for handwritten character recognition. Techniques for image enhancement are covered in Section 4. The findings of the experiments are the subject of Section 5, and the conclusion is in Section 6.

Odia (previously Oriya) is a popular language in India recognized by the constitution as well as the official language of the state of Odisha. Handwritten character recognition (HCR), online or offline, postal-address interpretation, writer recognition, signature verification, real-time handwriting recognition, bank-check/cheque processing, or note preparation are only a few of the ongoing study fields where deep learning produces better accuracy. Several studies have been conducted in the domain of optical character recognition in several languages [6, 7], but progress in the Odia language has been limited. The authors of [8] analyze various approaches for handwritten character recognition using a standard handwritten digit recognition test, and convolutional neural networks (CNNs) have been found to outperform all other techniques when dealing with the variability of 2-D shapes. The authors of [9] classified the printed Odia characters from the ISI Kolkata Dataset and obtained an accuracy of 96.3%. The preprocessing techniques used by authors were skew detection and correction, followed by line, word, and character segmentation. Stroke and run-number-based features, along with the features obtained from the concept of a water reservoir, were used, and the decision tree classifier was used for the classification task. In [10], binarization, skeletonization by chain coding, noise removal, and segmentation were the preprocessing techniques used by authors for the Odia Digit Database, NIT Rourkela, and they obtained an accuracy of 96.08% using the Finite Automata classifier, whereas in [11] binary external symmetry axis constellation (BESAC) features were used on the IITBBS Odia character database having 7800 data samples. The accuracy of random forest classifier was 89.92%, SVM classifier was 93.77%, and K-Neighbor classifier was 95.01%. An ensemble way of selecting features as well as classification of Odia characters has been proposed by [12]. Husnain et al. [1317] contributed their work on the identification of Odia alphabets and digits. Researchers used neural networks and other deep learning approaches to contribute to the field of character classification, as documented in [13, 1820]. In [21], the authors contributed to work on image augmentation based on generative adversarial networks (GANs) on an ISI Kolkata handwritten dataset of Latin, Bangla, Devanagari, and Oriya languages. GAN is a method for developing artificial sample images for a database that does not necessitate prior knowledge of the probable differences between samples and has obtained an accuracy of 97.31% on the Oriya (Odia) character set. Similar to Odia HCR, if we investigate HCR for other regional Indian languages such as Bengali, Devanagari, or Telugu, most of the works involve a machine learning approach associated with handcrafted feature extraction followed by classification [22]. Here, the authors propose a feature extraction technique to extract features to classify the Bangla compound characters. The feature vector of 180 length is constructed from the longest run feature (LRF), the histogram of oriented gradients (HOG) feature, and the diagonal feature. The extracted features were used to train an SVM classifier, which achieved 88.73% accuracy. The authors of [23] proposed a method for digit recognition called “Celled Projection” that partitions the image and computes the projection of each section, and the k-NN classifier achieved an accuracy of 94.1%. For automatic feature extraction as well as human-like accuracy, researchers are now inclined to neural network architecture [24, 25]. The authors of [26] expanded the image samples on the Bangla Lekha-isolated character dataset and tested their work on the CNN model with 91.81% accuracy on the alphabets on the base dataset and accuracy of 95.25% after expanding the dataset to 200,000 images using data augmentation techniques such as rotation, zoom, shear, position shifting, shear, etc.

2.1. Applications of Handwritten Character Recognition

Handwritten character recognition is one of the major applications of visual document analysis: sorting or reading PIN/ZIP from postal letters, bank check amounts, extracting data from any application form, OCR for blind people, playing a vital role in digital libraries by entering the textual information present in an image into digital formats, helping a great deal to preserve historical documents, and many more. The list below includes some real character recognition system models [2731].

Google’s neural machine translation (NMT) is an end-to-end learning approach for automated translation. NMTs are well known for requiring a high computational cost for both training and making translation inferences. NMT systems are allegedly not robust enough, especially when input phrases contain rare words, according to a number of authors. In comparison to Google’s phrase-based production system, the neural machine translation (GNMT) technology reduces translation errors by an average of 60%. On the WMT′14 English-to-French and German benchmarks, GNMT achieves competitive results that are state-of-the-art. The system’s accuracy outperforms all previously published findings when measured using a human side-by-side comparison. In a deep LSTM network with 8 encoder and 8 decoder layers, the system model uses attention links from the decoder network to the encoder as well as residual connections. The words were divided into a small set of typical subword units (known as “wordpieces”) for both input and output in order to better handle unusual terms [27].

An open-source OCR engine called Tesseract was created between 1984 and 1994 at Hewlett-Packard. Black-on-black text was perhaps the first to be handled so easily by an OCR engine, according to Tesseract. Tesseract assumes that its input is a binary picture with clearly defined polygonal text sections. At this point, blobs are created solely by stacking outlines together. Text blobs are examined for proportional or fixed-pitch text. Character cells immediately chop the fixed pitch. Words in proportional text are separated using both definite and fuzzy spaces. An adaptive classifier receives each suitable word as training data [29].

In the real world, The Deutsche Post, AG, employed a method for sorting letters to recognize handwritten zip codes. A time-delayed neural network (TDNN) classifier was used to identify hand-printed digits after the machine had read the destination address. A different classifier extracted the structure of each digit and compared it to a range of digits [30].

An OCR. in Braille for Blind People: this paper discusses the fundamentals of an optical character recognizer (OCR) for the Braille Code, the writing system used by blind people. With funding from the National Organization of Spanish Blind People, they created this system. Even with an A4 scanner, the OCR can handle sheets larger than the typical A4 [31].

3. Methodology

In this section, the proposed methodology for handwritten character and numeral recognition is provided. In particular, to make it simple and easy to clarify, this section is divided into two subsections: CNN architecture and datasets for Odia language.

3.1. CNN Architectures

The CNN algorithm is the most well-known and widely used in the field of deep learning. CNN has a distinct advantage over its predecessors in that it discovers important features without the requirement for human intervention. Computer vision, audio processing, and facial identification are just a few of the applications that CNNs have been used for. Similar to a traditional neural network, the structure of CNNs is also inspired by neurons in human and animal brains. This typical CNN, similar to a multilayer perceptron (MLP), includes numerous convolution layers preceding subsampling (pooling) layers, followed by fully connected (FC) layers.

For the handwritten character recognition of the Odia language, we have implemented five different CNN models. The architecture of a deep learning model can be thought of as its layers. Different types of layers can be employed in the models. Each of these layers has its own significance based on its characteristics. All of the CNN architectures we have implemented here have two convolutional layers followed by one hidden dense layer and another output layer. Extraction of features from images can be done by the convolutional layer, which is the first layer of the CNN architecture. Because pixels are only related to neighbouring and close pixels, convolution preserves the relationship between distinct regions of an image, and it is the process of shrinking an image by filtering it with a smaller filter.

In CNN, pooling layers are frequently added following each convolution layer, which minimises the spatial size of the feature maps. This is another method for reducing overfitting. The pooling method is employed by selecting the maximum, average, or total values inside these pixels. Maximum pooling is one of the most commonly used pooling algorithms, and we employ it after each convolution step in our submitted work.

The dense layer’s neurons are all coupled to the neurons in the layer before it. Dense layers are employed in handwritten character recognition to identify images based on the output of the convolutional layer. Each layer of the neural network’s neurons computes the input’s weighted average and passes it through a nonlinear function, an important part of a neural network’s architecture called an activation function. Commonly used activation functions include sigmoid, tanh, step function, linear function, exponential linear unit, ReLU, and leaky ReLU. The rectified linear unit activation function, or ReLU, will produce the same output as the input if the input is positive; otherwise, it will output zero and is shown in equation (1). We picked ReLU as the default activation function in all five of our constructed CNN models since it is easier to train and provides better results more often.

Batch normalization is used after each layer as it makes the architecture faster and more stable through normalization of the layers’ inputs by recentering and rescaling.

The architectural specifications of five different proposed CNN models applied to the Odia character dataset [32] are represented in Table 1. The number of features in the convolutional layers and the number of units in the dense layer change from one architecture to the next. For all the models, the shape of the input layer is 28 × 28 × 1, and final layer is the output layer. Bayesian optimization [33] is used to find the optimal value for the number of features, the number of units, and the learning rate. For the loss function, categorical cross-entropy has been chosen, and the Adam optimizer is used as the optimizer in all the models. Categorical cross-entropy is basically used as a loss function in multiclass classification tasks. This is applicable when there are multiple categories present and the system or model has to choose only one from them. Adam is a wider optimization technique that is used to iteratively adjust network weights based on training data. This optimization method is very efficient and consumes very little memory when dealing with a model with a lot of data or parameters. The layers from 1–6 are mainly used for extracting the features of the input image. Layer 7 flattens the image. Layers 8, 9, and the output layer classify the input image based on the features extracted by the previous layers.

3.2. Datasets of Odia Language

Odia, an Indian language, is mostly spoken in the state of Odisha in India (formerly known as Orissa). Native speakers make up to 82% of the population in Odisha, while Odia is also used in parts of Indian states such as Chhattisgarh, Jharkhand, and West Bengal. Due to the roundish structure of Odia letters and the fact that handwriting styles differ from person to person, it is a challenging task for researchers to get human-like classification accuracy. To design a machine learning model, a standard dataset is needed to validate the algorithm. For our research work, we have used the Odia character dataset [32], prepared at NIT Rourkela (NITROHCS v1.0) and the Odia numeral database [34], prepared at ISI, Kolkata. These databases are popular and mostly used as a benchmark database for the research community interested in handwritten digit or character recognition experiments for Odia language.

3.2.1. NITROHCS v1.0 Database of Handwritten Oriya Characters

This database contains 47 classes of handwritten characters with 320 images in each class, i.e., 15,040 samples in total. The dataset consists of samples that were collected from a total of 160 people from different age groups. At different times, each person has contributed to the samples twice. The sample characters of the character database are shown in Figure 1.

3.2.2. ISI Image Database of Handwritten Oriya Numerals

There are 5000 sample images collected from 356 people in this database of handwritten Odia numerals, which consists of 10 classes. A total of 105 emails and 166 job applications were used to create the database. The entire dataset was already divided into a training set with 4970 samples and a test set with 1000 samples. The sample characters of the numeral datasets are shown in Figure 2.

4. Data Augmentation

Data augmentation refers to techniques for enhancing the quantity of data that is accessible by including additional copies of existing data that have been minimally modified or by generating new artificial data from existing data. Deep learning models, which can learn characteristics with multiple layers of abstraction from data, have recently changed the state-of-the-art in many fields. The training of high-dimensional deep learning models like CNN requires the addition of additional data [35]. However, because so many parameters must be learned by these deep learning models, these approaches are prone to overfitting. Larger datasets could serve as regularizers and provide stronger models. But, collecting and manually assigning labels to handwritten images can be a time-consuming and expensive process. As a result, users frequently need to use artificial data augmentation while using datasets with fewer images. In this work, we employ a fully convolutional neural network that can perform at the cutting edge and look into the advantages of adding augmented image samples to the training set that are produced by nonlinearly transforming handwritten images. The data augmentation method involves applying random transformations to the initial training data in order to create new observations by rotating, translating, etc., to the existing ones. Image augmentation is a common activity in medical imaging procedures, including the processing of magnetic resonance images (MRI), X-ray computed tomography (CT), and positron emission tomography (PET) [3638].

All the samples in both datasets have been augmented for each of the augmentation techniques, and the same split has been used each time as it is used in the case of a normal dataset. The enhanced database size after applying various augmentation strategies is shown in Table 2.

4.1. Affine Transformations

Applying mathematical computations to each point, line, and plane of an object to create a new one is known as an “affine transformation,” as a result, the collinearity between points will be preserved [39]. The set of operations providing linear transformations includes translation, rotation, and scaling, and these affine transformations can be applied to an image to expand the dataset. We consider a 2-D image and a point and then the point affine transformed of point , where are scalar values.

4.1.1. Translation

Translation makes the image move along either the X or Y direction (or both) without changing the shape or angle. is the transformed point of and the values of and are given in equations (2) and (3). Figure 3 shows the translation process and some sample translated images of ISI image database. The parameter values of and decide the direction of translation.

Our assumption is that the images have a white background beyond their boundary and are appropriately translated. Such a technique is quite helpful, as most objects can be located anywhere in the image. This ensures that the convolutional neural network looks everywhere in the image. We have restricted only translation to small values because more translation leads to significant portions of characters being removed from the image, which proves detrimental to the performance of the CNN architectures. In the translation technique, for a parameter t, each of the images in the datasets is translated by −t, 0 and t pixels in the X direction and by −t, 0 and t pixels in the Y direction, which increases the size of the dataset by nine times. The process of rotation and sample translated images are shown in Figures 3(a) and 3(b). The images in the datasets are translated by −t, 0, t pixels in the X and Y directions. From Table 3, it is clear from the experiment that the models achieve better performance on the translated dataset than on the original dataset.

4.1.2. Rotation

Rotation involves turning an image along its centre in a clockwise and anticlockwise direction by some randomized number of degrees. is the transformed point of after rotation, and the values of and are given in the following equations:

Rotation involves turning an image along its centre in a clockwise or anticlockwise direction by some degrees. Naturally, by locating and extracting a character from the whole image, it is possible to get images even after they are slightly rotated. To make the CNNs robust to such changes, we rotate the images in the dataset by small angles. The image is rotated by −t, 0, and t degrees for a parameter t, which increases the size of the dataset by a factor of three, and sample images are shown in Figures 4(a) and 4(b). Rotation involves turning an image along its centre in a clockwise or anticlockwise direction by some degrees. For a parameter r, all the images are rotated by −r, 0, and r degrees. We rotated the images for r, which ranges from 1 to 10 degrees, and for r = 2, 5, 9, the maximum validation accuracy is achieved and is displayed in Table 4.

4.1.3. Scaling

Scaling involves stretching, compressing, or resizing the original image. The scaling point of the original point and the scaling process and sample images after scaling are shown in Figure 5.

Scaling involves resizing the original image. Here, we scale down the original image but add extra white pixels around it to keep the dimension of the resultant image unchanged. For a parameter s, we reduce the number of rows and the number of columns by s, which is shown in Figures 5(a) and 5(b). We scale the image by reducing the size of the character without changing the image dimension by a parameter s, where s is the parameter describing the amount of reduction. The performance of the scaling operation is shown in Table 5.

4.2. Elastic Deformation

Elastic transformation was first introduced in [3]. This paper puts forward that the distribution has invariance not only with respect to elastic deformations, which result from the uncontrolled oscillations of hands dampened by inertia, but also with respect to affine transformations. This paper showed that the elastic transformation improved the performance of CNN on the MNIST dataset. We postulate that the same is true for the NITROHCS Odia Character Dataset, ISI, and Kolkata Odia Numeral Dataset, and Figure 6 shows some example images of elastic deformation. From Table 6, it is clear that elastic deformation gives a considerable improvement in performance.

4.3. Gaussian Noise

Gaussian noise is statistical noise whose probability density function (PDF) is similar to the normal distribution. The generated noise is then added to the image, which disturbs the gray values present in the digital image. The PDF or normalized histogram of a Gaussian random gray variable “x” is given in the following equations: where is the standard deviation and is mean. It causes an increase in the size of the dataset by a factor of 2, and sample images after applying Gaussian noise are shown in Figure 7. From Table 7, it is observed that this transformation gives a slight improvement in performance by varying the sigma value.

4.4. Color Inversion

Color inversion inverts the color of each pixel. For example, a black character on a white background changes into a white character on a black character. This helps in doubling the size of the dataset, and sample images are shown in Figure 8. From Table 8, it is clear the color inversion gives a considerable improvement in performance.

5. Results and Discussion

This section contains a number of simulation results that characterize the performance of the proposed character and numeral recognition algorithms in various benchmark datasets. We first compare the baseline model with the regularized CNN model. Furthermore, we compare the effects of different augmentation techniques on the performance of the proposed five different CNN models for character and numeral recognition, and we compare the proposed method with the state-of-the-art recognition methods.

5.1. Baseline vs. Regularized Model

All experiments were carried out on the Google Colab in a GPU environment, and the experimental results reported in this section are for the designed five baseline models M1, M2, M3, M4, and M5. The NITROHCS v1.0 character database did not have separate training and testing examples. Hence, a 70%–30% split was used to obtain training and testing examples, whereas the ISI image numeral database was split into training and testing sets, and the same split has been used without any changes. It is observed from Figure 9 that the baseline model achieves a training accuracy of 100% after 10 epochs, but its validation accuracy stagnates around 97%. This applies to all five models on both the character and numeral datasets. This is a sign that the models are overfitting the data. To avoid overfitting of the models, L2 regularization and spatial dropout were added to the models [40]. It can be observed that in the regularized models, the gap between the training accuracy and validation accuracy has reduced in all the models. From Table 9, it is clear that the maximum validation accuracy achieved by the models has increased after the application of regularization.

5.2. Effect of Data Augmentation on Performance

In our experiment, we applied different data augmentation techniques, i.e., translation, rotation, scaling, etc. The maximum validation accuracy for two different character and numeral datasets is compared for five different models.

Table 2 shows the enhanced database size after applying various augmentation strategies. Tables 10 and 11 compare the performance of various available handwritten character recognition techniques for the Odia language on the NITROHCS v1.0 character and ISI image numeral datasets.

The data augmentation techniques to expand the dataset are popular in daily life applications like face, speech, or text recognition and classification for different languages, and they also plays an important role in medical imaging field also. But unfortunately, very limited contribution from different data augmentation techniques for Odia handwritten character recognition were found. The performance comparison of data augmentation techniques on different datasets is shown in Table 12.

A comparison among the recognition or classification accuracy of handwritten characters and numerals belonging to different Indian languages is shown in Table 13.

6. Conclusions

In this work, five variants of 2-layer CNNs are used for handwritten character recognition of Odia characters and numerals. The accuracy of five different baseline as well as regularized model is figured out. After testing the effectiveness of various data augmentation techniques on the Odia characters using the standard character and numeral datasets and providing the augmented dataset as input to five different CNN architectures, we conclude that when the original dataset is either color inverted or Gaussian noise is applied to the dataset, the models produce better accuracy, i.e., 98.91%, than the normal dataset. Other techniques such as translation and rotation also showed slight improvement in accuracy.

Data Availability

The OHCSv1.0 data used to support the findings of this study have been deposited in the NIT, Rourkela, India, repository (DOI: 10.1109/NCVPRIPG.2015.7490020). The ISI image Odia numerals data used to support the findings of this study have been deposited in the ISI Kolkata, India repository (DOI: 10.1109/ICDAR.2005.84).

Conflicts of Interest

The authors declare that they have no conflicts of interest.