Due to the heterogeneous nature of handwritten text, it is much more difficult to automatically recognise compared to printed text (two orders of magnitude of difference in error rate, see Table 3 vs Table 5 in [1]). Recent successes in handwriting recognition can be attributed to developments in deep neural networks. However, due to large computational costs, the systems are usually limited to recognising characters, words, and lines. We propose a full page offline handwriting recognition framework that is less computationally expensive compared to existing frameworks.
A. Text localisation
Text localisation is an essential component of document layout analysis and accurate text localisation is crucial for handwriting recognition [2]. Handcrafted features that utilise blob detection, clustering, edge detection, and histogram projections dominate in the traditional techniques. More recently, data-driven techniques are becoming more prominent with the growth of neural networks. Such techniques can be categorised based on the method in which the position of the text is defined. This includes lines [3], bounding boxes [4], or areas containing “text pixels” [2].
In this paper, we predict bounding boxes around the text using deep learning techniques of object detection. Given an image that contains multiple objects, object detection identifies bounding boxes that encompass the objects along with the confidence of the class of the object. In this work, the Single Shot MultiBox Detector (SSD) [5] framework was applied to text localisation.
B. Text recognition
Handwritten digit recognition with the MNIST dataset was among the first work in deep learning [6]. However, the learning problem was limited to images of single digit characters. Significant advances in handwritten text recognition were realised by the description of the multidimensional recurrent neural networks (MD-RNN) in Graves et al. [7] and the Connectionist Temporal Classification (CTC) loss. A number of advances based on the MD-RNN were reported including using attribute embeddings [8], dropout [9], Tucker decomposition [10] etc. Recent works conducted by Puigcerver [11] suggest that the multidimensional aspects of the MD-RNN can be replaced with feeding image-features (from a CNN) into a one-dimensional LSTM to significantly reduce the memory requirements of the systems. The described methods are either limited to single words or single lines of handwritten text. Bluche et al. [12], [13] described an end-to-end system that uses an MD-RNN along with an LSTM to encode multiple lines of text. Although the described system shows promise to automatically recognise multiple lines, it may not be a practical solution as it requires a large amount of computational power [11]. Winglinton et al. [14] utilised a region proposal network to find the starting positions of text lines and a line follower network was trained to trace the line of text. This was followed by using a CNN-LSTM approach to recognise the characters.
C. Approach overview
Previous works showed that the MD-RNN requires a large amount of computation power when it is used to recognise multiple lines of handwritten text. A less computationally expensive alternative could be realised if multiple lines of handwriting recognition were not directly performed. To achieve this, our described framework is comprised of two major components: text localisation and recognition. Text localisation identifies the positions of handwritten text given an image of the full page. Once a passage of handwritten text was identified, segmentation was conducted to locate each line of
Fig. 1. Overview of the system described.
text. Text recognition refers to converting an image of a line of handwritten text into a string with the corresponding characters and denoising the string with a language model. By limiting handwriting recognition to single lines, the computational costs associated with this framework can be dramatically reduced compared to previous works that utilise the MD-RNN.
Rather than designing an end-to-end network, we took a modular approach consistent with described components in the literature. This principle allows components of the framework to be easily replaced and tested with different ones. An overview of our system is provided in Figure 1.
A. Text localisation
The purpose of text localisation is to identify bounding boxes of each line of text given an image containing both printed and handwritten text. The text localisation procedure consists of two stages: passage identification and line segmentation.
1) Passage identification: The goal of passage identifica-tion is to predict the location of the handwritten passage (bounding boxes containing x, y coordinates, width and height of the bounding box in percentages of the page size). To simplify this step, we assume that there is one passage of printed text and one passage of handwritten text (using the IAM dataset [15] see Section IV-A for more details). This was achieved by extracting image features from a pre-trained truncated 34 layer residual network (ResNet34) [16] trained on ImageNet. In the ResNet34, the weights of the first convolutional layer were averaged into one channel to accommodate for greyscale images. The features were then fed into three fully connected layers: two layers with 64 units and a relu activation and one layer with 4 units and a sigmoid activation. The four units with sigmoid activation correspond to the x, y coordinates, width, and height of the bounding box in percentages. The network was trained to minimise the mean squared error.
2) Line segmentation: Given an image containing only handwritten text, this component predicts bounding boxes surrounding each line of text. We modelled this as an object detection problem to detect words followed by using a clustering algorithm to combine words into lines. A two stage approach was taken because early experiments showed that the network was prone to missing objects when identifying handwritten text. By detecting individual words, the chance of the network missing an entire line of words was less likely.
In our implementation, the SSD architecture [5] was used to predict bounding boxes relative to anchor points and predict the probability that the bounding boxes are encompassing words. The downsampler consists of two convolutional layers, batch normalisation layer, and a relu activation function. The class and bounding box predictor consists of a single convolutional layer with 6 (4 positional + 2 for classes) output channels. To adapt the SSD to our requirements, image features were extracted with a similar network described in Section III-A1 (ResNet34). Furthermore, the anchor boxes were adapted to resemble words (only squares and rectangles with widths > height). The SSD was trained to minimise the cross-entropy loss for the class (handwriting or not handwriting) and the L1 loss for the bounding box. Non-maximum suppression was performed to filter out objects overlapping bounding boxes.
After the bounding boxes of words were detected, a greedy algorithm was then used to cluster the words into lines proposals based on the overlap in the y-direction (see Algorithm 1).
The following heuristics were used to evaluate the line proposals:
• Lines must have a minimum area
• Lines that exceed boundaries of the page are removed
• Lines (excluding the last line) that are substantially shorter than the median width of the lines are removed
• Lines that are much longer than the median height are split into 2 lines (accounts for double lines)
• Lines with starting positions that significantly deviate from other lines are removed
• Remove lines that greatly overlap with other lines Lines that are not eliminated by the heuristics algorithm are used as the output of the text localisation stage.
B. Text recognition
Text recognition takes images containing single lines of handwritten text and recognises the corresponding characters. Our approach includes handwriting recognition then denoising the output with a language model.
1) Handwriting recognition: Following a similar scheme
to [11], we implemented a CNN-biLSTM network. It makes use of a multi-scale CNN for image feature extraction, then the features are fed into a bidirectional LSTM. The network was trained to optimise the CTC loss (shown in Figure 2). Intuitively, the CNN generates image features that are spatially aligned to the input image. The image features are then sliced along the direction of the text to generate a fixed number of “timesteps” and sequentially fed into an LSTM.
The CNN used to generate image features was identical to the residual network described in Section III-A1 (ResNet34, Figure 2-a). In order to account for varying sizes of the input image (e.g., lines that contain only one word compared to lines that contain seven words), multiple downsamples of the image features are provided (Figure 2-b, identical to the downsampler in the SSD used in Section III-A2). The image features and downsampled image features were each fed into separate biLSTMs. The outputs of the biLSTMs were concatenated along the time dimension and decoded into a array where N is the maximum length of the sequence and M is the number of unique characters (Figure 2-c). This array is fed into the language model denoiser.
2) Language model denoiser: The N × M output of the
CNN-biLSTM needs to be transformed into the output string. As the output contains probabilities corresponding to each character of the sequence, a naive solution (greedy solution) is to take the maximum probability (argmax) of each of the N slices and collapse the characters using the CTC collapsing function. Inspired by [7], a beam search approach can alleviate such issues by combining multiple decoding paths to generate candidate strings. A language model can be included in the beam search decoding to weigh the proposals based on their likeliness. The beam search approach required substantially more computational power as our early experiments, revealed that there was up to
computational time increase compared to the greedy solution).
In this paper, a language denoiser network was developed. Given a noisy input string, the network denoises the string in a sequence-to-sequence configuration. A previous approach
Fig. 2. Handwriting recognition CNN-biLSTM.
[17] encodes the noisy input at the character level and decode the clean output at the word level to ensure that the output is only composed of in-vocabulary words. This has proved relatively effective however falls apart for out of vocabulary words like names and places. To circumvent this issue, a character to character encoding / decoding scheme based on the Transformer architecture was used [18]. The denoiser was trained on sentences from an external database of public domain novels [19]. Characters are randomly inserted and deleted from the sentences (with a uniform distribution). Also, characters are replaced with visually similar counterparts (e.g., ‘d’ can be replaced with ‘c’ and ‘l’) in an attempt to model the real noisy distribution of the handwriting recognition model. The generated noisy sentences are used to predict its original counterpart. During inference, the output of the trained denoiser is fed into a beam search algorithm to generate candidate strings. We make use of the following heuristics to rank them:
1) Pick the candidate strings with the highest proportion of
2) Pick the candidate strings with the lowest Levenshtein distance.
3) Pick the candidate string with the lowest perplexity score using an off-the-shelf pre-trained language model.
A. Evaluation
The system was evaluated with the IAM dataset [15]. The IAM dataset contains 1539 pages of scanned documents. Each scanned document contains printed text and 657 writers were asked to write the contents of the printed text in the space provided. The dataset was split into train and test data, where the test dataset includes validation 1, validation 2, and test data designated by the authors of the dataset.
We both evaluated the system qualitatively by visually evaluating the transcription of examples and quantitatively by computing the character error rate (CER). Furthermore, we conducted memory and timing comparative analysis.
The CER was calculated with SCLITE [20] and the effects of the following components were evaluated: 1) no line heuristics, 2) no language model (argmax algorithm), 3) with beam search [7], and 4) with the denoiser described in Section III-B2. The predicted and actual text were aligned and the average CER was calculated line-by-line. Our method was compared to similar methods presented in [12]–[14].
B. Training details
The networks were developed with Apache’s MXNet deep learning framework [21]. The networks for each component (passage identification, line segmentation, handwriting recognition, and language model denoising) were trained separately and the Adam optimiser was used for all the networks [22]. Data augmentation including random translation, shearing, and occlusions were performed. However, many typical data augmentation are not applicable to this application (e.g., flipping and random cropping). In the word/line object recognition component, to circumvent this issue, lines or words were randomly blanked out. Details of the implementation can be seen here (https://github.com/awslabs/ handwritten-text-recognition-for-apache-mxnet).
Figure 3 shows actual results of paragraph segmentation and word to line segmentation. We can observe that the paragraph segmentation algorithm mostly predicts the bounding boxes of the handwriting component successfully, however, the third column presents a failure case where the last line is not encompassed by the predicted bounding box. Given an image containing only handwritten text, the word detection algorithm can detect tight bounding boxes for each word. However, as mentioned in Section III-A2, there are several short words (typically with words < 3 characters) that are not detected. Despite the missing words, we can observe in Figure 3-c that all the lines were successfully detected.
Figure 4 presents the selected examples to show differences between the language model component of the described system. First, we can observe that the greedy algorithm ([AM]) performs reasonably well and the beam search ([BS]) algorithm does not dramatically improve the results. In a), we can see that the word “noused” was converted into “roused”, which may be based on the preceding word “head” and the visual similarity of ‘n’ and ‘r’. In b), the handwriting looks like “beclared” but the denoiser replaced ‘b’ with ‘d’ based on the learnt language model. In c), the ‘t’ in “desterted” was deleted also based on language modelling. In d), none of the algorithms were successful to correct the sentences, and the denoiser worsened the CER.
The CER presented in Table I suggests that line heuristics dramatically improved handwriting recognition. Qualitatively evaluating the results suggest that the line heuristics algorithm eliminated incorrectly identified lines that caused large disparities when aligning the predicted and correct text. The denoiser achieved a 1.4 CER decrease compared to the greedy argmax algorithm and beam search algorithm. When compared to previous works on recognising cropped images (i.e., feeding a cropped image containing only the handwritten portion compared to the full page with printed and handwritten text, as indicated by Seg. in Table I), our method outperforms Bluche [13]. However, methods described in Bluche [12] and Wigington [14] had lower CER compared to our method.
TABLE I CER RESULTS
TABLE II MEMORY AND TIMING REQUIREMENTS
Table II presents the memory and timing requirements for our memory compared to existing methods. When comparing the mean time taken to run an image, our method requires approximately less time compared to [14] and [13]. Our method also utilises substantially less memory (approximately
less memory) compared to [14] (unfortunately, the memory requirements for [12] and [13] could not be attained). Since our memory usage is substantially smaller, it is possible to run multiple images at the same time; effectively reducing the time required by a third.
Fig. 3. Qualitative results: full page to line images. a) paragraph segmentation, b) word segmentation, c) word to line conversion. The pipeline of the images goes from top to bottom within each column.
Fig. 4. Handwriting recognition and language modelling. Four line images are displayed and under each line image contains the predicted string ([AM]: greedy argmax algorithm with no language modelling, [BS]: beam search [7], [D]: our denoiser Section III-B2)
In this paper, we presented a full page offline handwritten text recognition framework. This framework consists of a pipeline where the handwritten text is localised (text localisation) followed by converting images of words into strings (text recognition). Our method achieved a CER of 8.50. The main advantage of the framework introduced is the reduced computational costs compared to existing methods. For a tradeoff of CER2 comparing to [14], the throughput could be effectively
when using a similar amount of memory.
In conclusion, the framework that we presented is a computationally cheap alternative to performing full page offline handwritten text recognition. The results in this paper demonstrate the potential of this framework and future work can investigate different components of the pipeline for improved results.
Thank you Simon Corston-Oliver, Vishaal Kapoor, Sergey Sokolov, Soji Adeshina, Martin Klissarov, and Thom Lane for their helpful feedback for this project.
[1] M. Yousef, K. F. Hussain, and U. S. Mohammed, “Accurate, data- efficient, unconstrained text recognition with convolutional neural networks,” 2018.
[2] G. Renton, Y. Soullard, C. Chatelain, S. Adam, C. Kermorvant, and T. Paquet, “Fully convolutional network with dilated convolutions for handwritten text line segmentation,” International Journal on Document Analysis and Recognition (IJDAR), pp. 1–10, 2018.
[3] T. Gr¨uning, R. Labahn, M. Diem, F. Kleber, and S. Fiel, “Read-bad: A new dataset and evaluation scheme for baseline detection in archival documents,” in 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, 2018, pp. 351–356.
[4] B. Moysset, C. Kermorvant, C. Wolf, and J. Louradour, “Paragraph text segmentation into lines with recurrent neural networks,” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015, pp. 456–460.
[5] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
[6] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a backpropagation network,” in Advances in neural information processing systems, 1990, pp. 396–404.
[7] A. Graves and J. Schmidhuber, “Offline handwriting recognition with multidimensional recurrent neural networks,” in Advances in neural information processing systems, 2009, pp. 545–552.
[8] J. I. Toledo, S. Dey, A. Forn´es, and J. Llad´os, “Handwriting recognition by attribute embedding and recurrent neural networks,” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1. IEEE, 2017, pp. 1038–1043.
[9] V. Pham, T. Bluche, C. Kermorvant, and J. Louradour, “Dropout improves recurrent neural networks for handwriting recognition,” in Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on. IEEE, 2014, pp. 285–290.
[10] H. Ding, K. Chen, Y. Yuan, M. Cai, L. Sun, S. Liang, and Q. Huo, “A compact cnn-dblstm based character model for offline handwriting recognition with tucker decomposition,” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1. IEEE, 2017, pp. 507–512.
[11] J. Puigcerver, “Are multidimensional recurrent layers really necessary for handwritten text recognition?” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1. IEEE, 2017, pp. 67–72.
[12] T. Bluche, “Joint line segmentation and transcription for end-to-end handwritten paragraph recognition,” in Advances in Neural Information Processing Systems, 2016, pp. 838–846.
[13] T. Bluche, J. Louradour, and R. Messina, “Scan, attend and read: End-to-end handwritten paragraph recognition with mdlstm attention,” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1. IEEE, 2017, pp. 1050–1055.
[14] C. Wigington, C. Tensmeyer, B. Davis, W. Barrett, B. Price, and S. Cohen, “Start, follow, read: End-to-end full-page handwriting recognition,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 367–383.
[15] U.-V. Marti and H. Bunke, “The iam-database: an english sentence database for offline handwriting recognition,” International Journal on Document Analysis and Recognition, vol. 5, no. 1, pp. 39–46, 2002.
[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[17] S. Ghosh and P. O. Kristensson, “Neural networks for text correction and completion in keyboard decoding,” arXiv preprint arXiv:1709.06429, 2017.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
[19] “Public domain novels http://www.textfiles.com/etext/fiction/.”
[20] J. Fiscus, “Sclite scoring package version 1.5,” US National Institute of Standard Technology (NIST), URL http://www. itl. nist. gov/iaui/894.01/tools, 1998.
[21] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.
[22] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.