Multistage Curvilinear Coordinate Transform Based Document Image Dewarping using a Novel Quality Estimator

2020·arXiv

Abstract

1 INTRODUCTION

CAPTURING document images with handheld devices,such as cameras, mobile phones, etc. often introduce warping in the acquired images, along with a plethora of other issues. Even if it’s somehow managed to get all the preprocessing steps, such as contrast enhancement, binarization, etc. to yield perfect results, the warped texts in such document images still pose a severe challenge to standard OCR techniques, resulting in overall poor digitization of the documents.

Various algorithms exist for correcting a camera-captured document image containing perspective defects. But the results often introduce unnecessary skews in the text characters. Standard OCR techniques often rely heavily upon the assumption that the baselines of the adjacent text-lines are parallel as far as practicable and that the text characters have minimal skews in them. This necessitates the use of dewarping algorithms that can take care of individual character-level warps after performing an overall page-level dewarping.

There are several fast and effective techniques available for correctly dewarping linearly warped document images [1], [2], [3]. However, these methods are not suitable for dewarping non-linearly warped document images [4]. To address the inherent complexity of non-linearly warped document images, a global optimization based dewarping technique was proposed by Ezaki et. al. [5]. There, nonlinear warps were corrected by minimizing an objective function that strives to convert the warped lines to parallel ones. With a similar objective, curled text-line detection based techniques were demonstrated in [6], [7]. These methods are, however, less robust that the boundary based model-fitting techniques which were introduced in [8], [9], [10], [11], [12], [13]. Gatos et. al. [14] implemented a generalized approach that involved segmentation of the text-lines to determine the warps in the document images. This method performs better than the older ones in many use-cases. Generalizations, that involved representing the warped text-lines as a texture delimited by two smooth curved lines on top and bottom were described in [15], [16], [17], [18]. These methods involve a coarse but fast dewarping of the whole document image based on these two curves followed by finer corrections based on word detection. However, this technique heavily relies on the assumption that the processed image mostly contains text and not diagrams or pictures. Bukhari et. al. [19] demonstrated an active contour based warp correction technique for detection of curled text-lines. Although it produced better results in many cases, being a heavily iterative process, this methodology is considerably slower than many of the previous ones. A generalization of such model-based techniques was developed in [20], that uses a generalized cylindrical surface model to describe non-linearly warped pages. Depth information (if available) can also be used to dewarp a document image [21]. Such methodologies work for particular use cases and yield unsatisfactory results when applied to arbitrarily warped document images. Slightly more robust algorithms based on text-line detection [22], [23] were introduced in [24], [25] for handling complex page layouts. These methods iteratively check the alignment of the text-lines and try to improve it by dewarping them. Although such techniques can outperform the older ones, they are inherently slow and reliant upon the availability of bounding boxes and discernible page-boundaries or page limits. Koo et. al. [26] introduced a technique that can estimate and rectify warps by capturing the document image from multiple viewpoints. This was further improved in [27] by considering a ridge-aware 3d model for describing the warped documents. A similar, but slightly more constrained technique was earlier developed in [28]. Similar techniques based on stereo-vision were developed earlier in [29], [30].

A recent trend is the application convolutional neural networks for dewarping specifically warped documents. For example, [31], [32], [33] depict very effective ways of dewarping images of previously folded documents. Most deep-learning based methods in this domain, however, suffer from the fact that number of available sample dataset is quite small [4], [34]. There is, in fact, a strong need in research towards developing synthetically warped document images [35], [36], [37].

It is quite evident from the above discussion that most methods are designed with specific goals in mind. Some of them produce better accuracy at the cost of speed. However, it is often found that camera captured document images contain parts of other pages that are supposed to be beyond the region of interest (ROI). Moreover, most of the techniques rely heavily on assumptions like structural uniformity of the warps, inherent parallelism of the text-lines, etc. These assumptions might not always be true. Also, many dewarping techniques introduce skews in the characters as a side-effect of the dewarping process.

The present technique demonstrates a multistage approach for dewarping non-linearly warped document image. First, a proper estimation of the warps in the document images is made through a piecewise linear approximation based generalization of the law of anharmonic ratio under homography. The underlying assumption is that, either the page boundaries [38], or at least, two lines of printed text would be clearly visible. If the page boundaries are unambiguously discernable, then they are used to make an overall estimate for the curvilinear homography. Otherwise, the underlying homography is estimated by analyzing text-lines that are supposed to be parallel in the original unwarped printed document. The estimation is then extrapolated to assess the page level warps. Once such an estimate is available, the whole document image is divided into small sections by generating a perspective transformation grid. This is followed by a page-level dewarping by generating optimum inverse projections for each block. The quality of this process is then assessed by calculating five metrics related to the characteristics of the text-lines and rectilinear objects for measuring the parallelism, orthogonality, etc. in the dewarped images. This is done irrespective of the availability of any ground truth. Based on the calculated metrics, if the result of page-level dewarping after a single iteration is found to be unsatisfactory, the process is repeated with finer adjustments. If it is realized that further page-level dewarping would only produce diminishing improvements, then a line-level dewarping akin to [39], [40], [41], is applied that dewarps individual text-lines. Such considerations make the present methodology the fastest among all presently available techniques. The whole process is depicted in fig. 1. The programmes were tested in CBDAR 2007 / IUPR 2011 document image dewarping dataset [4], [34] and the performance of the methodology is evaluated on the basis of the performance of an OCR engine and the dewarping evaluation measure [42] on the end result. After establishing the effectiveness of the presented methodology, the technique has also been tested on the DocUNet 2018 [31] dataset containing 130 synthetically warped document images containing various kinds of non-uniform warps. The results are found to be quite promising.

The following are the specific contributions of the present work.

1) A multistage dewarping algorithm is designed to perform coarse and fine adjustments as needed.

2) A quick and accurate methodology is devised to easily extract the regions of interest (ROI) from a document image. This saves a lot of unnecessary work for the dewarping algorithms.

3) A mechanism has been developed for assessing the quality of the dewarped images without the need of ground-truth images. This has been used to shape the internal decision-making process.

4) The decision-making process is designed in a way that quickly determines the correct courses of action – eliminating any step that might provide negligible improvements with large time-penalty.

5) The results are analyzed thoroughly with the CBDAR 2007 / IUPR 2011 dataset to provide a detailed insight on the correlation between the different document layouts and types of warps with the performance of the methodology.

6) The effectiveness of the algorithms are further validated by testing them on the DocUNet 2018 dataset. This dataset is especially challenging, since, many of the assumptions related to inherent smoothness of the warps do not apply here. The dewarping results are found to be reasonably good on this dataset also.

The rest of the paper is organized as follows. Section 2 provides a detailed description of the present methodology. ROI selection, performance metrics and experimental results are discussed in Section 3. Section 4 concludes the discussion with remarks on future possibilities.

2 PRESENT METHODOLOGY

The dewarping process developed in this work is implemented in four stages. First, the law of anharmonic ratio under linear projections is analyzed. Then, a suitable generalization of the methodology for non-linear perspective transformation is developed, which is used on the document images to generate a nonlinear perspective grid depicting the curvilinear warps in the distorted images. A suitable optimization technique is then employed to quickly estimate the optimum inverse projections. This is done based on the calculation of certain parameters that can represent the distortions due to curvilinear warps. Based on the estimated inverse projections, page-level dewarping is applied on the images. Finer adjustments are then made by performing line-level dewarping that minimizes the skews in the dewarped text-lines.

Fig. 1. Representative block diagram of the dewarping process. The warped image is first dewarped at the page-level. Then quality of the result is assessed and if deemed necessary, the page-level dewarping is repeated. Otherwise, if required, line-level dewarping is applied on the processed image.

Fig. 2. A rectangular page and its perspective transformation

2.1 Law of anharmonic ratio under homography for images of linearly warped rectangular pages

The proposed process of page-level dewarping is based on a generalization of linear homography for curvilinear warps. A linear homography would simply be a perspective transformation. For example, a rectangular page and an instance of it undergoing perspective transformation is depicted in fig. 2. In the real world, without any kind of perspective transformation, the long edges are denoted by and . Under a perspective transformation the edges become and . Let be a point on the edge such that divides the line segment in a ratio. Thus, . In the projected plane,

becomes . Now, consider that the extended versions of the lines and in the projected plane meet at the vanishing point V . Now, using the theorem of anharmonic ratio, it can be written that,

Thus, the point on the line segment that actually divides in a way such that , is mapped to

a point in the projected plane such that,

This is true for any projected line on the page that passes through the vanishing point. Also, the apparent symmetry in the projection as depicted in fig. 2 is just for the purpose of keeping the diagram simple. It is not at all necessary for the theorem.

If the vanishing point is closer to the edge instead, then the equivalent ratio can be obtained by swapping and in (1). Thus,

Let us assume that there is a line segment on the projected page with endpoints and that vanishes at V . The set of points for , that actually split the line segment into n number of equal parts in the real world, can be determined as,

for where,

2.2 Generalized homography under curvilinear warps

In situations, where the page is nonlinearly warped, homographies can be generalized through piecewise linear approximations. Under this scenario, instead of considering continuous linear edges to determine the projection, instantaneous slopes of the curved edges are taken into account. Under such nonlinear projections, the the projection of is assumed to be a spatial function, modeled as,

This curve can be linearly approximated at a point as

Similar approximations can then be made on the projection of the opposite edge of the paper given by . For brevity, and are written as and respectively. As such, the approximate version of the edge at a point can be expressed as

Simplifying (7) yields

Just like (7), the approximation of the opposite edge of the projected image of the paper can be written as

and it can be simplified to

The instantaneous vanishing point () for the i-th division is defined as the point of intersection of (8) and (10).

If ∂

, the corresponding segments areparallel (or almost parallel) and thus, the corresponding vanishing point is assumed to be at infinity. Otherwise, the coordinates for on the projected plane can be calculated as,

where,

Thus, if two opposite edges in an image of a paper under curvilinear projection are visible, a grid can be formed on the image that corresponds to equal areas on the unprojected plane. Fig. 3 shows the formation of gridlines on one such image. The grid is designed in a manner such that, under an inverse perspective transformation, the blocks become squares of approximately the equal size. However, there can often be situations where the page boundaries

Fig. 3. Depiction of grid generation on a document image under curvilin- ear homography. Notice that, the further the blocks are in the projected surface, the smaller they become.

are either unavailable or un-discernable. These might result from unintentional crops or premature preprocessing / binarization. Under these circumstances, the perspective transformation grid can be formed by recognizing two or more parallel printed lines on the warped document image and extrapolating from there.

2.3 Calculation of optimum inverse projection

After the generation of the grid on the image based on curvilinear homography, the points of intersection of the gridlines are calculated. The objective is then to create a normalization operator for vectors v and w in . Here, v and w are members of the vector space representing the points in homogenous coordinates on the projected grid and the expected grid after inverse perspective transformation over the field R. Let the points and the centroid of the i-th block on the projected grid be denoted by and respectively. Let the normals on the gridlines in the i-th block be denoted by . Also, assume that represent the set of vectors directed from to . Let and represent the projected vectors in the direction of the major and minor directional flows (of the foreground objects, such as, the texts, images), respectively. Then, the average projection (), major flow vector () and the minor flow vector () for the i-th block can be calculated as

Here, the normalization operator can simply be assumed to be

Under the present scenario, can determine the average direction and magnitude of the projection vector of the i-th block within the gridlines formed on the projected image.

The optimum inverse perspective projection for the i-th block is then estimated by performing a constrained optimization for . It should also be noted that the optimizations performed on adjacent blocks must satisfy a smoothness criteria. This ensures that there are no abrupt changes in the inverse perspective transformations performed on adjacent blocks.

Let us assume that the average value of the major directional flow in the i-th block is . Also, let denote the forward finite differences operator, for vector-valued functions. Now consider the following five constraints for the i-th block:

To calculate the optimum value of denoted by , the weighted sum () of the constraint functions is minimized. Thus, the objective is:

Through many trials on a standard dataset (CBDAR 2007 / IUPR 2011), it was found that the Truncated Generalized Lanczos algorithm (Krylov) with a predefined trust region yielded the best compromise between the fastest and most accurate optimization for the objective (13).

with fixed pre-defined values for and random values for within a certain bound. The optimization algorithm first approximates as a quadratic function at the initial point as

where, is the Hessian. If it is positive definite at , then, the local minimum of can be calculated by setting . This yields,

The optimum is calculated iteratively by using (15). First a maximum step size is set and then the optimal step s is calculated inside a given trust-region. For the k-th step, the problem boils down to the following sub-problem:

subject to: .

The solution is updated in the next step as . The trust region is then updated as per the desired accuracy of the solution.

Fig. 4. Comparison of different optimization techniques in the context of the present dewarping application. It was seen that the truncated generalized Lanczos algorithm (krylov) performs the best in the present scenario. On an average, it provided the best optimization in the smallest run-time.

Once the optimum normals () for all the blocks are calculated in this manner, the mapping function for the inverse perspective transformation can be generated as a linear transformation that maps to for the i-th block. A comparison of the performance of few other optimization techniques which could have been used in this scenario is depicted in fig. 4. Relevant optimization algorithms that are compared with the present one are: Nelder-Mead (NM), Broyden-Fletcher-Goldfarb-Shanno (BFGS), Newton-Conjugate-Gradient (NCG), Trust-Region Newton-Conjugate-Gradient (TR-NGC), and Sequential Least Squares Programming (SLSQP). The accuracy is measured in terms of the OCR accuracy after one run of the page-level dewarping technique with the respective optimization algorithm.

The effectiveness of the described methodology is first tested on the 100 samples of the CBDAR 2007 / IUPR 2011 document image dewarping dataset. A sample set of results are depicted in fig. 5. Notice that the page boundaries for these test images are not clearly visible. Thus, the homographic grid is constructed by correlating text-lines that are supposed to be parallel straight lines after the inverse perspective transformation. Embedded images, however, are a different challenge. In many documents, the embedded images are often “framed”. Such rectangular boundaries around images provide clear correlatable baselines for determining the homography. In other cases where such image boundaries are not available, the images are disregarded while estimating the homography. Once the homography for the text-lines are calculated, the grid is then extrapolated to the portions covered by the embedded images based on a presumed smoothness criteria.

While forming perspective transformation grid on a warped document image, the average height of the text elements are estimated first. This is used as an initializer to determine a suitable step-size in the piecewise linear approximation of the curvilinear grid. It can be proved that the time complexity of the presented methodology is , where n is the number of sub-divisions, and as such, creating an unnecessarily dense grid would quickly result into a very large run-time. This approximation, however, sometimes leads to a dewarped image where the text-lines are not exactly parallel. This usually happens when the document image contains huge amount of unframed pictures

Fig. 5. Sample results of coordinate transform based dewarping of warped document images with optimum inverse projection. The tests are performed on CBDAR 2007 / IUPR 2011 document image dewarping dataset. The samples shown here are chosen specifically to demonstrate the effectiveness of the procedure for various document layouts. These include: single column, double column, single column with images, double column with images and mathematical formulas.

and other unusual layouts. In any case, after dewarping, an estimate is made for the loss of parallelism and orthogonality based on the methodology described earlier. If the loss is too high, the whole dewarping process is repeated with a denser grid. If the loss falls within an acceptable margin, individual text-lines are corrected with combined baseline / skew correction procedure.

2.4 Correction of nonlinear baselines and skews

If a baseline / skew correction is deemed necessary, the image is first converted to a binary image where the foreground objects (text, images) are presented with white pixels and the background is presented in black. Subsequently, the image is denoised followed by morphological opening with a small or a square structuring element. Then, the resulting images are morphologically thinned and pruned.

Let A be the pruned binary image. The skeletons produced through pruning would roughly represent the locations of the centroid of the text-lines. The objective is to generate smooth 8-connected paths representing the text-line without any intermediate breakage. For this, first a set of points on the pruned lines are selected at regular intervals satisfying a closeness criteria . Then a cubic B-spline interpolation is generated as a curve x(s), where the parameter s changes linearly within the points. The curves are then represented by , where is a cubic base function and are the spline coefficients. This is repeated for each of the detected text-lines on the image and splines representing each of the lines are generated. Regions of a small width orthogonal to each spline is created iteratively to make an estimate for the width of each test line. The selected width is gradually increased until the selected regions completely engulf the text-lines under consideration. Each of the splines are then re-sampled at regular intervals to generate baselines for each individual text-line. If and are two such consecutive points on a text-line, the instantaneous principle slope angle of the text-line is determined as

The midpoint is represented by . The points and on the upper and lower boundaries, respectively, of the orthogonal neighborhood of the part of the text-line between and are then represented by

The simplified version of the piecewise affine transform for dewarping the text-lines is given by where U and V are affine spaces and f is of the form , where G is a linear transformation over U and . The transformation can be rewritten as

This can be further simplified to

The transformation from a set of points to a projected set of points is calculated as

Putting (20) in (18) and setting coefficients and , the affine transformation matrix G becomes,

Fig. 6. An example of suboptimal result obtained when the document layout is too complicated

In general, the affine transformation from (x, y) to is given by

where, Sc, Ro, Sh and T r denote scale, rotation, shear and translation, respectively. Fig. 6 shows an example where the first pass of the page level dewarping yields unsatisfactory result. This is mainly due to the fact that the layout of the document is just too complicated for the presented technique. To combat that, the suboptimally dewarped image is then passed through the line level dewarping process that corrects the nonlinear baselines and skews in individual text-lines. The process of line-level dewarping on one such line is depicted in fig. 7.

3 EXPERIMENTAL RESULTS

The effectiveness of the present methodology is first assessed by performing tests on the CBDAR 2007 / IUPR 2011 document image dewarping dataset. This dataset is considered to be the de-facto standard for testing document image dewarping techniques, since, it was designed with a multitude of standard use-cases in mind. The results are compared with various other well-established and state-of-the-art techniques in the domain. Table 1 provides a brief overview of the selected techniques. However, before

TABLE 1 List of techniques with which the present methodology is compared

dewarping the images, often a pre-processing step can be employed to discard regions outside the expected regions of interest (ROI). In the IUPR 2011 dataset, both greyscale and binarized images samples are available. In such situations, page boundaries and the corresponding ROIs can be calculated using a combination of watershed-based segmentation, connectivity-based component analysis and clustering.

Fig. 7. Depiction of line level dewarping through correction of nonlinear baselines and skews. (a) A sample text-line after suboptimal document level dewarping, and (b) dewarped through nonlinear baseline and skew correction. The red dots indicate the equally spaced points on the estimated baseline. The blue dots are the calculated points for generating the inverse affine transformation map. (c)–(d) Another example where the considered line is a mathematical equation.

3.1 Page segmentation and ROI extraction from grayscale images

After performing a contrast enhancement [43] on the collected image, the image is re-scaled to a smaller size and converted to a grayscale one. The resulting image is then segmented using the classical watershed technique. This produces labels for different regions in the image, providing a correlation between different pixel intensity levels in the image. The set of connected components are then obtained and small clusters are discarded. Thus the ROI turns out to be in one of the larger clusters. A hull (which is not necessarily convex) is generated over this cluster through a morphological hole-filling algorithm. This hull is then used as a mask to extract the ROI from the image. The whole process is depicted in algorithm 1 and an example is shown in fig. 8.

3.2 Dewarping

Following a successful ROI extraction, the image is binarized and dewarped. As mentioned in the previous sections, the dewarping process takes place in multiple steps. After an initial page-level dewarp, the metrics describing the parallelism, orthogonality, geodesic property, text height, as described in the previous section, are calculated. This is depicted in Table 2. Based on an overall quality metric given by

a decision is made on what to be done next. It is worth noting that in case of a perfectly straight document image where every single text-line is of the same height and are perfectly parallel, the following would be the ideal normalized average values of the parameters:

1) orthogonality of the projection ; 2) parallelism of the text-lines: ; 3) geodesic property of the lines: ; 4) orthogonality of the text strokes and line directions:

Based on these estimates, if it is found that the resulting image after the initial stage of page-level dewarping requires good amount of overall improvement, the page-level dewarping process is repeated once more. The metrics are calculated again. If, after one or two iterations of page-level dewarping, the quality metrics indicate that only line-level corrections are needed, then line-level dewarping is applied on the processed image. Table 2 is sorted by the OCR accuracy after the initial step. At each step, the quality is assessed and decisions are made in accordance with the desired quality set by the user. In the present settings, if the overall quality metric happens to be greater than or equal to 0.95, then nothing is done after the initial application of page-level dewarping. This is indicated by “—”. If is found to be within [0.90, 0.95), only line-level corrections are deemed necessary. This is indicated with “L”. Similarly, “R+L” indicates that after the initial application of page-level dewarping, the process is repeated once more and then line-level corrections are applied. This is done if the value of after the first step was found to be below 0.90.

This is presented here just as a depiction of the correlation among the quality metrics and the OCR accuracy. In real applications, however, ASCII ground truths would obviously be unavailable and calculating OCR accuracy would be impossible. Thus, in real-world applications, these quality metrics only can provide an estimate of the quality of the whole process in the absence of a ground truth.

3.3 Dewarping evaluation measure

Since the ground-truth images for the present dataset are available, the dewarping evaluation measure () is also cal-

TABLE 2 Depiction of the quality assessment and decision making process after one application of page-level dewarping. After a single pass of page-level dewarping, further decisions are made based on the obtained value of the average value of the quality metric

culated as per [42] after the completion of a full dewarping stack. Scale invariant keypoints are first marked on the text-lines of both the warped and dewarped images. Let and denote smooth cubic polynomials representing the keypoints on the j-th line on a warped and dewarped image. The warp-mesures for the warped and the dewarped images, respectively, are calculated as,

The dewarping evaluation measure for the j-th line is then defined as,

is assigned when the dewarped text-line is even worse than the warped one. Often it is deemed sufficient to calculate for a few important and critical text-lines only. The average measure, represented as over N number of lines, is defined as,

TABLE 3 Average dewarping evaluation measure () and weighted average dewarping evaluation Measure () for various algorithms on the CBDAR 2007 / IUPR 2011 document image dewarping dataset. Higher values indicate better results.

Also, since different text-lines would have different amounts of warps in them, a weighed average dewarping evaluation measure can be calculated by modifying (24) as,

where, , for .

The average dewarping evaluation measure () and weighted average dewarping evaluation measure () for various algorithms on the CBDAR 2007 / IUPR 2011 document image dewarping dataset is provided in Table 3. The keypoints are chosen at the beginning of each word on the text-lines for standardizing the process of evaluation over multiple set of algorithms. The table shows the average measure for 6 critical text-lines per image, the weighted average measure for 6 critical text-lines per image and the average measure for all text-lines.

3.4 Implementation specific details

The algorithms were implemented in Python 3.7.4. The libraries used include, NumPy 1.16, SciPy 1.3, Scikit-Image 0.15, Scikit-Learn 0.21, OpenCV 4.1 and Cython 0.29. The programs were executed on a laptop with an Intel i5 6200U CPU (2.8 GHz) and 8 GB DDR3 (1600 MHz) RAM, running Ubuntu GNU/Linux 18.04. It is seen that the present methodology yields quite satisfactory results in terms speed and the quality of the OCR after dewarping. The average OCR accuracy is calculated to be 98.25%, which is higher than the previous state-of-the-art (97.54%) [25]. The OCR accuracy is calculated using Tesseract 4.0.0 and comparing the results with the ASCII ground truth provided with the dataset. Table 4 depicts the total run-time and OCR accuracy for each individual image in the dataset. Table 5 indicates the average recognition accuracy of the present algorithm in comparison to the previous ones. It is worth noting that the present methodology is reasonably faster than the previously available algorithms. The minimum, maximum, average and standard deviation of the run-time for the present implementation are 1.57 s, 9.56 s, 5.97 s and 1.75 s, respectively. The minimum and maximum run-time for the previous best implementation [25] was 6.23 s and 15.83 s respectively.

TABLE 4 Total run-time and final OCR accuracy for each sample in the CBDAR 2007 / IUPR 2011 document image dewarping dataset

TABLE 5 Average accuracy of various algorithms on the CBDAR 2007 / IUPR 2011 document image dewarping dataset

Fig. 9. Sample results of ROI extraction on the DocUNet [31] dataset.

Fig. 10. Sample results of dewarping on the images in the DocUNet [31] dataset. The images are used after removing their backgrounds and converting them to grayscale.

3.5 Tests on the DocUNet dataset

As a potential validation of the effectiveness of the present algorithms, further tests are done on the DocUNet [31] dataset. It contains a set of 130 synthetically warped document images representative of a multitude of (extreme) real-world scenarios. This dataset is extremely challenging for any conventional algorithm because of the inherent nonuniformity of the warps in the images. Since the present methodology partially relies on the absence of arbitrary and confusing backgrounds, the ROI detection process is deemed to be of utmost importance. However, as it can been seen from the sample results shown in fig. 9, the pre-

processing algorithm for ROI selection does an excellent job of quickly removing the backgrounds from the images.

Since this dataset contains warps that cannot be represented properly with only the knowledge of the page-boundaries, the present methodology is slightly tweaked. The modifications include the following assumptions:

1) knowledge of the page-boundaries is mostly useless,

2) homographies cannot be extrapolated based on the detection of just two parallel straight lines, and

3) no smoothness criteria can be assumed in the nature of the warps.

The quality of each step in the dewarping process is again measured by calculating the orthogonality and parallelism measures. The criteria is, however, slightly relaxed in view of the inherent complexity of the images. If the overall measure as given by (22), fell under 0.8, the dewarping process was repeated with finer adjustments. If , only line level corrections are made. Achieving was considered a success and no further processing was deemed necessary. Sample results of the application of the present technique on the DocUNet 2018 dataset is depicted in fig. 10. The average dewarping evaluation measure () and weighted average dewarping evaluation measure () for various algorithms on the DocUNet 2018 dataset has also been calculated. This is provided in Table 6. However,

TABLE 6 Average dewarping evaluation measure () and weighted average dewarping evaluation measure () for various algorithms on the DocUNet 2018 [31] dataset

for this dataset a comparison of OCR accuracy is omitted. This has been done since the majority of the images in the DocUNet 2018 dataset are covered with figures, instead of texts. Thus, as another structural measure of dewarping accuracy, multi-scale structural similarity (MS-SSIM) [44] has been calculated for the dewarped images. This provides a metric for structural similarity between the dewarped images with the ground truths. Also, scale invariant feature transform flow (SIFT flow) [45] has been used to measure the local distortions (LD) [27]. As a depiction of the strong correlation between dewarp estimate () and MS-SSIM (M), their values are calculated for every sample in the DocUNet 2018 dataset after one iteration of page-level dewarping. The results are shown in Table 7. This correlation is further depicted in fig. 11. The ordinary least squares regression line is calculated to be .

The final average MS-SSIM and LD are provided in Table 8. Compared to the benchmark results [31], the average MS-SSIM improved from 0.41 to 0.48 and the average LD improved from 14.08 to 12.28. This particularly demonstrates the robustness of the present technique. The run-time statistics are provided in Table 9.

4 CONCLUSIVE REMARKS

As it can be seen, the proposed quality measure based on orthogonality and parallelism does an excellent job of providing an estimate for the quality of the dewarping process. The most important fact is that, this measure is not reliant upon the existence of a ground-truth of any sort. Even though the methodology proposed in this paper is designed by carefully analyzing the CBDAR 2007 / IUPR

Fig. 11. Plot of MS-SSIM vs dewarp estimate () for each sample in the DocUNet 2018 dataset after one iteration of page-level dewarp. The regression line is depicted in orange.

2011 dataset, the inherent robustness of the technique still allows it to be applicable on a completely different dataset (DocUNet 2018) with a vastly different set of warps and obtained comparable results. This is despite the fact that the benchmark implementation [31] clearly assumes that conventional handcrafted algorithms, that do not employ deep learning cannot produce better results.

With that said, there are still huge developmental possibilities in this domain through the application of deep learning. But, one of the foremost challenges in this regard is the lack of suitable training datasets. One approach would be to capture a few thousand real-world samples of warped document images under different conditions and use those to generate a large number of synthetically warped document images through a generative adversarial network.

REFERENCES

[1] C. Wu and G. Agam, “Document image de-warping for text/graphics recognition,” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 2002, pp. 348– 357.

[2] S. Lu and C. L. Tan, “Document flattening through grid modeling and regularization,” in 18th International Conference on Pattern Recognition (ICPR’06), vol. 1. IEEE, 2006, pp. 971–974.

[3] A. F. Mollah, S. Basu, N. Das, R. Sarkar, M. Nasipuri, and M. Kundu, “A fast skew correction technique for camera captured business card images,” in 2009 Annual IEEE India Conference, Dec 2009, pp. 1–4.

[4] F. Shafait and T. M. Breuel, “Document image dewarping contest,” in 2nd Int. Workshop on Camera-Based Document Analysis and Recognition, Curitiba, Brazil, 2007, pp. 181–188.

[5] H. Ezaki, S. Uchida, A. Asano, and H. Sakoe, “Dewarping of document image by global optimization,” in Eighth International Conference on Document Analysis and Recognition (ICDAR’05). IEEE, 2005, pp. 302–306.

[6] A. Ulges, C. H. Lampert, and T. M. Breuel, “Document image dewarping using robust estimation of curled text lines,” in Eighth International Conference on Document Analysis and Recognition (ICDAR’05). IEEE, 2005, pp. 1001–1005.

[7] P. Kakumanu, N. Bourbakis, J. Black, and S. Panchanathan, “Document image dewarping based on line estimation for visually impaired,” in 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’06). IEEE, 2006, pp. 625–631.

[8] M. Wu, R. Li, B. Fu, W. Li, and Z. Xu, “A model based book dewarping method to handle 2d images captured by a digital camera,” in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 1, Sep. 2007, pp. 158–162.

TABLE 7 Depiction of the strong correlation between the quality metric () designed in this paper and the MS-SSIM (M) for the images in the DocUNet 2018 [31] dataset. The depicted values are the calculated metrics after a single iteration of page-level dewarping. One important point to note here is that, calculating MS-SSIM requires the ground truth images, while

TABLE 8 Multi-scale structural similarity (MS-SSIM) and local distortion (LD) of various dewarping techniques on the DocUNet 2018 [31] dataset. Higher MS-SSIM indicates higher similarity with the ground truth. Lower LD indicates lower distortion compared to the ground truth.

TABLE 9 Run-time statistics for the DocUNet 2018 [31] dataset

[9] M. Wu, R. Li, W. Li, E. P. Heaney Jr, K. Chan, and K. A. Rapelje, “Model-based dewarping method and apparatus,” Feb. 12 2008, uS Patent 7,330,604.

[10] A. Masalovitch and L. Mestetskiy, “Usage of continuous skele- tal image representation for document images de-warping,” in Proceedings of International Workshop on Camera-Based Document Analysis and Recognition, Curitiba, 2007, pp. 45–53.

[11] Y. He, P. Pan, S. Xie, J. Sun, and S. Naoi, “A book dewarping system by boundary-based 3d surface reconstruction,” in 2013 12th International Conference on Document Analysis and Recognition, Aug 2013, pp. 403–407.

[12] A. Huggett and G. Kirsch, “Method and apparatus providing perspective correction and/or image dewarping,” Apr. 2 2013, uS Patent 8,411,998.

[13] F. Bolelli, “Indexing of historical document images: Ad hoc de- warping technique for handwritten text,” in Italian Research Conference on Digital Libraries. Springer, 2017, pp. 45–55.

[14] B. Gatos, I. Pratikakis, and K. Ntirogiannis, “Segmentation based recovery of arbitrarily warped document images,” in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, Sep. 2007, pp. 989–993.

[15] H. Chethan and G. H. Kumar, “Image dewarping and text extrac- tion from mobile captured distinct documents,” Procedia Computer Science, vol. 2, pp. 330 – 337, 2010, proceedings of the International Conference and Exhibition on Biometrics Technology.

[16] N. Stamatopoulos, B. Gatos, I. Pratikakis, and S. J. Perantonis, “A two-step dewarping of camera document images,” in 2008 The Eighth IAPR International Workshop on Document Analysis Systems, Sep. 2008, pp. 209–216.

[17] N. Stamatopoulos, B. Gatos, I. Pratikakis, and S. J. Perantonis, “Goal-oriented rectification of camera-based document images,” IEEE Transactions on Image Processing, vol. 20, no. 4, pp. 910–920, April 2011.

[18] M.-S. Kwon, N.-I. Cho, S.-H. Kim, B.-S. Kim, and W.-k. Seo, “Method, apparatus, and computer-readable recording medium for converting document image captured by using camera to dewarped document image,” Apr. 5 2016, uS Patent 9,305,211.

[19] S. S. Bukhari, F. Shafait, and T. M. Breuel, “Coupled snakelet model for curled textline segmentation of camera-captured document images,” in 2009 10th International Conference on Document Analysis and Recognition, July 2009, pp. 61–65.

[20] Gaofeng Meng, Chunhong Pan, Shiming Xiang, Jiangyong Duan, and Nanning Zheng, “Metric rectification of curved document images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 4, pp. 707–722, April 2012.

[21] C. Wu, K. R. Bengtson, and J. P. Allebach, “Captured open book image de-warping using depth information,” in 2015 IEEE International Conference on Image Processing (ICIP), Sep. 2015, pp. 197–201.

[22] V. Frinken, A. Fischer, R. Manmatha, and H. Bunke, “A novel word spotting method based on recurrent neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 2, pp. 211–224, Feb 2012.

[23] H. I. Koo, “Text-line detection in camera-captured document im- ages using the state estimation of connected components,” IEEE Transactions on Image Processing, vol. 25, no. 11, pp. 5358–5368, Nov 2016.

[24] B. S. Kim, H. I. Koo, and N. I. Cho, “Document dewarping via text-line based optimization,” Pattern Recognition, vol. 48, no. 11, pp. 3600 – 3614, 2015.

[25] T. Kil, W. Seo, H. I. Koo, and N. I. Cho, “Robust document image dewarping method using text-lines and line segments,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, Nov 2017, pp. 865–870.

[26] H. I. Koo, J. Kim, and N. I. Cho, “Composition of a dewarped and enhanced document image from two view images,” IEEE Transactions on Image Processing, vol. 18, no. 7, pp. 1551–1562, 2009.

[27] S. You, Y. Matsushita, S. Sinha, Y. Bou, and K. Ikeuchi, “Multiview rectification of folded documents,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 2, pp. 505–511, Feb 2018.

[28] S. S. Bukhari, F. Shafait, and T. M. Breuel, “Ridges based curled textline region detection from grayscale camera-captured document images,” in International Conference on Computer Analysis of Images and Patterns. Springer, 2009, pp. 173–180.

[29] A. Ulges, C. H. Lampert, and T. Breuel, “Document capture using stereo vision,” in Proceedings of the 2004 ACM symposium on Document engineering. ACM, 2004, pp. 198–200.

[30] M. P. Cutter and P. Chiu, “Capture and dewarping of page spreads with a handheld compact 3d camera,” in 2012 10th IAPR International Workshop on Document Analysis Systems. IEEE, 2012, pp. 205–209.

[31] K. Ma, Z. Shu, X. Bai, J. Wang, and D. Samaras, “Docunet: Doc- ument image unwarping via a stacked u-net,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, pp. 4700–4709.

[32] S. Das, K. Ma, Z. Shu, D. Samaras, and R. Shilkrot, “Dewarpnet: Single-image document unwarping with stacked 3d and 2d regression networks,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.

[33] X. Li, B. Zhang, J. Liao, and P. V. Sander, “Document Rectification and Illumination Correction using a Patch-based CNN,” arXiv e-prints, p. arXiv:1909.09470, Sep 2019.

[34] S. S. Bukhari, F. Shafait, and T. Breuel, “The iupr dataset of camera-captured document images,” in 4th International Workshop on Camera-Based Document Analysis and Recognition. International Workshop on Camera-Based Document Analysis and Recognition (CBDAR-11), 4th, September 22, Beijing, China, ser. Lecture Notes in Computer Science (LNCS). Springer, 9 2011.

[35] V. C. Kieu, N. Journet, M. Visani, R. Mullot, and J. P. Domenger, “Semi-synthetic document image generation using texture mapping on scanned 3d document shapes,” in 2013 12th International Conference on Document Analysis and Recognition, Aug 2013, pp. 489–493.

[36] V. C. Kieu, M. Visani, N. Journet, R. Mullot, and J. P. Domenger, “An efficient parametrization of character degradation model for semi-synthetic image generation,” in Proceedings of the 2Nd International Workshop on Historical Document Imaging and Processing, ser. HIP ’13. New York, NY, USA: ACM, 2013, pp. 29–35. [Online]. Available: http://doi.acm.org/10.1145/2501115. 2501127

[37] A. Garai, S. Biswas, S. Mandal, and B. B. Chaudhuri, “A Method to Generate Synthetically Warped Document Image,” arXiv e-prints, p. arXiv:1910.06621, Oct 2019.

[38] F. Guo, Y. Li, and P. Liu, “A fast page outline detection and dewarping method based on iterative cut and adaptive coordinate transform,” in 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 4, Sep. 2019, pp. 1–6.

[39] S. Roy, G. Adhikari, T. Dasgupta, and T. Pradhan, “An adaptive warp correction algorithm for handwritten text images with non-linear baselines,” in 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), July 2018, pp. 1–7.

[40] G. Adhikari, S. Roy, T. Dasgupta, and T. Pradhan, “A novel technique for unwarping curved handwritten texts using mathematical morphology and piecewise affine transformation,” in 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), July 2018, pp. 1–7.

[41] A. Garai, S. Biswas, S. Mandal, and B. B. Chaudhuri, “Automatic dewarping of camera captured born-digital bangla document images,” in 2017 Ninth International Conference on Advances in Pattern Recognition (ICAPR), Dec 2017, pp. 1–6.

[42] N. Stamatopoulos, B. Gatos, and I. Pratikakis, “Performance eval- uation methodology for document image dewarping techniques,” IET Image Processing, vol. 6, no. 6, pp. 738–745, August 2012.

[43] G. Adhikari, R. Mukherjee, and T. Dasgupta, “A local adaptive region-wise histogram correction and thresholding technique for very poorly illuminated images,” in 2018 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), March 2018, pp. 1–5.

[44] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, 2003, vol. 2, Nov 2003, pp. 1398–1402 Vol.2.

[45] C. Liu, J. Yuen, and A. Torralba, “Sift flow: Dense correspondence across scenes and its applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 978–994, May 2011.

Tanmoy Dasgupta (S’16, M’20) received his B.E. and M.E. in electrical engineering from Jadavpur University in 2012 and 2014, respectively. He was the recipient of university gold medal in 2014 for securing first position in M.E. He is currently working as an assistant professor at Techno India University, West Bengal, India. He is presently enrolled as a doctoral student in the Department of Computer Science and Engineering, Jadavpur University. His research interests include signal and image processing.

Nibaran Das (M’07) received his B.Tech degree in computer science and technology from Kalyani Govt. Engineering College under Kalyani University, in 2003. He received his M.C.S.E and Ph.D. (Engg.) degree from Jadavpur University, in 2005 and 2012 respectively. He joined Jadavpur University as a faculty member in 2006 where he is currently serving as an associate professor in the department of computer science and engineering. He has published more than 150 peer-reviewed research articles in different international journals and conferences. His areas of current research interest are OCR of handwritten text, optimization techniques, deep learning and image processing.

Mita Nasipuri (M’88, SM’92) received her B.E.Tel.E., M.E.Tel.E., and Ph.D. (Engg.) degrees from Jadavpur University, in 1979, 1981 and 1990, respectively. Prof. Nasipuri has been a faculty member of Jadavpur University since 1987. She has supervised 18 doctoral students and has published 5 books and more than 400 peer-reviewed research articles in different international journals and conferences. Her current research interest includes image processing, pattern recognition, and multimedia systems. She is a senior member of the IEEE, fellow of IE (India) and WBAST, Kolkata, India.

Designed for Accessibility and to further Open Science