Text detection in natural scenes has attracted more and more attention in the field of computer vision due to its wide application in various natural scene under-
Jinyuan Zhao The State Key Laboratory of Management and Control for Complex Systems, Institute of Automation Chinese Academy of Sciences, Beijing 100190, China University of Chinese Academy of Sciences, China Tel.: +86-10-82544488 Fax: +86-10-62650820 E-mail: zhaojinyuan2016@ia.ac.cn
Fig. 1: Some scene text image examples taken from public datasets.
standing tasks, such as scene location, automatic driving, text analysis, etc.
In recent years, a lot of scene text detection technologies have emerged, and have achieved good performance in various competitions and public datasets. However, there are still many challenges in the task of scene text detection, such as changing fonts, languages, complex lighting and background conditions, confusion of similar patterns and logos, etc. Figure 1 shows sample images of some scene text detection tasks.
Existing scene text detection frameworks are mainly inspired by general object detection methods and semantics segmentation methods. The methods based on general object detection usually consist of two stages: RPN network extracts candidate text regions, and clas-sification network sorts the features from the RPN network and obtains the final text position. Semantic segmentation based methods usually treat text as a special segmentation instance, hoping to directly distinguish it from the background in the segmentation results. These methods are called one-stage methods. Compared with two-stage methods, one-stage methods are more intuitive and concise, but still, have the following problems:
Imprecise segmentation labels: Traditional one-stage methods often train the networks to get a binary text score map. However, due to the diversity of text distribution in scene text images, many annotated text boxes will contain some background pixels. When text pixels are used as a target to conduct pixel-level instance segmentation, these background pixels may cause the problem of learning confusion and reduce the effect of training.
Multitask learning problem: Some classic one-stage methods, such as EAST [35], adopt the strategy of obtaining text score map and features required by regression task from the same convolution network. However, regression information, as a distance measure, cannot share features extracted from the CNN network well with text score map based on graph features, and its performance is slightly weaker than that of the two-stage detector.
In this paper, we propose the discriminator guided scene text detector (DGST) to address the above problems and improve the performance of one-stage text detectors. We introduce the framework of conditional generative adversarial networks, which is popular in image generation task recently. Text detection task is transformed into related segmentation image generation tasks. A discriminator is used to automatically adjust the losses in training process and generate a satisfactory text score map. At the same time, we design the soft-text-score map to strengthen the center position of text boxes and weaken the influence of edge pixels on the detection results, so as to eliminate the interference of background pixels and avoid learning confusion in the learning process. The final detection results can be obtained by combining the soft-text-score maps of different shrink factors. We evaluated our method on ICDAR2013 [8], ICDAR2015 [9], ICDAR2017 [21] and MSRA-TD500 [31] datasets. Among them, the Fmeasure of our method reaches 87% on ICDAR2015 [9] and 74.3% on ICDAR2017 [21].
Our pipeline is shown in Fig.2. The main contributions of this paper are three-fold:
• We introduce the framework of generative adversarial networks into the task of scene text detection and design a suitable structure for it.
• We redefine the representation of text area and non-text area in the framework of semantic segmentation, and solve the learning confusion caused by background pixels.
• Extensive experiments demonstrate the state-of-the-art performance of the proposed method on several benchmark datasets.
With the development of computer technology and the popularization of deep-learning methods, detectors based on neural network framework have shown excellent performance in scene text detection tasks, which makes text detection enter a new era of deep-learning methods.
Many works have been done on scene text detection in recent years. These methods can be divided into two branches: one branch is based on general object detection methods such as SSD [15], YOLO [22], and Faster RCNN [23]. TextBoxes++ [13] modifies anchors and kernels of SSD [15] to enable the detector to process texts of large aspect ratio in scene images. RRPN [20] changes the aspect ratio of anchor in Faster RCNN [23] and adds rotation anchors to support scene text detection with arbitrary orientation. CTPN [27] further analyses the characteristics of text, optimizes RPN in Faster RCNN [23] to extract candidate box and merge many small candidate boxes into the final text prediction box, so as to solve the problem of text line detection of arbitrary length. These text detectors take words or text lines as a special object and add subsequent classi-fiers to filter text areas in convolution features. Usually, these methods need to add NMS to get the final text location.
Another branch is based on semantic segmentation, which regards scene text detection as a special semantics segmentation task. Zhang et al.[34] uses FCN to estimate text blocks and MSER to extract candidate characters. EAST [35] adopts the idea of FCN, and predicts the location, scale, and orientation of text with a single model and multiple loss functions (multi-task training). PSENET [29] uses semantic segmentation to classify text at the pixel level, which makes the modeling of curved text simpler and uses kernels to separate close text blocks. CRAFT [1] takes the affinity between characters and characters itself as different target instances to generate scoring graphs and detects text at the character level. These methods hoping to get a binary text score graph and extract texts in the image as segmentation instances. The final text position can be obtained by analyzing the text score map. Compared with the two-stage methods, these methods have more intuitive ideas and simpler network structure.
These methods above have achieved excellent performance on standard benchmarks. However, as illustrated in Fig. 3(a), the problem of imprecise segmentation labels has not been well solved, especially for semantically segmented detectors, the background pixels in the annotation boxes will affect the classification results, which leads to the deviation of the final results. Meanwhile, many methods need to learn multiple tasks at the same time, such as classification, regression, and text score-map generation, which makes the network structure and inference more complex.
Some semantics-based detectors have explored the text representation and improved the previous score map labeling methods: PixelLink [2] first transforms text detection into a pure segmentation problem by linking pixels within the same instance of eight-directions and then extracts the text boundary box directly from the segmentation without location regression. PSENet [29] finds text kernels of different scales and proposes a progressive scaling expansion algorithm to accurately separate cohesive text instances. Textfield [30] uses the direction field which encodes both binary text mask and direction information facilitating the subsequent text grouping process.
With the emergence of deep-learning techniques, the research on the direction of generative image modeling has made significant progress [12, 24, 28]. [26] uses the conditional GANs to translate a rendering image to a real image. An unsupervised image-to-image translation framework based on shared latent space is proposed in [14]. More recently, CycleGAN [36] and its variants [33, 10] have achieved impressive image translation by using cycle-consistency loss. [6] proposes a cycle-consistent adversarial model that is applicable at both pixel and feature levels.
Inspired by the above methods, in this paper, we use the generative adversarial networks framework and design more reasonable soft-text-score map to get more accurate semantic segmentation results and use connected components analysis to replace the traditional NMS process. This not only avoids the learning confusion caused by imprecise labels but also makes the whole network training process become a single task learning process, which is more concise and intuitive.
Fig.2 shows the flowchart of the proposed method for scene text detection, which is a one-stage detector. In the training process, the generator and discriminator learn alternately, so that the generator finally converts the input scene image into the corresponding soft-text-score map. This eliminates intermediate steps such as candidate proposal, thresholding, and NMS on predicted geometric shapes. The post-processing steps only include connected components analyses of the text score map. The detector is named as DGST since it is a Discriminator Guided Scene Text detector.
3.1 Label Generation
Some classical one-stage detectors usually generate a binary text score map, such as EAST [35], PSENET[29] and Pixel-Link [2]. However, this labeling method has the drawbacks mentioned in Section 1. When text feature extraction is regarded as a semantic segmentation task to classify the input image at the pixel level, the background pixels in the ground-truth boxes will interfere with the learning of text features. Some of these methods try to shrink the annotation boxes more tightly to reduce the background pixels, as shown in Fig.3 (a). However, such a rigid shrinkage can not accurately adjust the labeling of each box, and the text edges and background pixels can not be well distinguished, which makes the final text box position deviate from the desired result. CRAFT [1] method divides the text line annotation into single character annotation results and measures the Gauss distance on each character to get the text score map, which further weakens the influence of background noise on text feature extraction, but the conversion from word-level annotation to character-level annotation introduces additional complex work.
In this paper, inspired by the above methods, we propose a method to generate text score maps based on distance pairs between the pixels in the annotation box and the corresponding boundaries. We compare the distance between the pixels in the annotation box and the corresponding boundary in horizontal and vertical directions, highlighting the central position of the text line, and weakening the weight of the pixels on the edge, which are easily confused with the background. For a point (x, y) in the input image, its intensity value P in soft-text-score map can be calculated by the following formula:
Where we use set T to represent all annotated text boxes, and
represent the width and height of the i-th text box, respectively.
denote the
Fig. 2: Overview of the proposed DGST.
Fig. 3: Diagrams of different text score map annotation methods. (a) The labeling method used by EAST [35]. (b) The labeling method proposed by this paper.
distance of point (x, y) to each edge. We use the everage of and
to calculate the gray value P, which decreases from the center line in the horizontal and vertical direction to the edge points in every text box. An intuitive display is shown in Fig.3 (b).
The values of all the pixels are between [0,1]. In order to solve the problem that it is difficult to deal with cohesive text blocks in post-processing, we generate two different levels of score maps for the same input image. The pixel values in the two score maps are calculated in exactly the same way. The difference is that the text box in score map (2) is contracted in the way shown in Fig.3 (a) so that there is a greater gap between the text boxes (as shown in the dotted line box in Fig.3 (b)). In our experiment, the contraction factor is 0.2.
3.2 Network Design
3.2.1 Generator and discriminator
We use U-shaped network structure to fuse the feature in down-sampling and up-sampling step by step. This strategy has been validated in many previous scene text detection methods such as [1, 35] and [2]. We use ResNet-50 [3] as the backbone of DGST, and the feature maps of { Conv2 x, Conv3 x, Conv4 x, Conv5 x } are combined by up-sampling.
From an input image, five levels of the feature maps are combined to generate the final feature maps. With the help of discriminator, our generator outputs a twochannel feature map with the same scale as the input image, representing the soft text score maps under different shrink factors respectively. Therefore, the feature extraction task of traditional text detection is transformed into a feature image generation task.
Combining the original picture with the corresponding text score maps of different shrink factors as the input of the discriminator, the discriminator determines whether the input text score map is a labeled ground truth image or an imitation of the discriminator.
A more detailed network structure is shown in Fig.4. We use bilinear interpolation instead of deconvolution to avoid the chessboard effect. The green and blue tables in the figure are the network structure of the generator’s feature extraction and fusion phase respectively, and the orange table is the network structure of our discriminator.
Fig. 4: Network structure of the proposed method. The upsampling operation is done through bilinear interpolation directly. Feature maps from different stages are fused through a cascade of upsampling and add operations. () denotes a convolution layer with X convolution kernels of size
.
3.2.2 Loss function
Traditional GAN images are trained alternately by game learning of generators and discriminators. Their loss functions are as follows:
In order to obtain a more accurate score map, we use the following two measures to further strengthen the generator on the basis of the traditional GAN structure:
1. cGAN is used instead of traditional GAN structure. Input pictures are added as a restriction, so that the output of the generator is restricted by input pictures, and more reasonable result images can be obtained. The loss function is as follows:
2. On the basis of GAN loss, the traditional loss function L2-loss is introduced to optimize the predicted text score map, which makes the generated text score map not only deceive the discriminator but also perform better in the sense of traditional loss.
The final loss function is as follows:
Fig.5 shows the text scoremap (1) generated by our DTDR in different epochs. As the number of iterations increases, the text score map generated by our generator can continuously approximate the given GT and further filter out the noise interference in the background.
3.3 Text boxes extraction
Fig.6 shows the overall flow of our post-processing method. Two text score maps with different shrink factors are obtained from the generator, and the corresponding text boxes in Fig.6 (c) and Fig.6 (d) can be obtained by directly analyzing the connected components of score maps in Fig.6 (b). It can be seen that there is a cohesion problem in non-shrinking score map, and the shrinking score map can better extract text box spacing information, but it will lose some text information.
Therefore, we combine the two score maps from the generator to get a more complete image as shown in Fig.6 (e), and expand the text boxes from Fig.4 (e) under the constraint of the text boxes in Fig.6 (c), so that
Fig. 5: Text score maps generated in different epochs (contraction factor is 0).
Fig. 6: An illustration of extracting text location information from score maps. (a) Original input image. (b) Score maps of different contraction factors generated by DGST. (c) (d) The connected component analysis results of images in (b). (e) The binary result obtained by fusing the two maps in (b). (e) The final result of text detection.
the edge can surround the whole text area completely. The final text box position is shown in Fig.6 (f). More specific processes are shown in algorithm 1:
and
denote the two score maps with different shrink factors.
and
denote the binary image of
and
respectively. Here we introduce a paramenter t to threshold the score maps. we choose t = 0.25 in our experiments. Relevant operations such as thresholding and connected components analysis can be implemented with the correlation functions provided by OpenCV.
To verify the effectiveness of the proposed method in scene text detection task, we compare the performance of DGST with existing methods on several standard benchmarks: ICDAR 13, ICDAR 15, ICDAR 17 and MSRA-TD500. The experimental results show that we have achieved on better or comparable results than state-of-the-art methods.
4.1 Datasets
ICDAR2013 (IC13) [8] was released during the ICDAR 2013 Robust Reading Competition for focused scene text detection. ICDAR2013 dataset is a subset of ICDAR2011 dataset. The number of images of IC-DAR2013 dataset is 462, which is comprised of 229 images for the training set and 233 images for the test set. This dataset only contains texts in English. The annotations are at word-level using rectangular boxes.
ICDAR2015 (IC15) [9] was introduced in the ICDAR 2015 Robust Reading Competition for incidental scene text detection. 1,500 of the images have been made publicly available, split between a training set of 1, 000 images and a test set of 500, both with texts in English. The annotations are at the word level using quadrilateral boxes.
ICDAR2017 (IC17) [21] was introduced in the ICDAR 2017 robust reading challenge on multi-lingual scene text detection, consisting of 9000 training images and 9000 testing images. The dataset is composed of widely variable scene images which contain text of one or more of 9 languages representing 6 different scripts. The number of images per script is equal. The text regions in IC17 are annotated by the 4 vertices of quadrilaterals, as in ICDAR2015.
MSRA-TD500 (TD500) [31] contains 500 natural images, which are split into 300 training images and 200 testing images, collected both indoors and outdoors using a pocket camera. The images contain English and Chinese scripts. Text regions are annotated by rotated rectangles.
4.2 Evaluation protocol
We use standard evaluation protocol to measure the performance of detectors in terms of precision, recall, and f-measure. They are defined as follows:
where TP, FP, FN denote the True Positive, False Positive and False Negative values, respectively. For the detected text instance T, if the IOU is greater than the given threshold when T intersects a ground truth text instance (usually set to 0.5), then the text instance T is considered to be the correct detection. Because of the trade-off between recall and precision, F-measure is a common compromised measurement for performance evaluation.
4.3 Implementation details
The DGST is implemented in Pytorch framework and run on a server with 2.10GHz CPU, GTX 1080Ti GPU, and Ubuntu 64-bit OS. The layers of our generator are initialized with the backbone models (ResNet-50) pretrained on ImageNet [25]. We choose minibatch SGD and apply the Adam solver [11] with learning rate 0.0002.
When experimenting on a specific data set, the training set is augmented by existing training samples. The specific ways of expansion are as follows: (1) Each image is randomly scaled between 640-2560 in length or width, and the original aspect ratio is maintained. (2) Rotate each training image randomly at four angles [0,90,180,270]. (3) Random crop 640640 regions in the scaled image (pure background area does not exceed 30% of the total sample number). For the other methods in Tab.1,2,3 and 4, we directly use the experimental results shown in the original paper to compare with our results.
4.4 Ablation Experiments
We use the evaluation indicators in Section 4.2 and compare different network structures on the ICDAR15 test set. Table 1 summarizes the experimental results.
Our baseline is a U-net structure with ResNet50 as the backbone network, and uses cross-entropy loss to train a binary text score map. On this basis, we compare the effects of soft text representation and the discriminator training strategy on detector performance.
Table 1: Results on the ICDAR15 test set under differ- ent model configurations and training strategies.
In our ablation experiment, except for the differ-ences mentioned in the first column of the Table 1, the model structure and training strategy of other experimental links are exactly the same as the baseline. Among them, DGST is our final detector structure, which combines two strategies of soft text score map and Gan loss on the basis of baseline.
From the Table 1, we can see that using the soft text score map proposed in Section 3 instead of the traditional binary text score map can significantly improve the detection results. For the pixel level segmentation task, more abundant classification information can distinguish the text pixel and non text pixel information in the annotation box, which can significantly improve the classification accuracy of the final image pixel, so as to get more accurate detection results. In the meantime, similar to many semantic segmentation tasks, we use the conditional generative adversarial training strategy instead of traditional cross-entropy loss to train the generator, so that the classification results can continuously approximate the designed ground truth images, and also can improve the final pixel classification accuracy. Our final detector, DGST, combines the advantages of these two improvements and achieves the optimal effect on the test set.
4.5 Compare with Other Methods
Table 2: Comparison with other results on ICDAR 2013.
In order to evaluate the effectiveness of the proposed method, we conducted experiments on the datasets mentioned in subsection 4.1. The proposed method is compared with other state-of-the-art detection algorithms in Recall, Precision, and F-score. Table 1, 2, 3 and 4 show the experimental results on IC13, IC15, IC17, and MSRA-500 datasets respectively. From the results in the tables, we can see that our method achieves the
Table 3: Comparison with other results on ICDAR 2015.
Table 4: Comparison with other results on ICDAR 2017.
Table 5: Comparison with other results on MSRA-TD500.
state-of-the-art level on the four datasets and performs well in each evaluation index.
ICDAR2017: IC17 contains a large number of scene text images in different languages. We use the training set and verification set to finetune the model pre-trained on ImageNet, and iterate 200 epochs to get the final detector. When testing the model, we resize the longer side of images in the test set to 2560 and reaches
Fig. 7: Some failure cases of the proposed method.
the F-measure of 74.8%. The specific results are shown in Table 3.
ICDAR2015: The images in IC15 and IC17 are similar and contain many small text line instances. Therefore, we use the training set of IC15 to finetune the model from IC17 for 80 epochs, so as to achieve better detection results. For testing, we resized the image to 2240 on the long side for a single scale test, and the final F-measure was 87.1%. The specific results are shown in Table 2.
ICDAR2013: Similar to IC15, IC13 also finetune the model from IC17 to get a better detector. Because of the large area of the text area in the image, in the testing process, we resize the image to 960 on the long side for a single scale test and get the state-of-the-art result (F-measure is 87.1% as shown in Tabel 1).
MSRA-TD500: TD500 contains both Chinese and English text, and annotation boxes are line-level annotations. The blank areas between words are often included in text boxes. So instead of finetuning on IC17 pre-trained model, we train the TD500 separately, which enables the generator to generate text score maps in line form. When testing, the long side of the testing images are resized to 1600 for a single scale test. The results are shown in Table 4.
In the data sets above, IC13 and IC15 contain only English texts. The IC17 and TD500 datasets contain text in multiple languages. Experimental results show that our algorithm has good detection effect for the multi-language, multi-rotation angle, different length, and text arrangement.
Compared with these two-stage detectors, the semantic segmentation based detectors do not train additional classifiers to precisely filter the obtained text areas, so some noise will be introduced into the detection results. Our detection results may contain some noises in order to retain some smaller characters. Fig.7 shows some failure cases.
Fig.8 shows some detection results of the proposed DGST. It can be seen that the proposed method achieves potential detection results for text detection tasks in different scenarios. It has good robustness to different illumination, background and scale change, and can detect Chinese and English words effectively. At the same time, because our detector is based on the classifica-tion of pixel level, it has anti-interference to tilted and deformed text. This is also illustrated in Fig.5.
In this paper, we propose a novel scene text detector, DGST, which is based on the strategy of generative adversarial networks. Considering scene text detection as a special image transformation task, we introduce the idea of game theory, regard text feature extraction network as a text score image generator, and design a discriminator to identify the generated image, so that the generator can approach the labeled image step by step. In the meantime, we optimize the design of the text score image, weaken the influence of edge pixels and avoid the learning confusion problem caused by background pixels in the annotated text boxes. The experimental results on four public datasets show that our method is effective and robust.
Possible directions for future work include: (1) Explore whether the post-processing part can be replaced by a learnable network structure to reduce the use of empirical parameters. (2) Design an end-to-end text recognition system by combining our DGST detector and a robust text recognition system.
This work is supported by the National Natural Science Foundation of China (NSFC) under Grant No. 71621002 and the Key Programs of the Chinese Academy of Sciences under Grant No. ZDBS-SSW-JSC003, No. ZDBS-SSW-JSC004 and No. ZDBS-SSW-JSC005.
1. Baek Y, Lee B, Han D, Yun S, Lee H (2019) Charac- ter region awareness for text detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9365–9374
2. Deng D, Liu H, Li X, Cai D (2018) Pixellink: De- tecting scene text via instance segmentation. In: Thirty-Second AAAI Conference on Artificial Intelligence
3. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Fig. 8: Qualitative results of the proposed algorithm. (a) ICDAR 2017. (b) ICDAR 2015. (c) ICDAR 2013. (d) MSRA-TD500.
4. He P, Huang W, He T, Zhu Q, Qiao Y, Li X (2017) Single shot text detector with regional attention. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3047–3055
5. He W, Zhang XY, Yin F, Liu CL (2017) Deep direct regression for multi-oriented scene text detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp 745–753
6. Hoffman J, Tzeng E, Park T, Zhu JY, Isola P, Saenko K, Efros A, Darrell T (2018) Cycada: Cycleconsistent adversarial domain adaptation. In: Proceedings of the 35th International Conference on Machine Learning
7. Jiang Y, Zhu X, Wang X, Yang S, Li W, Wang H, Fu P, Luo Z (2017) R2cnn: Rotational region cnn for orientation robust scene text detection. arXiv preprint arXiv:170609579
8. Karatzas D, Shafait F, Uchida S, Iwamura M, i Big- orda LG, Mestre SR, Mas J, Mota DF, Almazan JA, De Las Heras LP (2013) Icdar 2013 robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition, IEEE, pp 1484–1493
9. Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar VR, Lu S, et al. (2015) Icdar 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE, pp 1156–1160
10. Kim T, Cha M, Kim H, Lee JK, Kim J (2017) Learning to discover cross-domain relations with
generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, pp 1857–1865
11. Kingma D, Ba J (2014) Adam: A method for stochastic optimization. Computer Science
12. Kingma DP, Welling M (2014) Auto-encoding vari- ational bayes. stat 1050:10
13. Liao M, Shi B, Bai X (2018) Textboxes++: A single-shot oriented scene text detector. IEEE transactions on image processing 27(8):3676–3690
14. Liu MY, Breuel T, Kautz J (2017) Unsupervised image-to-image translation networks. In: Advances in neural information processing systems, pp 700– 708
15. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision, Springer, pp 21–37
16. Liu X, Liang D, Yan S, Chen D, Qiao Y, Yan J (2018) Fots: Fast oriented text spotting with a uni-fied network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5676–5685
17. Long S, Ruan J, Zhang W, He X, Wu W, Yao C (2018) Textsnake: A flexible representation for detecting text of arbitrary shapes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 20–36
18. Lyu P, Liao M, Yao C, Wu W, Bai X (2018) Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In:
Proceedings of the European Conference on Computer Vision (ECCV), pp 67–83
19. Lyu P, Yao C, Wu W, Yan S, Bai X (2018) Multi- oriented scene text detection via corner localization and region segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7553–7563
20. Ma J, Shao W, Ye H, Wang L, Wang H, Zheng Y, Xue X (2018) Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia 20(11):3111–3122
21. Nayef N, Yin F, Bizid I, Choi H, Feng Y, Karatzas D, Luo Z, Pal U, Rigaud C, Chazalon J, et al. (2017) Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), IEEE, vol 1, pp 1454–1459
22. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779– 788
23. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
24. Rezende DJ, Mohamed S, Wierstra D (2014) Stochastic backpropagation and approximate inference in deep generative models. In: International Conference on Machine Learning, pp 1278–1286
25. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3):211–252
26. Taigman Y, Polyak A, Wolf L (2016) Unsupervised cross-domain image generation. arXiv preprint arXiv:161102200
27. Tian Z, Huang W, He T, He P, Qiao Y (2016) Detecting text in natural image with connectionist text proposal network. In: European conference on computer vision, Springer, pp 56–72
28. Van Oord A, Kalchbrenner N, Kavukcuoglu K (2016) Pixel recurrent neural networks. In: International Conference on Machine Learning, pp 1747– 1756
29. Wang W, Xie E, Li X, Hou W, Lu T, Yu G, Shao S (2019) Shape robust text detection with progressive scale expansion network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9336–9345
30. Xu Y, Wang Y, Zhou W, Wang Y, Yang Z, Bai X (2019) Textfield: Learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing
31. Yao C, Bai X, Liu W, Ma Y, Tu Z (2012) Detecting texts of arbitrary orientations in natural images. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 1083–1090
32. Yao C, Bai X, Sang N, Zhou X, Zhou S, Cao Z (2016) Scene text detection via holistic, multichannel prediction. arXiv preprint arXiv:160609002
33. Yi Z, Zhang H, Tan P, Gong M (2017) Dualgan: Un- supervised dual learning for image-to-image translation. In: Proceedings of the IEEE international conference on computer vision, pp 2849–2857
34. Zhang Z, Zhang C, Shen W, Yao C, Liu W, Bai X (2016) Multi-oriented text detection with fully convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4159–4167
35. Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) East: an efficient and accurate scene text detector. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 5551–5560
36. Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2223–2232