Style Transfer and Extraction for the Handwritten Letters Using Deep Learning

2018·Arxiv

ABSTRACT

ABSTRACT

How can we learn, transfer and extract handwriting styles using deep neural networks? This paper explores these questions using a deep conditioned autoencoder on the IRON-OFF handwriting data-set. We perform three experiments that systematically explore the quality of our style extraction procedure. First, We compare our model to handwriting benchmarks using multidimensional performance metrics. Second, we explore the quality of style transfer, i.e. how the model performs on new, unseen writers. In both experiments, we improve the metrics of state of the art methods by a large margin. Lastly, we analyze the latent space of our model, and we see that it separates consistently writing styles.

I. INTRODUCTION

One aspect of a successful human-machine interface (e.g. human-robot interaction, chatbots, speech, handwriting ...) is the ability to have a personalized interaction. This affects the overall human experience, and allow for a more fluent interaction. At the moment, there is a lot of work that uses machine learning in order to learn to model for such interactions. However, most of these models do not address the issue of personalized behavior: they try to average over the different examples from different people in the training set. Identifying the human styles during the training and inference time open the possibility of biasing the models output to take into account the human preference. In this paper, we focus the problem of styles in the context of handwriting.

However, defining and extracting handwriting styles is a challenging problem, since there is no formal definition for these styles (i.e. it is an ill-posed problem). A style is both social – depends on writer’s training, especially at middle school – and idiosyncratic – depends on the writer’s shaping (letter roundness, sharpness, size, slope ...) and force distribution across time. To add to the problem, till recently, there were no metrics to assess the quality of handwriting generation.

There are two questions: what is the task itself? and what is the style used to achieve this task?. In handwriting, the task space is well defined (i.e. which letter we want to write), thus, allowing us to focus on the second part, of extracting styles for achieving this task.

In this paper, we address the problem of style extraction by using an conditioned-temporal deep autoencoder model. The conditioning is on the letter identity. The reason we use an autoencoder is that there is no explicit way that we know about to evaluate the quality of the handwriting styles other than using them to generate handwriting, and evaluate this generation. [1] introduced benchmarks and evaluation metrics in order to assess the quality of generating handwritten letters. In comparison to the those benchmarks and metrics, we achieve higher performance, while extracting a meaningful latent space. We also hypothesize that the latent space of styles is generic, i.e. that it will generalize over unseen writers, thus achieving a “transfer of style”. To test this hypothesis, we assess our model on 30 new writers. We compare the tracings generated by this model to a benchmark model already proposed for online handwriting generation. In addition, we explore the latent space of our model for each letter separately. This revealed that there is a limited number of ’unique’ styles per letter, categorical as well as continuous. We report our analysis for some of the letters, since a full analysis is out of the scope for this paper. Thus, our contributions in this paper are the following:

• We test and compare our deep conditioned autoencoder with the state of the art benchmarks. We show that this model greatly improves the generation performance over a state of the art benchmark model.

• We experiment on performing style transfer on new writers using this model achieves, and we show that it achieves much better results than the benchmark model.

• Finally, and maybe most interestingly, we further analyze the extracted the latent space from our model to show that there is a limited number of styles for each letter and that the style manifold is not a continuous space.

II. RELATED WORK

A. Generative models

Recent advances in deep learning [2] architectures and optimization methods led to remarkable results in the area of generative models. For static data, like images, the mainstream research builds on the advances in Variational Autoencoders [3] and Generative Adversarial Networks [4].

For generating sequences, the problem is more difficult: the model generates one frame at a time, and the final result must be coherent over long sequences. Recent recurrent neural networks architectures, like Long-Short Term Memory (LSTM) [5] and Gated Recurrent Units (GRU) [6], [7], achieve unprecedented performance in handling long sequences.

Theses architectures has been used in many applications, like learning language models [8], [9], image captioning [10], [11], music generation [12] and speech synthesis [13].

Focus was dedicated to use these powerful tool in order to extract meaningful latent space. One such work that inspired the investigation in this paper is [14]. In their work, they investigated the problem of sketch drawing [15] using a Variational Autoencoder. The latent space emerged encoded meaningful semantic information about these drawings. In our work, we simple a similar architecture, without the variational part, showing that similar behaviour.

B. Data Representation

For handwriting, a continuous coordinate representation (e.g. continuous X, Y) seems the natural option. However, generating continuous data is not straightforward. Traditionally, in neural networks, when we want to output a continuous value, a simple linear or Tanh activation function is used in the output layer of the neural network.

However, Bishop [16] studied the limitations of these functions and showed that they can not model rich distributions. In particular, when the input can have multiple outputs (one-to-many), these functions will average over all the outputs. He proposed the use of Gaussian Mixture Model (GMM) as the final activation function of a neural network. The alliance of neural networks and GMMs is called Mixture Density Network (MDN). The training consists in optimizing the GMM parameters (means, covariances). The inference is done by sampling from the GMM distribution.

To simplify the process, and focus our study on investigating of styles, we extract two features for the tracings: directions and speed (explained in section III-B), and we quantize these features. Thus, we can model each point in the letter tracings as a categorical distribution, and use a simple SoftMax function as the output of the network, which is much simpler than MDN. This was inspired by the studies done in [13], [17], where they report impressive results on originally continuous data, using suitable quantization policy. A categorical distribution is more flexible and generic than continuous ones.

C. Evaluation metrics

The objective evaluation of a generative model is a challenging task, since there is no consensus for objective evaluation metrics. In many cases, a subjective evaluation is performed to overcome this problem. For handwriting of Chinese letters, [18] proposed two metrics: Content accuracy : They train an evaluator model on the ground truth data, and use it to recognize the letters produced by their generator. This approach however faces important problems: the model is trained with ground

truth data, and this results in error in the classification, Eeval. We call the error of the generator Egen. When the evaluator is exposed to the data coming from the generator, a new source/distribution of errors is now coming from the generator, which the evaluator have never been exposed to before, leading to a change in the evaluator error behavior. We call this new error EevalThus, there are no guarantee that the result of the evaluator is faithful in this case. It is also not possible to deduce Egen from just knowing Eeval, since the model performance in this case is unknown. Style discrepancy : In [19], the authors performed image style transfer: take an image, and transform it to the style of an artist. In order to evaluate the quality of the transfer, they measured the correlation between different filter activations (in convolutional neural network) at one layer – which represent the style representation –. While this metric is interesting to explore, it is not directly applicable to our case, since it assumes the use of convolutional neural network. [1] also addressed the problem of evaluation of handwriting generation. They used the BLEU score [20] (a metric widely used in text translation and image captioning) and the End of Sequence (EoS) analysis. They showed that these metrics correlate with the quality of the generated letter. The BLEU score is global: all frames of the generated sequence contribute to the final score. The BLEU score is used to compare segments of generated traces with the ground truth. Depending on the number of grams chosen, the BLEU score can compare larger segments, thus giving us different levels of granularity to assess the quality of the generated samples. The EoS is a simple yet important style feature. Some letters take longer (e.g., written using many strokes, like H or E) to write than other letters (e.g., written with one stroke like O or C). It is also an idiosyncratic feature of the writer: writers have different writing speeds, depending on age, education or cognitive/peripheral disorders.

III. DATASET AND PRE-PROCESSING

A. Dataset

In this study, we use the IRON-OFF Cursive Handwriting Dataset [21]. This dataset provides us with isolated letters, thus allowing us to focus on the problem of styles with a limited number of strokes per item, unlike other handwriting datasets such as IAM Handwriting Database [22]. To summarize this dataset:

• Around 700 writers in total. We use the 412 writers who have written isolated letters.

• 10,685 isolated lower case letters, 10,679 isolated upper case letters, 4,086 isolated digits and 410 euro signs.

• The gender, handiness (left or right handed), age and nationality of the writers.

• For each example (letter, digit, euro sign), we have that example’s image - with size around 167x214 pixels, and a resolution of 300 dpi -, pen movement timed sequence

comprising continuous X, Y and pen pressure, and also discrete pen state. This data is sampled at 100 points per seconds on a Wacom UltraPad A4. We focused on the uppercase letters only, and we did not use the pen state or the pen pressure. The idea was to limit number the possible style factors, so that we can better study them. One challenging issue with this dataset however is that we have only one example for each writer-letter combination. This makes the task more difficult, because it is hard to extract a writer style using very few items (the 26 letters/writer in this case).

B. Pre-processing

The letters tracings has been cleaned by removing points related to false starts or corrections as well extra strokes. Tracings with length exceeding 1 second has been removed, as well as tracings more than 99 time steps. This is because they are quite rare, thus, their existence would significantly degrade the performance of our model.

We represent each letter tracing by two features: directions and speed. Each feature is quantized into 16 levels and represented as a one-hot encoded vector.

Freeman codes [23] is used in order to encode the direction feature. It belongs to a family of compression algorithms called Chain Codes. This set of algorithms proved to be useful to encode an image with connected components. They can transform a sparse matrix to just a small fraction of the size of the image, in the form of a sequence of codes. Thus, they are being used as compression algorithms as well.

Freeman codes can N-directional codes (where N are the directions), depending on the needed resolution. It is quite simple as it encodes each direction with a unique number from 0 to N-1. A direction is defined as the directed vector connecting two neighbouring pixels on the contour of a connected component in the image.

We compute the change of directions between three consecutive points. Then, we map this change to its corresponding freeman code number, as shown in figure 1. Last, we transform the direction number into one-hot encoding scheme, and use this as input to our network. We also quantize the speed of each displacement.

Fig. 1: Example for freeman code representation for 8 directions. Each direction is given a unique number.

IV. MODEL ARCHITECTURE

The model architecture is illustrated in figure 3. The input/output frames of the model are detailed in figure 2. The trace of the letter is first fed to encoder module. The final hidden state of that module summarizes the letter. In order to allow this module to focus on learning the style embedding, we complement this last hidden state with the one-hot encoding of the letter identity, and use a projection of them as the bias input to the generator. Thus, we decouple the task space – the letter – from the style space: the encoder is free from the need to learn the letter identity, and can focus learning additional information that enables the generator to better approximate the ground truth tracings. In the decoder, we follow the framework proposed by [11] in order to bias the model: we create an extra time step at the beginning, which has the information we want to bias the model with. In this case, this time step is the projection of the encoder last hidden state and the letter encoder. This has a much lower dimension than encoder hidden state (the hyperparameters are discussed in section IV-A). This further encourage the model to learn only necessary style information, as suggested in [24].

Fig. 2: Input sequence to our model. The first time step contains the information necessary to condition/bias our model. In case of the encoder, this first time step (the bias) is not included.

A. Hyper-parameter tuning

We ran random hyper-parameter search for a wide range of parameters (learning rate, size and the number of layers for the encoder and the decoder, dropout percentage, etc). GRU layers [6], [7] is being used in this model. We use Adam [25] optimizer in this work, with a fixed learning rate. The implementation is done using PyTorch framework [26].

In order to allow for faster exploration of different hyperparameters, we use an early stopping of 20 epochs (no improvement happens during these epochs). To summarize, the current model specifications:

Fig. 3: Schematic diagram of the model we used. During the training time 3a, the input to the model is always the ground truth. During the inference time 3b however, the input to the decoder (generator) part at each time step is its own predication in the previous time step.

B. Training

The encoder and the decoder parts have the target of modeling the next time step in the sequence, xt, given the previous time steps, or in other words, P, where xt is the tracing point at time t, and T is the length of the input sequence (see figure 2). To achieve this, the model is given the ground truth input of points x1asked to output the sequence x2

The model is trained to minimize the negative log likelihood loss of the correct point at each time step. For each feature (speed and freeman codes), it is calculated as in equation 1. The final loss is the average loss of the two feature, as in equation 2.

where xgtis the generated/predicted next time step by the model, xt is the ground truth input at the current time step t, and ht is the hidden state of the GRU at the current time step. Thus, during the training, the model is exposed only to the ground truth data as input.

C. Inference

To sample from the model, we used the softmax sampling strategy: fit the output of the network into two multinomial distributions (one for freeman codes and the other for the speed). We then sample the next time step from these two distributions. We can control the level of randomness of the sampling using a temperature parameter for the softmax function. We tried different temperatures, and we found the value of 0.5 achieves the best results. The generation continues till Nmax time steps -which is 100 time steps in our case -.

V. EVALUATION METRICS

Evaluation is a challenging problem when using generative models. We want metrics to capture the distance between the generated and the ground truth distributions. Following the work done in [1], we use the same two evaluation metrics in our model:

• BLEU score [20] It is a well known metric to evaluate text generation applications, like image captioning [10], [11] and machine translation [9]. Since we discretized the letter drawings, this fits nicely within our work. The general intuition is the following: if we take a segment from the generated letter, did this segment happen in the

ground truth letter? We keep doing this for segments of increasing length (the length of the segment here is the number of grams used in the BLEU score). For our work, we report the results on segments from 1 to 3 time steps. Each part of the letter has two parallel segments: freeman codes and speed, thus, we report the BLEU score for both of them. The equation to compute the BLEU score is the following:

where: G is all the generated sequences, N is the total number of N-grams we want to consider. CountClipped is clipped N-grams count (if the number of N-grams in the generate sequence is larger than the reference sequence, the count is limited to the number in the reference sequence only), LR is the length of the reference sequence, LG is the length of the generated sequence. The term minis added in order to penalize short generated sequences (shorter than the reference sequence), which will deceptively achieve high scores.

• End of Sequence The length of the letter is another aspect of the style. The distribution of length in the generated examples should follow the ground truth examples. In order to perform this analysis, we compute Pearson correlation coefficient between the generated examples and the ground truth data.

VI. EXPERIMENTS AND RESULTS

A. Letter generation with style preservation

The objective here to compare the quality of the generated letters to the state-of-the-art benchmarks. As mentioned earlier, we compare using the BLEU score metric and the EoS analysis. The BLEU score results can be seen in table I, and the results for EoS analysis results are in table III. We can see that the BLEU-3 score results of our model achieves 32.3% accuracy in Speed feature and 38.7% accuracy in Freeman feature, compared to 25.1% and 28.3% accuracy using the benchmark model on both features respectively.

The same goes for the EoS analysis. In comparing the Person Coefficient, our model achieves 0.99 score compared to 0.55 for the benchmark model (the highest score is 1.0). This is a support that our model capture the style of handwriting better than the benchmark.

Examples for the generated letters can be found in figure 13.

B. Style transfer

One of the hypotheses we want to test is whether there is a limited number of styles needed, to generalize over new writers. To achieve this, the learned representation for styles should extract generic information about the styles.

In order to test this hypothesis, we expose our model to 30 writers that have not been seen before. We compare our model performance on these writers with a model is biased by the writer and letter identities (the benchmark model). The latter model was not constrained from seeing those writers (thus, the reported results of the comparison overestimates the actual performance of that model).

The BLEU scores can be seen in table II. Our model achieves on BLEU-3 score 32.2% and 42.1% accuracy on the Speed and Freeman code features, compared to 25.3% and 27.7% on the benchmark model for the same features respectively.

The EoS analysis can be seen in table IV. Our model achieves a coefficient value of 0.99, compared to 0.5 for the benchmark. Thus, the new model clearly outperform the current benchmarks on the transfer task, on both BLEU score and EoS analysis.

C. Styles per letters

One of the nice consequences of using our model is that we can have a better look at the styles. We explore the latent space for multiple letters, and see that we can uncover interesting writing styles. A full scale analysis is beyond the scope of this paper. We project the latent space using Principal Components Analysis (PCA) [27] and t-SNE [28].

As a start, we take a look at letter X. Beforehand, we identified a style feature in letter X: some writer draw X clockwise, and some draw it anti-clockwise. We manually annotated the whole dataset for this feature; the result can be seen in figure 4. Almost half of the writers draw the letter X clockwise, and the other half draw it anti-clockwise. If our assumption is correct, our model should be able to capture this feature. We project the latent of the model using PCA on all the letter X, which can be seen in figure 5. The model latent space clusters almost perfectly based on rotation. Examples for letters from both clusters are in figure 6.

Encouraged by the results on letter X, we explored more letters. For letter C, we can see the latent space project in figure 7. It can be seen that there are at least two main clusters. Examples from this cluster in the red ellipse are in figure 9. The indicated cluster represents the Edwardian handwriting style. The rest of the writers (in the big cluster) have a very similar style (this is expected, since the drawing of the letter C is quite simple).

For letter A, our model latent space create two main clusters, figure 8. We give examples from those two in figure 10, where we can see clear difference in the style. Some people start drawing the letter from down-left, other writers start from the top of letter A, move down, then continue drawing of the letter.

Another example is for letter S bottleneck, figure 11. There are three resulting clusters which we investigated. The indicated cluster (in red) is clearly different from the other two clusters (not indicated). Examples can be seen in figure 12. The indicated cluster is again for people with Edwardian handwriting style. We did not find a clear difference between the other two clusters though, but this is an expected outcome of using t-SNE (since it does not have the clear objective of clustering styles).

TABLE I: BLEU scores for different models for known writers.

TABLE II: BLEU scores for different models for style extraction for 30 new writers (style transfer).

TABLE III: Pearson correlation coefficients for the End-Of- Sequence (EoS) distributions for the different models on the normal generation scenario

TABLE IV: Pearson correlation coefficients for the End-Of- Sequence (EoS) distributions for the different models on 30 new writers (style transfer).

These examples show is that we can use our model to extract verbose style information.

Fig. 4: Results of the manual annotation for the rotation of letter X drawings over the whole dataset. Almost half the writers drew X clockwise, the other half anti-clockwise. The undefined styles were unclear to determine.

VII. CONCLUSIONS

In this paper, we explored the concepts of styles of handwriting, using a deep neural network paradigm. We have approached the problem systematically. First, we compared our generation results to the benchmark reported in the state-of-the-art on this problem, and we show that our model outperforms the benchmark. Second, we explore the ability to perform style transfer, by testing the model’s performance on 30 new writers. We hypothesize that there is a limited number of

Fig. 5: Projection for latent space for letter X using PCA. The colors show the ground truth of the X rotation: blue is counter clockwise, orange is clockwise, and the few red points are undefined.

style components that describe handwriting, and a good style extraction model should generalize well to new writers. Last, we analyze the latent space of our model for multiple letters, and show that the model separate the different styles in different clusters.

VIII. FUTURE WORK

Based on the results of the latent space analysis, our next objective is to build an latent space structure and objective function that disentangle the style manifold. So far, we used multiple projection techniques in order to explore the style information in the latent space. We would like this to emerge on its own in the latent space. This step is usually known as Knowledge Restructuring, which enable the addressing of several interesting questions, like:

• What are all the different styles available for different letters?

• Can we use the styles from those different letters to build a footprint for each writer (i.e. style embedding for the writer)? If so, how good is this embedding in learning to generate letters using it as a prior knowledge only?

Fig. 6: Examples for writing of letter X. Starting point is marked with the blue mark. Each raw is randomly sampled from each cluster in the bottleneck. The clusters shows that almost half the writers draw the letter clockwise (first row, first cluster), and the other half draw it anti-clockwise (second row, second cluster).

Fig. 7: Projection for latent space for letter C using t-SNE. The cluster surrounded by the red circle has a clear interpretation, where writers have a cursive style.

Fig. 8: Projection for latent space for letter A using PCA.

Fig. 9: Examples for writing of letter C from the selected cluster (first row) versus the rest of the letter drawings (second row). Starting point is marked with the blue mark. The drawings from the selected cluster show people with Edwardian style of handwriting.

Fig. 10: Examples for writing of letter A from the selected clusters. Starting point is marked with the blue mark. Each row is from one cluster. The first row show people who start drawing the letter from the top, going down, and then continue the drawing of the letter. The second row show people who start drawing from down directly.

• If we have a discrete number of styles for each letter, we investigate whether we can predict a writer’s style on one letter given the other letters, and what is the contribution of the other letters in identify the style of the writer. Also, in this study, we focused only on the upper case letters. We intend to expand our evaluation to include the rest of the dataset (lowercase and digits).

REFERENCES

[1] O. Mohammed, G. Bailly, and D. Pellier, “Handwriting styles: benchmarks and evaluation metrics,” in First International Workshop on

Fig. 11: Projection for latent space for letter S using t-SNE. We manage to interpret the indicated cluster as the Edwardian style in drawing. The other two clusters (not indicated) did not show clear difference in the style, but this is an expected behavior from using the t-SNE algorithm, since it does not try to cluster styles as an objective.

Fig. 12: Examples for writing of letter S from the selected cluster (first row) versus the other two clusters (second row). Starting point is marked with the blue mark. The drawings from the selected cluster is always Edwardian style.

Deep and Transfer Learning - Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), (Valencia, Spain), IEEE, 2018.

[2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

[3] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.

[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.

[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[6] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.

[7] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of

gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.

[8] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with recurrent neural networks,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024, 2011.

[9] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, (Cambridge, MA, USA), pp. 3104–3112, MIT Press, 2014.

[10] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137, 2015.

[11] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pp. 3156–3164, IEEE, 2015.

[12] J.-P. Briot and F. Pachet, “Music generation by deep learning-challenges and directions,” arXiv preprint arXiv:1712.04371, 2017.

[13] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.

[14] D. Ha and D. Eck, “A neural representation of sketch drawings,” CoRR, vol. abs/1704.03477, 2017.

[15] Google, “The quick, draw! dataset,” 2017.

[16] C. M. Bishop, “Mixture density networks,” 1994.

[17] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1747–1756, JMLR.org, 2016.

[18] B. Chang, Q. Zhang, S. Pan, and L. Meng, “Generating handwritten chinese characters using cyclegan,” CoRR, vol. abs/1801.08624, 2018.

[19] L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm of artistic style,” CoRR, vol. abs/1508.06576, 2015.

[20] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318, Association for Computational Linguistics, 2002.

[21] C. Viard-Gaudin, P. M. Lallican, S. Knerr, and P. Binter, “The ireste on/off (ironoff) dual handwriting database,” in Document Analysis and Recognition, 1999. ICDAR ’99. Proceedings of the Fifth International Conference on, pp. 455–458, Sep 1999.

[22] U.-V. Marti and H. Bunke, “A full english sentence database for offline handwriting recognition,” in Document Analysis and Recognition, 1999. ICDAR’99. Proceedings of the Fifth International Conference on, pp. 705–708, IEEE, 1999.

[23] H. Freeman, “On the encoding of arbitrary geometric configurations,” IRE Transactions on Electronic Computers, vol. 2, pp. 260–268, 1961.

[24] R. J. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” CoRR, vol. abs/1803.09047, 2018.

[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[26] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS-W, 2017.

[27] I. Jolliffe, “Principal component analysis,” in International encyclopedia of statistical science, pp. 1094–1096, Springer, 2011.

[28] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.

Fig. 13: Examples of generated letters. The blue mark is the starting point. The traces in green is the ground truth, and the red is the generated ones by our model.

designed for accessibility and to further open science