Sketch representation and interpretation remains an open challenge, particularly for complex and casually constructed drawings. Yet, the ability to classify, search, and manipulate sketched content remains attractive as gesture and touch interfaces reach ubiquity. Advances in recurrent network architectures within language processing have recently inspired sequence modeling approaches to sketch (e.g. SketchRNN [12]) that encode sketch as a variable length sequence of strokes, rather than in a rasterized or ‘pixel’ form. In particular, long-short term memory (LSTM) networks have shown significant promise in learning search embeddings [32, 5] due to their ability to model higher-level structure and temporal order versus convolutional neural networks (CNNs) on rasterized sketches [3, 18, 6, 22]. Yet, the limited temporal extent of LSTM restricts the structural complexity of sketches that may be accommodated in sequence embeddings. In language modeling domain, this shortcoming has been addressed through the emergence of Transformer networks [8, 7, 28] in which slot masking enhances the ability to learn longer term temporal structure in the stroke sequence.
This paper proposes Sketchformer, the first Transformer based network for learning a deep representation for free-hand sketches. We build on the language modeling Transformer architecture of Vaaswani et al. [28] to develop several variants of Sketchformer that process sketch sequences in continuous and tokenized forms. We evaluate the efficacy of each learned sketch embedding for common sketch interpretation tasks. We make three core technical contributions: 1) Sketch Classification. We show that Sketchformer driven by a dictionary learning tokenization scheme outperforms state of the art sequence embeddings for sketched object recognition over QuickDraw! [19]; the largest and most diverse public corpus of sketched objects. 2) Generative Sketch Model. We show that for more complex, detailed sketches comprising lengthy stroke sequences, Sketchformer improves generative modeling of sketch – demonstrated by higher fidelity reconstruction of sketches from the learned embedding. We also show that for sketches of all complexities, interpolation in the Sketchformer embedding is stable, generating more plausible intermediate sketches for both inter- and intra-class blends. 3) Sketch based Image Retrieval (SBIR) We show that Sketchformer can be unified with raster embedding to produce a search embedding for SBIR (after [5] for LSTM) to deliver improved prevision over a large photo corpus (Stock10M).
These enhancements to sketched object understanding, generative modeling and matching demonstrated for a diverse and complex sketch dataset suggest Transformer as a promising direction for stroke sequence modeling.
Representation learning for sketch has received extensive attention within the domain of visual search. Classical sketch based image retrieval (SBIR) techniques explored spectral, edge-let based, and sparse gradient features the latter building upon the success of dictionary learning based models (e.g. bag of words) [26, 1, 23]. With the advent of deep learning, convolutional neural networks (CNNs) were rapidly adopted to learn search embedding [35]. Triplet loss models are commonly used for visual search in the photographic domain [29, 20, 11], and have been extended to SBIR. Sangkloy et al. [22] used a three-branch CNN with triplet loss to learn a general cross-domain embedding for SBIR. Fine-grained (within-class) SBIR was similarly explored by Yu et al. [34]. Qi et al. [18] instead use contrastive loss to learn correspondence between sketches and pre-extracted edge maps. Bui et al. [2, 3] perform crosscategory retrieval using a triplet model and combined their technique with a learned model of visual aesthetics [31] to constrain SBIR using aesthetic cues in [6]. A quadruplet loss was proposed by [25] for fine-grained SBIR. The generalization of sketch embeddings beyond training classes have also been studied [4, 15], and parameterized for zero-shot learning [9]. Such concepts were later applied in sketch-based shape retrieval tasks [33]. Variants of CycleGAN [36] have also shown to be useful as generative models for sketch [13]. Sketch-A-Net was a seminal work for sketch clas-sification that employed a CNN with large convolutional kernels to accommodate the sparsity of stroke pixels [34]. Recognition of partial sketches has also been explored by [24]. Wang et al. [30] proposed sketch classification by sampling unordered points of a sketch image to learning a canonical order.
All the above works operate over rasterized sketches e.g. converting the captured vector representation of sketch (as a sequence of strokes) to pixel form, discarding temporal order of strokes, and requiring the network to recover higher level spatial structure. Recent SBIR work has begun to directly input a vector (stroke sequence) representations for sketches [21], notably SketchRNN; an LSTM based sequence to sequence (seq2seq) variational auto-proposed by Eck et al. [12], trained on the largest public sketch corpus ‘QuickDraw! [19]. SketchRNN embedding was incorporated in a triplet network by Xu et al. [32] to search for sketches using sketches. A variation using cascaded attention networks was proposed by [14] to improve vector sketch classification over Sketch-A-Net. Later, LiveSketch [5] extended SketchRNN to a triplet network to perform SBIR over tens of millions of images, harnessing the sketch embedding to suggest query improvements and guide the user via relevance feedback. The limited temporal scope of LSTM based seq2seq models can prevent such representations modeling long, complex sketches, a problem mitigated by our Transformer based model which builds upon the success shown by such architectures for language modeling [28, 7, 8]. Transformers encode long term temporal dependencies by modeling direct connections between data units. The temporal range of such dependencies was increased via the Transformer-XL [7] and BERT [8], which recently set new state-of-the-art performance on sentence classification and sentence-pair regression tasks using a cross-encoder. Recent work explores transformer beyond sequence modeling to 2D images [17]. Our work is first to apply these insights to the problem of sketch modeling, incorporating the Transformer architecture of Vaswani et al. [28] to deliver a multi-purpose embedding that exceeds the state of the art for several common sketch representation tasks.
We propose Sketchformer; a multi-purpose sketch representation from stroke sequence input. In this section we discuss the pre-processing steps, the adaptions made to the core architecture proposed by Vaswani et al. [28] and the three application tasks.
3.1. Pre-processing and Tokenization
Following Eck et al. [12] we simplify all sketches using the RDP algorithm [10] and normalize stroke length. Sketches for all our experiments are drawn from Quick-Draw50M [19] (see Sec. 4; for dataset partitions). 1) Continuous. Quickdraw50M sketches are released in the ‘stroke-3’ format where each point stores its relative position to the previous point together with its binary pen state. To also include the ‘end of sketch’ state, the stroke-5 format is often employed:
, where the the pen states
- draw,
- lift and
- end are mutually exclusive [12]. Our experiments with continuous sketch modeling use the ‘stroke-5’ format. 2) Dictionary learning. We build a dictionary of K code words (K = 1000) to model the relative pen motion i.e.
. We randomly sample 100k sketched pen movements in the training set for clustering via K-means. We allocate 20% of sketch points for sampling inter-stroke transition, i.e. relative transition when the pen is lifted, to balance with the more common within-stroke transitions. Each transition point
is then assigned to the nearest code word, resulting in a sequence of discrete tokens. We also include 4 special tokens; a Start of Sketch (SOS) token at the beginning of every sketch, an End of Sketch (EOS) token at the end, a Stroke End Point (SEP) token to be inserted between strokes (indicate pen lifting) and a padding (PAD) token to pad the sketch to a fixed length. 3) Spatial Grid. The sketch canvas is first quantized into
) square cells, each cell is represented by a token in our dictionary. Given the absolute (x, y) sketch points, we determine which cell contains this point and assign the cell’s token to the point. The same four special tokens above are used to complete the sketch sequence.
Fig. 2 visualizes sketch reconstruction under the tokenized methods to explore sensitivity to the quantization parameters. Compared with the stroke-5 format (continuous) the tokenization methods are more compact. Dictionary learned tokenization (Tok-Dict) can have a small dictionary size and is invariant to translation since it is derived from stroke-3. On the other hand quantization error could accumulate over longer sketches if dictionary size is too low,
Figure 1. Schematic of the Transformer architecture used for Sketchformer, which utilizes the original architecture of Vaswani et al. [28] but modifies it with an alternate mechanism for formulating the bottleneck (sketch embedding) layer using a self-attention block, as well as configuration changes e.g. MHA head count (see Sec. 3.2).
shifting the position of strokes closer to the sequence’s end. The spatial grid based tokenization method (Tok-Grid), on the other hand, does not accumulate error but is sensitive to translation and yields a larger vocabulary (.
3.2. Transformer Architecture for Sketch
Sketchformer uses the Transformer network of Vaswani et al. [28]. We add stages (e.g. self-attention and modified bottleneck) and adapt parameters in their design to learn a multi-purpose representation for stroke sequences, rather than language. A transformer network consists of an encoder and decoder blocks, each comprising several layers of multihead attention followed by a feed forward network. Fig. 1 illustrates the architecture with dotted lines indicating re-use of architecture stages from [28]. In Fig. 4 we show how our learned embedding is used across multiple applications. Compared to [28] we use 4 MHA blocks versus 6 and a feed-forward dimension of 512 instead of 2048. Unlike traditional sequence modeling methods (RNN/LSTM) which learns the temporal order of current time steps from previous steps (or future steps in bidirectional encoding), the attention mechanism in transformers allows the network to decide which time steps to focus on to improve the task at hand. Each multihead attention (MHA) layer is formulated as such:
where k, q and v are respective Key, Query and Value inputs to the single head attention (SHA) module. This module computes the similarity between pairs of Query and Key features, normalizes those scores and finally uses them as a projection matrix for the Value features. The multihead attention (MHA) module concatenates the output of multiple single heads and projects the result to a lower dimension. is a scaling constant and
are learnable weight matrices.
The MHA output is fed to a positional feed forward network (FFN), which consists of two fully connected layers with ReLU activation. The MHA-FFN (F(.)) blocks are the basis of the encoder side of our network (E(.)):
where X indicates layer normalization over X and N is number of the MHA-FFN units F(.).
The decoder takes as inputs the encoder output and target sequence in an auto-regressive fashion. In our case we are learning an transformer autoencoder so the target sequence is also the input sequence shifted forward by 1:
Figure 2. Visualizing the impact of quantization on the reconstruction of short, median and long sequence length sketches. Grid sizes of n = [10, 100] (Tok-Grid) and dictionary sizes of K = [500, 1000] (Tok-Dict). Sketches are generated from the tokenized sketch representations, independent of transformer.
Figure 3. t-SNE visualization of the learned embedding space from Sketchformer’s three variants, compared to LiveSketch (left) and computed over the QD-862k test set; 10 categories and 1000 samples were randomly selected.
where h is the encoder output, is the shifted auto-regressive version of input sequence x.
The conventional transformer is designed for language translation and thus does not provide a feature embedding as required in Sketchformer (output of E is also a sequence of vectors of the same length as x). To learn a compact representation for sketch we propose to apply self-attention on the encoder output, inspired by [27]:
which is similar to SHA however the Key matrix K, Value vector v and bias b are now trainable parameters. This self-attention layer learns a weight vector s describing the importance of each time step in sequence h, which is then accumulated to derive the compact embedding z. On the decoder side, z is passed through a FFN to resume the original shape of h. These are the key novel modifications to the original Transformer architecture of Vaswani et al. (beyond above-mentioned parameter changes).
We also had to change how masking worked on the decoder. The Transformer uses a padding mask to stop attention blocks from giving weight to out-of-sequence points. Since we want a meaningful embedding for reconstruction and interpolation, we removed this mask from the decoder, forcing our transformer to learn reconstruction without previously knowing the sequence length and using only the embedding representation.
3.2.1 Training Losses
We employ two losses in training Sketchformer. A classifi-cation (softmax) loss is connected to the sketch embedding z to preserve semantic information while a reconstruction loss ensures the decoder can reconstruct the input sequence from its embedding. If the input sequence is continuous (i.e. stroke-5) the reconstruction loss consists of a loss term modeling relative transitions
and a 3-way classi-fication term modeling the pen states. Otherwise the reconstruction loss uses softmax to regularize a dictionary of sketch tokens as per a language model. We found these losses simple yet effective in learning a robust sketch embedding. Fig. 3 visualizes the learned embedding for each of the three pre-processing variants, alongside that of a state of the art sketch encoding model using stroke sequences [5].
3.3. Cross-modal Search Embedding
To use our learned embedding for SBIR, we follow the joint embedding approach first presented in [5] and train an auxiliary network that unifies the vector (sketch) and raster (image corpus) representations into a common subspace.
This auxiliary network is composed of four fully connected layers (see Fig. 4) with ReLU activations. These are trained within a triplet framework and have input from three pre-trained branches: an anchor branch that models vector representations (our Sketchformer), plus positive and negative branches extracting representations from raster space.
The first two fully connected layers are domain-specific and we call each set , referring to vector-specific and raster-specific. The final two layers are shared between domains; we refer to this set as
. Thus the end-to-end mapping from vector sketch and raster sketch/image to the joint embedding is:
where and
are the input vector sketches and raster images respectively, and
and
their corresponding representations in the common embedding. E(.) is the network that models vector representations and P(.) is the one for raster images. In the original LiveSketch [5], E(.) is a SketchRNN [12]-based model, while we employ our multitask Sketchformer encoder instead. For P(.) we use the same off-the-shelf GoogLeNet-based network, pre-trained on a joint embedding task (from [4]).
The training is performed using triplet loss regularized with the help of a classification task. Training requires an aligned sketch and image dataset i.e. a sketch set and image set that share the same category list. This is not the case for Quickdraw, which is a sketch-only dataset without a corresponding image set. Again following [5], we use the raster sketch as a medium to bridge vector sketch with raster image. The off-the-shelf P(.) (from [4]) was trained
Figure 4. Schematic showing how the learned sketch embedding is leveraged for sketch synthesis (reconstruction/interpolation), classification and cross-modal retrieval experiments (see encoder/embedding inset, refer to Fig. 1 for detail). Classification appends fully-connected (fc) and softmax layers to the embedding space. Retrieval tasks require unification with a raster (CNN) embedding for images [4] via several fc layers trained via triplet loss.
to produce a joint embedding model unifying raster sketch and raster image; This allowed the authors train the and
sets using vector and raster versions of sketch only. By following the same procedure, we eliminate the need of having an aligned image set for Quickdraw as our network never sees an image feature during training.
The training is implemented in two phases. At phase one, the anchor and positive samples are vector and raster forms of random sketches in the same category while raster input of the negative branch is sampled from a different category. At phase two, we sample hard negatives from the same category with the anchor vector sketch and choose the raster form of the exact instance of the anchor sketch for the positive branch. The triplet loss maintains a margin between the anchor-positive and anchor-negative distances:
and margin m = 0.2 in phase one, m = 0.05 in phase two.
We evaluate the performance of the proposed transformer embeddings for three common tasks; sketch classi-fication, sketch reconstruction and interpolation, and sketch based image retrieval (SBIR). We compare against two baseline sketch embeddings for encoding stroke sequences; SketchRNN [12] (also used for search in [32]) and LiveSketch [5]. We evaluate using sketches from QuickDraw50M [19], and a large corpus of photos (Stock10M).
QuickDraw50M [19] comprises over 50M sketches of 345 object categories, crowd-sourced within a gamified context that encouraged casual sketches drawn at speed. Sketches are often messy and complex in their structure, consistent with tasks such as SBIR. Quickdraw50M captures sketches as stroke sequences, in contrast to earlier raster-based and less category-diverse datasets such as TUBerlin/Sketchy. We sample 2.5M sketches randomly with even class distribution from the public Quickdraw50M training partition to create training set (QD-2.5M) and use the public test partition of QuickDraw50M (QD-862k) comprising 2.5k sketches to evaluate our trained models. For SBIR and interpolation experiments we sort QD-862k by sequence length, and sample three datasets (QD345-S, QD345-M, QD345-L) at centiles 10, 50 and 90 respectively to create a set of short, medium and long stroke sequences. Each of these three datasets samples one sketch per class at random from the centile yielding three evaluation sets of 345 sketches. We sampled an additional query set QD345-Q for use in sketch search experiments, using the same 345 sketches as LiveSketch [5]. The median stroke lengths of QD345-S, QD345-M, QD345-L are 30, 47 and 75 strokes respectively (after simplification via RDP [10]).
Stock67M is a diverse, unannotated corpus of photos used in prior SBIR work [5] to evaluate large-scale SBIR retrieval performance. We sample 10M of these images at random for our search corpus (Stock10M).
4.1. Evaluating Sketch Classification
We evaluate the class discrimination of the proposed sketch embedding via attaching dense and softmax layers to the transformer encoder stage, and training a 345-way classifier on QD2.5M. Table 1 reports the classifica-tion performance over QD-862k for each of the three proposed transformer embeddings, alongside two LSTM baselines – the SketchRNN [12] and LiveSketch [5] variational autoencoder networks. Whilst all transformers outperform the baseline, the tokenized variant of the transformer based on dictionary learning (TForm-Tok-Dict) yields highest accuracy. We explore this further by shuffling the order of the sketch strokes retraining the transformer models from
Table 1. Sketch classification results over QuickDraw! [19] for three variants of the proposed transformer embedding, contrasting each to models learned from randomly permuted stroke order. Comparing to two recent LSTM based approaches for sketch sequence encoding [5, 12].
Figure 5. Visualization of sketches reconstructed from mean embedding for 3 object categories. We add Gaussian noise with standard deviation to the mean embedding of three example categories on the Quickdraw test set. The reconstructed sketches of Tform-Tok-Dict retain salient features even with high noise perturbation.
Table 2. User study quantifying accuracy of sketch reconstruction. Preference is expressed by 5 independent workers, and results with > 50% agreement are included. Experiment repeated for short, medium and longer stroke sequences. For longer sketches, the proposed transformer method TForm-Tok-Dict is preferred.
scratch. We were surprised to see comparable performance, suggesting this gain is due to spatial continuity rather than temporal information.
4.2. Reconstruction and Interpolation
We explore the generative power of the proposed embedding by measuring the degree of fidelity with which: 1) encoded sketches can be reconstructed via the decoder to resemble the input; 2) a pair of sketches may be interpolated within, and synthesized from, the embedding. The experiments are repeated for short (QD345-S), medium (QD345-M) and long (QD345-L) sketch complexities. We assess the fidelity of sketch reconstruction and the visual plausibility of interpolations via Amazon Mechanical Turk (MTurk). MTurk workers are presented with a set of reconstructions or interpolations and asked to make a 6-way preference choice; 5 methods and a ’cannot determine’ option. Each task is presented to five unique workers, and we only include results for which there is > 50% (i.e. > 2 worker) consensus on the choice.
Reconstruction results are shown in Table 2 and favor the LiveSketch [5] embedding for short or medium
Figure 6. Representative sketch reconstructions from each of the five embeddings evaluated in Table 2. (a) Original, (b) SketchRNN, (c) LiveSketch, (d) TForm-Cont, (e) TForm-Tok-Grid and (f) TForm-Tok-Dict. The last row represents a hard-to-reconstruct sketch.
length strokes, with the proposed tokenized transformer (TForm-Tok-Dict) producing better results for more complex sketches aided by the improved representational power of transformer for longer stroke sequences. Fig 6 provides representative visual examples for each sketch complexity.
We explore interpolation in Table 3 blending between pairs of sketches within (intra-) class and between (inter-) class. In all cases we encode sketches separately to the embedding, interpolate via slerp (after [12, 5] in which slerp was shown to offer best performance), and decode the interpolated point to generate the output sketch. Fig. 7 provides visual examples of inter- and intra- class interpolation for each method evaluated. In all cases the proposed tokenized transformer (TForm-Tok-Dict) outperforms other transformer variants and baselines, although the performance separation is narrower for shorter strokes echoing results of the reconstruction experiment. The stability of our representation is further demonstrated via local sampling within the embedding in Fig. 5.
4.3. Cross-modal Matching
We evaluate the performance of Sketchformer for sketch based retrieval of sketches (S-S) and images (S-I).
Sketch2Sketch (S-S) Matching. We quantify the accuracy of retrieving sketches in one modality (raster) given a sketched query in another (vector, i.e. stroke sequence) – and vice-versa. This evaluates the performance of Sketchformer in discriminating between sketched visual structures invariant to their input modality. Sketchformer is trained on QD-2.5M and we query the test corpus QD-826k using QD-345Q as the query set. We measure overall mean average precision (mAP) for both coarse grain (i.e. class-specific) and fine-grain (i.e. instance-level) similarity, as mean average of mAP for each query. As per [5] for the former we consider a retrieved record a match if it matches the sketched object class. For the latter, exactly the same single sketch must match (in its different modality). To run raster variants, a rasterized version of QD-862k (for V-R) and of QD345-Q (for R-V) is produced by rendering strokes to a pixel canvas using the CairoSVG Python library. Table 4 show
Figure 7. Representative sketch interpolations from each of the five embeddings evaluated in Table 3. For each embedding: (first row) inter-class interpolation from ’birthday cake’ to ‘ice-cream’ and (second row) intra-class interpolation between two ‘birthday cakes’.
Figure 8. Representative visual search results over Stock10M indexed by our proposed embedding (TForm-Tok-Dict) for a vector sketch query. The two bottom rows (‘animal migration’ and ‘tree’) are failure cases.
that for both class and instance level retrieval, the R-V con-figuration outperforms V-R indicating a performance gain due to encoding this large search index using the vector representation. In contrast to other experiments reported, the continuous variant of Sketchformer appears slightly preferred, matching higher for early ranked results for the S-S case – see Fig. 9a for category-level precision-recall curve. Although Transformer outperforms RNN baselines by 1-3% in the V-R case the gain is more limited and indeed the performance over baselines is equivocal in the S-S where the search index is formed of rasterized sketches.
Sketch2Image (S-I) Matching. We evaluate sketch based image retrieval (SBIR) over Stock10M dataset of diverse photos and artworks, as such data is commonly indexed for large-scale SBIR evaluation [6, 5]. We compare against the state of the art SBIR algorithms accepting vector (LiveSketch [5]) and raster (Bui et al. [4]) sketched queries. Since no ground-truth annotation is possible for this size of corpus, we crowd-source per-query annotation via Mechanical Turk (MTurk) for the top-k (k=15) results and compute both mAP% and precision@k curve averaged across all QD345-Q query sketches. Table 5 compares performance of our tokenized variants to these baselines, alongside associated Precision@k curves in Fig. 9b. The proposed dictionary learned transformer embedding (TForm-Tok-Dict) delivers the best performance (visual results in Fig. 8).
Table 3. User study quantifying interpolation quality for a pair of sketches of same (intra-) or between (inter-) classes. Preference is expressed by 5 independent workers, and results with > 50% agreement are included. Experiment repeated for short, medium and longer stroke sequences.
Figure 9. Quantifying search accuracy. a) Sketch2Sketch via precision-recall (P-R) curves for Vector-2-Raster and Raster-2-Vector category-level retrieval. b) Sketch2Image (SBIR) accuracy via precision @ k=[1,15] curve over Stock10M.
Table 4. Quantifying the performance of Sketch2Sketch retrieval under two RNN baselines and three proposed variants. We report category- and instance-level retrieval (mAP%).
Table 5. Quantifying accuracy of Sketchformer for Sketch2Image search (SBIR). Mean average precision (mAP) computed to rank 15 over Stock10M for the QD345-Q query set.
We presented Sketchformer; a learned representation for sketches based on the Transformer architecture [28]. Several variants were explored using continuous and tokenized input; a dictionary learning based tokenization scheme delivers performance gains of 6% on previous LSTM autoencoder models (SketchRNN and derivatives). We showed interpolation within the embedding yields plausible blending of sketches within and between classes, and that reconstruction (auto-encoding) of sketches is also improved for complex sketches. Sketchformer was also shown effective as a basis for indexing sketch and image collections for sketch based visual search. Future work could further explore our continuous representation variant, or other variants with more symmetric encoder-decoder structure. We have demonstrated the potential for Transformer networks to learn a multi-purpose representation for sketch, but believe many further applications of Sketchformer exist beyond the three tasks studied here. For example, fusion with additional modalities might enable sketch driven photo generation [16] using complex sketches, or with a language embedding for novel sketch synthesis applications.
[1] Tu Bui and John Collomosse. Scalable sketch-based image retrieval using color gradient features. In Proc. ICCV Workshops, pages 1–8, 2015. 2
[2] T. Bui, L. Ribeiro, M. Ponti, and J. Collomosse. Generali- sation and sharing in triplet convnets for sketch based visual search. CoRR Abs, arXiv:1611.05301, 2016. 2
[3] T. Bui, L. Ribeiro, M. Ponti, and J. Collomosse. Compact de- scriptors for sketch-based image retrieval using a triplet loss convolutional neural network. Computer Vision and Image Understanding (CVIU), 2017. 1, 2, 8
[4] T. Bui, L. Ribeiro, M. Ponti, and J. Collomosse. Sketching out the details: Sketch-based image retrieval using convolutional neural networks with multi-stage regression. Elsevier Computers & Graphics, 2018. 2, 4, 5, 7
[5] J. Collomosse, T. Bui, and H. Jin. Livesketch: Query per- turbations for guided sketch-based visual search. In Proc. CVPR, pages 1–9, 2019. 1, 2, 4, 5, 6, 7, 8
[6] J. Collomosse, T. Bui, M. Wilber, C. Fang, and H. Jin. Sketching with style: Visual search with sketches and aesthetic context. In Proc. ICCV, 2017. 1, 2, 7
[7] Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019. 1, 2
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, v.1, pages 4171–4186. Association for Computational Linguistics, 2019. 1, 2
[9] S. Dey, P. Riba, A. Dutta, J. Llados, and Y. Song. Doodle to search: Practical zero-shot sketch-based image retrieval. In Proc. CVPR, 2019. 2
[10] David H Douglas and Thomas K Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: the international journal for geographic information and geovisualization, 10(2):112–122, 1973. 2, 5
[11] Albert Gordo, Jon Almaz´an, Jerome Revaud, and Diane Lar- lus. Deep image retrieval: Learning global representations for image search. In Proc. ECCV, pages 241–257, 2016. 2
[12] D. Ha and D. Eck. A neural representation of sketch draw- ings. In Proc. ICLR. IEEE, 2018. 1, 2, 4, 5, 6, 8
[13] Y. Song T. Xiang T. Hospedales J. Song, K. Pang. Learning to sketch with shortcut cycle consistency. In Proc. CVPR, 2018. 2
[14] Lei Li, Changqing Zou, Youyi Zheng, Qingkun Su, Hongbo Fu, and Chiew-Lan Tai. Sketch-r2cnn: An attentive network for vector sketch recognition. arXiv preprint arXiv:1811.08170, 2018. 2
[15] K. Pang, K. Li, Y. Yang, H. Zhang, T. Hospedales, T. Xiang, and Y. Song. Generalising fine-grained sketch-based image retrieval. In Proc. CVPR, 2019. 2
[16] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Gaugan: semantic image synthesis with spatially adaptive normalization. In ACM SIGGRAPH 2019 Real-Time Live!, page 2. ACM, 2019. 8
[17] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran. Image transformer. In Proc. NeurIPS, 2019. 2
[18] Yonggang Qi, Yi-Zhe Song, Honggang Zhang, and Jun Liu. Sketch-based image retrieval via siamese convolutional neural network. In Proc. ICIP, pages 2460–2464. IEEE, 2016. 1, 2
[19] The Quick, Draw! Dataset. https://github.com/ googlecreativelab/quickdraw-dataset. Accessed: 2018-10-11. 1, 2, 5
[20] Filip Radenovi´c, Giorgos Tolias, and Ondˇrej Chum. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In Proc. ECCV, pages 3–20, 2016. 2
[21] Umar Riaz Muhammad, Yongxin Yang, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. Learning deep sketch abstraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8014–8023, 2018. 2
[22] Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The sketchy database: Learning to retrieve badly drawn bunnies. In Proc. ACM SIGGRAPH, 2016. 1, 2
[23] Ros´alia G Schneider and Tinne Tuytelaars. Sketch classifi- cation and classification-driven analysis using fisher vectors. ACM Transactions on Graphics (TOG), 33(6):174, 2014. 2
[24] Omar Seddati, Stephane Dupont, and Sa¨ıd Mahmoudi. Deepsketch 2: Deep convolutional neural networks for partial sketch recognition. In 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI), pages 1–6. IEEE, 2016. 2
[25] O. Seddati, S. Dupont, and S. Mahoudi. Quadruplet networks for sketch-based image retrieval. In Proc. ICMR, 2017. 2
[26] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Proc. ICCV, 2003. 2
[27] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End- to-end memory networks. In Advances in neural information processing systems, pages 2440–2448, 2015. 4
[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Proc. NeurIPS. IEEE, 2017. 1, 2, 3, 8
[29] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learning fine-grained image similarity with deep ranking. In Proc. CVPR, pages 1386–1393, 2014. 2
[30] Xiangxiang Wang, Xuejin Chen, and Zhengjun Zha. Sketch- pointnet: A compact network for robust sketch recognition. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 2994–2998. IEEE, 2018. 2
[31] M. Wilber, C. Fang, H. Jin, A. Hertzmann, J. Collomosse, and S. Belongie. Bam! the behance artistic media dataset for recognition beyond photography. In Proc. ICCV, 2017. 2
[32] P. Xu, Y. Huang, T. Yuan, K. Pang, Y-Z. Song, T. Xiang, and T. Hospedales. Sketchmate: Deep hashing for million-scale human sketch retrieval. In Proc. CVPR, 2018. 1, 2, 5
[33] Yongzhe Xu, Jiangchuan Hu, Kun Zeng, and Yongyi Gong. Sketch-based shape retrieval via multi-view attention and generalized similarity. In 2018 7th International Conference on Digital Home (ICDH), pages 311–317. IEEE, 2018. 2
[34] Qian Yu, Feng Liu, Yi-Zhe Song, Tao Xiang, Timothy M Hospedales, and Chen-Change Loy. Sketch me that shoe. In Proc. CVPR, pages 799–807, 2016. 2
[35] Hua Zhang, Si Liu, Changqing Zhang, Wenqi Ren, Rui Wang, and Xiaochun Cao. Sketchnet: Sketch classification with web images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1105– 1113, 2016. 2
[36] J. Zhu, T. Park, P. Isola, and A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proc. ICCV, 2017. 2