2.1 Zero-shot activity recognition
The zero-shot object classification has been applied through cross modal information transfer between the images and the language. For example, [14] projected the images onto semantic spaces, namely the word embeddings and [15] used stacked autoencoders to ground the attributional features of objects to visual inputs. However, the research in zero-shot activity recognition is a relatively recent and less studied task.
Zellers et al. tried classification of verbs in images, instead of objects, through given linguistic cues and hand-defined attributes [21]. They proved to be able to predict the unseen actions from the images up to 42.17 top-5 accuracy. Youtube2Text mined the S/V/O triplets from the captions of videos, and try to predict the triplets from the videos in a hierarchical manner by minimizing the syntactic tree distances [5].
Xu et al. performed unseen human activity classification with the help of videos. They use HOG and SIFT features along with the language in form of word embeddings to create a manifold through a transductive approach [19, 20]. However, there is an access to test labels during training in their setting. They test their approach on human activity datasets.
Piergiovanni et al. uses the sequential frames and sentences and encode them to a common representational space through temporal attentional layers they have defined in their previous work [13]. Our work is most similar with [13], however instead of temporal attention module, we propose a simpler model to test the effectiveness of the auto-encoder neural network only which was not done on previous research. Then we examine the effects of different loss functions. [13] use sentences that describe the events, whereas we focus on the verbs and only use the relevant word vectors. Since we focus on atomic inputs instead of long sentences, we opt out the temporal attention. By starting from the small inputs and minimum sets of cues, we induce our problem as a distributional mapping problem to explore the crossrelations between different modalities. Our inquiry is on finding a mapping that will enable a generalization between the modalities to allow zero-shot
class prediction.
2.2 Action recognition
The state-of-the-art video recognition approaches make use of 3D convolutional networks over sequential frames of either RGB format, or the optical-flow outputs. The recent works on are generally evaluated on common human activity datasets.
Carreira et al. introduced the Two-Stream Inflated 3D ConvNet (I3D) which combines the 3D frame features and 3D optical features for activity classification [1]. Their model can be seen as an extended version of ImageNet that can recognize the spatio-temporal features through 3D streams. They test their model on Kinetics dataset and perform action recognition accuracies of up to 80.9% on HMDB-51 dataset and 98.0% on UCF-101 challenge [8].
Temporal Segmentation Networks try to predict the important segments over a video to predict the action [16]. They make use of the arrow of time in order to capture the differences between actions and classify them. Kong provides an in-depth literature review and summarization of different approaches along with their evaluations [9].
These models try to capture the visual semantics and classify the actions with the help of temporal dependencies and cues. However, these settings are not trained for zero-shot problem, and cannot find an unseen action class. Different than those, our work aims to be helpful in action classification task with the help of existing video recognition models, and examine the possibilities of grounding the language to the video.
2.3 Video event detection and captioning
Gao et al. try to retrieve the specific events from the video by a language query [4]. The query comes in the form of a sentence, and the mapping is attained through Long Short Term Memory modules. Likewise, Hendricks et al. try to find the sequences events with the help of extra temporal words such as ”after” or ”during” that are used in the queries [6].
The solutions to these tasks are specified according to how they define the problem and are hardly generalizable. However, they give rise to further questions on the relationship between video and language. Moreover, they share similarities with our work in terms of trying to understand the atomic particles of video by recognition models and align them with linguistic cues. Setting a multimodal joint space would be helpful in detecting the particular events from the video through the motionary words (verbs) and figuring out the sequential dependencies as well.
The dense video captioning problem focuses on atomic spatio-temporal features of videos to generate meaningful textual explanations . [11, 18, 10] make use of video recognition models along with language models and attention modules. However, (a) they require gold-datasets (perfectly explanatory captions for the video samples), (b) they hope the alignment of the particular words and parts of speech with the pixels through the help of attention modules, (c) they are not optimized for zero and few shot settings. In this case, our joint space might lead to an automatic alignment of some verbs to the temporal instances in videos, and become a starting point for alignments when there are no gold-datasets. This can be done by sliding the auto-encoder through the video. If the resulting vector has a high similarity to any verb, the latent vector can be used as a starter for an LSTM description generation model.
In this work, we aim to incorporate the temporally rich visual cues and the relevant linguistic cues into a fixed sized embedding space. The videos provide the temporal information for the action verbs better than the static images, and a video understanding module can be used to extract these visio-temporal features. For the linguistic cues, the verbs from the distributional language models are used since they represent the words atomically in a large set of dimensions and capture the features of verbs on a similarity basis. We first describe how these cues are extracted, then explain the joint multimodal representation architecture.
3.1 Video Understanding
Temporal dependencies can be captured best through sequences of frames (videos). For that, the state-of-the-art action classification networks make use of 3D convolutional networks on RGB features and optical flows and combine the predictions from both signals in a weighted manner. Here, we can use only one type of feature since our task is not about improving the visual classification. The aim of our study is to understand whether using such a network will be meaningful for our multimodal space. Figure 1 demonstrates the network architecture.
Figure 1: Inflated3D Action classification architecture by Carreira et al. [1], where Inc. labeled modules represent the Inception Module. On the last average pooling layer, there are 1024 features for t frames. t is smaller than the original size of T due to multiple convolutional layers and pooling operations. The predictions are calculated by pooling the last t frames and mapping to C classes.
On the penultimate layer of the network, the original number of time steps T is reduced to a much smaller size t, where each frame has 1024 features. On the last layer, these t frames are averaged to a single vector and the classification is applied through softmax. Hence, we have a feature vector of size 1024 which represents the visual activity through the video. The original action recognition model uses these features to classify over the possible activity classes. Therefore, it means that this single vector is capable to reflect visio-temporal differences between actions. In this work, we use this vector as an input of the visual modality because it consists of rich information about the visual modality on a high dimensional setting.
3.2 Textual Representation
The distributional word embeddings have proven to be successful for the natural language processing tasks. Such embeddings include Glove or Word2Vec, which have similar distributional logic but trained in separate ways. In this work, we will use pre-trained Glove embeddings. There are different sets of embeddings trained on different datasets, and with different purposes. Each of them represent the co-occurence relationship of the words in the trained indomain dataset. We choose the most common and generalizable embeddings with the highest dimensions. Besides these word embeddings, the state-of-the-art is shifting towards attentive models such as Elmo or Transformer [3]. In this project, we will only make use of a simple representative model because, again, our question is whether our multimodal space is meaningful or not. Yet, the better word models might improve the results further.
3.3 Joint Embedding Space
We have two sources for multimodal understanding, (a) a video vector with C features that represent the most important spatio-temporal features of the activity, (b) a word vector with D dimensional embeddings. The video vector is extracted by passing the sequential frames to the 3D CNN layers of the action recognition network, whereas the word embeddings are learned through the skip-gram network. There are several approaches to find the relationship between them.
One method is to find a direct mapping from video input to a word embedding. There, each video will be reduced to the number of dimensions of the word embeddings through either a fixed linear map or a neural network. However, as Collell et al. showed in their work, such mappings conserve the semantics of the input vectors rather than learning the common features along with the paired targets or mappings [2]. In other words, they work biased towards the input vectors and will not be able to capture a symmetric relationship between different modalities.
An approach to overcome the inherent limitations of the direct mapping between multimodal vectors might be to build up an embedding space in between the videos and words and incorporate a sophisticated loss function to capture the model. A neural network in this direction is proposed by Pergiovanni et al. is illustrated in the Fig. 2.
They introduced the autoencoders to create an embedding space which
Figure 2: The multimodal embedding model used in [13]. The video is of size CxT where C is the number of features and T is the number of time steps. Likewise, the caption sentences are embedded by Glove vectors and their size is CxL where C is the number of Glove features and L is the number of words. In the encoder/decoder layer, they use the temporal attention mechanism to reduce the time dimension T to N and there are feed forward layers after the encoders.
are described as:
Video Encoder Video Decoder
Text Encoder
Text Decoder
The stands for ”Generator”, whereas
for ”Encoder”. Their model is trained with a mixture of these loss functions:
Here, the video encoder learns to construct it’s own input, and the text encoder learns to construct the textual input through the loss. Whereas
enforces the constructed embedding space from different modalities to be as close as possible. The aim is to match the representative vectors coming from both modalities. In addition, the
connects the video encoder to the text decoder, and the text encoder to the video decoder since we wish a text-visual sample to be retrieved from each other. This loss is commonly used in computer vision tasks to transfer the artistic style of an image to another image.
Different from [13], in order to check the limits and constraints of such a two-way auto-encoder model for two income modalities, we introduce a set of feed forward layers and nonlinearities instead of the attention. In our model, the neural layers constitute a bottleneck layer between the two modalities which represent the joint semantic space. Moreover, in [13], the attention module summarizes the video features over the time dimension, and likewise, summarizes the word features from a paragraph. In this work, we use atomic inputs so that there is a 1-dimensional vector representing the video features instead of a matrix, and there is a word vector for the corresponding action word instead of a caption or a paragraph. Hence, there are no temporal attention networks, because we assume that the temporal features of an action is already included in its feature vector.
Our auto-encoder is trained and tested with the loss functions described above (). In addition to these losses, the model will be able to learn better with the negative samples. We wish a non-related class vector to be far from the true class vector in the shared space. The authors of [13] have introduced the negative learning through additional discriminators for each modal space, and the adversarial loss functions. In this work, we have chosen a simpler approach and used a margin ranking loss:
Here, )) is the similarity between the constructed vector from a paired text and video. Whereas
the similarity between the unpaired text and video. We wish the
than
. The margin ranking loss enforces this with a predefined margin. A similar function is used by Zellers et al. [21] for the textual inputs only.
4.1 Datasets
The Kinetics Human Action Video Dataset consists of 400 different human activity classes with 10 seconds of videos [8]. These activities cover a broad range of movements, such as sports(playing basketball, snowboarding), basic body motions(jumping, clapping), eating or cooking related activities, hobbies, communicative motions etc... We used this dataset as video input and extracted the feature vectors through the official pre-trained I3D model released at . Each extracted video has 1024 features.
As a textual input, the Glove word embeddings of each activity class is used. Some activities include several words such as ”playing piano”. Here, we followed Iyyer et al’s approach [7] and averaged the Glove vectors over these words. The pretrained Glove embeddings can be found.dimensional vectors trained on Wikipedia 2014 are used.
In addition to the video and word embeddings, we tested the model on the IAPR TC-12 dataset which consists of 200K images and their captions. The features of these images are extracted through VGG-128 image recognition network and collect the last layer of 128 dimensional feature space. The textual features are extracted by the help of bidirectional gated recurrent unit(bi-GRU) and has 64 dimensions. We have used the pre-trained feature vectors from . Even though the main purpose of our model is to learn the temporal visual inputs, such static inputs will help to improve and test the model further.
The IAPR TC-12 dataset is split to 16K train, 2K validation and 2K test. The Kinetics Dataset has 400 videos for each 400 action classes. In this work, a smaller subset is generated by randomly selecting 300 classes and 40 videos for each of the class. In total, there are 10K train, 1K validation and 1K test data.
4.2 Model and Implementation Details
Each of the autoencoders consist of 3 feed-forward layers of decreasing sizes, with the ReLU non-linearity and drop-out between each FF layer. The joint space has a smaller number of dimensions: for the Kinetics dataset of 1024-d vision and 300-d text inputs. The bottleneck in between is tested with sizes of 300-d, 200-d and 150-d. Likewise, for the IAPR TC-12 dataset, the joint space size for 128-d vision and 64-d vision is tested with 64-d and 48-d.
The autoencoders are trained with the Adam optimizer, started with a learning rate of 13 with a weight decay of 1
5. A higher starting learning rate is also tried, but the model ended up in a local optima. The drop-out rate of 0.5 is used, though there was no major effect of different rates. The best models during training are selected by the lowest validation error.
4.3 Evaluation and Results
First, we measured the action class prediction accuracy in order to be able to understand whether our model was successful at representing the multi-modal information from two inputs. For that, we decoded the test video vectors into textual vectors, and retrieved the most similar N word vectors according to cosine similarity metric. If the actual class is in the set of most similar vectors, it is counted towards a hit, hence contributing for the topN accuracy. In our case, the ground truth class might have several words (”playing basketball”, ”making sandwich”, ”walking with horse”...) and is counted as a hit if any one of the words is matched. It is important to note that the tests are not in the ”Generalized Zero-Shot Learning(GZSL)” setting, and the similarity is applied in the global word embedding space, not restricted by only the classes. Therefore, our evaluation setting may not directly conform with the current trends on zero-shot evaluation.
Second, the zero-shot action recognition is measured to evaluate the generalization capability of the model over the unseen classes. In this case, the test set consists of randomly selected 10 unseen classes with 40 instances each. We did not include the seen class instances in the test set in order to evaluate the zero-shot accuracy explicitly. Again, we calculated the top-N accuracies.
In Table 1, the prediction accuracy for seen and unseen classes are reported. The hyper-parameters of each loss function is adjusted to examine the effects of different combinations. Each model is trained for 300 epochs. The first five lines show the results for the test setting where there are 300 classes with 40 video instances. In addition, we have extended the tests with two more settings where the results are shown at the bottom two rows of Table 1. The first setting included 100 training classes with 100 videos for each instance. The number of unseen classes is increased to 30. In the second setting, the number of training classes is increased to 200 with 100 instances each, and again with 30 unseen classes.
Table 1: Results on the action class prediction averaged over the test sets (in %). The top-5, top-10 and top-30 accuracies are given for the seen class prediction on the test dataset, whereas the top-5 and top-10 accuracies are given for the unseen class prediction task.
As it can be seen from the table, the model with only the reconstruction loss gave the highest top-N accuracies on seen action prediction task. Then, the addition of ranking loss only did not improve the results, and higher focus on this loss has decreased the retrieval. The combination of all losses, with a higher influence of cross and ranking losses worked best among the multi-loss models. However, in the zero-shot prediction case, the equally weighted combination of showed the highest top-5 and top-10 accuracy.
In the first additional test where there are less classes but more data per class, the accuracies for seen class prediction were as high as the initial experiments, and the zero-shot results both for top-5 and top-10 were the best. In the second setting where there are more classes with more data, however, the model did not perform better than in the initial experiments. Probably, it learned a representation with a fixed set of weights that is optimal for a highest score (local optima), but does not vary much over different inputs. This would mean that it did not have a generalization capability, and could not be fixed with the validation split nor drop-out.
One of the unexpected observations was the effect of loss. During training with this loss, the model stopped learning after several iterations. Moreover, it gave a single fixed vector output for any test sample without depending on the input class. Hence, we believe that such a loss used with the combination of others finds a fixed solution and loses the ability to generalize and vary on different inputs.
Here, since there is no baseline approach, it is not possible to compare these results with another model. However, a fully linguistic attribute based zero-shot model of [21] reported the top-1 accuracy as 18.15 and top-5 as 40.17. They extract the verb’s linguistic properties such as being a motion, having social aspect, the expected duration, and map them to static image features. Our results might indicate that such a simple auto-encoder model is not as successful to predict the classes of unseen actions based only on the visual inputs and the word embeddings.
Nearest Neighbor Overlap
Furthermore, in order to compare the effect of each input modality on the representation space, we have used the mean nearest neighbor overlap measure (mNNO) [2]. The mNNO is defined as:
where are two sets of N paired vectors. The
) indicates the number of common vectors in the K nearest neighborhoods of
. For instance, let the nearest 3 neighbors of
. The intersection of their neigborhood is {tiger, lion} for K=3, and the mNNO score is 2/3.
In our setting, when we decode a textual vector from a visual vector, the number of neighbor overlaps for these two vectors, and the neighbor overlaps between the ground truth word vector and the decoded word vector should be ideally close to each other. This will mean that the model can learn equal amount of information from different modalities instead of reflecting the topology of only one or the other.
We compared the mNNO scores from text-to-video and from video-to-text, with both the auto-encoder and the linear mapping settings. We have first measured the scores for IAPR TC-12 dataset for clear explanations. Then, we calculated for the Kinetics actions dataset. The results are reported in Table 2.
Table 2: Mean nearest neighbor overlap scores for video-to-text and text- to-video transfers for the test dataset. ff indicates the neural feed forward layer, whereas AE is our two-way auto-encoder. X represents the input, f(X) the mapped output, and Y the ground truth.
It can be seen that for the IAPR TC-12 dataset, the auto-encoder learns the modalities similar to the output modality without depending whether it is a video or a text, though the neural mapper sustains the features of the input modality. For the Kinetics dataset, no matter if it is a linear mapping or an auto-encoder, the transferred features are more similar to video modality. The reason behind this might be the high difference in the number of visual and textual dimensions (1024 versus 300).
4.4 Discussion
The attempt of a joint space which would make the transfer of modalities in each directions without the requirement of extra manually defined attributes might be useful in video and language related tasks. However, our experiments on the prediction tasks showed that such a model built in our described approach was not capable of retrieving high results.
Model limitations
We used two stacked auto-encoders consisting of 3 neural layers each with ReLU for nonlinear learning. Each auto-encoder learns the latent features of the inputs while learning to reconstruct them back. The latent feature space acts as a semantic space that learn the shared distributional features among the modalities. The effects of model features:
1. The joint loss forced the networks to have a fixed embedding without generalizability.
2. The cross loss taught the model to retrieve the related item across the encoders. The effect of this has been tested in the preliminary experiments. Without the cross loss, the model cannot bridge the modalities.
3. The ranking loss better enhanced the ability to differ between unrelated items and added an extra cue for ”sharedness”.
4. The drop-out made the model better generalizable and relieved the effect of hubness to some extent. Without drop-out, the classification results did not have a large variety and pointed to same set of words.
5. Non-linearity increased the accuracy on the preliminary experiments, which might be related with increasing the learnability of the model.
The existing zero-shot object classification problem is shown to have higher accuracy with compatibility learning models that learn the mapping between the distributions rather than the attribute classifiers [17]. However, zero-shot activity detection models’ state-of-the-art accuracies are achieved by the help of manually defined linguistic attributes of verbs [21]. The compatibility learning models map the different modalities to each other through the help of joint loss and a ranking loss. In this work, we hypothesized that a two-way cross mapper would improve the zero-shot accuracy, include equal level of information from both modals and provide a holistic solution to other cross-modal tasks.
Zero-shot recognition
The question of whether it is possible to map the classification based visual distributional models to the neighborhood based textual distributional models without any extra prior knowledge is still remaining. In other words, we casted the problem as follows:
Here, the main difference between the spaces is that, the visual space does not consider the similarity across different actions, but it only contains the class-wise differences which is helpful for classification. Whereas the textual space has a rich neighborhood that shows the similarity between each word. Hence, the textual modality should provide a cue for similarity relationship to the multi-modal representation space. The visual classes will then gain semantic information and the zero-shot recognition will be possible(or viceversa). Another difference between the spaces is the types of information they include. The textual embeddings include any types of words, not only verbs but also nouns, adjectives, pronouns, adverbs... The visual space, on the other hand, contains information on geometric shapes, edges, textures and depth. Would it be really meaningful to try to map these different types of informations to each other? Can the zero-shot classification problem solved by such a mapper? This question is directed to both our work, and to the current research direction on zero-shot learning.
Possible improvements
There might be extra options to improve our model:
• It can be extended with a discriminative loss where there is an additional model that learns to separate the real versus fake visual input and the encoder competes in order to trick the discriminator. This way, the encoder will learn the underlying structure of the visual data better and will have a higher capacity to predict either seen or unseen classes.
• The number of words in the textual space can be restricted to only verb vectors for practicality.
• A different language model, either trained on a more related corpus that especially consists of more sports or activity related words might improve the accuracies.
• The current action recognition model makes use of all of the visual input frames where the background and other objects are included. A better approach would be to have segmented frames or frames with bounded boxes through an object tracker. At the moment, such a dataset does not exist for a large set of activity classes.
• The neural networks may not be helpful to solve the problem, hence, different probabilistic cues could be introduced.
In this work, we have proposed an auto-encoder based neural model that aimed to connect the multimodal representations over a joint space. We focused on short activities, their video representations and the distributional word features. Such a joint space which would make the transfer of modalities in each directions useful in video and language related tasks. The action class prediction experiments showed that our model was not capable of achieving results high enough to successfully assist on different tasks. There might be possible points to improve this model such as extending with discriminative loss, using a different language model, or a video recognition model. However, overall, our model used an approach which did not exist in the literature of zero-shot learning and activity classification, and the accuracies indicate that the research in this direction might be promising.
[1] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733, July 2017.
[2] G. Collell and M.-F. Moens. Do neural network cross-modal mappings really bridge modalities? In Association for Computational Linguistics(ACL), 2018.
[3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[4] J. Gao, C. Sun, Z. Yang, and R. Nevatia. Tall: Temporal activity localization via language query. 2017 IEEE International Conference on Computer Vision (ICCV), pages 5277–5285, 2017.
[5] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the 14th International Conference on Computer Vision (ICCV-2013), pages 2712–2719, Sydney, Australia, December 2013.
[6] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell. Localizing moments in video with temporal language. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1380–1390. Association for Computational Linguistics, 2018.
[7] M. Iyyer, V. Manjunatha, J. L. Boyd-Graber, and H. D. III. Deep un- ordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 1681–1691, 2015.
[8] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
[9] Y. Kong and Y. Fu. Human action recognition and prediction: A survey. CoRR, abs/1806.11230, 2018.
[10] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles. Densecaptioning events in videos. In International Conference on Computer Vision (ICCV), 2017.
[11] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei. Jointly localizing and describing events for dense video captioning. In CVPR, 2018.
[12] H. E. Matheson and L. W. Barsalou. Embodiment and Grounding in Cognitive Neuroscience, pages 1–27. American Cancer Society, 2018.
[13] A. Piergiovanni and M. S. Ryoo. Unseen Action Recognition with Mul- timodal Learning. ArXiv e-prints, June 2018.
[14] Richard Socher and Milind Ganjoo and Christopher D. Manning and Andrew Y. Ng. Zero Shot Learning Through Cross-Modal Transfer. In Advances in Neural Information Processing Systems 26. 2013.
[15] C. Silberer and M. Lapata. Learning grounded meaning representations with autoencoders. In In Proceedings of ACL 2014, 2014.
[16] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Val Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
[17] Y. Xian, B. Schiele, and Z. Akata. Zero-shot learning - the good, the bad and the ugly. In IEEE Computer Vision and Pattern Recognition (CVPR), 2017.
[18] H. Xu, B. Li, V. Ramanishka, L. Sigal, and K. Saenko. Joint Event Detection and Description in Continuous Video Streams. ArXiv e-prints, Feb. 2018.
[19] X. Xun, H. Timothy, and G. Shaogang. Semantic embedding space for zero-shot action recognition. In 2015 IEEE International Conference on Image Processing (ICIP), pages 63–67, Sept 2015.
[20] X. Xun, H. Timothy, and G. Shaogang. Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision, 123(3):309–333, July 2017.
[21] R. Zellers and Y. Choi. Zero-shot activity recognition with verb attribute induction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 946–958. Association for Computational Linguistics, 2017.