For the approaches towards most NLP tasks, researchers turn to using pre-trained word embeddings (Mikolov et al., 2013; Pennington et al., 2014; Bojanowski et al., 2017) as a key component of their models. The representations map each word of a sequence (w1,...,wT ) to a real valued vector of dimension d. A drawback of these kinds of externally learned features is that they are (i) fixed, i.e. can not be adapted to a specific domain they are used in, and (ii) context independent, i.e. there’s only one embedding for a word by which it is represented in any context.
More recently, transfer learning approaches, as for example using convolutional neural networks (CNNs) pre-trained on ImageNet (Krizhevsky et al., 2012) in computer vision, have entered the discussion. Transfer learning in NLP context means pre-training a network with a self-supervised1 objective on large amounts of plain text and fine-tune its weights afterwards on a task specific, labelled data set. For a comprehensive overview on the current state of transfer learning in NLP, we recommend the excellent tutorial and blog post by Ruder et al. (2019)2.
With ULMFiT (Universal Language Model Fine Tuning), Howard and Ruder (2018) proposed a LSTM-based (Hochreiter and Schmidhuber, 1997) approach for transfer learning in NLP using AWD-LSTMs (Merity et al., 2017). After pre-training on a large unlabelled corpus, a task-specific layer is added to the network and the whole network is fine-tuned using labelled data. This model can be characterised as unidirectional contextual, while a bidirectionally contextual LSTM-based model was presented in ELMo (Embeddings from Language Models) by Peters et al. (2018).
Table 1: Summarizaton of the basic facts of the evaluated model architectures. Despite not being a central part of this evaluation, Word2Vec and FastText are added as baseline comparisons. With Transfer learning integration, we try to specify to which degree the model is capable for transfer learning. We distinguish between embedding models and end-to-end trainable transfer learning models.
The bidirectionality in ELMo is achieved by using biLSTMs instead of AWD-LSTMs. On the other hand, ULMFiT uses a more "pure" transfer learning approach compared to ELMo, as the ELMo-embeddings are extracted from the pre-training model and are not fine-tuned in conjunction with the weights of the task-specific architecture.
The OpenAI GPT (Generative Pre-Training, Radford et al. (2018)) is a model which resembles the characteristics of ULMFiT in two crucial points. It is a unidirectional language model and it allows stacking tasks specific layers on top after pre-training, i.e. it is fully end-to-end trainable. The major differences between these two models is the architecture inside the LM, where OpenAI GPT uses the Transformer architecture (Vaswani et al., 2017).
Instead of processing one of the input tokens at a time, like recurrent architectures (LSTMs, GRUs) do, the Transformer takes in the whole sequence all at once. This is possible because it utilizes a variant of the Attention mechanism (Bahdanau et al., 2014), which allows to model dependencies without having to feed the data to the model sequentially. At the same time, the OpenAI GPT can be characterised as unidirectional model as it just takes into account the left side of the context. Its successor OpenAI GPT2 (Radford et al., 2019) possesses (despite some smaller architectural changes) mainly the same model architecture and can thus also be termed as unidirectional contextual.
Original BERT (Bidirectional Encoder Representations from Transformers, Devlin et al. (2018)), and consequently the other two BERT-based approaches discussed here (Liu et al., 2019; Lan et al., 2019) as well, differ from the GPT models by the fact that they are bidirectional Transformer models. Devlin et al. (2018) developed Masked Language Modelling (MLM) as a special training objective which allows the use of a bidirectional Transformer without compromising the language modelling objective. XLNet (Yang et al., 2019) on the contrary relies on an objective which the authors call Permutation Language Modelling (PLM) and thus also achieves to model a bidirectional context despite being an auto-regressive model. A brief overview on the characteristics of the explained models can be found in table 1.
In their stimulating paper, Raffel et al. (2019) take several steps in a similar direction by trying to ensure comparability among different transformer-based models. They perform various experiments with respect to the transfer learning ability of a transformer encoder-decoder architecture by varying the pre-training objective (Different variants of denoising vs. language modelling), the pre-training resources (their newly introduced C4 corpus vs. variants thereof) and the parameter size (from 200M up to 11B). Especially, their approach of introducing a new corpus and creating subsets resembling previously used corpora like RealNews (Zellers et al., 2019) or OpenWebText (Gokaslan and Cohen, 2019) is a promising approach in order to ensure comparability.
However, their experiments do not cover an important point we trying to address in our paper:
Focussing on only one specific architecture does not yield an answer to the question which components explain the performance differences between two models where the overall architecture differs as well (e.g. Attentionbased vs. LSTM-based). Yang et al. (2019) also address model comparability to some extent by performing an ablation study to compare their XLNet explicitly to BERT (Devlin et al., 2018). In this ablation study, they train six different XLNet-based models where they modify different parts of the models in order to quantify how these design choices influence performance. At the same time they restrict themselves to an architecture of the same size as BERT-base and use the same lexical resources for pre-training. Liu et al. (2019) vary their RoBERTa model with respect to model size and use of pre-training resources in order to perform an ablation study aiming at comparability to BERT. Lan et al. (2019) go even one step further with their ALBERT model by also comparing their model to BERT with regard to run time and width/depth of the model.
Despite all these experiments are highly valuable steps into the direction of better comparability, there are still no clear guidelines on which comparisons to perform in order to ensure a maximum degree of model comparability with respect to potentially influential factors.
First, we will present the different available corpora which were utilised for pre-training the models and compare them with respect to their size, the domain they’re from and their accessibility. Subsequently, we will briefly introduce common benchmark data sets which the models are fine-tuned and evaluated on.
While the conceptual differences between the evaluated models have already been addressed in the introduction, the models will now be described in more detail. This is driven by the intention to emphasise differences beyond the obvious, conceptual ones.
3.1 Training corpora
We will start this chapter by briefly introducing the pre-training resources, which are commonly used. While there are some corpora that are commonly used by most of the models, some other corpora are often just used by one model in conjunction with one of the more popular ones. An overview is to be found in table 2.
English Wikipedia Devlin et al. (2018) state that they used data from the English Wikipedia and provide a manual for crawling it, but no actual data set. Their data encompassed around 2.5B words. Wikipedia data sets are available in the Tensorflow Datasets-module3.
CommonCrawl Among other resources, Yang et al. (2019) used data from CommonCrawl. Besides stating that they filtered out short or low-quality content no further information is given. Since CommonCrawl is a dynamic database, which is updated on a monthly base, and the extracted amount of data always depends on the user, we can not provide a word count for this source in table 2.
ClueWeb (Callan et al., 2009), Giga5 (Parker et al., 2011) The information about the use of ClueWeb and Giga5 is similarly sparse as for CommonCrawl (all three were used for pre-training XLNet). ClueWeb was obtained by crawling 2.8M web pages in 2012, Giga5 was crawled between 01/2009 and 12/2010.
1B Word Benchmark4 (Chelba et al., 2013) This corpus, actually introduced as a benchmark data set by Chelba et al. (2013) back in 2013, combines multiple data sets from the EMNLP 2011 workshop on Statistical
Table 2: Pre-training resources used by the language models (sorted by release date). Concerning the Accessability, the category Crawling Manual can be ranked between the two other categories. In this case, the authors did not provide the data, but at least a (more or less detailed) manual for crawling the data (or similar data) oneself. The dollar signs in brackets signify the necessity of a payment in order to get access to the corpus. There’s no information on RealNews (Zellers et al., 2019) and C4 (Raffel et al., 2019) as these corpora were not used by the evaluated models.
† We report the word-count as given in the respective articles proposing the corpora. Note that the number of tokens reported in depends on the tokenization scheme used by a specific model. ‡ Stated by one of the authors on twitter: https:/twitter.com/thtrieu_/status/1096672446864748545
Machine Translation5 (WMT11). The authors normalised and tokenized the corpus and performed further pre-processing steps in dropping duplicate sentences as well as discarding words with a count below three. Additionally they randomised the ordering of the sentences in the corpus. This constitutes a corpus with a vocabulary of 793.471 words and a total word count of 829.250.940 words.
BooksCorpus6 (Zhu et al., 2015) With their work from 2015, Zhu et al. introduced two corpora: the MovieBook Dataset and the BooksCorpus, with the latter one being heavily used for pre-training language models (cf. table 2). In their work, they used the BooksCorpus in order to train a model for retrieving sentence similarity.
Overall, the corpus comprises 984.846.357 words7 in 74.004.228 sentences obtained from analysing 11.038 books. The vocabulary consists of 1.316.420 unique words, making the corpus lexically more diverse than the 1B Word Benchmark as it possesses a by 66% larger vocabulary whereas having a word count which is only 19% higher. Unfortunately it is not available for public download anymore, the authors just provide a link to the ebook-store where they scraped the corpus.
Wikitext-1038 (Merity et al., 2016) Merity et al. (2016) emphasised the necessity for a new large scale language modelling data set by stressing the shortcomings of other corpora. They explicitly highlight the occurrence of complete articles, which allow the models to learn long range dependencies, as one of the main benefits of their corpus. This property is, according to Merity et al. (2016), not given in the 1B Word Benchmark as the sentence ordering is randomised there. With a count of 103.227.021 tokens and a vocabulary size of 267.735 it is about one eighth of the 1B Word Benchmark’s size concerning token count and about one third concerning the vocabulary size. Note, that there is also a smaller corpus available 9, which is a subset of about 2% of the size of Wikitext-103.
CC-News (Nagel, 2016) The CC-News corpus was presented and used in Liu et al. (2019). They used a web crawler proposed by Hamborg et al. (2017) to extract data from the CommonCrawl News data set (Nagel, 2016) and obtained a data set similar to the RealNews data set (Zellers et al., 2019).
Stories10 (Trinh and Le, 2018) This data set is also a specific subset of the CommonCrawl data. The authors built the data based on questions in common sense reasoning tasks. They extracted nearly 1M documents, most of which are taken from longer, coherent stories (hence the name of the corpus). One of the authors stated on twitter11 that the corpus contains approximately 7B words.
WebText (Radford et al., 2019) The data set GPT2 was pre-trained on, is not publicly available and was obtained by creating "a new web scrape which emphasised document quality" (Radford et al., 2019).
OpenWebText12(Gokaslan and Cohen, 2019) As a reaction to Radford et al. (2019) not releasing their pre-training corpus, Gokaslan and Cohen (2019) started an initiative to emulate an open-source version of the WebText corpus.
It becomes obvious that there is a lot of heterogeneity with respect to the observed combinations of availability and the clear specification of the corpus size as word count. Some corpora specify their size in gigabytes, but do not provide a token count or a vocabulary size. Thus, we can state that there is some lack of transparency when it comes to the lexical resources used for per-training. Especially, the missing availability of the BooksCorpus is problematic as this corpus is heavily used for pre-training.
3.2 Benchmark data sets for fine-tuning
Besides describing pre-training resources, it is also important to have a look at the data sets which are commonly used for benchmarking fine-tuned language models and thus determine new SOTA results.
GLUE13 (Wang et al., 2018) The General Language Understanding Evaluation (GLUE) benchmark is a freely available collection of nine data sets which models can be evaluated on. It also provides a fixed train-dev-test split with held out labels for the test set, as well as a leader board which displays the top submissions and the current SOTA. The relevant metric for the SOTA is an aggregate measure of the nine single task metrics.
Table 3 provides the basic information on the data sets included in GLUE. The benchmark includes two binary clas-sification tasks with single-sentence inputs (CoLa [Warstadt et al., 2018] and SST-2 [Socher et al., 2013]) and five binary classification tasks with inputs that consist of sentence-pairs (MRPC [Dolan and Brockett, 2005], QQP14, QNLI [Wang et al., 2018], RTE [Wang et al., 2018] and WNLI [Wang et al., 2018]). The remaining two tasks also take sentence-pairs as input but have a multi-class classification objective with either three (MNLI [Williams et al., 2017]) or five classes (STS-B [Cer et al., 2017]).
Table 3: A brief summarizaton of the different data sets which all together form the GLUE benchmark. This table is basically a rearrangement of table 1 from Wang et al. (2018) with slightly reduced information as it is just thought to be an overview on the different tasks and data set sizes.
SuperGLUE15 (Wang et al., 2019) As a reaction to human baselines being surpassed by the top ranked models, Wang et al. (2019) proposed a set of benchmark data sets similar to, but, according to the authors, more difficult than GLUE. On average, the size of the provided training data is smaller than in GLUE and, differently to GLUE, the data is also split in ’train’, ’dev’ and ’test’ as in GLUE. As of the writing of this paper, there is a large difference between the use of GLUE and SuperGLUE concerning the number of models evaluated on the respective benchmark.
Table 4: A brief summarizaton of the different data sets which all together form the SuperGLUE benchmark. This table is basically a rearrangement of table 1 from Wang et al. (2019) with slightly reduced information as it is just thought to be an overview on the different tasks and data set sizes.
It is considered to be more difficult than GLUE as it contains more complex tasks than just single-sentence or sentence-pair classification. SuperGLUE also features coreference resolution and question answering tasks. Unfortunately, it did not make sense to include it as a part of our model comparison, as (at the time of writing) only two of the discussed models were evaluated on SuperGLUE.
SQuAD16 (Rajpurkar et al., 2016, 2018) In its first version, the Stanford Question Answering Dataset (SQuAD) 1.1 (Rajpurkar et al., 2016) consists of 100.000+ questions explicitly designed to be answerable by reading segments of Wikipedia articles. The task is to correctly locate the segment in the text which contains the answer. A shortcoming of this task is the omission of situations where the the question is not answerable by reading the provided article. Rajpurkar et al. (2018) address this problem in SQuAD 2.0 by adding 50.000 handcrafted unanswerable questions to the SQuAD 1.1 data set. On their homepage, the authors provide a train and development set as well as an official leader board. The test set is completely held out. Instead, the participants are required to upload their models to CodaLab17. The SQuAD 1.1 data is, in an augmented form (termed QNLI), also part of the GLUE benchmark.
RACE18 (Lai et al., 2017) The Large-scale ReAding Comprehension Dataset From Examinations (RACE) contains (english) exam questions for Chinese students (middle and high school). In most of the articles, where the model is evaluated on RACE, it is described to be especially challenging due to (i) the length of the passages, (ii) the inclusion of reasoning questions and (iii) the intentionally tricky design of the questions in order to test a human’s ability in reading comprehension. The data set can be subdivided in RACE-M (middle school examination) and RACE-H (high school examination) and comprises a total of 97.687 questions on 27.933 passages of text.
3.3 Evaluated Models
ULMFit (Howard and Ruder, 2018) The first "pure" transfer learning applied in NLP was ULMFiT in the beginning of 2018. The core of the model builds on the work from Merity et al. (2017) as it uses AWD-LSTMs, which is a LSTM-variant that makes use of DropConnect (Wan et al., 2013) for better regularisation and applies averaged stochastic gradient descent (ASGD) for optimization (Polyak and Juditsky, 1992). This model consists of a 400 dimensional embedding layer followed by three LSTM layers, each of which encompasses 1150 hidden units. Howard and Ruder (2018) stack a softmax classifier with a hidden layer size of 50 on top of this architecture for pre-training the model. This final layer is complemented by a task specific final layer during fine tuning. The vocabulary size is limited to 30k words as in Johnson and Zhang (2017).
In contrast to the other models discussed in this paper, ULMFiT was not evaluated on the GLUE benchmark but on several other data sets (IMDb [Maas et al., 2011], TREC-6 [Voorhees and Tice, 1999], Yelp-bi, Yelp-full, AG’s news, DBpedia [all Zhang et al., 2015]).
ELMo (Peters et al., 2018) As already stated in section 1, ELMo differs from ULMFit with respect to its usability for transfer learning. The pre-trained ELMo-embeddings are plugged in at the lowest layer of an arbitrary NLP
Table 5: An overview on the data sets which ULMFit was fine-tuned and evaluated on. It is an extension of table 1 (Howard and Ruder, 2018), adding information on the size of the test set and the domain. All six tasks are classification tasks, where the target variables have between 2 and 14 classes.
model in order to use them for a downstream task19. In case of ELMo this means the following: As ELMo consists of multiple biLSTM layers, one can extract multiple intermediate-layer representations from the model. These representations are used for computing a (task-specific) weighted combination, which is concatenated with static context-independent word embeddings. So the model weights of ELMo are not updated during the training of the downstream model, but only the weights, which are learned for combining the intermediate-layer representations from ELMo, are. Peters et al. (2018) evaluate an ELMo-based model on SQuAD and other tasks, but when it comes to GLUE there are multiple ELMo-based architectures available on the leaderboard20. Thus, here we will concentrate on the best-performing ELMo-based model, a BiLSTM-model with Attention (Wang et al., 2018).
OpenAI GPT (Radford et al., 2018) The OpenAI GPT is a pure attention-based architecture the does not make use of any recurrent layers. Pre-training is performed by combining Byte-Pair encoded (Sennrich et al., 2015) token embeddings with learned position embeddings, feeding them into a multi-layer transformer decoder architecture with a standard language modelling objective. By using a decoder architecture the model does at each step only have access to the preceding tokens in the sequence. Thus, the GPT model is a unidirectional attention-based architecture. Fine-tuning was, amongst others, performed on the nine tasks that together form the GLUE benchmark.
BERT (Devlin et al., 2018) This model can be seen as a reference point for everything that came thereafter. Similar to GPT it uses Byte-Pair Encoding (BPE) with a vocabulary size of 30k. By introducing the MLM training objective, the authors were able to combine deep bidirectionality with the self-attention mechanism for the first time. In addition to the MLM objective it also utilizes as next-sentence prediction (NSP) objective, the usefulness of which has been debated in other research papers (Liu et al., 2019). The BERT-BASE model consists of 12 bidirectional transformer-encoder blocks (24 for BERT-LARGE) as described in Vaswani et al. (2017) with 12 (16 respectively) attention heads per block and an embedding size of 768 (1024 respectively). The need to better understand the behaviour of these huge networks even constituted a new field of research called BERTology, aiming at explaining the inner workings of BERT-based models.
OpenAI GPT2 (Radford et al., 2019) With GPT2, the OpenAI team published a scaled-up version of GPT in 2019. Compared to its predecessor, it contains some smaller changes concerning the placement of layer normalisation and residual connections. Overall, there are four different versions of GPT2 with the smallest one being equal to GPT, the medium one being of similar size as BERT-LARGE and the xlarge one being released as the actual GPT2 model with 1.5B parameters.
XLNet (Yang et al., 2019) In order to overcome (what they call) the pretraining-finetune discrepancy, which is a consequence of BERT’s masking approach, and to simultaneously include bidirectional contexts, Yang et al. (2019) propose the PLM objective for their XLNet. They use two-stream self-attention for preserving the position information of the token to be predicted, which would otherwise be lost due to the permutation of the sequence. While the first of the two streams (content stream attention) resembles the standard self-attention from a transformerdecoder, the other stream (query stream attention) doesn’t allow the token to see itself but just the preceding tokens of the permuted sequence.
RoBERTa (Liu et al., 2019) With RoBERTa (short for Robustly optimized BERT approach), Liu et al. (2019) introduce an exact (architectural) replicate of BERT with tuned hyperparameters and a larger corpus used for pre-training. The masking strategy for pre-training is changed from static (masking once during pre-processing) to dynamic (masking every sequence just before feeding it to the model), the additional NSP objective is removed, the BPE-level vocabulary is adjusted and increased to 50k and RoBERTa is trained on larger batches than BERT. All of these adjustments improve performance of the model and make it competitive to the previously SOTA results of XLNet.
ALBERT (Lan et al., 2019) By addressing the steady increase of the model size as a potential problem, ALBERT (short for A Lite BERT) goes into another direction compared to most of post-BERT architectures. Lan et al. (2019) apply parameter-reduction techniques in order to train faster models with lower memory demands that, at the same time, yield a comparable performance to SOTA models. In our work we will always refer to ALBERT-XXLARGE, which is the best performing ALBERT model. Note, that also the much smaller ALBERT models yielded results comparable to or even better than BERT.
The two tables below will try to give a comprehensive overview on the differences of the previously discussed model architectures. While table 6 will only attempt to give an overview on the amount of computation that was needed to train a given architecture on a given corpus, we will directly try to relate model architecture and size as well as usage of lexical resources to model performance in table 7.
Table 6: Summarizaton of the basic facts of the evaluated transfer learning model architectures. Word2Vec, FastText and ELMo are not included as these are no end-to-end trainable models, meaning that the model size also depends of the used model after obtaining the embeddings. The parameter size of ULMFiT is assumed to be the larger value from Merity et al. (2017), since Howard and Ruder (2018) use plain AWD-LSTMs with a vocabulary size of 30k tokens like Johnson and Zhang (2016, 2017). Values for GPT2-XLARGE are taken from Strubell et al. (2019).
† Estimation according to the formula proposed on https://openai.com/blog/ai-and-compute/: , with an assumed utilization of one third. Information on PFLOPS/unit for TPUs from https://cloud.google.com/tpu/. ‡ We provide two numbers here, as Devlin et al. (2018) do not specify whether they use v2 or v3 TPUs. The first number assumes the use of v2 TPUs, the one in square brackets assumes use of v3 TPUs.
One thing that we can learn from table 6 is the unfortunate lack of details when it comes to reporting the computational resources used for training the models. While Howard and Ruder (2018) do not provide any information at all on the computational resources utilised for pre-training ULMFiT, the other articles are also not over-informative when it comes to reporting them. Unfortunately, there are no clear guidelines on how to appraise resource consumption when it comes to evaluating and comparing models. This may be partly attributed to the rapidly growing hardware possibilities due to modern cloud computing architectures, but in our opinion it should nevertheless be accounted for, since it may pose environmental issues (Strubell et al., 2019) and also limits portability to smaller devices.
The second thing is that it is also important to consider the differences displayed in the tables 6 and 7 when comparing the model performances. When comparing two models of approximately the same size (e.g. BERT-BASE versus GPT), it seems to be obvious that the superior performance of BERT-BASE originates purely from its more elaborated model architecture (cf. table 1) because of the similar size. But one should also be aware of the larger pre-training resources (BERT-BASE uses at least twice as much data for pre-training) as well as the unknown differences in usage of computing power. We estimated the amount of compute used by a model as the pfs-days, resulting in an estimation for BERT-BASE being not less than the one for GPT.
Another aspect which should not be ignored when evaluating performance is the use of ensemble models. As can be seen in the first column of table 7, the three ensemble models seem to outperform both of the BERT models by a large margin. Only parts of these differences may be attributed to the model architecture, as the ensembling as well as the larger pre-training resources might also give an advantage to these models. As there are unfortunately no single model performance values available for XLNet, RoBERTa and ALBERT on the official GLUE leaderboard, we also compare the single model performances from Lan et al. (2019) obtained on the dev sets (WNLI excluded). From this comparison we can get a good impression of how high the contribution of model ensembling might be: The difference between BERT-LARGE and the XLNet ensemble in the official scores (7.9 percentage points) is more than twice as high as the difference on the dev score (3.4 percentage points).
In order to address the differences in size of the pre-training resources, Yang et al. (2019) make the extremely insightful effort to compare a BASE variant of XLNet to BERT-BASE (same size and same pre-training resources). While the F1 score on the SQuAD v2.0 dev set is still remarkably higher than for BERT-BASE (almost comparable to BERT-LARGE) it does not show a large improvement on the RACE test set anymore (which might have been expected due to the large improvement of XLNet-LARGE over both BERT models).
Table 7: Performance of different models on GLUE, SQuAD and RACE as well as model size and resource usage compared to BERT-BASE (except for GLUE dev set performance, where BERT-LARGE is the reference). Performance differences on the benchmark data sets are given in percentage points, while the differences in size/resources are given as factors, e.g. BERT-LARGE has 3.1 times the size of BERT-BASE and performs 2.2 percentage points better on GLUE. We omit SuperGLUE in this table as of the time of writing only BERT and RoBERTa were evaluated on it. ULMFiT and OpenAI GPT2 are also omitted as there are no performance values on these data sets publicly available. Highest improvements over the reference model in bold. For ELMo we do not provide a model size, since the performance values are from two different models (cf. section 3.3).
Displayed performance measures are Matthews Correlation (GLUE), F1 score (SQuAD) and Accuracy (RACE).
Ensemble performance; No single model performance available
Own calculations based on Lan et al. (2019) table 13; WNLI is omitted
Result for BERT-BASE on SQuAD v2.0 is taken from Yang et al. (2019) table 6
Result for BERT-BASE on RACE is taken from Zhang et al. (2019) table 2 † Liu et al. (2019) and Lan et al. (2019) specify the BooksCorpus + English Wikipedia as 16GB ‡ This variant of RoBERTa uses only BooksCorpus + English Wikipedia for pre-training
The comparability of RoBERTa from the GLUE leaderboard (model ensemble and larger pre-training resources) to BERT-LARGE is again limited, but the authors performed several experiments in order to show the usefulness of their model optimisations. When pre-training BERT-LARGE and a single RoBERTa model on comparable lexical resources (BooksCorpus + English Wikipedia; 13GB for BERT vs. 16GB for RoBERTa), the RoBERTa model still shows a significant improvement over BERT-LARGE, even if it decreases somewhat in size (compared to the difference between BERT-LARGE and the ensemble model). In another ablation study, Liu et al. (2019) train a BASE variant of RoBERTa on their larger pre-training resources. Even though comprising only about one third of the size of BERT-LARGE, the larger pre-training corpus in conjunction with the optimised training leads to a slightly better performance on the GLUE dev set (without WNLI). Unfortunately we cannot compare RoBERTa-BASE to BERTBASE, as we neither have the "official" leaderboard score for RoBERTa-BASE nor the "in-official" dev set score for BERT-BASE.
Table 8: Performance of BERT-LARGE and XLNet-LARGE on the benchmark data sets used by Howard and Ruder (2018) as well as model size and resource usage compared to ULMFiT. Specification of the differences are displayed as in table 7, highest improvements over the reference model in bold. Note that we report accuracies here, as opposed to Howard and Ruder (2018) and Yang et al. (2019), in order to provide a more similar interpretation of these values compared to the values in table 7. Displayed performance measures are Accuracy for all tasks.
In order to also set the results of ULMFiT into context, we present the results published by Yang et al. (2019) alongside with the information on model size and use of lexical resources in table 8. Despite being much larger and utilising some orders of magnitude larger corpora for pre-training, both BERT-LARGE and XLNet-LARGE do not exhibit that large improvements over the performance of ULMFiT. This might partly originate from the simplicity (compared to GLUE & co.) of the tasks, but partly also from the already achieved high performances where no extremely large improvements are possible anymore.
This chapter reflects the main takeaways from the above comparisons and tries to raise some issues for future research practices. We do not claim to have a solution to these potentially problematic aspects but think that these points are highly debatable.
Why no benchmark corpus for pre-training? It is good and well-established practice to use benchmark data sets like GLUE, SuperGLUE (not yet used that often), SQuAD and RACE for comparing the performance of pre-trained language models on different types of NLP/NLU tasks. Many recently published articles (Liu et al., 2019; Yang et al., 2019; Lan et al., 2019) perform (partly extensive) ablation studies controlling for pre-training resources in order to make (versions of) their models comparable to BERT as "benchmark model", which is really important as it helps the reader to get an intuition for the impact of pre-training resources. Nevertheless, it is unfortunately not perfect due to two critical issues: (i) BERT (and all the other models consequently as well) make use of the BooksCorpus (Zhu et al., 2015) which is not publicly available and (ii) this only leads to model comparisons in a low pre-training resource environment (compared to more recent models) and yields no insight on the behaviour of the reference model (e.g. BERT) in a high(er) pre-training resource context. So we view statements of the type "Model architecture A is superior to model architecture B on performing task X." somewhat critical and would propose to phrase it in a way comparable to the following statement: "Model architecture A is superior to model architecture B on performing task X, when pre-trained on a small/large corpus of low/high quality data from domain Y for time Z."
Why no standardised description of (computational) resources? When writing this article, it sometimes turned out difficult to really get one (measure) for how much compute was used to pre-train the model described in an article. In our opinion, this is not a carelessness of the authors but rather the lack of a clear reporting standard. We found ourselves confronted with the following situations:
While situation a) is clearly unsatisfactory and should be avoided, scenarios b) and c) basically provide (almost) all of the necessary information but miss out on going the last final step to scenario d) where the reporting would reach universal comparability across different articles. A quite nice and intuitive way was also proposed on the OpenAI-blog22 for estimating the GPU time needed for model training. This is of course not as exact as a computation based on the counts of operations in a model, but requires on the other hand no deep insight into the model architecture and is thus applicable to a a wide range of architectures without much effort.
Shouldn’t performance be evaluated in relation to size and resource consumption? As larger models have a higher capacity for learning good representations and using larger pre-training resources should also improve their quality, varying these two components simultaneously with the model architecture might lead to interference between the individual influences on model performance. So the intent of this aspect has a slight overlap with the question posed above, but while the above is more or less about introducing some kind of reference, this is about carefully varying and evaluating the effects of different parts of the model.
As can be seen from the above analysis, there is a clear lack of a concise guideline for fair comparisons of large pre-trained language models. It is not sufficient to just rank models by their performance on the common benchmark data sets as this does not take into account all the other factors mentioned in this analysis.
Table 9: Proposal of starting points when thinking about reporting standards for pre-trained LMs. We categorise the reporting of the experimental time and the benchmark performance of the un-tuned model as not easily feasible, as one has to be aware of these standards in order to track the time of all experiments. Also, defining what is an "un-tuned" version is not always that simple. With "un-tuned" we mean not further tuned during pre-training.
A further aspect (which is not explicitly addressed here) is the reporting of resources (time and compute) spent on model development, including all experimental runs and trials, and hyperparameter tuning during pre-training. In our opinion, this is important with respect to two facets: On the one hand side it is important to take into account energy and environmental considerations when training deep learning models (Strubell et al., 2019), on the other hand it is also a signal to the reader/user for how difficult it is to train (and to fine-tune) the model. This might have implications for the usage of a model as transfer learning model for diverse downstream tasks. Models that have already been tuned to a high degree during pre-training to reach a certain level of performance, have, in the long run, maybe less potential for further improvements than models which do so without much hyperparameter tuning.
Taking all these considerations into account, we want to tentatively propose starting points (cf. table 9) for defin-ing reporting standards which are globally accepted and applied when it comes to comparing pre-trained language models. We carefully try to categorise the different facets according to feasibility (How much effort does it take to report this?), current realisation (How many research papers are reporting this?) and their relevance for reproducible research (How crucial is this for performing reproducible research?). All these categorisations are of more or less subjective nature due to the fact that they cannot be quantified and are based on just a handful of the most influ-ential research papers.
We are aware of the fact, that it might take a large collective effort in order to establish some set of standards but we think that it is an absolutely crucial step to describe all the aspects we mentioned in a way that is as transparent as possible in order to foster replicability and reproducability.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghe- mawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
Callan, J., Hoy, M., Yoo, C., and Zhao, L. (2009). Clueweb09 data set.
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., and Specia, L. (2017). Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055.
Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., and Robinson, T. (2013). One billion word bench- mark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dolan, W. B. and Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
Gokaslan, A. and Cohen, V. (2019). Openwebtext corpus.
Hamborg, F., Meuschke, N., Breitinger, C., and Gipp, B. (2017). News-please: a generic news crawler and extractor. In 15th International Symposium of Information Science (ISI 2017), pages 218–223.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.
Howard, J. and Ruder, S. (2018). Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.
Johnson, R. and Zhang, T. (2016). Convolutional neural networks for text categorization: Shallow word-level vs. deep character-level. arXiv preprint arXiv:1609.00718.
Johnson, R. and Zhang, T. (2017). Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 562–570.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105.
Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. (2017). Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pages 142–150. Association for Computational Linguistics.
Merity, S., Keskar, N. S., and Socher, R. (2017). Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182.
Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016). Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Nagel, S. (2016). Cc-news. https://commoncrawl.org/2016/10/news-dataset-available/.
Parker, R., Graff, D., Kong, J., Chen, K., and Maeda, K. (2011). English gigaword fifth edition, june. Linguistic Data Consortium, LDC2011T07, 12.
Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextual- ized word representations. arXiv preprint arXiv:1802.05365.
Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openaiassets/researchcovers/languageunsupervised/language understanding paper. pdf.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
Rajpurkar, P., Jia, R., and Liang, P. (2018). Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822.
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
Ruder, S., Peters, M. E., Swayamdipta, S., and Wolf, T. (2019). Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18, Minneapolis, Minnesota. Association for Computational Linguistics.
Sennrich, R., Haddow, B., and Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
Strubell, E., Ganesh, A., and McCallum, A. (2019). Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243.
Trinh, T. H. and Le, Q. V. (2018). A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
Voorhees, E. M. and Tice, D. M. (1999). The trec-8 question answering track evaluation. In TREC, volume 1999, page 82. Citeseer.
Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., and Fergus, R. (2013). Regularization of neural networks using dropcon- nect. In International conference on machine learning, pages 1058–1066.
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
Warstadt, A., Singh, A., and Bowman, S. R. (2018). Neural network acceptability judgments. arXiv preprint arXiv:1805.12471.
Williams, A., Nangia, N., and Bowman, S. R. (2017). A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., and Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., and Choi, Y. (2019). Defending against neural fake news. arXiv preprint arXiv:1905.12616.
Zhang, S., Zhao, H., Wu, Y., Zhang, Z., Zhou, X., and Zhou, X. (2019). Dual co-matching network for multi-choice reading comprehension. arXiv preprint arXiv:1901.09381.
Zhang, X., Zhao, J., and LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657.
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., and Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.
AWD Averaged stochastic gradient decent weight-dropped
biLSTM bi-directional Long short-term memory
BPE Byte-Pair Encoding
CNN Convolutional neural network
FCNN Fully connected neural network
GRU Gated recurrent unit
LSTM Long short-term memory
MLM Masked Language Modelling
NLP Natural Language Processing
NLU Natural Language Understanding
PLM Permutation Language Modelling
SOTA State-of-the-art