Apart from early statistical methods, computational approaches to authorship attribution have conventionally been divided into classification-based and similarity-based (Stamatatos, 2009). In the classification-based paradigm, candidate authors are identified, and texts known to be from them are used to construct a training set; a standard supervised machine learning approach is then typically applied to construct a classifier, with methods now including deep learning (Ruder et al., 2016). This is the more common approach to authorship identification, which has successfully tackled basic versions of the problem, mostly with small numbers of authors, where these authors form a closed set and where they are known in advance. Koppel et al. (2011) describe this as the “vanilla” version of the problem that is not the typical case in the real world; rather, in real world problems there may be perhaps thousands of candidate authors, the author may not in fact be one of the candidates, and the known text from authors may be limited.
Koppel et al. (2011) argued that similarity-based approaches are better suited to large numbers of candidate authors. In the similarity-based paradigm, some metric is used to measure the distance between two texts, and an unknown text is attributed to the author of the known one(s) it is closest to, making this one of a class of nearest neighbour approaches (Hastie et al., 2009). For authorship attribution, there have been fewer of these approaches than of the classification-based ones, with noteworthy ones including the Writeprints method of Abbasi and Chen (2008) and the method of Koppel et al. (2011). The empirical success of this latter method has been demonstrated in a set of authorship shared tasks over a number of years, organised as part of the PAN framework of shared tasks on digital forensics and stylometry:for example, it formed the core of two of the winners of PAN authorship shared tasks (Seidman, 2013; Khonji and Iraqi, 2014), and has been a standard inference attacker for the PAN shared task on authorship obfuscation, where the goal is to conceal authorship from methods that attempt to detect it. Koppel et al. (2011) also note that reducing authorship attribution to instances of the binary authorship verification problem — determining if a given document is by a particular author or not — permits authorship attribution in cases where the author is not one of the known candidates, and is more naturally suited to similarity-based models. However, most existing methods have used only a static notion of similarity over fixed features, rather than a learned one.
Subsequent to much of this earlier work, deep learning has led to major changes in NLP, both in terms of learning more accurate models for a range of tasks, but also in blurring the distinction between classification-based and similarity-based approaches: the learned representations can be applied to predicting classes or to determining similarity of data items. A useful distinction to draw is between closed set and open set recognition (Geng et al., 2021): whether models apply to knowledge of the world at training time, as in the conventional classification-based authorship framework, or beyond that, as in Koppel et al. (2011)’s advocacy for approaching the task via similarity.
There are tasks within NLP that have successfully used deep learning for applying a learned notion of similarity, but these have been applied to learning similarity between semantic aspects of texts, rather than looking at stylistic similarity. For example, tasks like QA and image captioning have been tackled using Deep Semantic Similarity Models (DSSM), e.g. in the work of Yih et al. (2014) and Fang et al. (2015), respectively, in mapping between the semantics of image and corresponding text; other tasks like duplicate question detection (Rodrigues et al., 2017) — identifying questions with the same semantics — and semantic composition (Cheng and Kartsaklis, 2015) have similarly been approached with semantic similarity models. Approaches like these might also apply to stylistic similarity, with the right feature space, but it is an open question as to their application beyond semantics (Gerz, 2020).
A task that has parallels to our own comes from image processing: building on the original use of Siamese networks for signature verification by Bromley et al. (1994), Koch et al. (2015) use deep Siamese networks to learn a notion of similarity between images, where the generality of this notion is evaluated via one-shot recognition. This style of Siamese network has been adapted for a range of semantics-based NLP tasks, such as for sentence similarity by Mueller and Thyagarajan (2016) and for tasks like paraphrase identification by Yin et al. (2016). A proposal to use them for the stylistic task of author identification came from Dwyer (2017), but while the idea was appealing, results in that work were not positive, and performed in some cases worse than the baseline. More recently Boenninghoff et al. (2019) applied Siamese networks to short social media texts on a relatively small PAN dataset, producing some positive results. Contemporaneously with the present work, Araujo-Pino et al. (2020) and Boenninghoff et al. (2020) also applied Siamese networks in a closed-set PAN context.
We define a range of Siamese-based architectures for authorship attribution on the sorts of texts standardly used in this task and across large numbers of authors, and evaluate them in both known-author (closed set) and one-shot learning contexts (open set); as part of this, we examine the effect of choice of energy function, sub-network structure and text representation. We show that they can outperform both a strong classification-based baseline and, in one-shot contexts, the key conventional similarity-based method of Koppel et al. (2011), on datasets with large numbers of authors. We also find clear preferences for choice of sub-network type and text representation.
2.1 Authorship Identification
Stamatatos (2009) surveyed approaches up until 2009: we noted in 1 the division into classification-based and similarity-based approaches, the latter of which is better suited to large numbers of authors. A key work following the survey was the similarity-based approach of Koppel et al. (2011). The method represents texts by vectors of space-free character 4-grams, and then repeatedly samples features from these vectors and takes the cosine similarity between the vectors consisting of these sampled features; Koppel and Winter (2014) later found that the Ruzicka metric produced better results. Like the majority of work in authorship identification, these are applied to longer texts; in this specific instance, to blog posts taken from 10,000 authors.
Given the successful application to a very large number of authors, we use this as a baseline method in this paper.
Much work on authorship identification since then has appeared in PAN shared tasks: the years with attribution setups like this paper were 2011, 2012 and 2018, while the years 2013–2015 considered a verification setup instead. The attribution tasks have required choosing among small numbers of authors, e.g. 3 for 2012 (Juola, 2012) up to 20 for 2018 (Kestemont et al., 2018). For the most part systems in these tasks use conventional machine learning (i.e. not deep learning): the 2018 winner used an ensemble classifier (Cust´odio and Paraboni, 2018) and the runner-up a linear SVM (Murauer et al., 2018). As noted in earlier winners in verification setups (Seidman, 2013; Khonji and Iraqi, 2014) were based on the similarity approach of Koppel et al. (2011), which we used as a baseline. Another exception to conventional machine learning was the 2015 winner, Bagnall (2015), using an RNN-based classifier with shared state but different softmax layer for each author: the architecture is not generally applicable.
The PAN author attribution task in 2019 changed focus from the previous setups and the setup of the present paper, to focus on cross-domain texts; the task overview (Kestemont et al., 2019) noted that no deep learning approaches were used there (because of poor performance in the 2018 task, likely due to the small data setup), and participating systems were typically standard ensembles of conventional classifiers. After this digression, PAN in 2020 returned to closed-set author verification, with plans for open-set verification in 2021 and a ‘surprise task’ in 2022. PAN 2020 differed from previous years with the use of a much larger dataset to support more data-hungry approaches like deep learning (Kestemont et al., 2020).
Systems participating in the PAN 2020 verification task were developed contemporaneously with the present work. Approaches included neural networks (Araujo-Pino et al.,
Figure 1: Siamese network architecture.
2020; Boenninghoff et al., 2020; Ordo˜nez et al., 2020), statistical and regression models (Kipnis, 2020; Weerasinghe and Greenstadt, 2020), and comparisons on the basis of specific predefined or extracted features and thresholds (Halvani et al., 2020; Gagala, 2020; Ikae, 2020). We discuss the relationship of the two closest of these to our work in
Outside the PAN framework, some work is specific to certain authorship contexts and not purely stylistic: e.g. Chen and Sun (2017) and Zhang et al. (2018) on scientific authorship, incorporating publication content and references. Other work uses additional features that are restricted to specific contexts, such as the work by Hou and Huang (2020) that incorporates tone in stylometric analysis for Mandarin Chinese. Notable work on purely stylistic authorship identification with standardly used features, as in this paper, included the use of LDA by Seroussi et al. (2011), both within an SVM and using Hellinger distance, to handle large numbers of authors; this was extended in Seroussi et al. (2014). Mohsen et al. (2016) used feature extraction via a stack denoising auto encoder and then classifica-tion via SVM. Ruder et al. (2016) proposed a CNN classification model which outperformed Seroussi et al. (2011) and various other conventional machine learning approaches on up to 50 authors across a range of datasets; given the set of comparators and the relatively large number of authors used for a classification approach, we use it in this paper as another baseline.
2.2 Siamese Architectures
Siamese networks were first used for verifying signatures, by framing it as an image matching problem (Bromley et al., 1994). The key features of the Siamese network were that it consisted of twin sub-networks, linked together by an energy function (Fig 1). The weights on the sub-networks are tied, so that the sub-networks are always identical: inputs are then mapped into the same space, and the energy function represents some notion of distance between them. Siamese networks were updated for deep learning by Koch et al. (2015) for the task of general image recognition. The sub-networks were convolutional neural networks (CNNs), and to the outputs of the final layers of these CNNs the weighted distance was calculated and a sigmoid activation applied; a cross-entropy objective was then used in training.
This idea of deep learning-based Siamese networks has been adapted from Koch et al. (2015) for a number of tasks in NLP. In most cases, these Siamese networks have been applied to tasks at the level of sentences or below. Mueller and Thyagarajan (2016) applied a Siamese network structured like that of Koch et al. (2015) to the problem of sentence similarity using the SICK dataset, where pairs of sentences have been assigned similarity scores derived from human judgements. Their architecture similarly used (Manhattan) distance, but the sub-networks were LSTMs. At around the same time, Yin et al. (2016) defined an attention-based model motivated by the Siamese architecture of Bromley et al. (1994), and applied it to the tasks of answer selection, paraphrase identification and textual entailment. Much subsequent work has been similar in terms of applications: to answer selection or question answering (Das et al., 2016; Tay et al., 2018; Hu, 2018; Lai et al., 2018), sentence similarity (Reimers and Gurevych, 2019), job title normalisation (Neculoiu et al., 2016), matching e-commerce items (Shah et al., 2018), learning argumentation (Joshi et al., 2018; Gleize et al., 2019), and detecting funnier tweets (Baziotis et al., 2017). In some cases Siamese networks have been applied to word-based rather than sentence-based problems, such as identifying cognates (Rama, 2016) or antonyms (Etcheverry and Wonsever, 2019). In other cases the application of Siamese networks is secondary to the main task, such as relation extraction (Rossiello et al., 2019) or supervised topic modelling (Huang et al., 2018). The most typical configuration is to use some kind of RNN as sub-network and cosine similarity for the energy function; but CNNs are also used in sub-networks, and for energy functions
are also used, along with some less common alternatives such as a hyperbolic distance function (Tay et al., 2018) or one based on LSTM-based importance weighting (Hu, 2018).
There have been two attempts before this work to use Siamese networks for author iden-tification, and two contemporaneous with it. The first was Dwyer (2017), who observed that within NLP they have been used only for short texts rather than the longer sort standardly used for author identification, which is supported by the above summary of applications of Siamese networks to NLP. For sub-networks, Dwyer (2017) used fully connected networks, and as the energy function. Experimentally, on data from the PAN 2014 and 2015 tasks, he found that results were fairly poor, in some cases worse than a random baseline. More recently Boenninghoff et al. (2019) applied Siamese networks to short social media texts, using an architecture with LSTMs as sub-networks and an energy function based on Euclidean distance. This was applied to the relatively small PAN 2016 dataset, and produced some positive results. In the author verification context of PAN 2020, the winning system of Boenninghoff et al. (2020) used a Siamese network with LSTMs as sub-networks and a Probabilistic Linear Discriminant Analysis layer and Linguistic Embedding Vectors (LEV) to perform Bayes factor scoring for the verification task. The other Siamese network, of Araujo-Pino et al. (2020), consisted of residual sub-networks with densely connected components, and an
energy function.
Our architecture follows the basic structure of Koch et al. (2015), and of Mueller and Thyagarajan (2016) for sentence similarity; these both used distance for the energy function, although Koch et al. (2015) used CNNs for the sub-networks while Mueller and Thyagarajan (2016) used LSTMs. The goal of our network is to produce similarity scores for text pairs such that pairs by the same authors have high scores and those by different authors have lower scores.
Below we define the components of our primary models. We also note alternative model choices for the sub-networks, which we examine after the main results.
3.1 Sub-networks
Like Koch et al. (2015), we used CNNs here, in line with the observation of Kim (2014) that CNNs are good at text classification. Our primary sub-network architecture is similar to that of Ruder et al. (2016), a high-performing CNN classification approach to authorship attribution. The input for our primary model is character-level: Ruder et al. (2016) found that character-level input almost always worked best, and the representation is also character-level in Koppel et al. (2011), in line with the observations of Keˇselj et al. (2003) about stylistic authorship classification. Each sub-network consists of an embedding layer, four convolutional layers, and a dense layer. The activation functions are tanh for convolutional layers and sigmoid for dense layers.
We do also examine the effect of different choices here: choosing word-level input instead of character-level input, and LSTMs instead of CNNs. For the LSTM alternatives, we look at both unidirectional (left-to-right) and bidirectional.
3.2 Energy functions
Koch et al. (2015) considered both the distances between the outputs of the final layers of their sub-networks (vectors
in our Fig 1), and found that
worked better for their image matching task. Adapting their notation, we use this same function for our distance calculation:
where is the output of the final layer of sub-network i (in our case, the dense layer after the convolutional layers) and
th element of it;
the additional parameters that are learned by the model during training, weighting the importance of the component-wise distance; and
) the sigmoid activation function. This defines the final fully-connected layer for the network which joins the two Siamese components. When applied to our CNN sub-networks described above, we refer to the architecture as
We also observe, however, that in text-related tasks cosine similarity is commonly used: this is the measure used in Koppel et al. (2011) and many text-based Siamese or DSSM models, as discussed in 2. We therefore introduce a variant of
where the distance calculation is the complement of the cosine similarity between
, similar to Rodrigues et al. (2017). As this is a scalar quantity, there is no final dense layer. The energy function is then:
We refer to this as Siamcos.
4.1 Evaluation Framework
4.1.1 Known Author vs One-Shot
We consider two types of evaluation. The first is the one-shot evaluation of Koch et al. (2015). Here the set of authors in the test set is disjoint with respect to the authors in the training set. Closed-set classification approaches do not apply here, as there is no way to build a model of a previously unseen author. Open-set approaches will only work to the extent that they embody general notions of stylistic similarity between authors. We refer to this as the OneShot setup. In OneShot, the training set consists of 2/3 of the authors, as described below.
The second type of evaluation is common in authorship attribution: while the texts in the training and test sets are different, the same set of authors is represented in both. We refer to this as the KnownAuth setup. Classification approaches are applicable here, as well as similarity; for the similarity approaches, what they embody could involve both properties of specific authors and general models of authorial similarity. In KnownAuth, the training set consists of 3/4 of the texts written by each author.
4.1.2 Verification vs N-way
As in Koch et al. (2015), we begin with the task of verification: Are two texts by the same author? We use this solely to investigate how our Siamese models perform on their fundamental task of scoring similar authors high and different authors low.
The main task, also framed as in Koch et al. (2015), is N-way evaluation: Given a text T by author A, select the text out of N candidates that is also by A; there will be exactly one by A among the N.
The N-way evaluation applies to both KnownAuth and OneShot frameworks. The similarity approaches choose the candidate from among the N that has the highest similarity score to T. For the classification approach in 4.3), the candidate that is chosen is the author with the highest network prediction among the N.
4.2 Data
4.2.1 Datasets
There are several datasets previously employed for author identification, including various PAN datasets. While the PAN datasets released up to 2019 have been used by a number of authors, they are small for our many-author setup (the largest has 180 authors) and too small to train a deep learning model. For benchmarking the verification task on small data, we do use the PAN 2015 data, which consists of 4 languages (English, Dutch, Spanish and Greek). Each language includes 100 instances for which one unknown piece and one to five known pieces are given (Stamatatos et al., 2015). The number of authors is unknown, and the genre of text varies. We will refer to this dataset as pan15.
Some other large data sets are the Enron emails corpus; a set of IMDB reviews; and the Blog Authorship Corpus Schler et al. (2006), a large sample of personal blogs collected from We use the last of these as it includes a sufficiently large number of authors for our purposes: we extracted a subset of 1950 authors that contains all blogs with at least 1500 words, and retain as the text the first 1000 words. The average number of samples per author is 2.83, and the average vocabulary size under character-level tokenization is 270. We refer to this dataset as bl-2K.
In addition, we use a more recent dataset put together by Fernandes et al. (2019). Like the PAN 2018 attribution task, it consists of fanfiction; we choose this dataset as it has more authors. It was collected from fanfiction.net from the five most popular fandoms (“Harry Potter”, “Hunger Games”, “Lord of the Rings”, “Percy Jackson and the Olympians” and “Twilight”).We observe that having authors writing on similar topics (within a small number of “fandoms”) means that methods cannot rely on topic cues. From this we have put together 4 subsets of varying numbers of randomly chosen authors (100, 1K, 5K and 10K). Each text consists of 2000 words. The average number of samples per author is 2.1, and the average vocabulary size under character-level tokenization is 365. We will refer to these datasets as ff-n, where n is the number of authors.
We also use the dataset from the author verification task of PAN 2020. There are two versions of the dataset, large and small, where the latter is a subset of the former. Like many of the PAN participants, we train our network on the smaller version, although we also use the larger one as described below in 4.2.2. The small training data for author verification
includes over 50K pairs extracted from 1600 fandoms written by 6400 authors in positive pairs (i.e. pairs written by the same author) and 48500 authors in negative pairs (i.e. pairs written by different authors). Each text piece is of approximate length of 21000 characters and average word count of 4875. We use 90% of the data for training and 10% for validation, and will refer to this dataset as pan20. (The testset used in the PAN 2020 task is not publicly available.) As this dataset is for the verification task, we need to transform it to our N-way set-up as described in
For all datasets, we did not employ any specific pre-processing such as lemmatization or lower-casing, nor did we replace digits, letters or punctuation, as these can be indicators of
Figure 2: Schematic view of (a) how pieces are randomly paired to create similar and dif- ferent entries for train and test. (b) Test-set data for binary classification (author verification). (c) One set of 5-way OneShot task randomly selected from the testset, where stands for the jth piece written by the ith author.
authorship. PAN 2020, however, applied white space and punctuation normalization when preparing the data (Kestemont et al., 2019).
4.2.2 Training Data
To produce a reasonable number of samples, we divide each text into 8 pieces. In order to generate same/different author pairs for training the Siamese networks, the pieces are divided into 4 chunks, which are then paired, as follows.
Dividing each text into 8 pieces, if there are N authors and M documents per author, this gives 8 pieces. In order to generate same/different author pairs for training the Siamese networks, the pieces are divided into 4 chunks. Pieces included in the first two chunks for an author A (colored blue in Figure 2(a)) are randomly paired together to create same-author pairs. For different-author pairs, pieces in the third chunk (light
Table 1: Number of pairs included in the train and test sets for the Siamese network. Extracted from PAN 2020 large, only containing authors that exist in the training set.
Extracted from PAN 2020 large, authors are disjoint from the training set.
gray in Figure 2(a)) for author A are paired with pieces in the fourth chunk (dark gray in
Figure 2(a)) for some other author B; both selections (of author B and of sample piece) are randomized. In this way, we make sure none of the samples forming similar pairs are used more than once. Figure 2(b) illustrates a schematic train/test-set where stands for the jth piece written by the ith author. Table 1 shows the number of pairs making up these datasets. The final train and test sets are balanced in the number of similar and different pairs.
We keep 10% of the training set for validation data.
PAN Datasets The text chunks in the PAN 2015 dataset do not come in pairs. We employed four different strategies for pairing which enables us to study the effect of text size as well as order of chunks:
(A) Every unknown entry is paired with each of the elements in the known set. The whole text from both unknown and known samples is used as one piece.
(B) Every unknown entry is paired with each of the elements in the known set. Only the beginning part from both unknown and known samples is used, the size is equal to the size of the shortest sample in the language cases.
(C) Each sample in the language cases is divided into three equal pieces. First, second and third pieces from the unknown sample are paired with first, second and third pieces in the known samples respectively.
(D) Each sample in the language is divided into three equal pieces. First, second and third pieces from the unknown sample are randomly paired with first, second and third pieces in each known sample.
The pan20 dataset provided positive and negative pairs set up for the verification task. To construct data for our experiments:
• For the OneShot setup, we extracted 12, 000 pairs from the PAN 2020 large training subset, such that no authors were in our training set. This testset includes 13, 000 authors, making it our largest. We constructed N-way sets from this.
• For the KnownAuth setup, we extracted 12, 000 pairs from the PAN 2020 large training subset, and similarly constructed N-way sets from this. This testset includes 12, 689 authors, all of which are included in pan20 training set.
4.2.3 Test Data and Evaluation Metric
For N-way evaluation, we randomly create 500 sets of N-way authors from the appropriate test set (KnownAuth or OneShot), and we calculate the accuracy in predicting the correct author. Final results are based on the average of three runs of different sets of 500. For verification, we report results on all elements of the test set.
Additional PAN 2020 Evaluation In addition to our core evaluation above, for pan20 we calculate the evaluation metrics used in the PAN 2020 verification task, using the provided code.. The four metrics are:
• AUC (sometimes referred to as ROC-AUC), which calculates the area under curve score.
• F1, the conventional F1 score calculated using precision and recall.
• F0.5u, a new measure emphasising correct same-author predictions.
• C@1, which is very similar to the conventional F1 score. C@1 rewards the model if hard problems are not answered (i.e. a score of 0.5 is considered to indicate “I don’t know”.).
4.3 Baselines
4.3.1 Similarity
As noted in 1, the most prominent authorship conventional similarity-based method is by Koppel et al. (2011). To our knowledge, this is the only available method that can be used as is in our one-shot experimental setup.
We used as a starting point code from a reproducibility study (Potthast et al., 2016); we reimplemented it to improve performance. We refer to this as Koppel.
4.3.2 Classification
As noted in 3, the sub-networks in our Siamese architecture are similar to the high-performing method of Ruder et al. (2016) (see
We use an individual sub-network as our classification architecture. We refer to this as cnn.
Table 2: Network parameters
As another baseline, we consider the type of approach based on language model pre-training that has recently come to dominate performance in many NLP tasks. In these, pretrained language representations can be used either as additional features in a task-specific architecture (e.g. ELMo: Peters et al. (2018)) or via transfer learning and the fine-tuning of parameters for a specific task (e.g. GPT: Radford et al. (2018)). BERT (Devlin et al., 2019) is an approach that when it was introduced produced state-of-the-art performance on a range of NLP tasks set up as the GLUE benchmark(Wang et al., 2018): it gave the best performance on all tasks in this suite, including sentiment classification, prediction of grammatical acceptability, textual similarity, paraphrase, and natural language inference; improvements on many of the tasks were quite large with respect to previous state of the art. Later analysis (Tenney et al., 2019) has shown that BERT can perform across levels of linguistic analysis, from low (e.g. part-of-speech tagging) to high (e.g. semantic roles).
We therefore use BERT fine-tuned for our classification task as our second baseline. We do this by feeding the output of BERT to a dense layer, and carrying out a small amount of extra training.
4.4 Implementation Details
4.4.1 Siamese Networks
In terms of structure, each sub-network consists of an embedding layer, four convolutional layers, and a dense layer (resp. Emb, Convn, D in Table 2).
The Siamese networks are trained on the verification task, for at most 25 epochs. All initializations are random, and training is restarted if after the 10th epoch the verification accuracy is smaller than 0.55.
The epoch we select for the final result is the one with best verification accuracy on the validation set.
In terms of hyper-parameters, we use a learning rate of 0.0005, Adam as optimizer, and batch size of 25.
As noted above, for the LSTM alternatives to the main models, we look at both unidirectional (left-to-right) and bidirectional. In these, we use a hidden layer with 200 nodes. For the LSTM variants, hyperparameters have the same settings.
4.4.2 Koppel
Koppel has few parameters. The maximum number of character 4-grams is set to 20,000 as in the replication code; the actual number of character 4-grams in our data is always lower than this. The replication code samples 50% of the features, and repeats this 100
Table 3: Verification: accuracy on ff, bl-2K and PAN datasets.
times, which Koppel et al. (2011) found to produce good results. The replication code also by default uses the Ruzicka metric rather than cosine similarity (which we also found to perform better). There is an additional parameter, a threshold for a ‘don’t know’ option; we always make a choice, and so set this threshold to be 0.
We reimplemented the replication code to be more efficient, in order to run on larger numbers of authors: the replication code did not, for example, have efficient implementations of vector arithmetic. We verified that the replication code and our reimplementation performed the same on the PAN 2011 and 2012 and ff-100 datasets. Results in the paper are all from our reimplementation.
4.4.3 CNN Classification
The CNN classification model is trained for at most 150 epochs, and the epoch with the best validation accuracy across all classes is selected.
4.4.4 BERT Classification
To fine-tune BERT for authorship attribution, we trained for 3 epochs, as did Devlin et al. (2019) for all GLUE benchmark tasks. BERT takes as input sentences, so we segmented our input at the period character. (Other segmentations produced similar results, although they declined more steeply for larger N.)
5.1 Verification
Core Results Table 3 gives the results for author verification for our two Siamese variants. (For PAN 2015, this is the best result for the English subset of the data.) As expected, accuracy improves with more training data: it starts very low when there are only 100 authors to learn a notion of similarity from, increasing rapidly when there are 1000 authors to 0.980 for and 0.978 for Siamcos; there is no improvement for 10000 authors. The scores on bl-2K are very close to what we have for the similarly sized ff datasets. For the PAN datasets, again the larger has higher scores, although with PAN 2020 being the largest dataset and with the longest texts, it may seem surprising that the results are not higher. To handle the larger text sizes — text chunks in pan20 are approximately 8 times larger than for ff— the network needs many more epochs to converge. The task is nevertheless still more complex, even with additional training, as there are more fandoms in pan20 compared to ff (1600 vs 5, respectively). This allows a more strict setup making the problem more complex: in pan20, no positive pairs exist where the texts are from same fandom.
In terms of our energy function, for all of the results, there is no clear preference regarding
Additional Results: PAN 2015 As noted in 4.2.2, the format of pan15 allowed us to try different pairing strategies to see which worked best as training data for a Siamese network for verification. Results of all variants are in Table 4. There are clear differences: in the PAN 2015 setup where there are many known author texts for each unknown author snippet, pairing this snippet up multiple times provides the highest results.
Table 4: Best Results on Author Verification for English subset of pan15.
Results for the other languages in the dataset, using the setting under which we got the highest accuracy for the English dataset (A), are shown in Table 5. Here the datasets are small enough that results are sometimes not much over random chance (0.5 in our verification setup), reinforcing that larger datasets are necessary for these deep Siamese techniques.
Table 5: Author Verification accuracy for Dutch, Spanish and Greek data set of pan15.
Additional Results: PAN 2020 As noted in 4.2, the official testset for PAN 2020 is not available. Results for pan20 are provided in Table 6, under our setup where we use 90% of pan20 for training and the rest for validation. In addition to the usual accuracy, we also calculated the four PAN 2020 metrics described in
While this cannot be directly compared with results in PAN 2020, if our verification set were to have the same distributional properties and other characteristics as the official test set, Siamcos would rank second among competition systems, according to the data provided in Kestemont et al. (2020).
Table 6: Author verification accuracy, AUC, F1, C@1, and F0.5u scores on pan20 using the two variants of the Siamese model.
As observed in text chunks are up to 8 times larger than the chunks used for training the Siamese model on the ff datasets. This provides us with an opportunity to analyse the effect of text size on accuracy on both Siamese variants.
Figure 3 shows that, as expected, the larger the text chunk, the more authorial features there are to rely on and the higher the accuracy. Siamcos is consistently above all text chunk sizes on this dataset. Interestingly, there is a steady linear relation between Siamcos’s accuracy and the text chunk size, while
shows a jump when moving from size 6000 to 7000 characters.
Figure 3: Effect of text length on the accuracy of
5.2 N-Way One-Shot
5.2.1 Core Model
The verification results above indicate that 100 authors do not provide enough data for the Siamese networks to train to a high level of performance, and that results for 10000 authors as roughly the same as for 5000 authors. For the N-way one-shot scenario, then, Table 7 presents results for ff-1K, ff-5K, bl-2K and pan20.
We make the following observations:
Table 7: Results under the OneShot scenario on ff-1K, ff-5K, bl-2K and pan20: N-way classification accuracy.
Figure 4: Results on ff-5K under the OneShot scenario: accuracy under N-way classifi-cation.
• All results are much higher than chance (= 1/N), and naturally degrade as N increases.
• Siamcos has the best results for all datasets and for all N over those datasets. while not as good as Siamcos, is almost always better than Koppel; the exception is pan20.
• Performance on the datasets for is generally similar on each dataset for corresponding values of N. pan20 results are a little lower for all N (e.g. for Siamcos, for N = 2: 0.930 vs 0.983–0.990; for N = 100: 0.480 vs 0.753–0.930). There are two factors here acting in opposite directions: making the task more difficult than on other datasets is the very large number of authors, but making it easier is the larger text size. We explore the effect of text size in
• Koppel performs relatively better on bl-2K (which was its original test corpus) than on the ff corpora, that is, comparing like Ns against the performance of Koppel on the other corpora; for example, it scores 0.617 accuracy on bl-2K for 5-way comparison on this dataset of 2000 authors, versus 0.533 for 5-way comparison on ff-1K. (However, its performance is still lower than similarly performs relatively better on pan20, where it also beats
. In the case of bl-2K, this may because of the observation in Koppel et al. (2011) that the method could rely on topic clues, which they see as reflective of real-world use, as blog posts by a given author commonly share topics that other authors may not discuss; in the ff datasets, the topics (fandoms) are shared across many authors. This is not likely to be the explanation for pan20, where the dataset pairs do not share a topic (fandom); rather, it may be because the texts are larger. Siamcos seems relatively invariant to either of these effects.
Table 8: N-way accuracy under word- and character-level inputs on ff-1K under the
Table 9: N-way accuracy using LSTMs vs CNNs on ff-1K under the OneShot scenario
5.2.2 Model and Training Alternatives
Character versus Word In addition to the architectural choices for our primary model as described in 3, we also tried word-level inputs, and these as expected performed consistently worse, indicating stylistic features can be better identified through characters. Table 8 shows a comparison between word- and character-level inputs on ff-1K under Siamcos and
It is apparent that the difference is large and gets dramatically larger as N increases. In the pre-deep-learning era, Keˇselj et al. (2003) argued that character-level representations better capture stylistic characteristics for authorship; this is supported by these results, and indicates that it continues to be true for deep learning representations.
CNN versus LSTM Another alternative discussed in 3 was to use LSTMs for sub-networks; results, again on ff-1K, are in Table 9. Here also the results of the LSTMS are substantially worse than the core Siamcos model using a CNN, and only the biLSTM for the largest N is better than the core
model. In addition, in terms of practical use for these larger N, the LSTM-based networks are much slower to train, taking orders of magnitude longer: for the 100-way setup, one epoch took around a day. They are clearly infeasible for this kind of authorship identification setup.
We also considered both as a variant of
and the Ruzicka or minmax metric as a variant of Siamcos, as this latter has been found to be an improvement of Koppel et al. (2011) by Koppel and Winter (2014). Again, results were consistently poorer and we do not present them.
Table 10: N-way classification accuracy under OneShot scenario on chunks of maximum 2700 character in pan20 pairs, compared to scores for full-length texts from Table 7, using
Training Pair Size As noted in 4.2, the individual texts of pan20 are much larger than for the other datasets. As in
5.1, we therefore use this dataset to examine the effect of text size, here under the N-way setup. In Table 10 we compare the results of training the core model on texts that have the same average size as the ff-1K dataset — 2700 characters — with the results from Table 7 on the full texts. For both
, results are quite dramatically lower relative to the full-sized texts, and also relative to corresponding values of N for the other datasets. As discussed in
5.2.1, the very large number of authors produces this much lower result.
5.3 N-Way Known Author
Table 11 shows the results for ff under the KnownAuth scenario: we chose the smallest of the three datasets from Table 7, ff-1K, so that the classification approach would be competitive.
• For the smallest case, of N = 2, CNN classification does better than the traditional Koppel similarity, although it degrades much more quickly as N grows: this conforms to the general belief that conventional similarity methods work better for large numbers of authors.
• BERT follows a similar pattern. It starts slightly lower than CNN — as it uses word-level (or at least word piece-level) representations, this is not unexpected, in spite of its strong performance on other tasks — but degrades more slowly.
• Our new methods, , behave similarly to the OneShot scenario. Siamcos starts as the highest at N = 2, and stays the best until N = 500, at which point Koppel is slightly higher.
• Comparing Table 7 and Table 11, it can be seen that the Siamese scores are uniformly lower for equivalent N. This is because the network receives as input only 3/4 of the data per author (with 1/4 held out for KnownAuth testing). We would expect that
Table 11: Results under the KnownAuth scenario on ff-1K: N-way classification accuracy.
with quantities of training text per author that are similar to the OneShot scenario, we would see the same higher levels of accuracy for the Siamese methods.
Table 12 gives results for pan20. For this dataset we only consider Siamcos, given the very large number of authors. As elsewhere, Siamcos is the best; and also as observed above for ff-1K, results on KnownAuth are lower than under OneShot (Table 7) for corresponding values of N, supporting our suggestion that it is the smaller amount of training data per pair in the KnownAuth setup.
Table 12: N-way classification accuracy on pan20 under KnownAuth scenario.
In this work we have presented an investigation of the application of a Siamese network architecture to large-scale stylistic author attribution. Our system learns a general notion of authorship, strongly outperforming the key similarity-based method in one-shot N-way evaluation, and also performing well in a known-author context. While there is no clear difference between the metric and cosine similarity in the verification task, the latter is substantially better in a task of choosing among N authors, both in open-set and closed-set contexts. We also find that CNNs in general perform well in terms of the architectural structure of the Siamese subnetworks, and that for large numbers of authors LSTMs take infeasibly long to train.
There are two key directions we are exploring for future work. One is with respect to the type of metric. While we focussed on the most commonly used, notably and cosine similarity, hyperbolic distance has shown to useful in recent work on preserving privacy in the face of authorship attribution attacks (Feyisetan et al., 2019); using such a metric in authorship attribution could similarly be useful. The other future direction is in terms of exploring other architectures used for one-shot tasks in image processing, such as the matching networks of Vinyals et al. (2016): that work, for instance, incorporates other ideas from metric learning into deep learning, and aims to improve over a regular Siamese architecture by a better alignment of training objective to the N-way classification task, and by incorporating various deep learning mechanisms such as attention. Given the success of attention-based models across a wide range of NLP tasks (Vaswani et al., 2017), this could be a promising next step.
Ahmed Abbasi and Hsinchun Chen. Writeprints: A Stylometric Approach to Identity-Level Identification and Similarity Detection in Cyberspace. ACM Transactions on Information Systems, 26(2), 2008.
Emir Araujo-Pino, Helena G´omez-Adorno, and Gibran Fuentes Pineda. Siamese network applied to authorship verification. In Linda Cappellato, Carsten Eickhoff, Nicola Ferro, and Aur´elie N´ev´eol, editors, Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings. CEUR-WS.org, 2020. URL http://ceur-ws.org/Vol-2696/ paper 222.pdf.
Douglas Bagnall. Author identification using multi-headed recurrent neural networks. arXiv preprint arXiv:1506.04891, 2015.
Christos Baziotis, Nikos Pelekis, and Christos Doulkeridis. DataStories at SemEval-2017 Task 6: Siamese LSTM with Attention for Humorous Text Comparison. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 390– 395, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/S17-2065. URL https://www.aclweb.org/anthology/S17-2065.
Benedikt Boenninghoff, Robert M. Nickel, Steffen Zeiler, and Dorothea Kolossa. Similarity Learning for Authorship Verification in Social Media. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
Benedikt T. Boenninghoff, Julian Rupp, Robert M. Nickel, and Dorothea Kolossa. Deep bayes factor scoring for authorship verification. In Linda Cappellato, Carsten Eickhoff, Nicola Ferro, and Aur´elie N´ev´eol, editors, Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings. CEUR-WS.org, 2020. URL http://ceur-ws.org/ Vol-2696/paper 151.pdf.
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard S¨ackinger, and Roopak Shah. Sig- nature verification using a “siamese” time delay neural network. In Advances in neural information processing systems, pages 737–744, 1994.
Ting Chen and Yizhou Sun. Task-guided and path-augmented heterogeneous network em- bedding for author identification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 295–304. ACM, 2017.
Jianpeng Cheng and Dimitri Kartsaklis. Syntax-Aware Multi-Sense Word Embeddings for Deep Compositional Models of Meaning. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1531–1542, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1177. URL https://www.aclweb.org/anthology/D15-1177.
Jos´e Eleandro Cust´odio and Ivandr´e Paraboni. Each-usp ensemble cross-domain authorship attribution. Working Notes Papers of the CLEF, 2018.
Arpita Das, Harish Yenala, Manoj Chinnakotla, and Manish Shrivastava. Together we stand: Siamese Networks for Similar Question Retrieval. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 378–387, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1036. URL https://www.aclweb.org/anthology/P16-1036.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423.
Gareth Dwyer. Novel Approaches to Authorship Attribution. Master’s thesis, University of Groningen and Saarland University, 2017.
Mathias Etcheverry and Dina Wonsever. Unraveling Antonym’s Word Vectors through a Siamese-like Network. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3297–3307, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1319. URL https://www.aclweb.org/ anthology/P19-1319.
Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Doll´ar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. From captions to visual concepts and back. In CVPR, pages 1473–1482. IEEE Computer Society, 2015. ISBN 978-1-4673-6964-0. URL http: //dblp.uni-trier.de/db/conf/cvpr/cvpr2015.html#FangGISDDGHMPZZ15.
Natasha Fernandes, Mark Dras, and Annabelle McIver. Generalised Differential Privacy for Text Document Processing. In Proceedings of Principles of Security and Trust (POST), volume 11426 of LNCS, pages 123–148, 2019.
Oluwaseyi Feyisetan, Tom Diethe, and Thomas Drake. Leveraging Hierarchical Representa- tions for Preserving Privacy and Utility in Text. In Proceedings of the IEEE nternational Conference on Data Mining (ICDM), 2019.
Lukasz Gagala. Authorship verification with prediction by partial matching and context- free grammar. In Linda Cappellato, Carsten Eickhoff, Nicola Ferro, and Aur´elie N´ev´eol, editors, Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings. CEUR-WS.org, 2020. URL http://ceur-ws.org/Vol-2696/paper 240.pdf.
Chuanxing Geng, Sheng-Jun Huang, and Songcan Chen. Recent Advances in Open Set Recognition: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. doi: 10.1109/TPAMI.2020.2981604.
Daniela Gerz. Representation Learning beyond Semantic Similarity: Character-aware and Function-specific Approaches. PhD thesis, Cambridge University, 2020.
Martin Gleize, Eyal Shnarch, Leshem Choshen, Lena Dankin, Guy Moshkowich, Ranit Aharonov, and Noam Slonim. Are You Convinced? Choosing the More Convincing Evidence with a Siamese Network. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 967–976, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1093. URL https: //www.aclweb.org/anthology/P19-1093.
Oren Halvani, Lukas Graner, and Roey Regev. Cross-domain authorship verification based on topic agnostic features. In Linda Cappellato, Carsten Eickhoff, Nicola Ferro, and Aur´elie N´ev´eol, editors, Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings. CEUR-WS.org, 2020. URL http://ceur-ws.org/Vol-2696/paper 114.pdf.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2nd edition, 2009.
Renkui Hou and Chu-Ren Huang. Robust stylometric analysis and author attribution based on tones and rimes. Natural Language Engineering, 26(1):49–71, 2020. doi: 10. 1017/S135132491900010X.
Shengli Hu. Somm: Into the Model. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1153–1159, Brussels, Belgium, OctoberNovember 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1146. URL https://www.aclweb.org/anthology/D18-1146.
Minghui Huang, Yanghui Rao, Yuwei Liu, Haoran Xie, and Fu Lee Wang. Siamese network- based supervised topic modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4652–4662, Brussels, Belgium, OctoberNovember 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1494. URL https://www.aclweb.org/anthology/D18-1494.
Catherine Ikae. Unine at PAN-CLEF 2020: Author verification. In Linda Cappellato, Carsten Eickhoff, Nicola Ferro, and Aur´elie N´ev´eol, editors, Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings. CEUR-WS.org, 2020. URL http://ceur-ws.org/Vol-2696/paper 110.pdf.
Anirudh Joshi, Tim Baldwin, Richard O. Sinnott, and Cecile Paris. UniMelb at SemEval- 2018 Task 12: Generative Implication using LSTMs, Siamese Networks and Semantic Representations with Synonym Fuzzing. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 1124–1128, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/S18-1190. URL https://www.aclweb.org/anthology/S18-1190.
Patrick Juola. An overview of the traditional authorship attribution subtask. In CLEF (Online Working Notes/Labs/Workshop), 2012.
M. Kestemont, Enrique Manjavacas, I. Markov, Janek Bevendorff, Matti Wiegmann, E. Sta- matatos, Martin Potthast, and B. Stein. Overview of the cross-domain authorship veri-ficfation task at pan 2020. In CLEF, 2020.
Mike Kestemont, Michael Tschuggnall, Efstathios Stamatatos, Walter Daelemans, G¨unther Specht, Benno Stein, and Martin Potthast. Overview of the Author Identification Task at PAN-2018 Cross-domain Authorship Attribution and Style Change Detection. In CLEF 2018 Evaluation Labs and Workshop, 2018. URL http://ceur-ws.org/Vol-2125/.
Mike Kestemont, Efstathios Stamatatos, Enrique Manjavacas, Walter Daelemans, Martin Potthast, and Benno Stein. Overview of the cross-domain authorship attribution task at {PAN} 2019. In Working Notes of CLEF 2019-Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, pages 1–15, 2019.
Vlado Keˇselj, Fuchun Peng, Nick Cercone, and Calvin Thomas. N-Gram-Based Author Profiles for Authorship Attribution. In Proceedings of the Pacific Association for Computational Linguistics (PACLING), pages 255–264, 2003.
Mahmoud Khonji and Youssef Iraqi. A Slightly-modified GI-based Author-verifier with Lots of Features (ASGALF). In Working Notes for CLEF 2014 Conference, 2014. URL http://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-KonijEt2014.pdf.
Yoon Kim. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751. Association for Computational Linguistics, 2014. doi: 10.3115/v1/D14-1181. URL http://aclweb.org/anthology/D14-1181.
Alon Kipnis. Higher criticism as an unsupervised authorship discriminator. In Linda Cappel- lato, Carsten Eickhoff, Nicola Ferro, and Aur´elie N´ev´eol, editors, Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings. CEUR-WS.org, 2020. URL http://ceur-ws.org/Vol-2696/paper 228.pdf.
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
Moshe Koppel and Yaron Winter. Determining if two documents are written by the same author. JASIST, 65(1):178–187, 2014. doi: 10.1002/asi.22954. URL http://dx.doi.org/ 10.1002/asi.22954.
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Authorship attribution in the wild. Language Resources and Evaluation, 45(1):83–94, 2011.
Tuan Manh Lai, Trung Bui, and Sheng Li. A Review on Deep Learning Techniques Applied to Answer Selection. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2132–2144, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/C18-1181.
Ahmed M Mohsen, Nagwa M El-Makky, and Nagia Ghanem. Author identification using deep learning. In 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 898–903. IEEE, 2016.
Jonas Mueller and Aditya Thyagarajan. Siamese Recurrent Architectures for Learning Sentence Similarity. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI), pages 2786–2792. AAAI Press, 2016.
Benjamin Murauer, Michael Tschuggnall, and G¨unther Specht. Dynamic parameter search for cross-domain authorship attribution. Working Notes of CLEF, 2018.
Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru. Learning Text Similarity with Siamese Recurrent Networks. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 148–157, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/W16-1617. URL https://www.aclweb.org/ anthology/W16-1617.
Juanita Ordo˜nez, Rafael Rivera Soto, and Barry Chen. Will longformers PAN out for authorship verification? notebook for PAN at CLEF 2020. In Linda Cappellato, Carsten Eickhoff, Nicola Ferro, and Aur´elie N´ev´eol, editors, Working Notes of CLEF 2020 -Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings. CEUR-WS.org, 2020. URL http: //ceur-ws.org/Vol-2696/paper 220.pdf.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202.
Martin Potthast, Sarah Braun, Tolga Buz, Fabian Duffhauss, Florian Friedrich, J¨org Marvin G¨ulzow, Jakob K¨ohler, Winfried L¨otzsch, Fabian M¨uller, Maike Elisa M¨uller, Robert Paß-mann, Bernhard Reinke, Lucas Rettenmeier, Thomas Rometsch, Timo Sommer, Michael
Tr¨ager, Sebastian Wilhelm, Benno Stein, Efstathios Stamatatos, and Matthias Hagen. Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval. In Nicola Ferro, Fabio Crestani, Marie-Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hauff, and Gianmaria Silvello, editors, Advances in Information Retrieval. 38th European Conference on IR Research (ECIR 16), volume 9626 of Lecture Notes in Computer Science, pages 393–407, Berlin Heidelberg New York, March 2016. Springer.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding with unsupervised learning. Technical report, OpenAI, 2018.
Taraka Rama. Siamese Convolutional Networks for Cognate Identification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1018–1027, Osaka, Japan, December 2016. The COLING 2016 Organizing Committee. URL https://www.aclweb.org/anthology/C16-1097.
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3980–3990, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https: //www.aclweb.org/anthology/D19-1410.
Jo˜ao Ant´onio Rodrigues, Chakaveh Saedi, Vladislav Maraev, Joao Silva, and Ant´onio Branco. Ways of asking and replying in duplicate question detection. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017), pages 262–270, 2017.
Gaetano Rossiello, Alfio Gliozzo, Robert Farrell, Nicolas Fauceglia, and Michael Glass. Learning Relational Representations by Analogy using Hierarchical Siamese Networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3235–3245, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1327. URL https://www.aclweb.org/ anthology/N19-1327.
Sebastian Ruder, Parsa Ghaffari, and John G Breslin. Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv preprint arXiv:1609.06686, 2016.
Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James Pennebaker. Effects of age and gender on blogging. In Proceedings of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs, 2006.
Shachar Seidman. Authorship Verification Using the Imposters Method. In Working Notes for CLEF 2013 Conference, 2013. URL http://ceur-ws.org/Vol-1179/ CLEF2013wn-PAN-Seidman2013.pdf.
Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. Authorship attribution with latent dirichlet allocation. In Proceedings of the fifteenth conference on computational natural language learning, pages 181–189. Association for Computational Linguistics, 2011.
Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. Authorship Attribution with Topic Models. Computational Linguistics, 40(2):269–310, June 2014. ISSN 0891-2017. doi: 10.1162/COLI a 00173. URL http://dx.doi.org/10.1162/COLI a 00173.
Kashif Shah, Selcuk Kopru, and Jean-David Ruvini. Neural Network based Extreme Classification and Similarity Models for Product Matching. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 8–15, New Orleans - Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-3002. URL https://www.aclweb.org/anthology/N18-3002.
Efstathios Stamatatos. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60(3):538–556, 2009.
Efstathios Stamatatos, Walter Daelemans amd Ben Verhoeven, Patrick Juola, Aurelio L´opez-L´opez, Martin Potthast, and Benno Stein. Overview of the Author Identifica-tion Task at PAN 2015. In Linda Cappellato, Nicola Ferro, Gareth Jones, and Eric San Juan, editors, CLEF 2015 Evaluation Labs and Workshop – Working Notes Papers, 8-11 September, Toulouse, France. CEUR-WS.org, September 2015. URL http: //ceur-ws.org/Vol-1391.
Yi Tay, Luuanh Tuan, and Siucheung Hui. Hyperbolic Representation Learning for Fast and Efficient Neural Question Answering. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM), pages 583–591, 2018.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy, July 2019. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/P19-1452.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, �L ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching Networks for One Shot Learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 3637–3645, USA, 2016. Curran Associates Inc. ISBN 978-1-5108-3881-9. URL http://dl.acm.org/citation. cfm?id=3157382.3157504.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bow- man. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing
and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://www.aclweb.org/anthology/W18-5446.
Janith Weerasinghe and Rachel Greenstadt. Feature vector difference based neural network and logistic regression models for authorship verification. In Linda Cappellato, Carsten Eickhoff, Nicola Ferro, and Aur´elie N´ev´eol, editors, Working Notes of CLEF 2020 -Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings. CEUR-WS.org, 2020. URL http: //ceur-ws.org/Vol-2696/paper 125.pdf.
Wen-tau Yih, Xiaodong He, and Christopher Meek. Semantic Parsing for Single-Relation Question Answering. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 643–648. Association for Computational Linguistics, 2014. doi: 10.3115/v1/P14-2105. URL http://aclweb.org/anthology/P14-2105.
Wenpeng Yin, Hinrich Sch¨utze, Bing Xiang, and Bowen Zhou. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs. Transactions of the Association for Computational Linguistics, 4:259–272, 2016. doi: 10.1162/tacl a 00097. URL https://www.aclweb.org/anthology/Q16-1019.
Chuxu Zhang, Chao Huang, Lu Yu, Xiangliang Zhang, and Nitesh V Chawla. Camel: Content-aware and meta-path augmented metric learning for author identification. In Proceedings of the 2018 World Wide Web Conference, pages 709–718. International World Wide Web Conferences Steering Committee, 2018.