b

DiscoverSearch
About
My stuff
Summary and Distance between Sets of Texts based on Topological Data Analysis
2019·arXiv
Abstract
Abstract

In this paper, we use topological data analysis (TDA) tools as persistent homology, persistent entropy and bottleneck distance, to provide a TDAbased summary of any given set of texts and a general method for computing a distance between any two literary styles, authors or periods. To this aim, deep-learning word-embedding techniques are combined with these tools in order to study topological properties of texts embedded in a metric space. As a case of study, we use the written texts of three poets of the Spanish Golden Age: Francisco de Quevedo, Luis de G´ongora and Lope de Vega. As far as we know, this is the first time that word embedding, bottleneck distance, persistent homology and persistent entropy are used together to characterize texts and to compare different literary styles.

Keywords: Topological data analysis, Word embedding, persistent entropy, bottleneck distance, literary styles, Deep Learning

Topology is the branch of mathematics which deals with proximity relations and continuous deformations of abstract spaces. Recently, many researchers

2 Summary and Distance between Sets of Texts based on TDA

have paid attention to it due to the increasing amount of data available and the need for in-depth analysis of these datasets to extract useful properties of the space they sample. The application of topological tools to the study of data, known as Topological Data Analysis (TDA), has achieved a long list of successes in recent years (see, e.g., [1], [2] or [3], among many others). In this paper, we focus our attention on applying TDA techniques to study and effectively compute a kind of proximity among literary styles.

Until now, most of the methods used in comparative studies in philology are essentially qualitative. The comparison among writers, periods or, in general, literary styles is often based on stylistic analysis that cannot be quantified. Several quantitative methods in linguistics were applied in the past (see [4]) but their use is still controversial [5].

Our aim is to provide quantitative methods to classify texts, authors and literary styles, in general. But, instead of only using statistical methods, our procedure is based on the analysis of the spatial shape of the data after embedding it in a high-dimensional metric space. Broadly speaking, we start by representing a text as a cloud of points by using the so-called word embedding technique. The second key point of our method is the use of some TDA techniques, such as the persistent entropy and the bottleneck distance, to measure the proximity between different point clouds representing different texts, authors, literary styles, etc. The reason why we use persistent entropy is that it is a summary tool easy to compute and stable under small changes in the input data [6].

Let us recall that word embedding techniques try to find a representation of a set of words on a given alphabet as a high-dimensional point cloud in such a way that the semantic proximity is kept. Among the most popular systems for word embedding, the word2vec [7], GloVe [8] or FastText [9] systems can be cited. Along this paper, the word2vec system with its skipgram variation will be used to obtain said multidimensional representation of the texts.

Regarding the study of proximity between the word embedding of different texts, there are many ways in computer science to study dissimilarities and to measure the distance between two point clouds [10], but most of them are merely based on some kind of statistical summary of the point cloud and not on its shape. TDA techniques can capture the structure representation of data distribution, as shown in [11] and hence, it can be considered a powerful tool in combination with machine learning to be used in the different areas of data analysis (see, for example, [12] and [13]). In spite of the doubtless interest in quantifying and measuring the proximity between literary styles, as far as we know, very few researchers explored the dissimilarity and proximity between them by combining machine learning and TDA techniques (see, for example, [1416]).

Our contribution is two-folded. Firstly, to the best of our knowledge, this is the first time that persistent entropy is applied to language processing problems. Secondly, the concept of a TDA-based summary of a set of texts is introduced. Specifically, the shape of a point cloud representing a text is

image

captured by using a TDA technique known as persistence diagrams, which is based on deep and well-known concepts of algebraic topology such as simplicial complexes, homology groups and filtrations. Specifically, to summarize a persistence diagram, we will compute its persistent entropy. Persistent entropy is based on the Shannon entropy and it has been successfully applied in many real-world problems such as to characterize epithelial tissues images [17] or to measure the heart-rate variability to a sleep-wake classification [18]. Persistent entropy will be applied to provide a TDA-based summary of a set of texts and to characterize the literary works made by an author. A distance between persistence diagrams, namely the bottleneck distance, provides a way to quantify the proximity between two different persistence diagrams and, hence, a way to quantify the proximity between two different literary styles.

In order to illustrate the potential of the proposed technique, we provide a comparison of the literary works of two poets, Luis de G´ongora and Francisco de Quevedo, who are representatives of the two main Spanish Golden Age literary styles called Culteranismo and Conceptismo, respectively. We also consider a third poet, called Lope de Vega. Literary experts agree that Lope de Vega and Francisco de Quevedo styles are close (they both belong to Conceptismo), but both are far from the style of Luis de G´ongora, which belongs to Culteranismo [19]. The application of TDA techniques made in this paper for measuring the proximity between such literary styles, quantitatively con-firms that the styles of Lope de Vega and Francisco de Quevedo are close to each other and yet both are far from the style of Luis de G´ongora.

The paper is organized as follows: In Section 2, some preliminary notions about word embedding and TDA techniques are provided. The procedure applied to compute a TDA-based summary of a set of texts and to compare two different literature styles is described in Section 3. In Section 4, the spe-cific computation of TDA-based summary of the literary works of each of the three poets mentioned above, and a comparison between their literary styles is thoroughly described. Finally, in Section 5, conclusions and future work are given.

In this section we recall some basics related to the techniques used along the paper. Firstly, word embedding methodology is briefly introduced. Later, the relevant tools from TDA used in our approach are described.

2.1 Word Embedding

Word embedding is the collective name of a set of methods for representing words from natural languages as points (or vectors) in a real-valued multidimensional space. The common feature of such methods is that words with similar meanings take close representation. Such representation methods are on the basis of some of the big successes of deep learning applied to natural

4 Summary and Distance between Sets of Texts based on TDA

image

Fig. 1 The skipgram neural network architecture. The input layer has as many neurons as the length of the one-hot vector that encode the words of the corpus, i.e., the number of words that compose the vocabulary of the corpus, N in this case. The size of the projection layer is equal to the dimension in which we want to embed the corpus, M. Finally, the output layer has  N · Sneurons where S is the size of the window, i.e., the number of surrounding words that the model tries to predict. This image is inspired in the image of the skipgram model in [22].

language processing (see, for example, [20] or [21]). Next, we recall some basic definitions related to this methodology.

Definition 1 (corpus) Given a set of words on a given alphabet, a corpus is a finite collection of writings composed with these words, denoted by C. The vocabulary, V , of a corpus C is the set of all the words that appear in C. Finally, given  d ≥ 1, aword embedding is a function  E : V → Rd.

The word embedding process used along this paper is the  word2vec1, specifically, its modified version called skipgram [25]. It is based on a neural network architecture with one hidden layer where the input is a corpus and the output is a probability distribution. We train it with a corpus to detect similarities in words based on their relative distance in a writing. Such distance is the base of their representation in an n-dimensional space.

Roughly speaking, the skipgram neural network is trained by using a corpus, where the context of a word is considered as a window around a target word. In this way, in the skipgram model, each word of the input is processed by a log-linear classifier with continuous projection layer, trying to predict the previous and the following words in a sentence. In this kind of neural network

image

architecture, the input is a one-hot vector representing a word of the corpus and the output is a prediction of the surrounding words. More specifically, the neural network follows the architecture shown and explained in Figure 1.

2.2 Topological Data Analysis

The field of computational topology and topological data analysis were born as a combination of topics in geometry, topology, and algorithmics. In this section, some of their basic concepts are recalled. For a a detailed presentation of these fields, [26] and [27] are recommended.

We will recall, firstly, homology and, lately, persistent homology as fundamental tools of TDA. The information obtained when computing persistent homology is usually encapsulated in a persistence diagram. Next, the concept of persistent entropy is introduced as a summary tool for persistence diagrams. Finally, the bottleneck distance will be shown as the main distance to compare persistence diagrams.

The class of the spaces where we define homology groups are the underlying spaces of simplicial complexes which are combinatorial structures built from lines, segments, triangles, and so on for higher dimensions. These components are called simplices.

Definition 2 (n-simplex) Let  {v0, . . . , vn}be a set of geometrically independent points in  Rd. The n-simplex σspanned by  v0, . . . , vnis defined as the set of all points

image

where  ti ∈ R when 0 ≤ i ≤ n, and �ni=0 ti = 1. Besides, v0, . . . , vnare called the vertices of  σ, the number n is called the dimension of  σ, and any simplex spanned by a subset of  {v0, . . . , vn}is called a face of  σ.

When a set of n-simplices is glued, a simplicial complex is formed.

Definition 3 (simplicial complex) A simplicial complex  K in Rd is a collection of simplices in  Rd such that:

1. Every face of a simplex of K is in K; 2. the intersection of any two simplexes of K is a face of each of them.

image

Next, the definition of n-chains and their boundaries is recalled. It is a key tool to formalize the idea of holes in a multidimensional space.

Definition 4 (chain complexes) Let K be a simplicial complex and n a dimension. An n-chain is a formal sum  c = �mi=1 aiσi, where σi are n-simplices of K and

6 Summary and Distance between Sets of Texts based on TDA

ai ∈ Z2are coefficients. The sum between n-chains is defined componentwise, i.e., let  c′ = �mi=1 biσibe another n-chain, then  c + c′ = �mi=1(ai + bi)σi. The n-chainstogether with the addition form an abelian group denoted by  Cn. To relate these groups with different dimension, the boundary of an  n-simplex σ = {v1, . . . , vn} isdefined as the sum of its (n −1)-dimensional faces, that is,

image

where the hat on  vjindicates that  vjis omitted. The boundary of an n-chain is the sum of the boundaries of its simplices. Hence, the boundary operation  ∂n is ahomomorphism that maps an n-chain to an (n −1)-chain. Then, a chain complex is the sequence of chain groups connected by boundary homomorphisms,

image

Next, the chains with empty boundary are considered. From an algebraic point of view, they have a group structure.

Definition 5 (n-cycles and n-boundaries) The group of n-cycles is the subgroup of the group of n-chains denoted by  Zncomposed of those n-chains c with empty boundary, that is,  ∂nc = 0. The group of n-boundaries is the subgroup of the group of n-chains denoted by  Bncomposed of those chains that are in the image of the (n + 1)-st boundary homomorphism, that is,  Bn = im ∂n+1.

Let us observe that since  ∂n+1∂n = 0 then Bnis a subset of  Zn. Therefore, we can already recall the definition of homology groups that determine the holes in the underlying space of a simplicial complex.

Definition 6 (homology groups) The n-th homology group is the quotient of the n-boundaries over the n-cycles, that is,  Hn = Zn/Bn. The elements of  Hnare called n-homology classes.

Next, we recall how to build a nested sequence of simplicial complexes in order to track the evolution of the homology groups throughout the sequence.

Definition 7 (sublevel set filtration) Given a simplicial complex K and a continuous increasing function  f : K → R called filter function, the sublevel set filtration K is anested sequence of subcomplexes of K defined as:

image

Let us observe that  K(a) ⊆ K(b) when  a ≤ bsince f is increasing. The sublevel set filtration that we will use in this paper is the so-called Vietoris-Rips filtration, which is a filtration usually applied to point clouds. The VietorisRips filter function enlarges n-balls from each point in the point cloud. Then,

image

when two of these n-balls intersect, a 1-simplex is built. The process is extrapolated to higher dimensions, that is, if three balls intersect, a 2-simplex is built, and so on.

As previously mentioned, in general, for every  a ≤ b, an inclusion map from K(a) to K(b) is considered. Therefore, we have an induced homomorphism f a,bnfrom  Hn(K(a)) to  Hn(K(b)).

Definition 8 (persistent homology) The sequence of n-th homology groups connected by homomorphisms obtained from a filtration K is called the n-th persistent homology of K.

Now, using the homology homomorphisms induced by the inclusion maps, we say that an n-homology class  αwas born at  Hn(K(a)) if it is not the image of any n-homology class  α′ ∈ Hn(K(a′)) with  a′ < aand  f a′an (α′) = α. It dies entering  Hn(K(b)), with  a ≤ b, if it is the image of  αand it is the image of another class born earlier than  α. If  b − ais “close” to 0, then  αis considered to be noise.

Definition 9 (persistence diagrams) Let  µa,bndenote the number of n-homology classes born at  Hn(K(a)) and dying entering  Hn(K(b)). Then, the n-th persistence diagram of a filtration K, denoted by Dgmn(K), is the multiset of points (a, b) with multiplicity  µa,bn(together with the points of the diagonal with infinity multiplicity by convention).

Let us describe now a toy example as an illustration of the concept of persistent diagrams. Let us consider the three different datasets shown in Figure 2. The first one samples a circumference, the second one samples a noisy version of a circumference, and the last one samples two circumferences. Vietoris-Rips filtration using the Euclidean metric was computed to obtain the persistence diagrams shown in Figure 3. The 2-dimensional blue and orange points of the persistence diagrams correspond, respectively, to the 0- and 1-homology classes with birth and death time values being the coordinates. Looking at the persistence diagram showed in Figure 3 on the left, we can observe that just one significant 1-homology class is presented that corresponds to the hole of the circumference. However, in the persistence diagram showed in Figure 3 on the center, the points that appear close to the diagonal can be considered noise. Finally, the two orange point of the persistence diagram showed in Figure 3 on the left, correspond to the two holes, one for each circumference sampled by the dataset displayed in Figure 2 on the right.

To summarize the information of a persistence diagram, we will make use of the persistent entropy concept [28, 29] which has been proven to be stable under small perturbations in the input data (see [30]).

8 Summary and Distance between Sets of Texts based on TDA

image

Fig. 2 From left to right: A 2-dimensional point cloud sampling a circumference, a 2- dimensional point-cloud sampling a noisy circumference, and a 2-dimensional point-cloud sampling two circumferences.

image

Fig. 3 Three persistence diagrams of the Vietoris-Rips filtration obtained from a dataset (see Figure 2) of a random selection of points belonging to a circumference and from two circumferences, respectively, with the 0- and 1-homology classes. The blue points represent the (birth,death) of the 0-homology classes, and the orange points are the (birth,death) of the 1-homology classes. Observe that in the third persistence diagram there are two orange points that are far from the diagonal corresponding to the holes of the two circumferences.

Definition 10 (persistent entropy) Given a filtration K and the corresponding persistence diagram Dgmn(K) = {(xi, yi) | 1 ≤ i ≤ n}(seen as a finite set of points), the n-th persistent entropy of K is defined as

image

Let us remark that those homology components that do not die (blue points in the horizontal dot line in Figure 3) will not be considered for the persistent entropy computation. For example, the persistent entropy values of the 0-th persistence diagrams plotted in Figure 3 are, from left to right, 4.58, 4.49, and 4.

Finally, two persistence diagrams can be compared using a distance, the bottleneck distance being considered the most common and the one that will be used in the next sections.

image

Fig. 4 The set of arrows represents the optimum bijection between the black and white points that belong, respectively, to two different persistence diagrams, which are shown overlaid here.

Definition 11 (bottleneck distance) The bottleneck distance between two persistence diagrams A and B is:

image

A graphical description of the bottleneck distance is shown in Figure 4.

Next, we describe the methodology based on TDA techniques designed to compute a TDA-based summary feature to any given set of texts and to establish a distance between different sets of texts. In the next section, we will see that the TDA-based summary characterize the literature works of an author and the proposed distance can establish which authors’ styles are “closer”. This way, our results will support the qualitative philological studies previously made.

Broadly speaking, given a corpus composed of texts belonging to different categories (e.g., authors, styles) a stemming process (which we call stem) is applied to each text where the non-informative words (also called stop-words) are deleted. Then, the skipgram word embedding E (described in Section 2.1) is applied to the vocabulary of the corpus, obtaining a high-dimensional representation of the words as a point cloud. The point cloud is divided in (overlapping) subsets with points (words) belonging to the same category (e.g., authors, styles). For each of these point clouds, the Vietoris-Rips filtration is computed to obtain the corresponding persistence diagrams and the persistent entropy, which constitutes the TDA-based summary of the category. The pseudocode of the methodology explained above to compute a TDA-based summary of a set of texts and a quantitative comparison bewteen them is shown in Algorithm 1.

10 Summary and Distance between Sets of Texts based on TDA

image

In this section, we illustrate the methodology presented above and describe thoroughly the experimentation process accomplished on the literature works of three well-known Spanish Golden Age poets2. In order to determine if there exist significant differences between the TDA-based summary of the literature works of the three poets, we apply a non-parametric statistical test to the resultant persistent entropy values. In the following subsections, we proceed to describe each of the steps of the experiment in detail.

4.1 The Context: Spanish Golden Age Literature

The Spanish Golden Age literature is a complex framework still alive in the sense that it remains an appealing subject for the literary experts. In this section, we will provide a justification borrowed from the literary experts that supports our experimental results, and recall the preliminary literary notions needed to understand them.

We are interested in studies related to what we consider the inner ”stylistic configurations” of the sentences in order to capture them with the word2vec embedding. Following the study developed by D´amaso Alonso [31], poets draw on different stylistic configurations for their verses. The first one we would like to comment can be exemplified by the following two verses of a sonnet by Cervantes [32]:

image

We can see that the main concepts of the first verse correspond member by member to the ones of the second verse, summarizing the following four sentences in the two verses: Afuera el fuego de amor que abrasa; afuera el lazo de amor que aprieta, afuera el hielo de amor que enfr´ıa, afuera la flecha de amorque hiere. It can be described as the following formula:

image

that summarizes the sentences  α(Ai)β(Bi) for i = 1 . . . n. In this example,  αis afuera and  βis de amor. Other kind of resource is the reiterative correlation plurality described in depth in [33]. These techniques illustrate the big panoply of methods that concern the configurations of the verses, but many others could be cited.

Our aim with the word2vec algorithm is to encapsulate this kind of con-figurations. In spite of its intrinsic difficulty, our work explores the possibility of finding similarities between words and their use taking into consideration their context. It seems natural, in a first approach, to study if word2vec with its skipgram variation can imitate or be used as a complement to the qualitative methods in order to distinguish different literature styles. Besides, looking at the mathematical formulation to study the architecture of the sonnets introduced by D´amaso Alonso and his comment3 ”it would be a labourof a truly team of workers”to apply such deep studies, in this paper, we take the chance to do that heavy work that D´amaso Alonso mentioned, with recent mathematical tools in a efficient and effective way.

Luis de G´ongora and Lope de Vega are, both of them, summits of the Spanish Golden Age. Traditionally, it is said that Luis de G´ongora started the literature style Culteranismo and that Lope de Vega is related to an opposite trend called Conceptismo which had its major representative in Francisco de Quevedo [19, 34]. See also [35] where it is claimed that both literary styles are related but with elements that distinguish them. However, there exists discrepancies between the literary experts. For example, in [31], D´amaso Alonso did a thorough study of Lope de Vega, and he even developed a study of the comparison of this author with G´ongora. He stated that there existed a discontinuous influence by the G´ongora’s work on the Lope de Vega’s work. So, it might not be possible (and it is natural not to be so) to establish rigid differ-ence between such literary styles. In fact, poets present an evolution through their entire productive life, and the different literary styles can be inspired or fed by others. We also recommend [36] as an study of the context of these three poets.

12 Summary and Distance between Sets of Texts based on TDA

Hence, it is important to highlight that the conclusions of our proposed technique can only be applied to the chosen sets of texts and they can not be generalized to all the production of an author.

4.2 The Corpus and the Preprocessing Step

The corpus we used is a huge dataset4composed of the sonnets from the Spanish Golden Age poets [37]. It provides some metrical annotations according to stressed syllables, type of rhyme, etc. In our case we used the sonnets of the three poets we are interested in: Lope de Vega, Quevedo, and G´ongora.

Since, in the database, there are only 115 sonnets of G´ongora, we kept 115 sonnets of each poet (345 sonnets in total) in order to avoid an unbalanced dataset. We chose just the first 115 sonnets of each poet in the cited dataset, without taking into consideration any possible classification that the literary experts could consider.

Then, each sonnet was pruned as a result of a stemming process. There exists some words that have no value in terms of meaning or that do not provide structure to the sentence such as prepositions: de, el, la, ... As they can be considered noise to the aim we follow, we erased them from the sonnets. Besides, some words are shortened to their root in order to prevent the word2vec algorithm from thinking that different verb tenses or words with different genre are different words. The procedure we applied to delete this non-informative words (also called stop-words) is implemented in the nltk library5.

4.3 Application of the word2vec Algorithm

This step consists in the application of the skipgram variation of the word2vec algorithm. Specifically, we used the implementation provided by the nltk Python library. We then obtained a high-dimensional embedding of the words of the 345 sonnets. Specifically, the sonnets were embedded in a 150-dimension-al space after a 250-iteration training using a window of 10 words. We used a window of 10 words because we wanted to catch patterns using the verses in their full extension, and 10 words is an upper bound to the number of words of a verse in a sonnet.

4.4 Persistent Entropy and Bottleneck Distance Computation

Having the high-dimensional representation of the words that compose the different sonnets of the dataset, we compute the Vietoris-Rips filtration. The metric used to compute the Vietoris-Rips filtration is the cosine distance because it measures similarity between words by the angle of their vectors, and it is the common distance applied in the word2vec algorithm (see [24]).

image

Table 1 Tests applied to see if there is a significant difference between the persistent entropy values obtained for the literary works of the 3 poets considered.

image

As a result, we obtained 3 different 0-th persistence diagrams, one for each poet. Then, the persistent entropy for each persistence diagram and the bottleneck distance between any two persistence diagrams were computed.

4.5 Results

The methodology shown in Algorithm 1 with the specific procedures and parameters described in Subsection 4.2, Subsection 4.3, and Subsection 4.4, was applied and repeated 100 times.

The persistent entropy values obtained after the 100 repetitions were compared using non-parametric statistical tests. The results of the statistical tests are shown in Table 1, supporting that there exists a significant difference between the three set of sonnets and, hence, between the authors.

The bottleneck distances obtained after the 100 repetitions are shown in Figure 5 using a box-plot representation. Let us recall that, in a box-plot, the higher horizontal line correspond to the maximum value and the lower horizontal line to the minimum value. The horizontal line in the middle of the box corresponds to the median, the top of the box is the third quartile, and the bottom of the box is the first quartile. Finally, the circumferences correspond to outliers. We can see that the experimentation we applied can infer a significant difference between the bottleneck distances, being closer the persistence diagrams associated to the cloud points representing the literary works of Lope de Vega and Quevedo, respectively.

Finally, in order to decide if the differences between the bottleneck distances computed are significant, a repeated measures ANOVA was applied6. Sphericity is an assumption in repeated measures ANOVA designs. When  ϵdoes not reach 1, the F-score can be inflated and different corrections can be applied. Then, in Table 2, both corrections were applied as well with the sphericity assumption. The Greenhouse-Geisser and Huynh-Feldt corrections, in case the sphericity assumption is violated, are  ϵ = 0.563 and  ϵ = 0.565, respectively. Then, in Table 2 the different values obtained by the application

14 Summary and Distance between Sets of Texts based on TDA

image

Fig. 5 Box-plot showing the bottleneck distance results obtained from the sonnets of the three poets. (1) is the box-plot of the bottleneck distance obtained from the comparison between the sonnets of Quevedo and Lope, (2) is the box-plot of the bottleneck distance obtained from the comparison between the sonnets of Quevedo and G´ongora, and (3) is the box-plot of the bottleneck distance obtained from the comparison between the sonnets of Lope de Vega and G´ongora.

Table 2 The repeated measures ANOVA applied to infer if there exists a significant difference between the bottleneck distances. Spher means Sphericity assumed, G-G means Greenhouse-Geisser correction and H-F means Huynh-Feldt correction.

image

of the repeated measures ANOVA are displayed. A p-value lower than 0.001 and a F-value of 51.42 were reached. So, we can say that there exists a signifi-cant difference. Therefore, we can infer that there exists a significant difference between the 3 groups of bottleneck distances as we expected by visualizing Figure 5.

Finally, to specifically determine which of the groups is the different one, a pairwise comparison was computed in Table 3. As it is shown, the p-value is lower than 0.001 when we compare with C. Therefore, the sample of the bottleneck distances between Quevedo and Lope de Vega literary works is sig-nificantly different from the other two. The p-value and the confidence intervals were Bonferroni corrected.

We conclude that our method shows that the distance (on the studied sonnets) of Quevedo and Lope de Vega literary works is significantly shorter than their distances to G´ongora literary work. Hence, we have found quantitative justification that support the philologists’ theory that Quevedo and Lope de Vega belong to the same literary style (Conceptismo) and both of them are stylistically far from Luis de G´ongora (whose style belongs to the Culteranismo).

image

Table 3 Pairwise comparison between bottleneck distances. A, B, and C correspond to the sample of the bottleneck distances between Lope de Vega and G´ongora, Quevedo and G´ongora, and Quevedo and Lope de Vega, respectively.

image

Extracting knowledge from complex datasets is a hard work which requires the help of techniques coming from other fields. In this sense, representing the data as points in a metric space opens a bridge between research fields which are seemingly far apart. The use of TDA techniques is a new research area which provides tools for comparing properties of point clouds in high-dimensional spaces, and therefore, for comparing the datasets represented by such point clouds.

In this paper, we propose the use of such TDA techniques in order to compare different literary styles. In this approach, bottleneck distance between the persistence diagrams of the Vietoris-Rips filtration obtained from the cloud points representing sets of texts from different writers encodes the differences between their literary styles and quantifies the proximity between them.

This novel approach opens a door for the interaction of TDA and philological research. TDA techniques can be applied in order to give a topological description of a work, a writer or an age and go deeper into their belonging to a greater trend.

The work was partly supported by the Agencia Estatal de Investigaci´on/10.13039/501100011033 under grant PID2019-107339GB-100 and the Agencia Andaluza del Conocimiento under grant PY20-01145.

[1] Liu, S., Wang, D., Maljovec, D., Anirudh, R., Thiagarajan, J.J., Jacobs, S.A., Van Essen, B.C., Hysom, D., Yeom, J.-S., Gaffney, J., Peterson, L., Robinson, P.B., Bhatia, H., Pascucci, V., Spears, B.K., Bremer, P.-T.: Scalable topological data analysis and visualization for evaluating data-driven models in scientific applications. IEEE Transactions on Visualization and Computer Graphics 26(1), 291–300 (2020). https://doi.org/

16 Summary and Distance between Sets of Texts based on TDA

10.1109/TVCG.2019.2934594

[2] Riihim¨aki, H., Chacholski, W., Theorell, J., Hillert, J., Ramanujam, R.: A topological data analysis based classification method for multiple measurements. BMC Bioinformatics 21:336 (2020). https://doi.org/10.1186/ s12859-020-03659-3

[3] Ramamurthy, K.N., Varshney, K.R., Mody, K.: Topological data analysis of decision boundaries with application to model selection. In: Proceedings of the 36th Int. Conf. on Machine Learning, ICML 2019, pp. 5351–5360 (2019). http://proceedings.mlr.press/v97/ramamurthy19a.html

[4] Johnson, K.: Quantitative Methods in Linguistics. Blackwell Pub., USA (2008)

[5] Rahman, M.S.: The advantages and disadvantages of using qualitative and quantitative approaches and methods in language “testing and assessment” research: A literature review. Journal of Education and Learning 6(1), 102–112 (2017). https://doi.org/10.5539/jel.v6n1p102

[6] Atienza, N., Gonzalez-Diaz, R., Soriano-Trigueros, M.: On the stability of persistent entropy and new summary functions for topological data analysis. Pattern Recognition 107, 107509 (2020). https://doi.org/10.1016/j. patcog.2020.107509

[7] Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among lan- guages for machine translation. CoRR abs/1309.4168 (2013) https: //arxiv.org/abs/1309.4168

[8] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conf. on Empirical Methods in Natural Language Processing, EMNLP 2014, pp. 1532–1543 (2014). https://www.aclweb.org/anthology/D14-1162/

[9] Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135–146 (2017). https://doi.org/10.1162/ tacl a 00051

[10] Deza, M.M., Deza, E.: Encyclopedia of Distances. Encyclopedia of Distances. Springer, Berlin Heidelberg (2009)

[11] Zhang, J., Xie, Z., Li, S.Z.: Prime discriminant simplicial complex. IEEE Transactions on Neural Networks and Learning Systems 24(1), 133–144 (2013). https://doi.org/10.1109/TNNLS.2012.2223825

image

[12] Zielinski, B., Lipinski, M., Juda, M., Zeppelzauer, M., Dlotko, P.: Persis- tence codebooks for topological data analysis. Artif. Intell. Rev. 54(3), 1969–2009 (2021). https://doi.org/10.1007/s10462-020-09897-4

[13] Dey, A., Das, S.: Topo sampler: A topology constrained noise sampling for GANs. In: NeurIPS 2020 Workshop on Topological Data Analysis and Beyond (2020). https://openreview.net/forum?id=OTxZfmVFlTO

[14] Gholizadeh, S., Seyeditabari, A., Zadrozny, W.: Topological signature of 19th century novelists: Persistent homology in text mining. Big Data and Cognitive Computing 2(33), 1–10 (2018). https://doi.org/10.3390/ bdcc2040033

[15] Temˇcinas, T.: Local Homology of Word Embeddings (2018). https:// arxiv.org/abs/1810.10136

[16] Wright, M., Zheng, X.: Topological Data Analysis on Simple English Wikipedia Articles (2020). https://arxiv.org/abs/2007.00063

[17] Atienza, N., Escudero, L.M., Jimenez, M.J., Soriano-Trigueros, M.: Char- acterising epithelial tissues using persistent entropy. In: Computational Topology in Image Context, CTIC2019, pp. 179–190. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10828-1 14

[18] Chung, Y.-M., Hu, C.-S., Lo, Y.-L., Wu, H.-T.: A persistent homology approach to heart rate variability analysis with an application to sleep-wake classification. Frontiers in Physiology 12, 202 (2021). https://doi. org/10.3389/fphys.2021.637684

[19] Rutherford, J.: The Spanish Golden Age Sonnet. Iberian and Latin American Studies. University of Wales Press, UK (2016)

[20] Yin, Z., Shen, Y.: On the dimensionality of word embedding. In: Ben- gio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS2018, pp. 895–906 (2018). http://papers.nips.cc/paper/ 7368-on-the-dimensionality-of-word-embedding

[21] Almeida, F., Xex´eo, G.: Word Embeddings: A Survey (2019). http:// arxiv.org/abs/1901.09069

[22] Hu, J., Li, S., Yao, Y., Yu, L., Guanci, Y., Hu, J.: Patent keyword extraction algorithm based on distributed representation for patent classification. Entropy 20(104), 1–19 (2018). https://doi.org/10.3390/ e20020104

18 Summary and Distance between Sets of Texts based on TDA

[23] Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient Estimation of Word Representations in Vector Space (2013). https://arxiv.org/abs/ 1309.4168

[24] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th Int. Conf. on Neural Information Processing Systems. NIPS’13, vol. 2, pp. 3111–3119. Curran Associates Inc., USA (2013). https://dl.acm.org/doi/10.5555/2999792.2999959

[25] Guthrie, D., Allison, B., Liu, W., Guthrie, L., Wilks, Y.: A closer look at skip-gram modelling. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC 2006, pp. 1222– 1225 (2006). http://www.lrec-conf.org/proceedings/lrec2006/summaries/ 357.html

[26] Edelsbrunner, H., Harer, J.L.: Computational Topology, An Introduction. American Mathematical Society, USA (2010)

[27] Carlsson, G., Vejdemo-Johansson, M.: Topological Data Analysis with Applications. Cambridge University Press, USA (2021)

[28] Chintakunta, H., Gentimis, T., Gonzalez-Diaz, R., Jimenez, M.-J., Krim, H.: An entropy-based persistence barcode. Pattern Recognition 48(2), 391–401 (2015). https://doi.org/10.1016/j.patcog.2014.06.023

[29] Merelli, E., Rucco, M., Sloot, P., Tesei, L.: Topological characterization of complex systems: Using persistent entropy. Entropy 17(10), 6872–6892 (2015). https://doi.org/10.3390/e17106872

[30] Atienza, N., Gonzalez-Diaz, R., Soriano-Trigueros, M.: On the stability of persistent entropy and new summary functions for topological data analysis. Pattern Recognition 107, 107509 (2020). https://doi.org/10.1016/j. patcog.2020.107509

[31] Alonso, D.: Poes´ıa Espa˜nola: Ensayo de M´etodos Y L´ımites Estil´ısticos: Garcilaso, Fray Luis de Le´on, San Juan de la Cruz, G´ongora, Lope de Vega, Quevedo. Biblioteca rom´anica hisp´anica: Estudios y ensayos. Editorial Gredos, Spain (1966)

[32] Cervantes Saavedra, M.d.: La Galatea. Alicante : Biblioteca Virtual Miguel de Cervantes, 2001, Spain (2001). http://www.cervantesvirtual. com/nd/ark:/59851/bmcn29t1

[33] Alonso, D.: Versos Plurimembres Y Poemas Correlativos: Cap´ıtulo Para la Estil´ıstica del Siglo de Oro vol. 49, p. 191. Secci´on de Cultura e Informaci´on Artes Gr´aficas Municipales, Spain (1944)

image

[34] Chamorro, D.C.: Sobre los or´ıgenes del conceptismo andaluz: Alonso de bonilla. Bolet´ın del Instituto de Estudios Giennenses 130, 59–84 (1987)

[35] Molfulleda, S.: Sobre la oposici´on entre culteranismo y conceptismo. Universitas Tarraconensis. Revista de Filologia 6, 55–62 (2018)

[36] Rozas, J.M.: G´ongora, Lope, Quevedo. Poes´ıa de la Edad de Oro, II. Alicante : Biblioteca Virtual Miguel de Cervantes, Spain (2002). http: //www.cervantesvirtual.com/nd/ark:/59851/bmc47499

[37] Navarro, B., Ribes Lafoz, M., S´anchez, N.: Metrical annotation of a large corpus of Spanish sonnets: Representation, scansion and evaluation. In: Proceedings of the 10th Int. Conference on Language Resources and Evaluation (LREC’16), pp. 4360–4364 (2016). https://www.aclweb.org/ anthology/L16-1691


Designed for Accessibility and to further Open Science