Incremental Monoidal Grammars

2020·Arxiv

Abstract

Abstract

In this work we define formal grammars in terms of free monoidal categories, along with a functor from the category of formal grammars to the category of automata. Generalising from the Booleans to arbitrary semirings, we extend our construction to weighted formal grammars and weighted automata. This allows us to link the categorical viewpoint on natural language to the standard machine learning notion of probabilistic language model.

Recurrent neural networks and probabilistic language models [BDVJ03] have become standard tools in the natural language processing (NLP) community. These networks are inherently incremental, scanning through the sequence of words they are given and updating their prediction for what comes next. Despite their practical success on hard language tasks such as translation and question answering, the structure underlying these machine learning models is yet poorly understood, and they are generally only used as black boxes.

On the other hand, the categorical compositional distributional (DisCoCat) models of Coecke et al. [CCS08, CCS10, CGS13] use grammatical structure, explicitly encoded as string diagrams in a free monoidal category, to compute natural language semantics. DisCoCat models have received experimental support on small-scale tasks [GS11, KSP12, KSP13], but the extra mathematical structure makes them hard to scale to the billions of words used in the training of modern NLP models.

In this work, we aim to bridge this gap by constructing a functor which sends formal grammars (encoded as monoidal categories) to a state automaton that parses the grammar in an incremental way, reading one word at a time and updating the set of possible parsings. In section 1 we introduce some core definitions and preliminaries. In section 2 we introduce monoidal grammars and S-monoidal grammars for an arbitrary semiring S. In section 3 we define the functor I, from the category of S-monoidal grammars to coalgebras of the weighted automata functor section 4, we describe an algorithm for learning -monoidal grammars from a probabilistic language model. This paper represents a work in progress, and we conclude with a discussion of future work.

1 Preliminaries

1.1 Formal Grammars

Given a pair of sets X and Y , we write for their Cartesian product and disjoint union respectively. We use the Kleene star to denote the free monoid, with concatenation as product and the empty sequence Similarly, we write for the free semigroup (i.e. non-empty sequences).

If we fix a finite set of words V called the vocabulary, then a the language of a formal grammar over this set of words is a subset : the set of sequences of words that form grammatical sentences accoridng to the rules of the grammar. A wellknown class of formal grammars is context-free grammars, introduced by Chomsky as the first level of his complexity-theoretic hierarchy [Cho56, Cho57, Cho65].

Definition 1.1. A context-free grammar is given by a 4-tuple (V, X, R, s) where:

• V is a finite set called the vocabulary (or terminal symbols),

• X is a finite set of syntactic variables (or nonterminal symbols),

• production rules,

• is a syntactic variable called the start symbol.

We define the language of a context-free grammar G = (V, X, R, s) as follows: We first construct a binary relation (for any pair of strings if and only if there is some production rule (and a pair t = uvw. We then compute its reflexive transitive closure . Note that this corresponds precisely to the preordered monoid generated by the production rules R, i.e. with generators V + X and relations for each (

Definition 1.2. A preordered monoid is a monoid X equipped with a reflexive transitive relation

If we generalise this to arbitrary preordered monoids, i.e. to grammars with production rules that have arbitrary sources R (and not just the ones with an R with restrictions as defined above), we obtain abstract rewriting systems, also known as semi-Thue systems [Thu14, Pow13]. If we forbid rules with empty source or target we obtain the word-problem for semigroups, shown to be undecidable independently by Markov [Mar47] and Post [Pos47]. Both are equivalent in computational power to the last level of Chomsky’s hierarchy: unrestricted grammars (which are thus undecidable).

Definition 1.3. An unrestricted grammar is a tuple G = (V, X, R, s) where V and X are finite sets called the terminals and non-terminals with and the language of G, where the order is given by the preordered monoid generated by R.

Empirically, the complexity of human languages has been shown to lie somewhere in between that of context-free and unrestricted grammars [Shi87], see the line of work on midly context-sensitive grammars [Jos85, Wei88, VW94].

We cannot infer the underlying structure of human languages from complexity-theoretic arguments: indeed, the same class of languages can be generated by nonisomorphic grammar formalisms. As our main example, Lambek’s pregroup grammars [Lam99, Lam01, Lam08] have the same expressive power as context-free grammars [BM08], i.e. there are back-and-forth translations which preserve the generated languages. However, pregroup grammars and context-free grammars are only weakly equivalent: these translations do not give an isomorphism.

is a preordered monoid P equipped with a pair of functions (

Definition 1.5. A pregroup grammar is a tuple G = (V, B, D, s) where V, B and are finite sets and s is an element of B, for B a set of basic grammatical types and the free pregroup generated by B.

In the above, V is the Vocabulary, B a set of basic grammatical types such as {n, s}, for n the type of a noun phrase and s that of a sentence, and D a relation known as the type dictionary or the lexicon. The language L(G) of a pregroup grammar G = (V, B, D, s) is the set of all strings each (

1.2 Monoidal Categories

We assume familiarity with monoidal categories, see e.g. [Awo06] for an introduction. In order to fix some notation, we define monoidal signatures and presentations.

Definition 1.6. A monoidal signature is a tuple Σ = (Σand Σare sets of generating objects and arrows respectively, and domare pairs of functions called domain and codomain.

A homomorphism of monoidal signatures is given by a pair of functions such that the following diagram commutes:

We write MonSig for the category of monoidal signatures and their homomorphisms. The free monoidal category generated by a monoidal signature Σ is the image of the left adjoint to the forgetful functor from MonCat to MonSig. Explicitly, the arrows are given by (planar progressive) string diagrams with arrows in Σas nodes and as input/output, see [JS88, JS91, Sel10]. In order to impose equations on diagrams, we need some extra data: a presentation.

Definition 1.7. A presentation for a monoidal category is given by a monoidal signature Σ and a set of relations between parallel arrows of the free monoidal category.

The monoidal category generated by (Σ, R) is the quotient , see [Lan98][II.8].

1.3 Coalgebras

Coalgebras are a useful framework for categorically modeling dynamical systems. In functional programming terminology, some functor F defines a type signature for a mapping on Set, and an F-coalgebra is a particular implementation of this signature.

Definition 1.8. For an endofunctor where X is a set and is a function that defines structure over X.

Example 1.8.1. For X a set of states and F the identity functor, an F-coalgebra (X, g) maps states to states, computing a function over states such as permutations.

Example 1.8.2. The powerset functor F(X) = P(X) maps a state to a set of states, computing the accessibility relation of a Kripke structure or the transitions of an automata.

Example 1.8.3. The functor maps states in X to the product of {false, true} and the set of functions from some set V back to X. This signature is the curried form of a deterministic automata transition function: each state in X is tagged with a boolean value indicating whether or not it is an accept state, and each state defines a function from a vocabulary V to the next state.

2 Monoidal Grammars

Our goal in this section will be to formally describe a characterization of formal grammars in terms of monoidal categories. We will begin with the following definition:

Definition 2.1. A monoidal grammar is a tuple presentation of a monoidal category

Given two monoidal grammars over a common vocabulary V , we will define the morphisms between them to be the monoidal functors that the following diagram commutes, where are the inclusion maps from into the objects of respectively:

Monoidal grammars and their morphisms trivially form a subcategory of (V + , which we will denote Grammar.

We can characterize the formal grammars that we described earlier as monoidal grammars. For example:

Proposition 2.1. A pregroup grammar G = (V, B, D, s) is a monoidal grammar with Σand the relations R given by the snake equations, see [Sel10][4.1].

Proposition 2.2. A context-free grammar is a monoidal grammar with Σ

We can now express the languages of these constructions in a unified manner:

we call an arrow parse state of . Whenever in a parse state we have r = s, we call that parse state a parsing or a derivation.

Definition 2.3. The language of a monoidal grammar is the set of all strings that have a parsing.

We can think of the string diagrams of parse states and parsings as akin to partially constructed and fully constructed syntax trees respectively.

Example 2.3.1. Let G = (V, B, D, s) be a pregroup grammar with V = { Alice, loves, Bob }, . Then the following string diagram represents the arrow in

Example 2.3.2. Let G = (V, X, R, s) be a context-free grammar with V and X defined as:

for s the non-terminal representing a sentence, np the non-terminal representing a noun phrase, similarly, adj an adjective, itv an intransitive verb, tv a transitive verb, and vp a verb phrase. Where the production rules are:

Then the following string diagrams represent the arrows in

2.1 S-Monoidal Grammars

We can generalize monoidal grammars to associate weights (such as likelihoods or probabilities) with each parse state.

Definition 2.4. For some semiring S, an S-monoidal signature is a monoidal signature Σ = (Σequipped with a mapping

A homomorphism of S-monoidal signatures ) is given by a monoidal signature homomorphism such that the following diagram commutes:

Definition 2.5. An S-monoidal grammar (G, r) is a tuple (Σis an S-monoidal signature and is a monoidal grammar.

We can extend r into a function from by defining and ). If we do this, then r defines a functor from trivial monoidal category where arrows are the elements of S and both composition and tensor product are semiring multiplication.

We will refer to the category of S-monoidal grammars over a common vocabulary as . The morphisms in this category are monoidal functors such that the following diagram commutes:

Let us note that we can characterize any monoidal grammar G as the B-monoidal grammar (G, r) where B is the semiring of Booleans and r is the constant functor that maps all arrows in

3 Incremental Monoidal Grammars

In an S-monoidal grammar, we utilize the categorical composition to model the sequential nature of the parsing process and we use the tensor-product to model both the adjacency of words/types in a sentence and the parallel application of processing steps to different parts of a sentence. In this section we will explore how we can study the behavior of the tensor-product as a dynamic construct.

We can define an action of the free monoid on the category of endofunctors of to model the process of adding a “new” word to our system. For any word the endofunctor maps the object and the arrow . In order to model the interpretation of this new word in context, we look at the interaction of this endofunctor with the arrows in by defining a mapping as follows.

Intuitively, ) is the generalized “likelihood” of the parse state given ). Naturally, parse states that cannot be expressed as likelihood of 0 given

If we define a “listener” to be an entity that assigns the parse state a to the string of words defines how a new word may update a listener’s parse state. We can represent this process with an S-weighted automaton. The states in the machine are parse states , and the transition function is S-weighted automaton are also equipped with a function that maps states to S. For example, in the picks out the “accept states” of the automaton. For

some S-monoidal grammar () we will define this function as follows:

In order to formalize this relationship, we will use the theory of F-coalgebras on Set [Rut96]. An S-weighted automaton is a coalgebra of the functor W(X) = and by [SBBR13] a given transition function will uniquely define coalgebras of W up to bisimulation, so we simply need to define a mapping and coalgebras of W.

Definition 3.1. The Incremental Functor I is a map on the objects and arrows of that acts as follows:

• For some object (to be the coalgebra (

• For some morphism I(F) to be the map ’s mapping on arrows.

Lemma 3.1. The map I is a functor from to the category of coalgebras of W.

Proof. Since I(F) uses the functor mapping on arrows, I trivially preserves identity and composition. All that remains is to prove that is indeed a coalgebra homomorphism. To do this we need to show that the following diagram commutes:

When we apply the function we get the following function:

Here commute in . When we use the fact that preserves composition and tensor product we can rewrite

This makes it clear that:

Therefore, we can conclude that:

So is indeed a coalgebra homomorphism.

The incremental functor reveals an interesting duality. Given a monoidal category that describes how laterally composed (tensor product) objects are processed vertically (by composition of arrows), we get for free an automaton that defines a laterally evolving process upon vertically composed arrows. This automaton models how the process that the monoidal category describes will behave in the face of new information, and it reveals the monoidal category’s underlying incremental structure. This suggests the following corollary:

Corollary 3.1.1. If the S-monoidal grammars (have functors between them, then the coalgebras are bisimulatable.

) have functors between them, then have coalgebra homomorphisms between them. By [Lev], this implies that and ) are bisimulatable.

4 R≥0-Monoidal Grammars from Language Models

For some -monoidal grammar (G, r), we can think of r as assigning “confidence values” or “likelihoods” to the arrows in G. This allows us to model the relative probability of different interpretations of the same string of words and incorporate probabilistic and distributional language models into the parsing process. Let us explore what this means.

4.1 Maximum Parse State Likelihood

Definition 4.1. A language model over the monoidal grammar G is the following family of probability distributions:

• For any prefix , a probability distribution with finite support over the possible completions such that there is a parsing . We will denote this as

• For any sentence with at least one parsing probability distribution with finite support over the set of all such parsings. We will denote this as

We would like to develop a mapping from language models over a monoidal grammars -monoidal grammars (G, r) such that r(p) is indicative of the probability that a particular parse state is “optimal” according to that language model. We can formalize this notion with the following concepts:

Definition 4.2. For some prefix equipped with a parse state and sentence equipped with a parsing , we say that a is p-compliant if there exist some arrows

Definition 4.3. Given some prefix and sentence such that there exists a parsing , we will define a maximal parse state for to be a parse state that a is p-compliant and there exists no is not isomorphic to p and )-compliant.

Intuitively, the parsing a is p-compliant if the parse state p factors a, and p is a maximal parse state for a if p cannot be developed any farther while still maintaining the possibility to extend it into a. We can use these constructs to define the following quantity:

maximal parse state likelihood

Where is the set of all parsings a such that p is a maximal parse state for on a.

Intuitively, given a prefix , a parse state language model over the possible fully parsed sentences, ) is the probability that p will be the maximal parse state for the completed sentence.

Now say we want to construct an -monoidal grammar (G, r) such that for any parse state, r(p) is as close as possible to ). By the definition of S-monoidal grammars, r is defined by the values of r(g) for all generating arrows example, for any parse state p in G, we have that:

Here is any bag of generating arrows in Σwhen combined by composition and tensor product.

Therefore, our objective is to find r to minimize the following expression, where is the set of all parse states in G:

Because log is a monotonic function, we can rewrite this as the following, where is the number of times that g appears in

This is a standard linear approximation of the coefficients we can use the normal equations or gradient descent to learn the values of r(g). If not, we can use stochastic gradient descent [Bis06].

5 Future Work

We will close with some remarks on future directions.

• In section 4 we describe a simple algorithm for constructing an grammar from a language model over a monoidal grammar. However, since we do not describe a method for constructing such language model from a dataset and choice of monoidal grammar, it is not immediately clear how we can implement this algorithm in practice. Such a method would enable us to efficiently implement our algorithm on top of any linguistic parsing library for CFGs, pregroups, or other monoidal grammars.

• In section 3 we introduce a functor to the coalgebras of the functor . A natural question is whether there are functors in the reverse direction as well. In particular, we are curious as to whether I has left/right adjoints. If such functors exist, this would be further evidence of a fundamental connection between dynamical systems and monoidal categories. Such a discovery could help us unite these fields of study, as well as enable us to reduce questions about automata into questions about monoidal categories.

• Our framework is fully syntax-based and is currently agnostic to semantic meaning. In [CSC10], the authors define a functorial framework for affiliating syntactic structure with vector space semantics, and in [SPHK18] the authors project vector space semantics onto dynamic syntax trees in order to model incrementality. This raises the question of how we can best incorporate semantic information into our framework. For example, if we extend the functor from [CSC10] to operate over monoidal grammars in general, it would be interesting to see how the interaction between this functor and I compares to the construction in [SPHK18].

References

[Awo06] Steve Awodey. Category Theory. Ebsco Publishing, May 2006.

[BDVJ03] Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, and Christian Janvin. A Neural Prob- abilistic Language Model. J. Mach. Learn. Res., 3:1137–1155, 2003.

[Bis06] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006.

[BM08] Wojciech Buszkowski and Katarzyna Moroz. Pregroup Grammars and Context-free Grammars. Computational Algebraic Approaches to Natural Language, Polimetrica, 121, 2008.

[CCS08] Stephen Clark, Bob Coecke, and Mehrnoosh Sadrzadeh. A Compositional Distributional Model of Meaning. In Proceedings of the Second Symposium on Quantum Interaction (QI-2008), pages 133–140, 2008.

[CCS10] Stephen Clark, Bob Coecke, and Mehrnoosh Sadrzadeh. Mathematical foundations for a compositional distributional model of meaning. In J. van Benthem, M. Moortgat, and W. Buszkowski, editors, A Festschrift for Jim Lambek, volume 36 of Linguistic Analysis, pages 345–384. 2010.

[CGS13] B. Coecke, E. Grefenstette, and M. Sadrzadeh. Lambek vs. Lambek: Functorial Vector Space Semantics and String Diagrams for Lambek Calculus. ArXiv e-prints, 2013.

[Cho56] Noam Chomsky. Three models for the description of language. IRE Transactions on Information Theory, 2(3):113–124, September 1956.

[Cho57] Noam Chomsky. Syntactic Structures. Mouton and Co., The Hague, 1957.

[Cho65] Noam Chomsky. Aspects of the Theory of Syntax. MIT Press, 1965.

[CSC10] Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark. Mathematical foundations for a compositional distributional model of meaning. CoRR, abs/1003.4394, 2010.

[GS11] Edward Grefenstette and Mehrnoosh Sadrzadeh. Experimental Support for a Categorical Compositional Distributional Model of Meaning. In The 2014 Conference on Empirical Methods on Natural Language Processing., pages 1394–1404, 2011.

[Jos85] Aravind K. Joshi. How much context sensitivity is required to provide a reasonable structural description. 1985.

[JS88] Andr´e Joyal and Ross Street. Planar diagrams and tensor algebra. Unpublished

[JS91] Andr´e Joyal and Ross Street. The geometry of tensor calculus, I. Advances in Mathematics, 88(1):55–112, July 1991.

[KSP12] Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, and Stephen G. Pulman. A Unified Sentence Space for Categorical Distributional-Compositional Semantics: Theory and Experiments. In COLING, 2012.

[KSP13] Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, and Stephen Pulman. Separating Disambiguation from Composition in Distributional Semantics. page 10, 2013.

[Lam99] Joachim Lambek. Type Grammar Revisited. In Alain Lecomte, Fran¸cois Lamarche, and Guy Perrier, editors, Logical Aspects of Computational Linguistics, pages 1–27, Berlin, Heidelberg, 1999. Springer Berlin Heidelberg.

[Lam01] Joachim Lambek. Type Grammars as Pregroups. Grammars, 4:21–39, 2001.

[Lam08] Joachim Lambek. From Word to Sentence: A Computational Algebraic Approach to Grammar. Open Access Publications. Polimetrica, 2008.

[Lan98] S.M. Lane. Categories for the Working Mathematician. Graduate Texts in Mathematics. Springer New York, 1998.

[Lev] P. Levy. Coalgebra: Basic concepts.

[Mar47] A Markov. On certain insoluble problems concerning matrices. In Doklady Akad. Nauk SSSR, volume 57, pages 539–542, 1947.

[Pos47] Emil L. Post. Recursive Unsolvability of a problem of Thue. Journal of Symbolic Logic, 12(1):1–11, March 1947.

[Pow13] James F. Power. Thue’s 1914 paper: A translation. arXiv:1308.5858 [cs], August 2013.

[Rut96] J. J.M.M. Rutten. Universal coalgebra: A theory of systems. Technical report, CWI (Centre for Mathematics and Computer Science), Amsterdam, The Netherlands, The Netherlands, 1996.

[SBBR13] Alexandra Silva, Filippo Bonchi, Marcello Bonsangue, and Jan Rutten. Generalizing de- terminization from automata to coalgebras. Logical Methods in Computer Science, 9(1):9, March 2013.

[Sel10] P. Selinger. A Survey of Graphical Languages for Monoidal Categories. New Structures for Physics, pages 289–355, 2010.

[Shi87] Stuart M. Shieber. Evidence Against the Context-Freeness of Natural Language. In Walter J. Savitch, Emmon Bach, William Marsh, and Gila Safran-Naveh, editors, The Formal Complexity of Natural Language, Studies in Linguistics and Philosophy, pages 320–334. Springer Netherlands, Dordrecht, 1987.

[SPHK18] Mehrnoosh Sadrzadeh, Matthew Purver, Julian Hough, and Ruth Kempson. Exploring Semantic Incrementality with Dynamic Syntax and Vector Space Semantics. arXiv:1811.00614 [cs], November 2018.

Uber Ver¨anderungen von Zeichenreihen Nach Gegebenen Regeln. na, 1914.

[VW94] K. Vijay-Shanker and D. J. Weir. The equivalence of four extensions of context-free grammars. Mathematical Systems Theory, 27(6):511–546, November 1994.

[Wei88] David Jeremy Weir. Characterizing Mildly Context-Sensitive Grammar Formalisms. PhD thesis, University of Pennsylvania, Philadelphia, PA, USA, 1988. AAI8908403.