b

DiscoverSearch
About
My stuff
A Transfer-Learnable Natural Language Interface for Databases
2018·arXiv
Abstract
Abstract

Relational database management systems (RDBMSs) are powerful because they are able to optimize and answer queries against any relational database. A natural language interface (NLI) for a database, on the other hand, is tailored to support that specific database. In this work, we introduce a general purpose transfer-learnable NLI with the goal of learning one model that can be used as NLI for any relational database. We adopt the data management principle of separating data and its schema, but with the additional support for the idiosyncrasy and complexity of natural languages. Specifically, we introduce an automatic annotation mechanism that separates the schema and the data, where the schema also covers knowledge about natural language. Furthermore, we propose a customized sequence model that translates annotated natural language queries to SQL statements. We show in experiments that our approach outperforms previous NLI methods on the WikiSQL dataset and the model we learned can be applied to another benchmark dataset OVERNIGHT without retraining.

Introduction The majority of business data is relational data. Many applications are built on relational databases, including customer relations management Ngai, Xiu, and Chau (2009), financial fraud detection Ngai et al. (2011), and knowledge discovery in medicine Esfandiari et al. (2014), etc. However, a minimal understanding of SQL is required to operate on these data.

image

This gives rise to the study of natural language interfaces to database Androutsopoulos, Ritchie, and Thanisch (1995), or NLIDB, with the goal of making databases more accessible to the general public.

An NLIDB translates a natural language question to a structured query (in our case an SQL query) that can be executed by a database engine. Figure 1 shows two examples. In Example (a), we see that the natural language question mentions columns (a.k.a columns) of the database but the mentions and the column names are not exactly the same (e.g., “director” vs. “directed by”, “actor” vs. “star in”, etc.). In Example (b), we show that such differences could be quite significant, for example, “population” vs. “how many people live in”. Furthermore, the question mentions “Mayo” but does not mention the column name “county” where “Mayo” appears in. Still, the NLI needs to derive the column name “county” from “Mayo” as it is needed in the translated SQL query.

The challenge of NLIDB thus lies in the idiosyncrasy and complexity of a natural language: Column names in the schema and their mentions in the questions can have very different forms, and in some cases the mentions in the questions are missing, and need to be inferred from other parts of the question. As a result, an NLI developed for one particular database usually does work for another database.

In this paper, we explore the latent semantic structure of natural language queries against relational databases. As we

image

Figure 1: Natural language questions and their corresponding SQLs against two different databases. Note that the annotated SQLs of the two different questions are the same. This figure is better viewed on media with color support.

image

Figure 2: Framework overview.

can see in Figure 1, after we annotate the natural language queries (identifying mentions of database columns and values in the queries), the final annotated SQL queries for the two different questions against the two different databases are exactly the same. This means if we can separate out the idiosyncrasy of natural language, and focus on the underlying the semantics of relational queries, then it is possible to build a general purpose or transfer-learnable NLIDB. Experiments show that our method achieves 82% query execution accuracy on WikiSQL Zhong, Xiong, and Socher (2017), as well as zero-shot transfer-ability on OVERNIGHT Wang, Berant, and Liang (2015).

Overview of our Approach Our goal is to separate out data specific components and focus on the latent semantic structure in a natural language question. The said data specific components include the schema of the data and the usage of natural language spe-cific to the data schema, which we will describe in detail in the next section. Figure 2 shows the framework of our approach that consists of three stages:

1. Convert a natural language question q to its annotated form  qa;

2. Use a seq2seq model to translate  qato an annotated SQL sa;

3. Convert the annotated SQL  sato a normal SQL s.

Figure 1 showed two examples of  (q, qa, sa, s). In the annotation, we use placeholder  cito denote the i-th column of a database table and  vito denote a value that is likely to belong to the i-th column. For example, the term “directed by” in Figure 1(a) is annotated as  c2since it is a mention of the 2nd column of the database table, and “Jerzy Antczak” is annotated as  v2since it is a value of the second column. Note that a value that appears in a query may not appear in the database, thus, the annotation process must be able to annotate a term as a possible value of a database column. Also note that the process of converting an annotated SQL saback to a normal SQL s is deterministic. Thus, in the rest of the paper, we focus on step (1) and (2).

Our framework aims at separating natural language complexities out of relational query semantics. This allows us to focus on the remaining data agnostic and semantic part of the question with a seq2seq model. This simple idea is powerful because it reveals, for example, that the two different questions in Figure 1 may have exactly the same structure  SELECT c1 WHERE c2 = v2 AND c3 = v3). We consider the two natural language questions have the same “latent semantic.” In doing so, our method mitigates the obstacles in transfer learning, as a model trained on the first example could be applied to the second one due to shared annotated SQL query.

image

The task of annotating a natural language question is to detect mentions of database columns as well as database values in the question.

Challenges

Although some mentions of database columns and values can be detected exactly as they appear in a natural language question, in many cases, mention detection relies heavily on the context, as illustrated in the following three examples of challenges:

of year 2011” mentions the database column “best actor 2011”. Clearly, we cannot rely on exact string matching to detect mentions.

star in? (Example (a) in Figure 1). Here, “Piotr Adamczyk” is an actor, but the question does not mention the database column “actor” explicitly. We need to infer the column from the question, the database schema, and the database statistics.

“2” and “3” could refer to either “rebounds” or “points.” The correct resolution depends on the syntax, or the context of the question.

Next, we describe meta knowledge of a database, which is the minimal knowledge about the underlying database we need in order to understand a natural language question against the database. Then we describe annotation as a two-stage process consisting of mention detection and mention resolution. The first stage detects a set of many possible mentions of column and values, in which some mentions may be inconsistent with others since certain possibilities are mutually exclusive. The second stage finds a maximum subset of these mentions that is consistent, which constitutes the output annotation.

Meta Knowledge of a Database

We rely on the meta knowledge of a database to understand questions posed against the database. The meta knowledge contains the following:

1. Database schema C. The database schema is the definition of the columns of a database table. For example (a) in Figure 1, C = {Nomination, Actor, Film Name, Director, Country}.

2. Database statistics  Vcfor each column  c ∈ C. Intuitively, a human database operator knows which database columns need to be involved for answering a particular question even if the name of the column is not explicitly mentioned in the question. For example, in Figure 1 (a), we should consult the “Actor” column for “Piotr Adamczyk”. To approximate this mental process, we construct and leverage database statistics  Vcthat enables us to measure how likely a phrase is related to database column c for all c ∈ C. For example, we may create a language model for a column. Specifically, we may use pre-trained wordembeddings to decide if a particular term belongs to a particular column (the idea is that if a term is related to a column, its embedding should be close to the word-embedding space of values in the column).

3. Natural language expressions specific to a database, a database column, and values of the column  Pc. We need to know how people talk about things embodied by the database. Ideally, if we have a general purpose ontology that tells us everything about how language is used to describe any entity and its features, we might not need this particular handcrafted component in the schema. But since such ontology does not exist, we consider this component requires minimal effort to enable us to support cross domain NLI. Specifically, for column c, we collect a set of phrases  Pcthat mention c. For example, for c = population in Figure 1 (b), we have  Pc = {population of ⟨city⟩, size of  ⟨city⟩, how many people live in  ⟨city⟩, ...}. Our approach provides a direct way to inject this minimal knowledge to our model.

Mention Detection

In Mention Detection, we derive candidate mentions (of database columns and values) for terms in a question. Formally, we define a term to be a continuous span of words. A term may mention a database column or value. For example, for the question Which film directed by Jerzy Antczak did Piotr Adamczyk star in? in Figure 1 (a), “directed” could be a mention of column “Director” and “Piotr Adamczyk” could be a mention of a value in either the “Director” column or the “Actor” column. Note that a term may have multiple candidate mentions, and the entire set of candidate mentions may not be consistent, and we rely on the next step, mention resolution, to find a consistent set of mentions.

We treat mentions of database values and mentions of database columns differently. For mentions of columns, we have a set of known columns, and each of them can be mentioned in many different ways. For example, the term “directed” may be used to mention the “Director” column, and “how many people live in ...” may be used to mention the “Population” column. Our method, described below, combines syntax and semantics to deal with this scenario. For mentions of database values, however, we do not have a set of ground-truth values. One may ask “When was Joe Biden elected U.S. president?” against a database of U.S. presidents, but clearly, “Joe Biden” is not in the database (at least not yet). Thus, for mentions of values v, we evaluate whether a term, in its exact literal form in the question (denoted as wv), is likely to be a value of a column of the database.

We now describe mention detection for columns. For a term, defined as a span of words  w = [wa, wa+1, . . . , wb]in a question, we want to know if it is a mention of a column c. We do the following:

We consider w a possible mention of column c only if the w covers c as a string effectively and efficiently. Formally, w covers c if (1) There does not exist a larger span  w′that contains w (e.g.  w′ = [wa′, wa′+1, . . . , wb′]s.t.  a′ ≤a, b ≤ b′, |a−a′|+|b′−b| ≥ 1) and covers more words of c than w; (2) There does not exist a smaller span  w′that is contained within w(e.g.  w′ = [wa′, wa′+1, . . . , wb′]s.t.  a ≤ a′, b′ ≤ b, |a − a′| + |b′ − b| ≥ 1) and covers the same number of words of c as w. The computation of number of words covered is done by counting the number of pairs (x, y) where  x ∈ wand  y ∈ csuch that x and y are close. The measure of closeness could be either of the following:

Edit Distance. We denote  Fed(x, y) = D(x, y) /max (|x|, |y|) where D is character-level edit distance, and define x and y are close if  Fed(x, y) < τed.

Word-Embedding Space Distance. We denote Fsem(x, y) = 0.5 · (1 − Cosine (W(x), W(y)), and define x and y are close if  Fsem(x, y) < τsim.

We use meta database knowledge  Pcdirectly for mention detection. For example, in Figure 1(b), the phrase “how many people live in” is detected as a mention for column “Population”.

Taking question q = “ Who is the best actress of year 2011?” and column name c = “best actor 2011” as an example. We detect w = “best actress of year 2011” as a mention of c, as “best”, “actress” and “2011” in w have words covered in c, namely “best”, “actress”, “2011”. Note that the closeness of “actress’ and “actor” is assessed by  Fsem. Moreover, w is neither “the best actress of year 2011?” (violating (1) above) nor “best actress of year” (violating (2) above).

image

Figure 3: Use lowest common ancestor (LCA) in constituency tree to find correct pairing of column and value. Here “rebounds” is paired with “2” and “points” with “3”.

Mention Resolution

There could be many candidate mentions of columns and values for many terms. The goal of mention resolution is to figure out globally, what is the most likely subset of mentions that are consistent.

First, we use syntax or context to reduce the number of candidate mentions. For example, in question “For which player his rebounds is 2 and points is 3?” the values “2” and “3” could be mentions of either “rebounds” or “points”. Both are valid unless we take into consideration the syntax and context of the question. We observe that in the question’s constituency tree, paired columns and values are structurally close to each other. Therefore, we use the depth of Lowest Common Ancestor (LCA) in the question’s constituency tree as a measure of structural closeness. More specifically, for value v and column c, and their mentions  wvand  wcin the question respectively, the structure closeness is measured by  C(v, c) = maxxv∈wv,xc∈wc LCAdepth(xv, xc)where xv, xcare words in  wv, wcrespectively. We only consider matches (v, f) that have best structure closeness, e.g. when C(v, c) = maxc∈C C(v, c).

Second, we use a global metric to further reduce inconsistency. Note that the first step does not solve all the problems. For example, it is possible that multiple column candidates have the same maximal LCA to a given value in the question. We formalize this task as a graph matching problem: Consider a bipartite graph G = (V, C, E) where vertices V represent all mentions of values, vertices C all mentions of columns, and edges E contain edges between two vertices  (v ∈ V, c ∈ C)if (v, c) is considered as a possible match as described above. We then find a Maximum Bipartite Matching (MBM) Kuhn (1955) in G as a proxy to the mention resolution, since finding MBM is equivalent to finding a maximum subset of matchings that are compatible, where each edge (v, c) in MBM constitutes a matching of value and column in the output annotation. After finding MBM, we add all values and columns whose corresponding vertex is without any edge in G to the annotation, since adding them would not disturb any existing matching. Finally, all columns and values are deterministically given an index (c1, c2, . . . v1, v2, . . .) where (1) matching column and value share the same index, and (2) indices are ordered according to the earliest mention of indices’ corresponding column or value in the question.

We denote an input natural language question as q = (q1, q2, ..., qm), annotated question as  qa = (q1a, q2a, ..., qma ), and the corresponding annotated SQL query as  sa= (s1a, s2a, ..., sna). The result of the aforementioned annotation process on a natural language question q is represented as its annotated form  qa, which is the input to a seq2seq model that translates  qato an annotated SQL  sa. In this section we discuss the representation of the annotated input, followed by a Sequence Translation Model that produces  safor a given  qa.

Representation of Annotated Input

There are many options in representation. For example, in Figure 1(a), “directed by” is annotated as the mention of c2. We can either substitute “directed by” by  c2or insert c2after “directed by” in the question. We explore difference annotation encoding methods and propose our unique annotation encoding to separate schema from natural language questions without loss of schema information. We will show in the Experiment Section that the representation affects the schema separation performance dramatically.

Question What position did the player LeBron James play? Annotated Question: Symbol Substitution What c1 did the c2 v2Annotated Question: Symbol Appending What c1 [position] did the

image

Figure 4: An example with two format of annotated form.

Include annotations as additional symbols Intuitively, the annotation should enable schema separation that strips off schema-specific information from natural language questions or substitutes schema-specific information with symbols. However, replacing the mentions of columns with uni-fied symbols like  cior  vidiscards the diverse semantic meaning of the column texts. Therefore we propose to append the symbols into stacks rather than substituting them to leverage the semantics of column texts. This proposed approach is referred to as “column stack” in Experiment Section. Figure 4 shows the difference of the two approaches and highlights that the proposed approach provides more information to the downstream sequence model.

Annotated Question: When  v1[Piotr Adamczy] was nominated as c1[Best Actor in a Leading Role] ? Annotated SQL SELECT Nomination Date WHERE  c1 = v1

Annotated Question with table header encoding: When  v1 [PiotrAdamczy] was nominated as  g1[Best Actor in a Leading Role] ? | g1[Nomination]  g2 [Actor] g3[Film Name]  g4[Director]  g5[Nomination Date] Annotated SQL with table header encoding SELECT g5 WHERE c1 = v1

Figure 5: An example showing approaches with and without table header encoding.

Table Header When a column in SQL is not mentioned in the question, we can only rely on the sequence model to infer the column. For the example in Figure 5 , “Nomination Date” is not explicitly mentioned and needs to be inferred by the model. However, most of the columns (e.g., “Nomination Date”) consist of multiple tokens, which are hard to be inferred correctly by a sequence model. To encourage the correct inference of multi-token columns, we append all headers  g ∈ Cin the end of the annotated question, so that even if a column is not mentioned in and cannot be inferred only from the question, it could still be referred to as  giby the sequence model.

image

g5 [Nomination Date]” is appended to the annotated question. and thus simplifies the annotated SQL to the form of “SELECT g5 WHERE c1 = v1”. Besides referring to unmentioned headers, our strategy also contributes to a much smaller output vocabulary and thus makes it easier to train a sequence model.

Sequence Translation Model We train a seq2seq model to estimate  p(sa|qa), which captures the conditional probability of p(sa|qa)= �nj=1 p(sja|qa, s1:j−1a )with encoder and decoder.

Encoder is implemented as stacked bi-directional multilayer recurrent neural network (RNN) with using Gated Recurrent Unit (GRU) Cho et al. (2014). To keep the dimension consistent, we add an affine transformation before each layer of RNN, which is defined as  yi(l) = W0(l)x(l)i + b0(l), where  xi(l)is the input of the l-th layer in the i-th position. W0(l)and  b0(l)are model parameters. The hidden state of the forward RNN and backward RNN are computed as:

image

We concatenate forward state vector and backward state vector as  hi(l) = [−→hi(l), ←−hi(l)], i = 1, 2, ..., m. The input of each layer is computed as  xi(1) = φ(qia) (φis the word embedding lookup function)  xi(l+1) = hi(l).

Decoder is an one-layer attentive RNN with copy mechanism. We use Bahdanau’s attention (Bahdanau, Cho, and Bengio, 2014) as follows: At each time step i in the decoder, the decoding step is defined as:

image

where W1, W2, W3, v, Uare model parameters, (d1, d2, ..., dn)the hidden states of the decoder RNN and j the index enumerating all positions in the encoder. In NLIDB, annotations in the output SQL often correspond directly from the input natural language question, To encourage the model choose tokens that appears in the input. We introduce copy mechanism that samples output token  staas

image

Note that this is different from the vanilla copy mechanism where the output is sampled through softmax over entire word vocabulary as  p(sia|qa, s1:i−1a ) ∝ exp(U[di, βi]).

Natural Language Interface to Database (NLIDB) Natural Language Interface to Database aims at providing an interactive bridge between humans and machines, where human users can issue a question in natural language text, which would then be translated to a structured query that can be executed by a database engine. Androutsopoulos, Ritchie, and Thanisch (1995) first explores this task with concrete examples defining this problem and highlights the separation between linguistic and database-derived information. Later Popescu, Etzioni, and Kautz (2003) proposes to identify questions whose answers are tractable solely from the database, and Giordani and Moschitti (2012) incorporates tree kernels in ranking candidate queries. Many recent advances can be categorized into two groups. The first group uses semantic parsing Wang, Berant, and Liang (2015); Pa- supat and Liang (2015); Jia and Liang (2016) as well as some extensions that support cross-domain semantic parsing Herzig and Berant (2017); Su and Yan (2017). However, due to the idiosyncrasy and complexity of natural languages, most works in this group are confined in narrow domains. The other group relies on neural based methods where sequence to sequence models are leveraged to translate input questions to SQL queries, optional combined with the help of user feedback Iyer et al. (2017) and reinforcement learning Zhong, Xiong, and Socher (2017).

Slot Filling in Dialogue System Dialogue system aims at communicating with a user in a session with multiple turns of dialogs, where state, or conceptually what the session is talking about, needs to be tracked for dialogue system to archive good performance Young et al. (2010). Commonly the dialogue system identifies and tracks entities that appear across turns as slots in a process called slot filling. These slots and entities that can fill in these slots are usually specific to the domain that the dialogue system is focusing on. For example, slots can be food, airport or city names, and therefore are from a pre-defined, externally crafted list of possible values. Recently, neural based approaches have been proposed for tracking state: Henderson, Thomson, and Young (2014) proposes a simple recurrent network that predicts the probability of each word in dialog being one of pre-defined slots, which is extended by Mrkši´c et al. (2015), a hierarchical model that can handle cases where entities can be from one of multiple domains. To specifically provide better tracking for ranking slot values in dialog, Belief Tracker Mrkši´c et al. (2017) sums up separating representations of system output, user feedback and candidate slot values. To further improve the performance, Wen et al. (2017) considers a policy network that arbitrates the outputs from multiple models, including the aforementioned belief tracker, a sequence model that encodes user input, and a generation network that produces system output.

Closest to our proposed work is Xu, Liu, and Song (2017), which employs a sketch-based approach that represents an SQL as a template with slots, and the model predicts values from a limited candidate set to be filled in each slot. This is different from our work that focuses on annotation and does not restrict SQL to a particular template-based form. Another close work is TypeSQL Yu et al. (2018) that enriches the inference of columns and values using domain-specific knowledge-based model that searches five types of entities on Freebase, an extra large database, which is contrast to our work which does not rely on extra database knowledge.

image

Table 1: Comparison of models. lf, qm, ex represent logical forms, exact query match, and query execution accuracy respectively. Performance on the first block are copied from the corresponding papers. “–” and “+” mean removing or adding one component from our best approach respectively for ablation.

* We report performance without using extra knowledge base for a fair comparison

Sequence-to-sequence Generation model Sequence to Sequence (referred to as seq2seq in the rest of the paper) learning Sutskever, Vinyals, and Le (2014) has led to many advances in neural semantic parsing models. Notable advances in sequence learning include attention mechanism Bahdanau, Cho, and Bengio (2014) and pointer network Vinyals, Fortunato, and Jaitly (2015) that boost performance for sequence learning and enable it to handle long sequence and rare words. They have seen successful applications on language model Merity et al. (2016), text summarization Gu et al. (2016), text understanding Wen et al. (2017), and neural computing Graves et al. (2016). Our model also benefits from these techniques since our model needs to see both information packed in long sequence and rare words that only appear in few tables.

We conduct experiments1 on two scenarios: (1) in-domain scenario of NLIDB trained and evaluated on WikiSQL dataset Zhong, Xiong, and Socher (2017), and (2) cross-domain scenario where we evaluate our model (trained on WikiSQL)’s transferring accuracy on OVERNIGHT dataset (Wang and Yang, 2015) without extra training in the target domain. We use three metrics for evaluating the query synthesis accuracy: (1) Logical-form accuracy. We compare the synthesized SQL query against the ground truth for whether they agree token-by-token in their logical form, as proposed in Zhong, Xiong, and Socher (2017). (2) Query-match accuracy. Like logical-form accuracy, except that we convert both synthesized SQL query and the ground truth into canonical representations before comparison. This metric can eliminate false negatives cases, such as two semantically identical SQL queries being different only in the literal ordering of condition clauses. (3) Execution accuracy. We execute both the synthesized query and the ground truth query and compare whether the results agree, as proposed in Zhong, Xiong, and Socher (2017).

WikiSQL

We train and evaluate our NLIDB model on WikiSQL, which contains 87673 records of natural language questions, SQL queries, and 26521 database tables. Since tables are not shared among the training / validating / testing splitting provided in the dataset, models evaluated on it are supposed to generalize to new questions and database schema.

Training Details The annotation process uses  τed = 0.5and  τsim = 0.15. For the sequence model, we use two layers of stacked GRU with hidden size 200 for encoder, and 400 for decoder. The input and output layers of both encoder and decoder share the tied embedding weights. We initialize the embedding weights with pre-trained Glove embedding with dimension D = 300, and embeddings for tokens not covered by GloVe with random vector. Symbols introduced by annotation (such as  c1) are also treated as tokens, each of them being represented by the concatenation of the embedding of annotation type (c or v) and index. Also, the embedding of annotation type and index are randomly initialized with D′ = 150so the concatenations has the same dimension as D = 300. In training we use gradient clipping with threshold 5.0, and in inference we use beam search with width 5.

Evaluation We compare our method with previous methods through three aforementioned metrics: accuracies in terms of logical form exact match, exact query match, and the results of query execution. As shown in Table 1, our result outperforms these previous methods, including the state-of-the-art Coarse2Fin Dong and Lapata (2018) by 0.3% for exact query match, and 3.7% for query execution. This demonstrates that our method enables high transfer-ability to unseen tables, since in WikiSQL database tables are not shared among training and testing splittings.

We note that TypeSQL achieves high accuracy in the "content-sensitive" setting where it queries Freebase when handling natural language questions in training as well as in inferencing. TypeSQL is thus limited because it cannot easily generalize to content that is not covered by Freebase. Since our focus is automated process with generalization, we did not take this setting into consideration.

image

Table 2: OVERNIGHT transfer accuracy. “–” means removing one component from our best approach respectively for ablation.

Ablation In Table 1, we conduct ablation analysis to demonstrate the contribution to performance from different components of our model. Removing each component of our method leads to a decreases in performance: The removal of (1) a layer in encoder (e.g. using only one layer of GRU), (2) copy mechanism, (3) column stack (e.g. using column substitution ), and (4) encoding of table header, each respectively decrease performance on testing set.

Since in our framework the annotation and sequence modeling are separate processes, we also test our annotation method combined with the transformer model 2, an alternative and state-of-the-art architecture for sequence modeling such as machine translation. With the same annotation, the transformer model shows worse performance. We hypothesis that this is due to NLIDB task being different from translation tasks: NLIDB has huge difference between vocabulary sizes in source space and target space, and only outputs the symbols while learning original text implicitly.

Cross-domain transfer-ability

For cross-domain evaluation, we evaluate the transfer-ability of our model that is trained on one domain (WikiSQL in our case) and tested on other domains to assess its transfer-ability. This task is challenging since the model is required to model domains not seen before.

Dataset We use OVERNIGHT dataset (Wang and Yang, 2015), consisting of pairs of natural language query and dataset-specific logical form, as target domains. In particular, the query and logical form pairs are originally categorized into eight sub-domains, of which we use five of them (BASKETBALL, CALENDAR, HOUSING, RECIPES, and RESTAURANTS) where the logical form can be feasibly converted to SQL for evaluation. We then annotate these five sub-domains of OVERNIGHT dataset.

Evaluation We evaluate the transfer-ability of our model that is trained on WikiSQL on five sub-domains of OVERNIGHT dataset. Since OVERNIGHT SQL sketch is highly variant and different from WikiSQL sketch. We make a reasonable assumption that only SQL sketch compatible with WikiSQL are considered in the transfer-ability evaluation. Table 2 presents our transfer-ability performance,

2We use pre-trained GloVe embedding, transformer implementation from tensor2tensor https://github.com/

image

where Pruning Rate represents the percentage of sketch noncompatible records in each category, and number of records represents the number of sketch compatible records in each category, which is the size of our evaluation set. Transfer accuracy is calculated over sketch compatible records and based on recovered SQL. It is shown that our model performs high transfer-ability with zero-shot learning on OVERNIGHT dataset.

Ablation We conduct ablation analysis to demonstrate the transfer-ability contributed by our annotation strategies as well as different components of our model as shown in Table 2. When we remove column stack, the model not only performs worse for WikiSQL (see Table 1), but also transfers poorly to OVERNIGHT. The overall query match accuracy is reduced by 22%. In contrast to the results shown in Table 1, encoding table headers hurts transfer ability. The reason is that, in OVERNIGHT dataset, only five sub-domains are considered, so we use pre-collected paraphrase set to annotate columns. Most of the records are fully annotated, and encoding table headers does not enhance annotation coverage. On the contrary, it feeds the model redundant information which hurts the overall accuracy.

We also conduct ablation analysis to demonstrate the transfer-ability contributed by different component of our model. As shown in Table 2, the removal of 1 GRU layer results in 1% overall accuracy decrease. The model using only one layer of GRU is not capable of capturing all the conditional probabilities, thus causing lower transfer-ability.

Counter-intuitively, the model without copy mechanism has 1.5% higher transfer accuracy than the original model. Copy mechanism contributes to the test accuracy on WikiSQL but hurts the transfer-ability on OVERNIGHT. We hypothesis that transfer-ability is not fully correlated with the accuracy of the original model, it also depends on schema extraction and how the model generalizes to another domain.

In this work, we propose an NLIDB as a general paradigm to convert natural language queries to SQL queries for any database. The main contribution of our work is to separate meta data from the data itself and learn, transfer, and accumulate knowledge of natural language and domain-specific knowledge separately. Our extensive experimental analysis demonstrates the advantage of our approach over state-of-the-art approaches in standard datasets.

Androutsopoulos, I.; Ritchie, G. D.; and Thanisch, P. 1995. Nat- ural language interfaces to databases–an introduction. Natural language engineering 1(01):29–81.

Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

Dong, L., and Lapata, M. 2018. Coarse-to-fine decoding for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.

Esfandiari, N.; Babavalian, M. R.; Moghadam, A.-M. E.; and Tabar, V. K. 2014. Knowledge discovery in medicine: Current issue and future trend. Expert Systems with Applications 41(9):4434–4463.

Giordani, A., and Moschitti, A. 2012. Translating questions to sql queries with generative parsers discriminatively reranked. Proceedings of COLING 2012: Posters 401–410.

Graves, A.; Wayne, G.; Reynolds, M.; Harley, T.; Danihelka, I.; Grabska-Barwi´nska, A.; Colmenarejo, S. G.; Grefenstette, E.; Ramalho, T.; Agapiou, J.; et al. 2016. Hybrid computing using a neural network with dynamic external memory. Nature 538(7626):471.

Gu, J.; Lu, Z.; Li, H.; and Li, V. O. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 1631–1640.

Henderson, M.; Thomson, B.; and Young, S. 2014. Word-based di- alog state tracking with recurrent neural networks. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 292–299.

Herzig, J., and Berant, J. 2017. Neural semantic parsing over mul- tiple knowledge-bases. arXiv preprint arXiv:1702.01569.

Huang, P.-S.; Wang, C.; Singh, R.; Yih, W.-t.; and He, X. 2018. Natural language to structured query generation via metalearning. In Proceedings of NAACL HLT 2018.

Iyer, S.; Konstas, I.; Cheung, A.; Krishnamurthy, J.; and Zettle- moyer, L. 2017. Learning a neural semantic parser from user feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 963–973.

Jia, R., and Liang, P. 2016. Data recombination for neural semantic parsing.

Kuhn, H. W. 1955. The hungarian method for the assignment problem. Naval research logistics quarterly 2(1-2):83–97.

Merity, S.; Xiong, C.; Bradbury, J.; and Socher, R. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.

Mrkši´c, N.; Séaghdha, D. Ó.; Thomson, B.; Gasic, M.; Su, P.-H.; Vandyke, D.; Wen, T.-H.; and Young, S. 2015. Multi-domain dialog state tracking using recurrent neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, 794–799.

Mrkši´c, N.; Séaghdha, D. Ó.; Wen, T.-H.; Thomson, B.; and Young, S. 2017. Neural belief tracker: Data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 1777–1788.

Ngai, E.; Hu, Y.; Wong, Y.; Chen, Y.; and Sun, X. 2011. The appli- cation of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Systems 50(3):559–569.

Ngai, E. W.; Xiu, L.; and Chau, D. C. 2009. Application of data mining techniques in customer relationship management: A literature review and classification. Expert systems with applications 36(2):2592–2602.

Pasupat, P., and Liang, P. 2015. Compositional semantic parsing on semi-structured tables. In ACL.

Popescu, A.-M.; Etzioni, O.; and Kautz, H. 2003. Towards a theory of natural language interfaces to databases. In Proceedings of the 8th international conference on Intelligent user interfaces, 149–157. ACM.

Su, Y., and Yan, X. 2017. Cross-domain semantic parsing via paraphrasing. arXiv preprint arXiv:1704.05974.

Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to se- quence learning with neural networks. In Advances in neural information processing systems, 3104–3112.

Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer networks. In Advances in Neural Information Processing Systems, 2692– 2700.

Wang, W. Y., and Yang, D. 2015. That’s so annoying!!!: A lexi- cal and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2557– 2563. Lisbon, Portugal: Association for Computational Linguistics.

Wang, Y.; Berant, J.; and Liang, P. 2015. Building a semantic parser overnight. In ACL.

Wen, T.-H.; Vandyke, D.; Mrkši´c, N.; Gasic, M.; Barahona, L. M. R.; Su, P.-H.; Ultes, S.; and Young, S. 2017. A networkbased end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, volume 1, 438–449.

Xu, X.; Liu, C.; and Song, D. 2017. Sqlnet: Generating structured queries from natural language without reinforcement learning. arXiv preprint arXiv:1711.04436.

Young, S.; Gaši´c, M.; Keizer, S.; Mairesse, F.; Schatzmann, J.; Thomson, B.; and Yu, K. 2010. The hidden information state model: A practical framework for pomdp-based spoken dialogue management. Computer Speech & Language 24(2):150–174.

Yu, T.; Li, Z.; Zhang, Z.; Zhang, R.; and Radev, D. 2018. Type- sql: Knowledge-based type-aware neural text-to-sql generation. arXiv preprint arXiv:1804.09769.

Zhong, V.; Xiong, C.; and Socher, R. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.


Designed for Accessibility and to further Open Science