We can define Personal Data Entity (PDE) as any information about a person. Such information can be present in both the public domain as well as in personal data.
Figure 1: Personal Data Entities in unstructured text.
The above sentences are from the publicly available Wikipedia page of an elected official. These sentence by themselves cannot be considered as personal data. But they contain Personal Data Entities (PDEs), i.e. entities which describe a person. A news article may also contain such mentions about an elected official.
For a number of applications in Data Protection, fraud prevention and business intelligence, there is a need to extract Personal Data Entities, classify them at a fine grained level, and identify relationships between people. Manually created Person Ontologies are used for this purpose in many enterprises cutting across domains. The challenges however are in populating an Ontology Graph based on such an Ontology.
Figure 2: Attributes of Person Entity
The first challenge in Ontology population is in identifying attributes at a fine grained level. In Figure 1, Brigham Young University could be classified coarsely as ORGANISATION by a Named Entity Recognizer. In recent years, a number of Neural Fine Grained Entity Classification (NFGEC) models have been proposed, which assign fine grained labels to entities based on context. They could might type Brigham Young University as /org/education. However the focus of such systems has not been on PDEs. They do not treat the problem of identifying PDEs any different from other entities. For the purpose of Ontology population, it might be desirable to assign the below labels.
In typical relation extraction tasks, a person and their place of birth could be considered a relation. However in a Person Ontology, we might want to have only people as a first class concept. Hence we want to extract relations between people, but other entities like place of birth could be considered attributes of the Person entity. In our example, we might be satisfied with the following relations, though class mate, ex-spouse will be more accurate.
Figure 3: Relations among Person Entities
We summarize our contributions in this work as follows: • We introduce a new dataset annotated with 36 Personal Data Entity Types (PDET) and 9 Personal Data Entity Relations (PDER). • We propose an approach to improve state of the art models for fine-grained entity classification, only using light weight features. • We share our results on running a semantic relation model on sentences rather than triples, by incorporating sentence embedding. These results however have not been encouraging. • Finally, we implement a personal data ontology population pipeline by using graph neural networks to augment the relations from the relation extraction model.
Entity Classification
Entity classification is a well known research problem in Natural Language Processing (NLP). (Ling and Weld 2012) proposed the FIGER system for fine grained entity recognition. In recent years, (Yogatama, Gillick, and Lazic 2015), (Shimaoka et al. 2017), (Choi et al. 2018) have proposed different neural models for context dependent fine grained entity classification. (Abhishek, Anand, and Awekar 2017), (Xu and Barbosa 2018) proposed improvements to such models using better loss functions. (Yogatama, Gillick, and Lazic 2015) showed the relevance of hand-crafted features for entity classification. (Shimaoka et al. 2017) further showed that entity classification performance varies significantly based on the input dataset (more than usually expected in other NLP tasks).
Relation Extraction
Models making use of dependency parses of the input sentences, or dependency-based models, have proven to be very effective in relation extraction, as they can easily capture long-range syntactic relations. (Zhang, Qi, and Manning 2018) proposed an extension of graph convolutional network that is tailored for relation extraction. Their model encodes the dependency structure over the input sentence with efficient graph convolution operations, then extracts entitycentric representations to make robust relation predictions. Hierarchical relation embedding (HRE) focuses on the latent hierarchical structure from the data. (Chen et al. 2018) introduced neighbourhood constraints in node-proximity-based or translational methods.
Datasets
(Ling and Weld 2012) introduced the Wiki dataset that consists of 1.5M sentences sampled from Wikipedia articles. OntoNotes dataset by (Weischedel et al. 2013) consists of 13,109 news documents where 77 test documents are manually annotated (Gillick et al. 2014). BBN dataset by (Weischedel and Brunstein 2005) consists of 2,311 Wall Street Journal articles which are manually annotated using 93 types. (Murty et al. 2017) have proposed a much larger label set based on Freebase.
Figure 4: Personal Data Entity Types (PDET)
Figure 5: Personal Data Entity Relations (PDER)
(Ling and Weld 2012) proposed the FIGER entity type hierarchy with 112 types. (Gillick et al. 2014) proposed the Google Fine Type (GFT) hierarchy and annotated 12,017 entity mentions with a total of 89 types from their label set. These two hierarchies are general purpose labels covering a wide variety of domains. (Dasgupta et al. 2018) proposed a larger set of Personal Data Entity Types with 134 entity types as shown in Figure 4. We have selected the 36 personal data entity types, as shown in 4 that were found in our input corpus.
For relation extraction labelset, YAGO (?) contained 17 relations, TACRED (Zhang et al. 2017) proposed 41 relations and UDBMS (DBPedia Person) dataset (Lu, Chen, and Zhang 2016) proposed 9 relations. We have used the 9 Personal Data Entity Relations (PDER) as shown in Figure-5.
Personal Data Annotators
Any system that assigns a label to a span of text can be called an annotator. In our case, these annotators assign an entity type to every entity mention. We have experimented with an enterprise (rule/pattern based) annotation system called SystemT introduced by (Chiticariu et al. 2010).
SystemT provides about 25 labels, which are predominantly coarse grained labels.
We use these personal data annotators in 3 ways:
Figure 6: Personal Data Annotators
• To annotate the dataset with entities for the entity classi-fication task.
• As part of the Personal Data Classification pipeline, where for some of the classes, the output of these PDAs are directly used as entity types. These are types like email address, zip codes, number where rule-based systems provide coarse labels at high precision.
• To create a series of labeling functions that annotate relations between entities. These relations are used to bootstrap link prediction models, which in turn populate the Ontology Graph.
While neural networks have recently improved the performance of entity classification on general entity mentions, pattern matching and dictionary based systems continue to be used for identifying personal data entities in the industry.
We believe our proposed approach, consisting of mod-ifications to state-of-the-art neural networks, will work on personal datasets for two reasons. (Yogatama, Gillick, and Lazic 2015) showed that hand-crafted features help, and (Shimaoka et al. 2017) have shown that performance varies based on training data domain. We have incorporated these observations into our model, by using coarse types from rule-based annotators as side information.
We used our Personal Data Annotators to create a number of labeling functions like those shown below to create a set of relations between the entities.
We have created this dataset from the Wikipedia page of US House of Representatives and the Members of the European Parliament. We obtained the names of 1196 elected representatives from the listings of these legislatures. These listings provide the names of the elected representatives and other details like contact information. However this semistructured data by itself cannot be used for training a neural model on unstructured data.
Hence, we first obtained the Wikipedia pages of elected representatives. We then used Stanford OpenNLP to split the text into sentences and tokenize the sentences. We ran the Personal Data Annotators on these sentences, providing the bulk of the annotations that are reported in Table 1.
We then manually annotated about 300 entity mentions which require fine grained types like /profession. The semistructured data obtained from the legislatures had name, date of birth, and other entity mentions. We needed a method to find these entity mentions in the wikipedia text, and assign their column names or manual label as PDEs.
We used the method described in (Chiticariu et al. 2010) to identify the span of the above entity mentions in wikipedia pages. This method requires creation of dictionaries each named after the entity type, and populated with entity mentions. This approach does not take the context of the entity mentions while assigning labels and hence the data is somewhat noisy. However, labels for name, email address, location, website do not suffer much from the lack of context and hence we went ahead and annotated them.
Table 1: Statistics on personal data annotations
We use the architecture described in (Dasgupta et al. 2018), which in turn was based on (Shimaoka et al. 2017). It consists of an encoder for the left and right contexts of the entity mention, another encoder for the entity mention itself, and a logistic regression classifier working on the features from the aforementioned encoders. An illustration of the model is shown in Figure 7a. The major drawback of the features used in (Shimaoka et al. 2017) was the use of custom hand crafted features, tailored for the specific task, which makes generalization and transferability to other datasets and similar tasks difficult. Building on these ideas, we have attempted to augment neural network based models with low level linguistic features which are obtained cheaply to push overall performance. Below, we elaborate on some of the architectural tweaks we attempt on the base model. Similar to (Shimaoka et al. 2017), we use two separate encoders for the entity mention and the left and right contexts. For the entity mention, we resort to using the average of the word embeddings for each word. For the left and right contexts, we employ the three different encoders mentioned in (Shimaoka et al. 2017), viz. • The averaging encoder, which like the mention encoder, and uses the average as the context representation • The RNN encoder, which runs an RNN over the context and takes the final state as the representation of the context • The attentive encoder, which runs a bidirectional RNN over the context, and employs self-attention to obtain scores for each word, which are in turn used to get a weighted sum of the states to use as the representation. Details of the different encoders can be found in (Shi- maoka et al. 2017), and we omit them here for brevity. The
Figure 7: Neural Models for Entity Classification and Relation Extraction
features from the mention encoder, and the left and right context encoders are concatenated, and passed to a logistic regression classifier. If we consider to be the representation of the left context,
to be the representation of the right context, and
to be the representation of the entity mention, each being D dimensional then these features are concatenated to form
, which is passed to the logistic regression classifier, which in turn computes the function:
where is the set of weights that project the features from a
dimensional feature space to a K dimensional output, where K is the number of labels, and
K. Since the output is a binary vector, we employ a binary cross entropy loss during training. Given the predictions y and the ground truth t for a sample, the loss is defined as:
We employ stochastic mini-batch gradient descent to optimize the above loss function, and the details are specified later in the experimental results section.
Table 2: Statistics of the datasets used in our experiments.
The results on Elected Reps dataset as can be seen in Table 3, clearly show the same trend, i.e. adding token level features improve performance across the board, for all metrics, as well as for any choice of encoder. The important thing to note is that these token level features can be obtained cheaply, using off-the-shelf NLP tools to deliver linguistic features such as POS tags, or using existing rule based systems to deliver task or domain specific type tags. This is in contrast to previous work such as (Ling and Weld 2012), (Yogatama, Gillick, and Lazic 2015) and others, who resort to carefully hand crafted features.
Table 3: Entity Classification performance with and without light weight features.
Extracting meaningful information from text requires models capable of extracting semantic relations, which are more comprehensive in terms of relational properties. A practical solution is to use translation-based graph embedding methods along with sentence embeddings, which will extract relation using vector representation of graph and sentence together. Where graph embedding will provide relation spe-cific projection and sentence embedding will provide contextual information.
Majority of translation-based knowledge graph embedding methods project source and target entities in a kdimensional vector. They typically focus only on simple relations, and less on comprehensive relations. In this work, we evaluate On2Vec (Chen et al. 2018) on the task of relation extraction. On2Vec proposed a two-component model, the Component Specific Model (CSM): encodes concepts and relation into low-dimensional embedding space without the
Figure 8: Person Ontology population pipeline
loss of relational properties, such as symmetric and transitive. And the Hierarchical Model (HM): for better learning of hierarchical relations.
For generating sentence embedding, we use Universal Sentence Encoder (USE) (Cer et al. 2018) that makes use of Deep Averaging Network (DAN), whereby input embedding for words and bi-grams are averaged and passed to deep neural network to produce sentence embeddings. We tried a modification of On2Vec model, where we pass triples and sentence embedding generated using USE to the On2Vec’s CSM (7b). Sentence embedding is added to the energy function of CSM model, to provide the textual context.
We evaluate On2Vec model on YAGO60K, YAGO15K, TACRED and UDBMS (DBPedia Person) datasets. YAGO and UDBMS datasets contain triples, whereas TACRED contains sentences. We have mapped the relations in TACRED and YAGO datasets as transitive, hierarchical and symmetric. The results of our experiments are shown in Table 4. We observe that incorporating the sentence embedding is not helping the model.
Table 4: Performance of Relation Extraction models
We have implemented a pipeline for Personal Ontology population as shown in Figure 8. This pipeline consists of existing personal data annotators, Stanford Named Entity Recognizer which provide rule based entity and relation extraction. We have then improved two state of the art models for entity classification and relation extraction as described in the previous sections. Finally we use a graph neural network for Link Prediction to infer more relationships between people mentioned in the corpus.
The input to our pipeline are text sentences. The outputs are person entities, their personal data as attributes and semantically rich relations between person entities. These can be used to populate a graph database like the one provided by networkx (Hagberg, Swart, and S Chult 2008).
We present the results from training two graph neural networks on the Personal Data Entity (PDE) data extracted using our method and a similar DBPedia Person data which has been annotated by wikipedia users.
Table 5: Comparison of Link Prediction on the UDBMS and Elected Representatives datasets
Figure 9: Person Ontology Graph
As shown in 5, Position Aware Graph Neural Network (You, Ying, and Leskovec 2019) performs much better than Graph Convolutional Networks (Schlichtkrull et al. 2018) on both the UDBMS and Elected Representatives datasets.
The Ontology Graph populated by us, parts of which are shown in 9, can be used to improve search, natural language based question answering, and reasoning systems. Further, the graph data can be exported as RDF and PPI formats, and used as a dataset for Link Prediction experiments.
We introduced a personal data ontology with 36 Entity Types (PDET) and 9 relations (PDER) and annotated unstructured documents from wikipedia using rule based annotators known as SystemT. We then showed improvements to state of the art models for Entity Classification and Relation Extraction. Finally we showed the implementation of a personal data ontology graph population pipeline, incorporating these two neural models along with a Graph Neural Network for Link Prediction.
[Abhishek, Anand, and Awekar 2017] Abhishek, A.; Anand, A.; and Awekar, A. 2017. Fine-grained entity type classifi-cation by jointly learning representations and label embeddings. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers.
[Cer et al. 2018] Cer, D.; Yang, Y.; Kong, S.-y.; Hua, N.; Limtiaco, N.; John, R. S.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, C.; et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175.
[Chen et al. 2018] Chen, M.; Tian, Y.; Chen, X.; Xue, Z.; and Zaniolo, C. 2018. On2vec: Embedding-based relation prediction for ontology population. In Proceedings of the 2018 SIAM International Conference on Data Mining, 315–323. SIAM.
[Chiticariu et al. 2010] Chiticariu, L.; Krishnamurthy, R.; Li, Y.; Raghavan, S.; Reiss, F. R.; and Vaithyanathan, S. 2010. Systemt: an algebraic approach to declarative information extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.
[Choi et al. 2018] Choi, E.; Levy, O.; Choi, Y.; and Zettle- moyer, L. 2018. Ultra-fine entity typing. arXiv preprint arXiv:1807.04905.
[Dasgupta et al. 2018] Dasgupta, R.; Ganesan, B.; Kannan, A.; Reinwald, B.; and Kumar, A. 2018. Fine grained classification of personal data entities. arXiv preprint arXiv:1811.09368.
[Gillick et al. 2014] Gillick, D.; Lazic, N.; Ganchev, K.; Kirchner, J.; and Huynh, D. 2014. Contextdependent fine-grained entity type tagging. arXiv preprint arXiv:1412.1820.
[Hagberg, Swart, and S Chult 2008] Hagberg, A.; Swart, P.; and S Chult, D. 2008. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States).
[Ling and Weld 2012] Ling, X., and Weld, D. S. 2012. Fine- grained entity recognition. In AAAI.
[Lu, Chen, and Zhang 2016] Lu, J.; Chen, J.; and Zhang, C. 2016. Helsinki Multi-Model Data Repository. http://udbms.cs.helsinki.fi/?dataset.
[Murty et al. 2017] Murty, S.; Verga, P.; Vilnis, L.; and Mc- Callum, A. 2017. Finer grained entity typing with typenet. arXiv preprint arXiv:1711.05795.
[Schlichtkrull et al. 2018] Schlichtkrull, M.; Kipf, T. N.; Bloem, P.; Van Den Berg, R.; Titov, I.; and Welling, M. 2018. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, 593–607. Springer.
[Shimaoka et al. 2017] Shimaoka, S.; Stenetorp, P.; Inui, K.; and Riedel, S. 2017. Neural architectures for fine-grained entity type classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers.
[Weischedel and Brunstein 2005] Weischedel, R., and Brun- stein, A. 2005. Bbn pronoun coreference and entity type corpus. Linguistic Data Consortium, Philadelphia.
[Weischedel et al. 2013] Weischedel, R.; Palmer, M.; Mar- cus, M.; Hovy, E.; Pradhan, S.; Ramshaw, L.; Xue, N.; Taylor, A.; Kaufman, J.; Franchini, M.; et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA.
[Xu and Barbosa 2018] Xu, P., and Barbosa, D. 2018. Neural fine-grained entity type classification with hierarchy-aware loss. arXiv preprint arXiv:1803.03378.
[Yogatama, Gillick, and Lazic 2015] Yogatama, D.; Gillick, D.; and Lazic, N. 2015. Embedding methods for fine grained entity type classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers).
[You, Ying, and Leskovec 2019] You, J.; Ying, R.; and Leskovec, J. 2019. Position-aware graph neural networks. arXiv preprint arXiv:1906.04817.
[Zhang et al. 2017] Zhang, Y.; Zhong, V.; Chen, D.; Angeli, G.; and Manning, C. D. 2017. Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 35–45.
[Zhang, Qi, and Manning 2018] Zhang, Y.; Qi, P.; and Man- ning, C. D. 2018. Graph convolution over pruned dependency trees improves relation extraction. arXiv preprint arXiv:1809.10185.