Recruiters in information technology domain have met the problem finding appropriate candidates by their skills. In the resume, the candidate may describe one skill in different ways or skills could be replaced by others. The recruiters may not have the domain knowledge to know if one’s skills are fit or not, so they can only find ones with matched skills.
In order to cope with the problem, one should try to find the relatedness of skills. There are some approaches: building a dictionary manually, ontology approach, natural language processing methods, etc. In this study, we apply a word embedding method Word2Vec, using skills from online job post descriptions. We treat skills as terms, job posts as documents and find the relatedness of these skills.
To find relatedness of skills, Simon Hughes [4] from Dice proposed an approach using Latent Semantic Analysis with an assumption that skills are related to skills which occur in the same context, and here contexts are job posts. This approach will build a term-document matrix and use Singular Value Decomposition to reduce the dimensionality. The cons of this approach is that when we have new data coming, we can not update the old term-document matrix, this leads to difficulties in maintaining the model, as the change of trend in this domain is high.
Google’s Data Scientists also face the same problems in Cloud Jobs API [7]. Their solution is to build a skill ontology defining around 50,000 skills in 30 job domain with relationships such as is a, related to, etc. This approach can represent complicate relationships between skills and jobs, but building such an ontology costs so much time and effort.
To overcome the problem of relevant term, [3] present a new, effective log-based approach to relevant term extraction and term suggestion.
The goal of [2] is to develop an automated system that discovers additional names for an entity given just one of its names, using Latent semantic analysis (LSA) [1]. In the example of authors, the city in India referred to as Bombay in some documents may be referred to as Mumbai in others because its name officially changed from the former to the latter in 1995.
[5] is the introduction of an ontology-based skills management system at Swiss Life and the lessons learned from the utilisation of the methodology, present a methodology for application-driven development of ontologies that is instantiated by a case study.
Word2Vec is a group of models proposed by Mikolov et al in 2013 [6]. It consists of 2 models: continuous bag-of-words and continuous Skip-gram, both are shallow neural networks that try to learn distributed representations of words with the target is to maximize the accuracy while minimizing the computational complexity. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. On the other hand, Skip-gram model try to predict surrounding context words based on the current word. In this work, we focus on Skipgram model as it is known to be better with infrequent words and it also give slightly better result in our experiment.
The model consists of three layers: input layer, one hidden layer and output layer. The input layer take a word encoded using 1-of-V encoding (also known as one-hot encoding), where V is size of the dictionary. The word is then fed through the linear hidden layer to the output layer, which is a softmax classifier. The output layer will try to predict words within window size before and after current word. Using stochastic gradient descent and back propagation, we train the model until it converges.
This model takes vector dimensionality and window size as parameters. The author found that increasing the window size improves the quality of the word vector, and yet it increases the computational complexity.
A. Data collecting and processing
Choosing a universal data set for the model is extremely important, the data should be large enough and should have balanced distributions over words (i.e. skills).
There are two dataset we need to concern, one (1) is the standard skills dictionary for the parser and another (2) is skills for training model; follow the figure 1.
Career websites (1) Standard skillscollect cleaning
Fig. 1. Pipeline of data collecting and processing
First, we collected and prepared a large dictionary of skills. With this dictionary, we can extract a set of skills from raw job descriptions. Skillss need to be cleaned into unique skills because there are many way to present a skill in job description (i.e. OOP or Object-oriented programming). Figure 2 briefly depicts the concept of the cleaning process.
Fig. 2. Pipeline cleaning skills
After that, we had the dictionary of skills ready for parsing. We collected a huge number of job descriptions from Dice.com - one of the most popular career website about Tech jobs in USA. From these job descriptions, we extract skills for each one by using our skills dictionary (1). Now, the dataset is presented by a list of collections of skills based on job descriptions. After crawling, we got a total of 5GB with more than 1,400,000 job descriptions. From these data, we extracted skills and performed as a list of skills in the same context, the context here includes skills in the same job description. The dataset is published at https://github.com/duyetdev/skill2vec-dataset
The data structure is shown in table I.
TABLE I DATA STRUCTURE
B. Skill2vec architecture
In this paper, for training the dataset, we used a neural network inspired by Word2Vec model as mentioned above. Here we treated our skills as words in Word2Vec model. In this study, with the documents contain only the skills, we chose the maximum window size, implied that every skills in the same job description are related to each other. For the vector dimensions, after some point, adding more dimensions provides diminishing improvements, so we chose this parameter empirically. To honour the work of Word2Vec model as it holds a big part in our study, we name our model Skill2Vec. Figure 3 briefly describes our Skill2Vec model.
Fig. 3. Skill2vec architecture
To evaluate our method, we have an expert team assesses the result following these steps:
1) Pick 200 skills randomly from our dictionary. 2) Our system will return top 5 ”nearest” skills for each. 3) The expert team will check if these top 5 ”nearest” skills are relevant or not. The experiment showed that 78% of skills returned by our model is truly relevant to the input skill. We present the experimental results in table II
TABLE II QUERY TOP 5 RELEVANT SKILLS
In this paper, we developed a relationship network between skills in recruitment domain by using the neural net inspired by Word2vec model. We observed that it is possible to train high quality word vectors using very simple model architectures due to lower cost of computation. Moreover, it is possible to compute very accurate high dimensional word vectors from a much larger dataset. Using Skip-gram architecture and an advanced technique for preprocessing data, the result seems to be impressive. The result of our work can contribute to building the matching system between candidates and job post. In the other hand, candidates can find the gap between the job post requirements and their ability, so they can find the suitable trainings.
A direction we can follow in the future: adding domain in training model, for example: Between Python, Java, and R, in Data Science domain, Python and R are more relevant than Java, however in Back End domain, Python and Java are more relevant than R.
[1] Michael W Berry and Ricardo D Fierro. “Low-rank orthogonal decompositions for information retrieval applications”. In: Numerical linear algebra with applications 3.4 (1996), pp. 301–327.
[2] Vinay Bhat et al. “Finding aliases on the web using latent semantic analysis”. In: Data & Knowledge Engineering 49.2 (2004), pp. 129–143.
[3] Chien-Kang Huang, Lee-Feng Chien, and Yen-Jen Oyang. “Relevant term suggestion in interactive web search based on contextual information in query session logs”. In: Journal of the Association for Information Science and Technology 54.7 (2003), pp. 638–649.
[4] Simon Hughes. How We Data-Mine Related Tech Skills. 2015. URL: http : / / insights . dice . com / 2015/03/16/how-we-data-mine-related- tech-skills/ (visited on 09/12/2017).
[5] Thorsten Lau and York Sure. “Introducing ontology-based skills management at a large insurance company”. In: Proceedings of the Modellierung. 2002, pp. 123–134.
[6] Tomas Mikolov et al. “Efficient Estimation of Word Representations in Vector Space”. In: CoRR abs/1301.3781 (2013). URL: http://dblp.uni- trier.de/db/journals/corr/corr1301. html#abs-1301-3781.
[7] Christian Posse. Cloud Jobs API: machine learning goes to work on job search and discovery. 2016. URL: https: // cloud. google. com/ blog/big - data/2016/11/cloud-jobs-api-machine- learning-goes-to-work-on-job-search- and-discovery (visited on 03/03/2018).