With the rapid growth of e-Commerce, online product search has emerged as a popular and effective paradigm for customers to find desired products and engage in online shopping. However, there is still a big gap between the products that customers really desire to purchase and relevance of products that are suggested in response to a query from the customer. In this paper, we propose a robust way of predicting relevance scores given a search query and a product, using techniques involving machine learning, natural language processing and information retrieval. We compare conventional information retrieval models such as BM25 and Indri with deep learning models such as word2vec, sentence2vec and paragraph2vec. We share some of our insights and findings from our experiments.
E-Commerce has seen an unprecedented growth in recent years, resulting in a vast number of customers engaging in online shopping to find desired products. With such an increase in adoption, it becomes pertinent for online retailers to be able to suggest products that are very relevant to a customer’s search query. We propose a model to be able to predict search relevance scores given a search query and a product. Our dataset which is described in the following section, is taken from Kaggle, and focuses on search relevance prediction, specific to an e-Commerce organization, Home Depot. The input to our system will be pairs of product IDs and search queries; the output will be search relevance score, on a scale of 1-3, for each pair. The evaluation criteria for the performance of our model is going to be the root mean-squared error (RMSE) and the correlation coefficient of the predicted relevance scores. Since we do not have multiple products for each query, our focus is therefore on the relevance alone and not on the rankings, which is why we do not perform a ranking evaluation.
1.1 Related Works
There is plenty of published literature about techniques for free-text querying for information retrieval. Most of the popular techniques that have been proposed for search relevance have been aimed at web search. Some of the older works on information retrieval that are relevant to our goal have revolved around exploring different ways for modeling language (vector space models, query likelihood models, variations of popular systems like Lucene, Okapi BM25, etc.) 
A few popular methods (such as Indri ) have been inspired by Bayesian networks. While many of these information retrieval models are inspired by very different underlying theory, most of them in practice utilize a few common features such as using tf-idf , bag of words approaches / unigram features, treating documents and queries as vectors and using similarity metrics such as cosine similarity  and KL divergence, using simple maximum likelihood estimation approaches as well as maximum a-priori estimates using query-independent evidence, etc. These features have over the last few years become a central part of information retrieval. There have been other works on structured contextual clustering  and optimizing web search using web click-through data . Recently, several approaches involving natural language processing [7–13], machine learning [14–17], deep learning [18, 19] and numerical optimizations [20–24] have also been used in the visual and language domains.
The motivation and goal behind our paper is to make better use of machine learning for improving product-search relevance. Compared to existing works, we are much more focused on exploiting the small amount of product data that is available to us (consisting solely of the natural language product details) by trying out new representations of the data and new learning techniques to automatically learn these representations. The evaluation criteria here is also different since we are only interested in predicting absolute relevance scores (which are more important in the context of an e-commerce environment) rather than the direct ranking of results.
Recent developments in the field of deep learning have popularized the use the of word embeddings for various NLP tasks which aim to capture the contextual meaning of natural language. We aim to use such methods in predicting product relevance, see how they fare in comparison to more conventional approaches, and whether they can be combined in a meaningful way.
The training data is available to us in a relational form and is divided into the following tables:
1. Training data : This includes the product uid (primary key / number), product title, search query and the relevance score (which is also the label. This is a floating point number between 1.0 and 3.0). This table consists of 74067 instances.
2. Test data: This consists of 166693 instances.
3. Product description: This consists of the product uid(foreign key to the training data table) and the product description.
4. Product attributes: This consists of product uid (foreign key), attribute name and the attribute value.
Each product that is a part of the train and test data has exactly one corresponding description. A product may have any number of attributes (or none at all). Most products have different sets of attributes that are specific to the product being considered.
The process of data collection was quite straight-forward since the complete data for training was available to us through the kaggle website in the form of comma separated values.
We concatenated various text features from different tables to create a matrix of feature vectors and performed some basic pre-processing on the textual features: case-conversion to lowercase, stop-word removal (using a list of stop-words from NLTK) and stemming (using a porter stemmer) and tokenization (creating n-gram features out of the remaining tokens). Furthermore, on the basis of error analysis on initial experiments, we also employed a basic spelling correction algorithm based on the Levenshtein distance between new unseen tokens and dictionary terms. We also converted numerical characters and number into a canonical form by a simple regular expression mechanism.
Table 1: Some statistics of the corpus
Table 2: Parameters used for the SVM
2.3 Corpus Exploration
We have performed an in-depth analysis of the corpus and some of the key statistics computed have been summarized in Table 1
3.1.1 N-gram models and Boolean Retrieval
For our baseline experiments we used two methods. First we used a unigram model i.e. extracted unigram features for the search term, product title and product description. Secondly, we computed OR and AND scores of the search term with the product title and the product description. These operators are explained in detail below. On both these methods of extracting feature sets, we used the Sequential Minimal Optimization (SMO)  Regression algorithm for training the SVM. The SMO is an effective way for solving the quadratic programming (QP) problem that arises during the training of support vector machines and is very commonly used in information retrieval systems. We chose to work with RBF (Radial Basis Function) kernel for the SVM. The different parameter values that were chosen for the method are listed in Table 2. We have used 10-fold cross validation and generated relevance scores for the test data between 1.0 to 3.0. We chose to work with SVM because:
1. We get to choose from multiple Kernels
2. We can engineer the kernel (in our case find the optimal Gamma value) so that the model gains more knowledge about the features.
3. SVM has a regularization parameter through which we can control over-fitting.
4. SMO in particular is very successful in solving the convex optimization problem by which the SVM is defined.
Authors in  have thoroughly benchmarked the SVM classifiers.
We have tuned the parameter in the RBF kernel. We considered values , and we found that the best performing parameter value is 0.01.
3.1.2 OR, AND and NEAR operators
If we have a search term of n words () and document then we calculate the OR score
score() = MAX(score(
We calculate the count of each word in the search term in the product title, product attributes and product description individually. We take the maximum count of the words in search term. eg: If my search query is ‘angle bracket’ and the product title is ‘Simpson Strong-Tie 12-Gauge Angle’ then the OR score is 1.
score() = MIN(score(
3.2 Baseline results
We tried various representations of data as part of baseline experiments. For n-gram models (with the Markov assumption), we extracted unigram, bigram and trigram features from the data. For the learning algorithm, we used SVM to train a regression model (Model 1). SVM and n-gram models gave reasonably good performance with an RMSE in the range of 0.2880 to 0.3057. However, since we found that bigram and trigram models were quite computationally expensive to train on the entire training data, we performed most experiments on a small subset of the data using just unigrams. This is the reason we have not reported the results for bigrams/trigrams. We found that increasing the amount of training data from 10,000 to 74067 instances resulted in a marginal improvement, but at a very high computational cost.
In contrast to the n-gram features, we find that features inspired by common information retrieval models such as Boolean retrieval (OR, AND) to be much more useful in improving the model. We again trained a regression model (Model II) using just 6 of these features (described earlier) alone and got a much better RMSE of 0.2872.
Finally, we tried to took the predictions from Model 1 and use those as an additional feature for Model 2 to train another model (Model III). This actually resulted in a statistically significant improvement. We found the third model to be better than any of the other models. It gives a very good RMSE performance, takes into account the most useful n-gram features (by using the relevance predictions of Model 1) and at the same time has a very small feature space; so is quite fast to train.
3.3 More advanced information retrieval models : Okapi BM25 and Indri
We calculate the BM25 and Indri scores for each instance. Indri and BM25 are two popular ranking functions based on a probabilistic retrieval framework and a Bayesian Network respectively . The Indri and BM25 scores assign a relevance score to a <query, product> pair by computing a function of the query term frequency in a product description, the ratio of the product’s description length to the average product description length, and the number of products that the term appears in (tf-idf). The idea is that the relative length of the product description (that contains some matching keyword) acts as a prior since shorter descriptions with keyword matches are more likely to be more relevant to our query. The product frequency (no. of products that a query term is matched with) plays a simiar role. The Indri model is shown in the Figure 1.
We used two levels of smoothing in our calculation of Indri scores to avoid zero entries for the query terms: linear interpolation as the first level and Bayesian smoothing with Dirichlet priors as the second. Assuming q is the search query in consideration and are the n search terms in the query, the smoothing used is given below:
where d is the document in consideration (the product in our case), is the term-frequency of the search term in document , the collection term frequency of , is the frequency of across all the documents, c is the length of the collection of all the documents, is the parameter in the linear interpolation model and is the parameter in the Bayesian smoothing.
Figure 1: The figure shows the Bayesian network that the Indri model is based on. We create 3 language models for different kinds of product data, and get the frequency counts for each of them (representation nodes). The first layer of the query network (, x) shows the raw scores calculated for each query term. The second layer(, x) shows how these scores are combined.
3.4 Word and Sentence Embeddings
One major problem in our baseline model which was evident from some cursory error analysis was that our baseline system relied a little too heavily on exact matched terms as being the strongest features for most instances. It failed to take into account the context of words. In order to overcome this problem, we decided to use word and sentence embeddings to represent our data – namely the textual features of product title,descriptions and attributes. These embeddings capture the context of the search term and product descriptions.
Since most of the entries in our dataset have single words as search terms, we expect word vectors to be much more useful than sentence vectors. Nevertheless, we also decided to do experiments using sentence vectors which we felt could be helpful for products with lengthy descriptions and search terms. We have used the sentence to vector model proposed in . The Sentence to Vector model uses Recurrent Neural Network using Gated Recurrent Units for training their model. For our experiments we have used the word to vector model described in . It uses the skipgram model to generate word embeddings. We trained the google word to vector model on our dataset to generate vectors which capture the context of our dataset.
4.1 Evaluation Metrics
In our experiments, we have used two different metrics to evaluate our performance: the root mean-squared error (RMSE) and the correlation coefficient of the predicted and gold standard relevance scores.
4.2 Baseline Experiments
We have performed experiments with varying amount of training samples. For each experiment, we use SMO model to predict scores and calculate the correlation and RMSE values using 10-fold cross validation. These are listed below in Table 3:
The entire dataset of 74067 samples has 54042 unigram features.
Table 3: Different Features Extracted: Feature Space Design
Figure 2: Flowcharts for the models Used
4.3 Sentence to Vector Model
We first generate the embedding of the product description and the search term using the skip-thought vector model . We then train a logistic regression with sigmoid activation unit. The input feature for the model is the dot product of the embeddings of the product description and the search term. The Y labels for the model for training phase were the relevance score. During the testing phase, we get the features using the dot product and then we generate the Y label using the model. This is done by taking the dot product of the prediction probability of the class (In our case, we have 3 classes since the score is between 1-3) and the no of classes.
4.4 Word to vector Model
We have trained the Google word to vector model  on our dataset. We generate the word embeddings for each word in the product description. We then take an average of the word vectors for it to capture the information of the product in one vector. We calculate the word embedding of each term of the search term and then average them as well. Then, we calculate the dot product between the averaged vector of product description and the search term to yield a score between -1 and 1. This is then made to fit in the range of 1 to 3, and is outputted as the relevance score. Figure 2 shows the architecture and flow for our experiments.
Figure 3: Comparison of Correlation Coefficient and RMSE
4.5 Paragraph to vector
We have used the gensim library2, which takes a document of text as input and generates its embedded vector. This was motivated from the model described in . The product descriptions and the search terms were fed to it and in the process the embedded vectors were obtained. Once again we get the cosine similarity between this vector and the search term vector to predict the relevance score.
Figure 3a shows the comparison of the correlation coefficients between the Unigram type of features and the OR and AND features. As we can see, the correlation coefficient is much higher (better) using only the 4 features obtained from the OR and AND operator for all of the experiments. Figure 3b shows the comparison of the RMSE value between the Unigram type of features and the OR and AND features. Here, too we see that the RMSE value is lower (better) for the 4 features extracted from the OR and AND operator for all of the experiments. Figure 3c shows the comparison of correlation coefficients across the different models considered, where we see that there is a slight improvement in performance of Word2Vec over our baselines. The result values are shown in the Table 4. From the results, we see that the features calculated using BM25 and Indri models are much better in representing our data than the methods used in the baseline.
In Table 6 we have shown the RMSE and correlation coefficient value results for the deep network models and BM25 and Indri model. The table also shows the dimension of the vector / the number of features used by each model. We have calculated the Pearson correlation coefficient and RMSE value as described in [14,15]. During the training phase of the word to vector model we had a vocabulary size of 226705.
Table 4: Baseline results obtained for various features
Table 5: Error Analysis - some typical examples of correctly and incorrectly classified examples.
Furthermore, we tried to combine the results of our best models using two different approaches (i.e. BM25/Indri/scores and the word2vec model similarity score) to train another SVM model. However, this did not result in any statistically significant improvement.
In Table 5, we have shown how different models fail or succeed in different cases. The example 1. shows that when the search term is just a word like ‘ptfe’ which does not mean anything by itself the sentence to vector model will not be able to capture the context of this word at all. Hence, the predicted score is much lower than the actual relevance score. Whereas in example 2., the search
Table 6: Results obtained for various models
term is ‘ 4 foot christmas tree colored lights’. Hence, the model already has some context about lawn mowers and covers. Hence, it captures the context and predicts a score very close to the actual relevance score. In example 3. we have applied word to vec model on example 1. and this model works better on this example because during training phase of the model, it captured information of ‘ptfe’. In example 4. we get baseline score for the example 3. The baseline performs worse than the word to vec model here because it just count the number of times ‘ptfe’ occurs in the product description. Hence, the count is OR and AND score is 2. Hence it predicts a value lower than the actual relevance score. In example 5. we have applied word to vec model on example 2. Here sentence to vec model performs better because the query term is long and in the form of a sentence and the sentence embedding captures the context of the query.
Through the baselines experiments we observed that simply using an SVM with features from different information retrieval models helps improve on the relevance scores predicted by some of the models alone. To address the problem of the context of words being ignored by the baseline model, we tried various methods that use deep networks. We found that the the word2vec model resulted in a significant improvement with an RMSE of 0.2482. In most cases, it did a good job in capturing the relevance of ambiguous queries correctly. On the other hand, we found that the sentence2vec / skip-thought vectors actually caused a slight decline in performance. It appears that the most likely reason for this is that many of the descriptions do not have very well-formed English sentences and the text is in more of a bullet point format consisting of many unrelated product characteristics written together. So this explains why using sentence vectors for such relevance tasks may not be a good idea. In contrast, word vectors still do a good job here.
However, we could not achieve any significant improvement when we tried to train a single SVM model using features from both our best models (i.e. BM25/Indri scores as well as similarity scores from the word2vec model). It is hard to explain this result but it’s most probable that there are some spurious correlations among the features learned by these two approaches which confuses the SVM.
Furthermore, since these models are computationally expensive to train, we compared the performance of our deep learning models trained on different sizes of training sets in order to get an estimate of how many training example might be sufficient to get a good enough performance. We found that the performance improves considerably while increasing the training data and our results indicate that we can expect a much more drastic improvement if we are able to get hold of more labeled data in order to train a better model.
 Joaquín Pérez-Iglesias, José R Pérez-Agüera, Víctor Fresno, and Yuval Z Feinstein. Integrating the probabilistic models bm25/bm25f into lucene. arXiv preprint arXiv:0911.5046, 2009.
 Trevor Strohman, Donald Metzler, Howard Turtle, and W Bruce Croft. Indri: A language model-based search engine for complex queries. In Proceedings of the International Conference on Intelligent Analysis, volume 2, pages 2–6. Citeseer, 2005.
 Juan Ramos et al. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, volume 242, pages 133–142. Piscataway, NJ, 2003.
 Bin Tan, Xuehua Shen, and ChengXiang Zhai. Mining long-term search history to improve search accuracy. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 718–723. ACM, 2006.
 Bruno Roustant, Pierre-Yves Chevalier, and Yves Mahe. Structured contextual clustering method and system in a federated search engine, September 13 2005. US Patent 6,944,612.
 Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen, Yong Yu, Wei-Ying Ma, WenSi Xi, and WeiGuo Fan. Optimizing web search using web click-through data. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 118–126. ACM, 2004.
 Rahul Radhakrishnan Iyer, Ronghuo Zheng, Yuezhang Li, and Katia Sycara. Event outcome prediction using sentiment analysis and crowd wisdom in microblog feeds. arXiv preprint arXiv:1912.05066, 2019.
 Rahul Radhakrishnan Iyer and Katia Sycara. An unsupervised domain-independent framework for automated detection of persuasion tactics in text. arXiv preprint arXiv:1912.06745, 2019.
 Rahul Radhakrishnan Iyer, Jing Chen, Haonan Sun, and Keyang Xu. A heterogeneous graphical model to understand user-level sentiments in social media. arXiv preprint arXiv:1912.07911, 2019.
 Rahul R Iyer, Katia P Sycara, and Yuezhang Li. Detecting type of persuasion: Is there structure in persuasion tactics? In CMNA@ ICAIL, pages 54–64, 2017.
 Rahul Radhakrishnan Iyer and Carolyn Penstein Rose. A machine learning framework for authorship identification from texts. arXiv preprint arXiv:1912.10204, 2019.
 Rahul Iyer, Rohit Mandrekar, Atishay Aggarwal, Pranav Chaphekar, and Gresha Bhatia. Recomob: Opinion mining for product enhancement. In 2017 International Conference on Computer Communication and Informatics (ICCCI), pages 1–5. IEEE, 2017.
 Rahul Radhakrishnan Iyer, Yulong Pei, and Katia Sycara. Simultaneous identification of tweet purpose and position. arXiv preprint arXiv:2001.00051, 2019.
 Yuezhang Li, Ronghuo Zheng, Tian Tian, Zhiting Hu, Rahul Iyer, and Katia Sycara. Joint embedding of hierarchical categories and entities for concept categorization and dataless classification. arXiv preprint arXiv:1607.07956, 2016.
 Rahul Radhakrishnan Iyer, Sanjeel Parekh, Vikas Mohandoss, Anush Ramsurat, Bhiksha Raj, and Rita Singh. Content-based video indexing and retrieval using corr-lda. arXiv preprint arXiv:1602.08581, 2016.
 Michael Honke, Rahul Iyer, and Dishant Mittal. Photorealistic style transfer for videos. arXiv preprint arXiv:1807.00273, 2018.
 Rahul Radhakrishnan Iyer, Manish Sharma, and Vijaya Saradhi. A correspondence analysis framework for author-conference recommendations. arXiv preprint arXiv:2001.02669, 2020.
 Rahul Iyer, Yuezhang Li, Huao Li, Michael Lewis, Ramitha Sundar, and Katia Sycara. Transparency and explanation in deep reinforcement learning neural networks. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 144–150. ACM, 2018.
 Yuezhang Li, Katia Sycara, and Rahul Iyer. Object-sensitive deep reinforcement learning. arXiv preprint arXiv:1809.06064, 2018.
 Rahul Radhakrishnan, Abhinoy Kumar Singh, Shovan Bhaumik, and Nutan Kumar Tomar. Multiple sparse-grid gauss–hermite filtering. Applied Mathematical Modelling, 40(7-8):4441– 4450, 2016.
 Rahul Iyer and Ahmed H Tewfik. Optimal ordering of observations for fast sequential detection. In 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pages 126–130. IEEE, 2012.
 Hai Qian, Shengwen Yang, Rahul Iyer, Xixuan Feng, Mark Wellons, and Caleb Welton. Parallel time series modeling-a case study of in-database big data analytics. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 417–428. Springer, 2014.
 Hari Prabhat Gupta, T Venkatesh, Seela Veerabhadreswara Rao, Tanima Dutta, and Rahul Radhakrishnan Iyer. Analysis of coverage under border effects in three-dimensional mobile sensor networks. IEEE Transactions on Mobile Computing, 16(9):2436–2449, 2016.
 Rahul Radhakrishnan, Ajay Yadav, Paresh Date, and Shovan Bhaumik. A new method for generating sigma points and weights for nonlinear filtering. IEEE Control Systems Letters, 2(3): 519–524, 2018.
 John Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. 1998.
 Tony Van Gestel, Johan AK Suykens, Bart Baesens, Stijn Viaene, Jan Vanthienen, Guido Dedene, Bart De Moor, and Joos Vandewalle. Benchmarking least squares support vector machine classifiers. Machine learning, 54(1):5–32, 2004.
 Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. Okapi at trec-3. Nist Special Publication Sp, 109:109, 1995.
 Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015.
 Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196, 2014.
 Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.