In today’s world the majority of commercial companies has a large amount of data, which is often unstructured and inconsistent. Due to this, during recent years businesses have an increasing interest in information retrieval instruments considering them as a tool to improve the productivity of their search workflow — in particular, make it more effective to customers and, therefore, raise sales. The ERIKS company — a service provider offering a wide range of engineering components — set up a research on how to boost their product data search. In the research we investigate the influence of different preprocessing and indexing techniques on the quality of the search. The paper discusses the impact of multi-indexing, word decompounding and adding n-grams, reports on the achieved progress in improving the search and proves the statistical significance of the results.

1 Dataset and methods

1.1 Problem and instruments

The problem is set up by the ERIKS company which needs an effective search engine for their extensive product descriptions. In this paper we investigate possible approaches to improving product data search, such as smart preprocessing, multi-indexing, words decompounding and adding n-grams. For storing and indexing the data we use Elasticsearch — a distributed, RESTful search and analytics engine based on Lucene. It provides scalable and near real-time search and has an official Python client.

1.2 Data

The data is provided by the ERIKS company, which has a large catalogue of engineering details. The materials include descriptions of the products for sale, example customer queries with matched products and some example matches to get the idea of the problems which can appear during the research. All the data is shared under the non-disclosure agreement.

The main dataset used for creating a search index consists of 1185633 catalogue products descriptions and product identifiers. For model evaluation the company offers 500 customer queries with matched product ids from the catalogue.

The preliminary data analysis shows that there are multiple problems with mapping between catalogue data and customer descriptions as none of them is standardised:

1. Case and word order variability

2. Inconsistent measurement format

3. Erratic punctuation

4. Abbreviations and word parts replacement

The described issues invite the assumption that these catalogue descriptions and queries should be well preprocessed in order to learn the search engine to match them.

1.3 Evaluation measures

As a primary metric to evaluate the model we choose Mean Reciprocal Rank (MRR) [1, p. 1703] as it is well-suited for tasks with only one relevant result for each query. To the best of our knowledge it is the case in ERIKS product search: a customer looks for the particular item to order.

The Reciprocal Rank calculates the reciprocal of the rank at which the first relevant document was retrieved. The Mean Reciprocal Rank is the average of the reciprocal ranks of results for a sample of queries Q:

where | Q | is the size of the sample.

Another interesting metric to look at is Mean Success At 10 (MS@10) which reflects the probability of finding the relevant product among first ten results as it is germane for customers to have relevant descriptions at the top [2, p. 148]. People expect to see the products they search for at the first page of search results and this metric shows whether this expectation is fulfilled.

1.5 Proposed system

To achieve the best performance within limited time, we focus on two main steps: proper model selection and multi-field indexing with separate preprocessing per field. The following paragraphs discuss these aspects in details.

Model selection

In the proposed system we use Jelinek-Mercer language model, one of the the query likelihood model realizations which uses linear way of smoothing probabilties [2, p. 226]. To estimate the likelihood of the token given the document it uses a mixture between a document-specific and entire collection information. In such a way the algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity model has a single parameter l which is usually chosen empirically and is set up to 0.3.

Multi-field indexing

To gain better relevance, the proposed model uses five-field index with multiple analyzers, which are different per field. The description is included in the results list if it matches the broad-matching main field. If it also matches the signal fields, it gets extra points and is pushed up the results list. Let’s briefly discuss each field with the corresponding analyzer.

• Txt. This field stores raw description tokenized with a standard Dutch analyzer and lowercase filter. The question of tokenization is widely addressed in [2, p. 21] as well as the question of case-folding in [2, p. 28]. These techniques proved to be very useful in modern IR systems. The Dutch analyzer is chosen due to the predominance of Dutch language in catalogue descriptions.

• Words. This field contains all the words and product codes with removed punctuation. To achieve this, first we apply a whitespace tokenizer, which splits text into tokens whenever it encounters a whitespace character. Then we make obtained terms lowercase and apply filter, which replaces all possible dimension words (namely, mm, mt, kg, gr, x) with a hyphen. It is done in order to prepare tokens for the next filter, which splits them into subwords and performs optional transformations on subword groups. In a result token set we preserve both words and subwords to increase the relevance.

• Numbers. This field contains all the numbers obtained with a pattern tokenizer with the regular expression "\ d+".

• N-grams. This field comprises all possible n-grams of size from 3 to 8 which start is anchored to the beginning of the token. After applying a standard tokenizer, which provides grammar-based tokenization, and a lowercase filter, we generate possible n-grams with the help of edge ngram tokenizer.

• Decompounded words. Word segmentation can be a very useful approach to the languages with multiple compound words [2, p. 24] and Dutch is among such languages. To build this field we use SECOS — a compound splitter that uses information from a distributional thesaurus [3] which has a pretrained model for Dutch. we collect all the words from the catalogue descriptions using a simple regular expression "\ b[ \ d \ W]+ \ b“ and then, with the help of SECOS, create a dictionary of subwords corresponding to the dataset. Finally, we use this dictionary as a filter to split words into parts.

2 Results

As a baseline we treat the system with a default BM25 similarity model and a two-field index, which contains raw text and "numbers“ field preprocessed in the same manner as described in the previous section. This system is quite simple as it has only two fields with almost no preprocessing or feature extraction involved.

The evaluation results for both baseline and proposed systems can be seen from the table:

Model	MS@10	Reciprocal rank
Baseline	0.376	0.3441
Offered model	0.6	0.4267

One can observe that the proposed model has 8% improvement in reciprocal rank and increases almost twice the probability of seeing the relevant document from catalogue during first 10 results.

3 Conclusion

It has been observed during the experiments that there are many ways to improve the search engine from the level reached with a baseline model. Multi-indexing proved to be one of the most powerful tools to raise the search performance as it helps the search engine in understanding what parts of the query are the most important. During research we introduced some fields in the data index which seemed to me the most promising: word extraction, word decompounding and edge n-grams. Though word extraction gave the best gain to the scores, however, adding edge n-grams and word decompounding made an important contribution as well and, what is more important, they can be further expanded and used for future investigation and experiments.

It should be noticed that an important drawback of adding new fields in the index is the rapidly increasing search time which can be crucial for many systems. Many customers can lose the interest in buying products from a particular company after long waiting for the answer from its search engine. Although it is not the primary goal of this paper to discuss the possible ways of search time improvement, this problem should be taken into account while deploying the system.

References

[1] Ling Liu and M Tamer Özsu. Encyclopedia of database systems, volume 6. Springer Berlin, Heidelberg, Germany, 2009.

[2] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.

[3] Chris Biemann Martin Riedl. Unsupervised compound splitting with distributional semantics rivals supervised methods. In Proceedings of The 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie, pages 617–622, San Diego, CA, USA, 2016

Главная Конференции Редколлегия Учреждения Документация Авторы Новости Контакты

Наверх

Цитаты
великих
людей

«Наука — дело очень нелегкое. Наука пригодна лишь для сильных умов»

Мишель Монтень

ГОРОДА: Москва, Санкт-Петербург, Новосибирск, Екатеринбург, Нижний Новгород, Казань, Самара, Челябинск, Омск, Ростов-на-Дону, Уфа, Красноярск, Пермь, Волгоград, Воронеж, Владивосток, Ярославль, Обнинск, Калининград, Орел, Тюмень, Томск, Тамбов, Тверь, Улан-Удэ, Смоленск, Саранск, Сочи, Ставрополь, Сыктывкар, Рязань, Пенза, Оренбург, Набережные Челны, Новгород Великий, Новороссийск, Магадан, Магнитогорск, Липецк, Калуга, Кемерово, Краснодар, Ижевск, Иваново, Иркутск, Забайкальск, Владимир, Вологда, Белгород, Брянск

Разработка и
продвижение: AdHeads