! This post is also available in the following language. Japanese

【Internship Report】 Implementation of the IU Web search completion function

Hello. My name is Ryoto Suzuki. I am a first-year master’s student at the Graduate School of the University of Tokyo.

I participated in the LINE internship of the engineer/working-type course for six weeks from August 23rd.

During this internship, I belonged to the IU Dev Team of the Data Platform Department and was engaged in the improvement of the search function of the data catalog “IU Web,” which was used in LINE. In this blog, I will introduce the details.

Background

About IU Web

Reference: “Democratization of Data Utilization” across Services Realized by LINE

To use data company-wide, LINE uses the data platform called “IU” (Information Universe), in which the data of almost all services of LINE are accumulated. As of now, the number of servers is 2,000 or more and the storage utilization is about 300 PB.

“IU Web” provides the following as a data catalog to manage the data of IU:

  • Clarification of what kind of data exists
  • Who has the permission for data
  • Meta information of data
  • Data search function

In-house users use IU Web to access the target data from the tremendous amount of data in IU.

Issue

Many in-house users use the search functions of IU Web to access the target data. The problem here is that it is difficult to reach the target data without entering the exact search keyword because many pieces of data are stored in IU (for example, there are about 40,000 pieces of data of the TABLE category searched frequently) and many of them have similar names. This hinders IU’s purpose of “company-wide data utilization.”

Therefore, I decided to solve this problem by implementing an auto-complete function which presents a list of candidates if part of a data name is entered in the search column even when the user does not remember the exact data name.

For data search, the search engine Elasticsearch is used. The metadata of each piece of data is stored in the JSON format in Elasticsearch, and you can specify search conditions and the priorities of search results by the settings configured at the time of storage and the settings of search queries. In this internship, my goal was to realize an auto-complete function using Elasticsearch.

What is desirable completion

When developing the auto-complete function, the first thing I considered was what search results should be returned to what entries. For example, what patterns of entries are expected when a user is looking for the table “app_log_beta.” I consulted with my mentor, Mr. Okuyama, and assumed the following entry patterns:

  • ap (the head of the first token)
  • lo (the head of the second token)
  • app_log (the first and second tokens connected with an underscore)
  • log_beta (the second and third tokens connected with an underscore)
  • app log (the first and second tokens connected with a space)
  • log beta (the second and third tokens connected with a space)
  • ap lo (two tokens partially entered)
  • eta (one token partially entered)

On the other hand, there was a possible problem that supporting many entry patterns might result in an inclusion of irrelevant data (noise) in search results. Therefore, we decided to prioritize search results based on the following indicators to show data intended by the user at the top of search results:

  • Whether the search result is hit in a prefix search (in addition to a partial match search)
  • Relevance to the user
  • Disk utilization

Research and comparison of methods to implement an auto-complete function

When the style of completion to be achieved was decided, I researched the methods to realize it by Elasticsearch.

As a result, the following four methods were found:

  • Completion Suggester
  • Edge NGram Tokenizer
  • Edge NGram Tokenizer + Shingle Token Filter
  • Wildcard Query

Each of them had pros and cons, so I set the following evaluation standards and followed them for selection.

  • Search quality
  • Speed 
    • Auto-complete had already been implemented in the user search page of IU and its required time was about 200 to 300ms, to which I set the guideline
  • Implementation cost
  • The complexity of settings and behaviors
    • The settings and behaviors of Elasticsearch shouldn’t be so complex, because if they are complex, the subsequent tests, debugging, tuning, etc. will be troublesome

Implementation of a test tool

To try and compare these methods, I implemented a Python tool to try auto-completing by directly executing queries of Elasticsearch from the command line and calling the API. The tool can execute a search query every time a character is entered and confirm the following:

  • Execution time
  • Number of hits
  • List of search results.

This tool was very useful in the subsequent development process such as the measurement of the execution time, debugging, etc. after implementation in addition to the purpose of comparison.

Research of Each Method

I researched each method and organized the pros and cons.

Completion Suggester

Overview

Completion Suggester | Elasticsearch Guide [6.8] | Elastic

Elasticsearch has a function for completion called “Completion Suggester,” which enables:

  • Prefix search
  • Ambiguous search (Suggest words within the specified edit range)

This seems to be optimized for auto-complete and very fast.

Evaluation
  • Speed
    • Search is fast thanks to optimization for auto-complete of prefix searches
  • Search quality
    • Few noise
    • The partial match search is unavailable
    • It is necessary to specify the priority when storing the index of a document, so flexible ranking at the time of search is difficult (maybe impossible)
  • Implementation cost
    • There seems to be no obstacle in the implementation
  • The complexity of settings and behaviors
    • Simple because this is dedicated to the prefix search

Edge NGram Tokenizer

Overview

Edge NGram Tokenizer | Elasticsearch Guide [6.8] | Elastic

This method divides the target data of search into words (tokens) and stores an n-gram starting with the first characters of each word.

For example, if this method is applied to “2 Quick Foxes,”

["2", "Q", "Qu", "Qui", "Quic", "Quick", "F", "Fo", "Fox", "Foxe", "Foxes"]

is stored and it is hit such as when “Fox” is entered as the search string.

On the other hand, there is a problem that it is hit even if the words are entered in a different order from the original text like “Fox Qui.” This should be avoided as much as possible because responding to unexpected entry patterns may increase noise. In the following execution example, “app_log” and “api_log” are hit when “log ap” is entered.

Evaluation
  • Speed
    • Probably fast because this is a normal match query
  • Search quality
    • If “event_log” is entered, “log_event” is also suggested
    • The partial match search is available
    • If multiple words are entered, the word order isn’t retained
  • Implementation cost
    • There seems to be no obstacle in the implementation
  • The complexity of settings and behaviors
    • The behavior is not so complex: dividing a text into words with spaces and underscores and creating “n-gram fixing the start position to the first character” (Edge n-gram) for each word

Edge NGram Tokenizer + Shingle Token Filter

Overview

Shingle Token Filter | Elasticsearch Guide [6.8] | Elastic

It turned out that Edge NGram Tokenizer’s problem of “If multiple words are entered, the word order are not retained” could be solved by using Shingle Token Filter as well.

If Shingle Token Filter is used, connected continuous tokens are handled as a token. In other words, if this method is applied to “2 Quick Foxes.”,

["2", "2 Quick", "Quick", "Quick Foxes", "Foxes"]

are handled as tokens. If Edge NGram is constructed based on this token,

["2 Quic", "Quick Fox"]

are stored and hit when “Quick Fox” is entered as the search string but not hit when “Fox Quick” is entered.

Evaluation
  • Speed
    • Though the index size is large, the speed is probably fast because the search is based on match queries
  • Search quality
    • The partial match search is available
    • Few noise
  • Implementation cost
    • There are many filters and tokenizers to configure
  • The complexity of settings and behaviors
    • Analyzer settings are complex

Wild Card Query

Overview

Wildcard Query | Elasticsearch Guide [6.8] | Elastic

It is also possible to realize the partial match search using a wild card. This method is slower than the other methods mentioned above. However, as shown in the following execution example, the result is well above the target of 200 to 300ms.

Evaluation
  • Speed
    • Though this method is slower than the other methods, the result is well above the target of 200 to 300ms
  • Search quality
    • The partial match search is available
    • Much noise
  • Implementation cost
    • It isn’t necessary to change the index
  • The complexity of settings and behaviors
    • Intuitive

Search results for assumed entry patterns

The following table shows the range of the assumed entry patterns which each method can cover.

Method\Entryap(the head of the first token)log(the head of the second token)app_log(the first and second tokens connected with an underscore)log_beta(the second and third tokens connected with an underscore)app log(the first and second tokens connected with a space)log beta(the second and third tokens connected with a space)ap lo(two tokens partially entered)eta(one token partially entered)
Completion SuggesterYesNoYesNoNoNoNoNo
Edge NGramYesYesYesYesYesYesYesNo
Edge NGram + Shingle Token FilterYesYesYesYesYesYesNoNo
Wildcard Query and Regexp QueryYesYesYesYesYesYesYesYes

This shows that Edge NGram and Wildcard Query support a wide range of search words.

Conclusion

The following table shows the comparison result of each evaluation standard.

MethodSpeedSearch qualityImplementation costThe complexity of settings and behaviors
Completion SuggesterExcellentFast because this is dedicated to the prefix searchNot good or badDedicated to the prefix searchThe flexible ranking is difficultGoodThere seems to be no obstacleGoodSimple because this is dedicated to the prefix search
Edge NGramGoodThough the index size is large, the search speed is fast enoughGoodIf multiple words are entered, results with a different word order may be suggestedGoodSame as aboveGoodThe behavior is not so complex: dividing a text with spaces and underscores and creating Edge n-gram for each word
Edge NGram + Shingle Token FilterGoodSame as aboveGoodIf multiple words are entered, the word order is retainedNot good or badThere are many Filter・Tokenizers to be set (setting at the time of storing a document in Elasticsearch)Not good or badThe settings and behaviors are relatively complex
Wildcard QueryNot good or badSlower than the other methods, but as a result of the experiment, it turned out to be fast enough for practical use.GoodEven if a word is entered omitting the first and/or some of the following characters, suggestions may be providedExcellentIt is not necessary to change the indexExcellentEasy and intuitive to use because there are few changes

In terms of speed, Wildcard Query is relatively slow, but all of these methods seem to achieve results well above the target of 200 to 300ms.

Moreover, the number of pieces of data in the TABLE and DB categories, which are subject to the auto-complete function, are as follows:

TABLE: 39,564

DB: 1,149

However, the execution time was almost the same when I executed a search by Wildcard Query (which was the slowest) for the data of the COLUMN category, which had 869,349 pieces of data, so there will be no problem with the operation even if the number of pieces of data increases tenfold in the future.

As for the search quality, all methods except Completion Suggester satisfy the standard judging from Search Results for Assumed Entry Patterns.

In terms of the implementation cost and the complexity of the settings and behaviors, Edge NGram + Shingle Token Filter seems to be difficult.

Considering these, I concluded that Wildcard Query was the best.

Before starting implementation, I shared the following research and comparison results with the team members for approval.

Implementation of Suggest API

The API was implemented with Java+Spring Boot and the CRUD operation of Elasticsearch was executed with Java High-Level REST Client.

For the operation check in the implementation process, I used the CLI tool and Swagger prepared beforehand.

Moreover, I implemented processing to sort documents hit in a partial match search based on the following standards:

  • Disk utilization
  • Relevance to the user
  • Whether they are also hit in a prefix search

Test

After implementation, I conducted various tests to confirm:

  • How it behaves to frequent entries
  • Whether any problem occurs for a specific job
  • Whether the operation can be continued when API requests are concentrated

Test of latency

Test for frequently used words

To confirm the behavior for frequently entered words, I excerpted words searched frequently based on search logs in IU Web and called the Suggest API 100 times for each work to measure latency.

The following figure shows the result. (The numbers of tables hit in the search are shown in the parentheses of the labels on the horizontal axis)

It turned out that the execution time was 200ms or shorter for any words.

Test for combinations of two alphabetical characters

Then, I conducted a comprehensive test of calling the Suggest API 10 times for all combinations of two alphabetical characters to confirm whether any problem occurs for specific words. From this result, it turned out that:

  • There is almost no correlation between the number of hits in a search and the latency
  • There is a strong correlation between the size of the response and the latency
  • The latency is very large (400ms or more) for some words

Therefore, I confirmed the responses to the search words with large latencies found that the sizes of the fields not necessary for the auto-complete function were very large, so I excluded such fields from responses. As a result, as shown in the following figure, the sizes of responses, in general, were reduced, which significantly reduced the latency.

This problem was critical because the latency exceeded 2,000ms depending on the entered word, so I needed to find it before release. From this experience, I understood the importance of comprehensive tests.

Load test

After release, I conducted a load test to confirm that the system would operate without any problem when API requests are concentrated. I used Vegeta as the test tool and monitored the following indicators by Grafana during the load test:

  • CPU utilization
  • Memory utilization
  • GC occurrence

As a result of the load test, the latency was about 100ms even when 60 requests were sent every second. There was no problem with the CPU utilization, the memory utilization, and the number of times and duration of GC.

Result

How the system operates

Since the successful release of the system, I confirmed its actual operation in IU Web. Search candidates are suggested as expected and the operation speed was fast enough.

Summary

In this internship, I implemented a search completion function for IU Web starting with the specification design of API and the selection of the auto-complete method. When determining the implementation method of auto-complete, I organized the pros and cons of each method researched beforehand, shared, and discusses them with the team members. After the implementation of the completion function, I conducted a latency test and a load test to confirm that the system could continue operation without any problem even if requests are concentrated.

Impression

At first, I had no background knowledge on Spring Boot and Elasticsearch, which are the main technologies used in this internship, but the period for catch-up was long enough and my mentor, Mr. Okuyama, supported me earnestly, so I could resolve the issue.

In this internship, I learned not only various technologies I had never experienced but also a wide range of knowledge about how to proceed with team development such as the selection of the development policy, code review, and the process up to release to the production environment. This is a valuable experience that I would never have acquired through developments without collaboration with team members.

This time, I participated in a full remote style, but my mentor, Mr. Okuyama, took time out of his busy schedule, so I could frequently communicate with him using Zoom and did not feel any inconveniences that may come with remote work. In addition, I interacted with various people through abundant opportunities including the team Zoom lunch meeting held twice a week and the intern get-together session.

I would like to thank LINE for this great 6-week internship.