Hello. My name is Ryoto Suzuki. I am a first-year master's student at the Graduate School of the University of Tokyo.
I participated in the LINE internship of the engineer/working-type course for six weeks from August 23rd.
During this internship, I belonged to the IU Dev Team of the Data Platform Department and was engaged in the improvement of the search function of the data catalog "IU Web," which was used in LINE. In this blog, I will introduce the details.
Background
About IU Web
Reference: "Democratization of Data Utilization" across Services Realized by LINE
To use data company-wide, LINE uses the data platform called "IU" (Information Universe), in which the data of almost all services of LINE are accumulated. As of now, the number of servers is 2,000 or more and the storage utilization is about 300 PB.
"IU Web" provides the following as a data catalog to manage the data of IU:
- Clarification of what kind of data exists
- Who has the permission for data
- Meta information of data
- Data search function
In-house users use IU Web to access the target data from the tremendous amount of data in IU.
Issue
Many in-house users use the search functions of IU Web to access the target data. The problem here is that it is difficult to reach the target data without entering the exact search keyword because many pieces of data are stored in IU (for example, there are about 40,000 pieces of data of the TABLE category searched frequently) and many of them have similar names. This hinders IU's purpose of "company-wide data utilization."
Therefore, I decided to solve this problem by implementing an auto-complete function which presents a list of candidates if part of a data name is entered in the search column even when the user does not remember the exact data name.
For data search, the search engine Elasticsearch is used. The metadata of each piece of data is stored in the JSON format in Elasticsearch, and you can specify search conditions and the priorities of search results by the settings configured at the time of storage and the settings of search queries. In this internship, my goal was to realize an auto-complete function using Elasticsearch.
What is desirable completion
When developing the auto-complete function, the first thing I considered was what search results should be returned to what entries. For example, what patterns of entries are expected when a user is looking for the table "app_log_beta." I consulted with my mentor, Mr. Okuyama, and assumed the following entry patterns:
- ap (the head of the first token)
- lo (the head of the second token)
- app_log (the first and second tokens connected with an underscore)
- log_beta (the second and third tokens connected with an underscore)
- app log (the first and second tokens connected with a space)
- log beta (the second and third tokens connected with a space)
- ap lo (two tokens partially entered)
- eta (one token partially entered)
On the other hand, there was a possible problem that supporting many entry patterns might result in an inclusion of irrelevant data (noise) in search results. Therefore, we decided to prioritize search results based on the following indicators to show data intended by the user at the top of search results:
- Whether the search result is hit in a prefix search (in addition to a partial match search)
- Relevance to the user
- Disk utilization
Research and comparison of methods to implement an auto-complete function
When the style of completion to be achieved was decided, I researched the methods to realize it by Elasticsearch.
As a result, the following four methods were found:
- Completion Suggester
- Edge NGram Tokenizer
- Edge NGram Tokenizer + Shingle Token Filter
- Wildcard Query
Each of them had pros and cons, so I set the following evaluation standards and followed them for selection.
- Search quality
- Cover the entry patterns assumed in What Is Desirable Completion
- The flexible ranking is desirable
- Speed
- Auto-complete had already been implemented in the user search page of IU and its required time was about 200 to 300ms, to which I set the guideline
- Implementation cost
- The complexity of settings and behaviors
- The settings and behaviors of Elasticsearch shouldn't be so complex, because if they are complex, the subsequent tests, debugging, tuning, etc. will be troublesome
Implementation of a test tool
To try and compare these methods, I implemented a Python tool to try auto-completing by directly executing queries of Elasticsearch from the command line and calling the API. The tool can execute a search query every time a character is entered and confirm the following:
- Execution time
- Number of hits
- List of search results.
This tool was very useful in the subsequent development process such as the measurement of the execution time, debugging, etc. after implementation in addition to the purpose of comparison.
Research of Each Method
I researched each method and organized the pros and cons.
Completion Suggester
Overview
Completion Suggester | Elasticsearch Guide [6.8] | Elastic
Elasticsearch has a function for completion called "Completion Suggester," which enables:
- Prefix search
- Ambiguous search (Suggest words within the specified edit range)
This seems to be optimized for auto-complete and very fast.
Evaluation
- Speed
- Search is fast thanks to optimization for auto-complete of prefix searches
- Search quality
- partial matching from the middle in completion suggestion elasticsearch (This is one way to do it.)
- Few noise
- The partial match search is unavailable
- It is necessary to specify the priority when storing the index of a document, so flexible ranking at the time of search is difficult (maybe impossible)
- Implementation cost
- There seems to be no obstacle in the implementation
- The complexity of settings and behaviors
- Simple because this is dedicated to the prefix search
Edge NGram Tokenizer
Overview
Edge NGram Tokenizer | Elasticsearch Guide [6.8] | Elastic
This method divides the target data of search into words (tokens) and stores an n-gram starting with the first characters of each word.
For example, if this method is applied to "2 Quick Foxes,"
["2", "Q", "Qu", "Qui", "Quic", "Quick", "F", "Fo", "Fox", "Foxe", "Foxes"]
is stored and it is hit such as when "Fox" is entered as the search string.
On the other hand, there is a problem that it is hit even if the words are entered in a different order from the original text like "Fox Qui." This should be avoided as much as possible because responding to unexpected entry patterns may increase noise. In the following execution example, "app_log" and "api_log" are hit when "log ap" is entered.
Evaluation
- Speed
- Probably fast because this is a normal match query
- Search quality
- If "event_log" is entered, "log_event" is also suggested
- The partial match search is available
- If multiple words are entered, the word order isn't retained
- Implementation cost
- There seems to be no obstacle in the implementation
- The complexity of settings and behaviors
- The behavior is not so complex: dividing a text into words with spaces and underscores and creating "n-gram fixing the start position to the first character" (Edge n-gram) for each word
Edge NGram Tokenizer + Shingle Token Filter
Overview
Shingle Token Filter | Elasticsearch Guide [6.8] | Elastic
It turned out that Edge NGram Tokenizer's problem of "If multiple words are entered, the word order are not retained" could be solved by using Shingle Token Filter as well.
If Shingle Token Filter is used, connected continuous tokens are handled as a token. In other words, if this method is applied to "2 Quick Foxes.",
["2", "2 Quick", "Quick", "Quick Foxes", "Foxes"]
are handled as tokens. If Edge NGram is constructed based on this token,
["2 Quic", "Quick Fox"]
are stored and hit when "Quick Fox" is entered as the search string but not hit when "Fox Quick" is entered.
Evaluation
- Speed
- Though the index size is large, the speed is probably fast because the search is based on match queries
- Search quality
- The partial match search is available
- Few noise
- Implementation cost
- There are many filters and tokenizers to configure
- The complexity of settings and behaviors
- Analyzer settings are complex
Wild Card Query
Overview
Wildcard Query | Elasticsearch Guide [6.8] | Elastic
It is also possible to realize the partial match search using a wild card. This method is slower than the other methods mentioned above. However, as shown in the following execution example, the result is well above the target of 200 to 300ms.
Evaluation
- Speed
- Though this method is slower than the other methods, the result is well above the target of 200 to 300ms
- Search quality
- The partial match search is available
- Much noise
- Implementation cost
- It isn't necessary to change the index
- The complexity of settings and behaviors
- Intuitive
Search results for assumed entry patterns
The following table shows the range of the assumed entry patterns which each method can cover.
MethodEntry | ap(the head of the first token) | log(the head of the second token) | app_log(the first and second tokens connected with an underscore) | log_beta(the second and third tokens connected with an underscore) | app log(the first and second tokens connected with a space) | log beta(the second and third tokens connected with a space) | ap lo(two tokens partially entered) | eta(one token partially entered) |
Completion Suggester | Yes | No | Yes | No | No | No | No | No |
Edge NGram | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No |
Edge NGram + Shingle Token Filter | Yes | Yes | Yes | Yes | Yes | Yes | No | No |
Wildcard Query and Regexp Query | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
This shows that Edge NGram and Wildcard Query support a wide range of search words.
Conclusion
The following table shows the comparison result of each evaluation standard.
Method | Speed | Search quality | Implementation cost | The complexity of settings and behaviors |
Completion Suggester | ExcellentFast because this is dedicated to the prefix search | Not good or badDedicated to the prefix searchThe flexible ranking is difficult | GoodThere seems to be no obstacle | GoodSimple because this is dedicated to the prefix search |
Edge NGram | GoodThough the index size is large, the search speed is fast enough | GoodIf multiple words are entered, results with a different word order may be suggested | GoodSame as above | GoodThe behavior is not so complex: dividing a text with spaces and underscores and creating Edge n-gram for each word |
Edge NGram + Shingle Token Filter | GoodSame as above | GoodIf multiple words are entered, the word order is retained | Not good or badThere are many Filter・Tokenizers to be set (setting at the time of storing a document in Elasticsearch) | Not good or badThe settings and behaviors are relatively complex |
Wildcard Query | Not good or badSlower than the other methods, but as a result of the experiment, it turned out to be fast enough for practical use. | GoodEven if a word is entered omitting the first and/or some of the following characters, suggestions may be provided | ExcellentIt is not necessary to change the index | ExcellentEasy and intuitive to use because there are few changes |
In terms of speed, Wildcard Query is relatively slow, but all of these methods seem to achieve results well above the target of 200 to 300ms.
Moreover, the number of pieces of data in the TABLE and DB categories, which are subject to the auto-complete function, are as follows:
TABLE: 39,564
DB: 1,149
However, the execution time was almost the same when I executed a search by Wildcard Query (which was the slowest) for the data of the COLUMN category, which had 869,349 pieces of data, so there will be no problem with the operation even if the number of pieces of data increases tenfold in the future.
As for the search quality, all methods except Completion Suggester satisfy the standard judging from Search Results for Assumed Entry Patterns.
In terms of the implementation cost and the complexity of the settings and behaviors, Edge NGram + Shingle Token Filter seems to be difficult.
Considering these, I concluded that Wildcard Query was the best.
Before starting implementation, I shared the following research and comparison results with the team members for approval.
Implementation of Suggest API
The API was implemented with Java+Spring Boot and the CRUD operation of Elasticsearch was executed with Java High-Level REST Client.
For the operation check in the implementation process, I used the CLI tool and Swagger prepared beforehand.
Moreover, I implemented processing to sort documents hit in a partial match search based on the following standards:
- Disk utilization
- Relevance to the user
- Whether they are also hit in a prefix search
Test
After implementation, I conducted various tests to confirm:
- How it behaves to frequent entries
- Whether any problem occurs for a specific job
- Whether the operation can be continued when API requests are concentrated
Test of latency
Test for frequently used words
To confirm the behavior for frequently entered words, I excerpted words searched frequently based on search logs in IU Web and called the Suggest API 100 times for each work to measure latency.
The following figure shows the result. (The numbers of tables hit in the search are shown in the parentheses of the labels on the horizontal axis)
It turned out that the execution time was 200ms or shorter for any words.
Test for combinations of two alphabetical characters
Then, I conducted a comprehensive test of calling the Suggest API 10 times for all combinations of two alphabetical characters to confirm whether any problem occurs for specific words. From this result, it turned out that:
- There is almost no correlation between the number of hits in a search and the latency
- There is a strong correlation between the size of the response and the latency
- The latency is very large (400ms or more) for some words
Therefore, I confirmed the responses to the search words with large latencies found that the sizes of the fields not necessary for the auto-complete function were very large, so I excluded such fields from responses. As a result, as shown in the following figure, the sizes of responses, in general, were reduced, which significantly reduced the latency.
This problem was critical because the latency exceeded 2,000ms depending on the entered word, so I needed to find it before release. From this experience, I understood the importance of comprehensive tests.
Load test
After release, I conducted a load test to confirm that the system would operate without any problem when API requests are concentrated. I used Vegeta as the test tool and monitored the following indicators by Grafana during the load test:
- CPU utilization
- Memory utilization
- GC occurrence
As a result of the load test, the latency was about 100ms even when 60 requests were sent every second. There was no problem with the CPU utilization, the memory utilization, and the number of times and duration of GC.
Result
How the system operates
Since the successful release of the system, I confirmed its actual operation in IU Web. Search candidates are suggested as expected and the operation speed was fast enough.
Summary
In this internship, I implemented a search completion function for IU Web starting with the specification design of API and the selection of the auto-complete method. When determining the implementation method of auto-complete, I organized the pros and cons of each method researched beforehand, shared, and discusses them with the team members. After the implementation of the completion function, I conducted a latency test and a load test to confirm that the system could continue operation without any problem even if requests are concentrated.
Impression
At first, I had no background knowledge on Spring Boot and Elasticsearch, which are the main technologies used in this internship, but the period for catch-up was long enough and my mentor, Mr. Okuyama, supported me earnestly, so I could resolve the issue.
In this internship, I learned not only various technologies I had never experienced but also a wide range of knowledge about how to proceed with team development such as the selection of the development policy, code review, and the process up to release to the production environment. This is a valuable experience that I would never have acquired through developments without collaboration with team members.
This time, I participated in a full remote style, but my mentor, Mr. Okuyama, took time out of his busy schedule, so I could frequently communicate with him using Zoom and did not feel any inconveniences that may come with remote work. In addition, I interacted with various people through abundant opportunities including the team Zoom lunch meeting held twice a week and the intern get-together session.
I would like to thank LINE for this great 6-week internship.