Big data and data analysis have long been popular key words in the IT world. Nowadays, they are no longer a choice but a necessity. Everyone is piling up and digging up data to find a meaning from big data. To achieve success in this “fact-finding” process, we need an appropriate analytics environment. Today I’d like to share with you how and what LINE has built for game analysis.
LINE Games: Analytics environment
When it comes to an analytics environment, it means tools such as Tableau, Excel, and R in a narrow sense but it can expand into designing, collecting, processing and analyzing data in a wide sense. I’m going to talk about a wider definition of analytics environment in this post. When you build an analytics environment, you have a choice: go for commercial services by service providers such as Google or Amazon or use open source software. Needs and requirements will vary depending on the scale and purpose of data analysis but usually Hadoop ecosystem, NoSQL, Kafka, Elastic, and R are the main building blocks of the analytics environment.
The following is a high-level diagram of the analytics environment at LINE.
LINE Games built its analytics environment with open source software. Data sent from the Growthy SDK are collected in the log collector; stored in the Hadoop clusters via log pipeline; extracted by Hive and Impala; and finally presented with statics and analytics, using R, Tableau, and in-house BI (Business intelligence) tool. I’m going to use seven key words — Catalog, Quality, Open, Deep, Security, Wizard, Index — to explain about the differentiating characteristics of the game analytics environment at LINE Games.
The first key word is “Data Catalog”. If you are not familiar with the term, data catalog, you can think of it as data statement or data design. You might wonder what role data catalog plays in an analytics environment. In a wide sense, all data analysis starts from data collection, which requires a design stage in which you define data. Data catalog in the analytics environment of LINE Games includes the following information.
- Log design: When and which data to send
- DW (Data warehouse) design: How collected data are processed and stored
Data catalog is critical information used everywhere from data logging, collection, processing, storage and analytics. Since it is not one of the core functions for games, it is managed relatively loosely with more focus on how data should be written. At LINE Games, the analytics team directly maintains all data catalogs because the team provides an analytics service on top of data collection and storage environment. Data catalogs are created as a template, maintained at the service level or sometimes at the data type level. For instance, you can create an item template for obtaining and using items in a game and customize this item template for different games based on the type of items and usage. Using a template can be a big plus for accumulating data and analyzing aggregated data as the same data will be saved in the same column regardless of games.
The second key word is “Data Quality”. Everyone talks about how important it is to maintain the quality of data. Data sets with an outlier could make analysis results go haywire, and it is usually difficult to identify a root cause. Then, when and who should keep an eye on data quality?
There is a relatively easy answer to “When”. It is quite inconvenient and burdensome to filter out low-quality data during the analysis stage. Therefore, data quality should be controlled at the time of data collection, more precisely during the development stage when you implement to collect data.
Now, who should be in charge of data quality control? As I have explained above that data quality should be controlled during the development stage; it is in the hands of developers to collect accurate data based on respective data catalogs. In reality, however, it is almost impossible to sift through data during the development and QA stages. Developers and game QA testers are busy with work on their plate, let alone sifting through data for accuracy. Moreover, game QA testers might not have enough background knowledge to determine data accuracy.
Against this backdrop, LINE Games has set up a data quality management organization to conduct QA tests for data.
Data QA testers at LINE Games engage right from the start till the end of the process; from development stage and onward. So, they understand the history of changes in data. In addition, during the game QA period, data QA testers collaborate with game QA testers to verify the data. After the launch of the game, data QA testers continue to monitor data quality and flag it as an issue when identifying inaccurate data. When new data are being added with a content update, they undergo the same process as if they are launching the game for the first time. Such consistent effort for data quality control allowed us to provide analysis results with a high confidence level.
For the efficient data QA, logs are automatically generated by the SDK and the data QA system automatically runs verification and reports. Data customized for each game are systematically managed with about 1,000 test cases under management by the Data QA Team. Our next goal is to automate the data QA system to automatically conduct QA tests for customized data as well.
The third key word is “Open Data” and ultimately achieving an open data analysis environment. As a game publisher, LINE Games provides analytics on game data to internal business teams and external developers. We don’t just passively respond to data requests, but open up our data analysis environment for external users so that they can directly handle data and conduct their own analysis. It’s always wise to “teach a man to fish”. For this to happen, we need to have a separate cluster dedicated to the open data analysis environment. I’ll briefly explain about the issues we faced to set up the open data analysis environment and how we came to resolve those issues.
As an initial setup, we stored data collected from the log collector in a cluster consisting of more than 100 nodes using Fluentd for the analytics service and internal analytics purpose. We couldn’t open up these clusters directly to external developers for many reasons including security issues. Moreover, data in the clusters were not stored per game but by logs. We decided to create a half-sized cluster to hold data for each game and for the open data analysis environment. There were still two issues to resolve.
- Where would we break down data for each game?
- How do we handle personal data included in data sets?
We used Fluentd, an open source data collector, to do the job. Fluentd is a data pipeline that collects and filters data from data sources and outputs them to destinations. It is similar to Logstash by Elastic.
Breaking down and storing data for each game
We used Fluentd to filter out data for each game and send them to the new open cluster. During this process, however, we faced another problem, which was delayed transmission from Fluentd. The cause for this problem was Fluentd’s configurations. We have set the same data transmission frequency for the new open cluster as for the existing clusters at 10 seconds. The open cluster needs to manage buffer for more than 100 games so the number of threads as output is huge, even for single data type. Given the total number of data types, Fluentd cannot process hundreds of threads in 10 seconds, consequently leading to delayed transmission.
We came up with two solutions to resolve this issue. First, we tuned Fluentd’s configurations to reduce the number of threads sent simultaneously and increase the term between sending. After the final tuning, the timekey was set at 3 minutes. It gave plenty of time as the open cluster was not used in real time. Second, we increased the number of Fluentd servers from 8 physical servers to 18 virtual servers to lower server specifications but increase parallelism. With these adjustments, we were able to send data to both clusters in a stable manner.
Resolving privacy issue
Among those data sent to the open cluster, IDs must not be directly provided for privacy issue, but at the same time they are an essential piece of information for analytics purpose. Thus, IDs must be encrypted before they are provided to external users. If we were to conduct encryption at the time of sending ID data to the open cluster, it would require additional development of a plug-in to add an encryption logic to Fluentd. Moreover, it is inefficient to read back the data already saved from the open cluster just for encryption purpose. A fair choice is to undergo encryption at the time of collecting data from the log collector. As encrypted IDs are unnecessary data for the existing cluster, we set it up so that they are not saved in the existing cluster but sent to the open cluster via Fluentd.
The fourth key word is “Deep”, as we enable users to use the raw data for analysis. LINE’s analytics environment allows you to read “3-minute old” to “1-year old” raw data using SQL. We chose SQL as it had a low entry barrier for anyone including planners, analysts, and developers. In addition, as data are stored in the Hadoop clusters, you can make a good use of Hive and Impala, SQL-on-Hadoop. In fact, the Cloudera package offers Apache Hue service that supports Hive and Impala, but instead LINE Games developed its own web-based service for the open data analysis environment. There were two reasons why we didn’t use Hue: 1) Hue has too many unnecessary features beyond the required scope for us; and 2) Hue supports LDAP (Lightweight Directory Access Protocol), which requires more than necessary operational resources for account management.
The following is the initial page of the public web service. You can select a DB and table to check the schema and run a Hive query to see the results. You can also download the data as a CSV file.
Still, there were four issues for the key word, deep.
As we were storing data per game in the open cluster, we had to selectively show game data for respective developers, hiding game data for other developers. I’ll explain in detail later with the key word, security.
Next was resource management for each developer to stably run Hive queries. If we were to maintain a single resource pool for the cluster, one developer could use up all available resources with their tasks. As our service was provided as an open analysis environment, we needed to implement a fair resource allocation approach for all developers.
LINE Games uses the same authentication for this web-based open analysis service as the external analytics service. We can identify the company code during the authentication process. Then, we assign this company code to the pool name when a query is run so that the resource is statically assigned to each company. Yet there is a downside to this approach: as more companies access this analysis service, the resource pool will be segmented further and active users will feel that the resource is limited while some resources could be held up by inactive users. We are trying to improve this approach with dynamic resource allocation in the future.
Not only SELECT but also CREATE
Those of you who have experience with various queries will know that it is more efficient to create a table for intermediate results when you need to perform multiple operations with your query. In the initial stage of the open analysis web-based service, we only allowed SELECT for View so a nested query had to be used to generate intermediate results, leading to repeated run of heavy queries. We resolved this issue by providing a private DB for each user, which allowed users to create a table with the result set of the SELECT statement in their private DB and reuse it for analysis. We set the retention period for tables as 30 days to ensure unused data are not just sitting around.
Slow, too slow
It is more like a fundamental structural problem of Hive when it comes to slow query results. We were able to speed up a little by additionally providing Impala.
The fifth key word is “Security”, with a focus on cluster security and game data segregation. Kerberos, LDAP and Apache Sentry are adopted for the open analysis environment of LINE Games. For instance, when user A runs a Hive query, A goes through Kerberos authentication and then LDAP account identification before A’s query is executed. We adopted Sentry as it also requires authorization to control access rights on Impala and Hive’s DB and tables.
Sentry is a role and group-based authorization module for objects, linked with the LDAP group. To briefly explain how it works, Sentry is provided as a plugin for Hive and Impala, and Sentry validates if the user’s LDAP group has an access privilege to objects in the query. The following diagram depicts the working relations between Kerberos, LDAP and Sentry.
It might sound complicated, but it is set up to automatically proceed with a few clicks on the administration page.
The sixth key word is “Wizard.” After a new game is launched, there are data extraction requests for operational and analytics purposes. Developers can’t do anything until they receive requested data, and the data team will receive more and more requests as the number of games increases. To resolve this potential deadlock, we added the Wizard to the open analysis environment so that developers can extract data on their own. We researched data extraction and analysis requests for the past year to define the requirements for the Wizard. Consequently, we created a SQL template with frequently made requests and conditions as shown below:
The most powerful feature of the Wizard is allowing users to manage their own templates. You can modify a template when you need to change a preset condition or add a new condition. You can also easily create a new template. It has been 6 months since we launched the Wizard (as of April 2019). As I write this article, a total of 108 users ran 3,412 tasks on the Wizard or about 570 hours in total, assuming 10 minutes per task. This indicates that the Wizard contributed to significantly reducing resources for both parties at either end of the process, i.e. developers and data analytics team.
The seventh key word is “Index,” which refers to the business intelligence (BI) tool, servicing analytics created based on the analytics know-how of LINE Games. LINE Games proudly offers pivot features with its in-house developed BI tool. Pivot enables users to conduct a more in-depth analysis as they can produce the analytics as they want.
Prepared vs on-demand
Data structure and delivery are also important for pivot features. There are generally two ways: prepared and on-demand. For the prepared approach, it usually delivers results faster, given a small or fixed number of dimensions. With the on-demand approach, there is no need to prepare data in advance and queries can be made for any kind of filters. As the analytics for LINE Games have 7 to 10 dimensions and it is almost impossible to prepare data in advance with such number of dimensions, we chose the on-demand approach.
Exact vs approximate
Another point with pivot features is whether to provide an exact value or an estimate. For instance, when the SUM function is used for the pivot table, the numbers must exactly add up regardless of how dimensions are defined. If an estimate is provided only for faster results, you won’t achieve accuracy with the results. Therefore, LINE Games’ BI tool uses Impala as a query engine for accurate calculations and massive data processing with the on-demand approach.
Well-built data warehouse is more important than what features we provide with the analytics service. If data quality is poor in the first place, the results from the BI tool cannot be accurate. Building a data warehouse with high data quality is a prerequisite to developing a sophisticated BI tool.
There, we explored all of the seven key words of the LINE game analytics environment. In a nutshell, the LINE Games’ analytics environment is built with direct management of data catalog; maintained with data QA for high-quality data; and offers an open analysis environment for external users to directly access raw data for analytics purpose. Security is as tight as it can be. We offer the Wizard to assist users to extract and analyze data and the in-house developed BI tool with pivot features to present big data with smart analytics.
There is one last key word underway, “Integration.” To be more precise, it is integration of data catalog.
Once data catalog is integrated, registered data design information can be linked to allow the following automation:
- Collector: Making decision on data collection
- Data QA: Automatically verifying custom data from users
- Ingestion: Automatic processing of data based on the design in data catalog as data are being stored into the cluster
- DW: Automatic building of DW
- Index: Automatic generation of analytics based on data catalog and DW information
As soon as we complete the last key word, Integration, I’m confident there will be no game analytics environment like what LINE Games offers. Our target is to complete this last stage by the first half of this year, and we welcome anyone who wishes to join the ride. Please check out more information in the following link!