This post is the result of the combined efforts and work of many. We’d like to thank all of you who were involved in preparation for New Year’s traffic.
LINE shows a unique traffic pattern typical to online messaging apps. This happens at the midnight of December 31st when users start to share New Year’s greetings via messaging apps, significantly increasing the traffic volume compared to the usual average. During this time of the year, the traffic volume growth shows various patterns, depending on the time difference and the culture of each country. LINE makes necessary preparations to seamlessly handle such a sharp spike in traffic. We call it “preparations for the New Year.” In this post, we’d like to share our efforts and corresponding results after our preparations for the New Year in 2020.
How LINE messaging servers prepare for New Year’s traffic
When the clock hits midnight, ushering in the New Year, in each country, many LINE users send New Year’s greetings. This increases the traffic exponentially within a very short period of time. Some countries send text messages while others send images or videos. In order to overcome this unique situation, LINE starts to make preparations even from June at the earliest. LINE app has various components. Each team responsible for respective components closely work together during our preparations for the New Year.
Instant spike up in traffic at 0 o’clock in each country
Hourly trend of message traffic at the New Year with different patterns, depending on the preference of media messages
LINE has been accumulating statistics on traffic patterns and peculiarities observed during every new year since its launch. We build our forecasts based on these statistics and make necessary improvements. What is equally important is to accurately assess server availability. We’ll go into detail about how to assess the current availability of application servers and database and make improvements step by step below.
Preparations for the New Year
New functionalities are added and various improvements are made to LINE messaging servers every year. We have a daily release cycle, which means the codes evolve every day. That’s why it is important to estimate and prepare in advance so that we can seamlessly handle New Year’s traffic. In order to ensure that current servers are ready for the New Year, it is critical to accurately simulate the traffic forecast. However, it is not at all easy to exactly simulate read/write traffic pattern of actual users. With simulations, a single factor could derail the accuracy of the simulation results against actual traffic, sometimes, leading to unexpected issues in real life situations. Against this backdrop, we conduct a benchmark test (BMT) using actual traffic data. The first step in the BMT is to prepare a series of dashboards to get an accurate view of the status of servers.
Various dashboards to give a snapshot of various indicators
Once dashboards are ready, it’s time to embark on testing. The test starts with closely monitoring dashboards and gradually increasing inbound traffic into specific hosts. When an abnormality is detected on dashboards, we immediately reduce the traffic routed to that host. Even if the host fails due to a temporary spike in traffic above its capacity, it is designed not to lose messages through multiple retries via various routes. Consequently, the accuracy can be raised without impacting services.
Make a plan for the New Year based on the existing traffic pattern
As for application servers, the traffic is distributed to each server so the BMT can be conducted with the method described above, such as increasing inbound traffic to a specific server. However, it’s a different story for databases as multiple servers share resources. That’s why we use accumulated traffic data to make a plan for the New Year.
LINE has compiled statistics of the New Year traffic every year since the launch of the service. It is a big help as we have details on traffic patterns and preparations. For each year, everything from participating teams and components is put together in a single document. It provides a comprehensive understanding of data from other teams or components, traffic patterns, issues and improvements.
LINE messaging servers use Redis and HBase databases. As for Redis, we classify each cluster based on the traffic pattern from the past data. Some clusters tend to experience more burden than others during the New Year.
We refer to IOPS (Input/Output Operations Per Second) and UsedMemory to understand traffic patterns.
IOPS is the number of requests to Redis per second, which goes up along with user requests during busy time like in the New Year. As for services receiving a lot of GET requests, we check if the current capacity of Redis clusters is big enough to handle GET requests.
If not, we can either expand clusters to further distribute loads or reduce burden by optimizing Redis requests at the application level. We usually discuss about pros and cons and efficiency of both methods and decide which method to use.
Trends of I/O requests to Redis clusters per second during the New Year of 2020
UsedMemory is the size of data stored in Redis. Given the characteristics of Redis clusters, when there are many PUT requests during the New Year, you need to closely watch UsedMemory numbers. It could be critical to services, depending on data characteristics, if Redis clusters exceed the maximum available memory due to a sharp increase in requests during the New Year. To avoid such a situation, we estimate the size of required memory from the past data and increase the memory, if necessary.
Trend of UsedMoery of Redis clusters during the New Year of 2020
As mentioned earlier, we have a daily release cycle for LINE messaging servers, and it is continuously changing and evolving. When new features are added and improvements are made, sometimes we face unexpected side effects. These side effects can be insignificant with average traffic, but it could scale up to a crisis if the traffic instantly increases in many folds. That’s why we take extra caution to assess, estimate and monitor New Year’s traffic.
When we were preparing for the first day of 2020, we focused on push notifications. LINE was in the process of phasing in Armeria, a high-performance asynchronous server/client libraries released by LINE. In 2018, Armeria was introduced to replace push notification component and HTTP protocol. We conducted the same BMT afterwards and verified that there was no issue. However, when we were ushering in 2019, we faced an unexpected problem. The traffic was in a close range of what we verified through the BMT, but push component responses significantly slowed down from 0 o’clock. The push queue in messaging servers became full, losing pushes. We quickly restored servers by removing some of piled up pushes and immediately engaged in our search for what caused such delays.
The analysis of related indicators and logs revealed that a huge number of connections were simultaneously made to the push component from 0 o’clock, leading to full GC (garbage collection) due to a sharp increase in the heap memory of the push component. This appeared to be the cause of slower performance.
We tried to conduct the BMT by simulating traffic as close to real-life pattern as possible. Yet, there was still a variable that we did not foresee. When we were conducting the BMT, we closely monitored and increased the traffic step by step up to the target TPS (transactions per second). With this method, we couldn’t test a situation where the traffic instantly spiked up by many folds.
Since we were using the Armeria client, which is an asynchronous HTTP client, for push notifications, it could have been configured to handle high TPS only with a limited number of connections. Because we didn’t configure Armeria this way, an excessive number of connections were made with a flood of requests. We held a retrospective meeting based on analysis results and decided to carry out the following tasks for improvement.
- Improve performance
- Create and reuse a limited number of connections and confirm if the target TPS can be reached
- Prepare safety nets
- Segregate notification queues by roles to minimize the impact of failures arising from a specific component
- Introduce circuit breaker to automatically block requests and isolate failures when a specific component becomes unresponsive; Take necessary actions for the downed service to quickly recover
After implementing these improvements, we carried out a new version of the BMT. We started with the same old BMT by gradually increasing traffic up to the target TPS, which turned out to be working. Then, we brought TPS back down to usual average and shot it up to the target TPS at once.
Testing by gradually increasing TPS and then raising TPS up to the target TPS at once
We concluded the improvement after confirming that push notifications were stably sent with limited connections even when it was raised to the target TPS at once.
When faced with an unexpected problem, LINE focuses on preventing recurrence of the same problem by sharing the issue transparently and implementing necessary actions instead of pointing fingers. Such a forward-looking culture allows developers to fiercely seek innovations and keep their focus on developing better and stronger systems.
We also noticed that the Apache HTTP client used by LINE messaging servers underperformed after the upgrade. As reported in the HTTPCLIENT-1809 issue, the Apache HTTP client showed weak performance after a certain throughput and couldn’t increase throughput although servers had remaining capacity. Luckily as we were phasing in Armeria into LINE messaging servers, we could resolve this issue by replacing the Apache HTTP client with Armeria.
As explained earlier, we prepare Redis clusters by comparing the current capacity vs the required capacity at the New Year. LINE messaging servers use Redis with “client side sharding” (reference), and it is designed to distribute loads to each node in a balanced way. When the current capacity of clusters apparently falls short of the required capacity, there are three ways to resolve this issue as follows.
- If the Redis cluster is responsible for multiple roles, split up the cluster and set up an independent cluster.
- Look if there is a more efficient way to use clusters by improving codes at the application level
- Add nodes to the Redis cluster and reduce the traffic handled by each node
We’ll briefly introduce three cases of how we improved clusters for the preparation of the New Year this year.
First, we split up Redis clusters that handled many different types of data. LINE’s traffic volume is constantly on the increase, and we had to pick a specific cluster to handle New Year’s traffic. We classified data on this cluster and migrated selected data to a new cluster. You can see the IOPS before and after migration below.
IOPS before data migration
IOPS of the existing cluster after data migration
IOPS of the new cluster after data migration
After adding a new cluster with data migration, the number of requests to the existing cluster was reduced to 1/4 and the new cluster handled 3/4 of the requests. As we had to create a new cluster and migrate data without interrupting the service, we conducted this job over 4 to 5 days with close monitoring.
Second, we added a functionality to detect malicious user requests, making abnormal calls to the service, in order to stabilize the IOPS of the cluster. We already have a system to automatically block malicious user requests based on their behavioral pattern, and it is stopping most of malicious attempts. Unfortunately, malicious users constantly evolve and still trigger huge Redis IOPS in some clusters.
IOPS spikes from time to time at the New Year of last year, making traffic pattern almost meaningless
This year, we added a new functionality to block malicious users faster and more accurately, and the result of more stabilized traffic of the Redis cluster is shown below.
Stabilized cluster with IOPS climbing up along with requests during the New Year of this year
Third, we changed the Eviction Policy of a certain Redis cluster. There was a cluster, showing a constant increase in UsedMemory. It wouldn’t have an impact on data even if we set TTL (Time to Live) so we set TTL on data and modified the Eviction Policy to “volatile-ttl”. And, we expanded the cluster to retain data during the given TTL based on the expected traffic during the New Year, estimated from the past traffic pattern.
Retrospective on preparations for the New Year
Last step in preparations for the New Year is having a retrospective. We document all relevant information not only on messaging servers but also traffic patterns and issues from LINE’s server components. After the New Year’s day, everyone meets up and holds a retrospective meeting based on the documentation. Since the launch of LINE, we never missed a year to prepare this documentation, and teams responsible for each component produce relevant documents. These documents will serve as important references for the next year’s preparations so we try to include as much details as possible. We record traffic patterns with accurate numbers and graphs and keep snapshots of dashboards at specific times.
When all the documents are ready, each component team meets up to hold a retrospective meeting. Some development teams are stationed overseas so we utilize video conferences as well. The key aspect of every year’s retrospective meeting is finding improvements for each component, identified during the New Year. If there is even a tiny anomaly in data observations, we share it transparently as well as its improvement plan.
Again in this year, we were able to successfully handle New Year’s traffic in each country thanks to hard work from all participating members. There was no problem per say, but we noticed something unusual. When messaging servers were calling other components, there were a few cases of longer latency than average. The logs of the component indicated that it took longer to create a new connection than processing actual requests. When traffic larger than usual by many folds is processed in an instant, you can come across cases that do not emerge during average days. After detecting this issue, we made an improvement plan to efficiently use connections with HTTP/2 when communicating with the component. And, this is still underway. For your information, Armeria adopted widely in LINE supports HTTP/2 so it was easy to convert to HTTP/2.
At LINE, we will prepare for the upcoming New Year just as we explained in this post. We make unswerving efforts every day to build a better system by transparently sharing even a small anomaly and going through improvement and retrospective processes. If you are interested in and want to be a part of ever evolving LINE’s server component development, please don’t hesitate to apply!
- Messenger Server Platform development / LINE platform
- 분산 스토리지 Reliability Engineer / LINE platform