On December 10th of 2017, we celebrated our second anniversary for LINE LIVE, our live streaming service. The service has been up and running ever since the launch with no trouble, thanks to the efforts put in by countless engineers at LINE. On this blog post, as a member of ITSC, a team at LINE running a global infrastructure system, I’d like to share with you some of decisions we had to make in the initial phase of designing and implementing the system. Back then, I had absolutely no experience in building or running a live media service.
LINE LIVE is a service for anyone to broadcast from anywhere. The service had to expect an unpredictable number of users to access our service, to broadcast simultaneously and view streaming flawlessly. To meet our expectations, we defined the following as minimum requirements.
Scaling out the system shall be easy and quick when we are running low on resources
We assume that predicting an exact number of users to broadcast and to view would be practically difficult. Our service shall be capable of serving as many users and handling as much processes as possible, simultaneously.
Stability shall be guaranteed to broadcasters and watchers using the service simultaneously
The LINE LIVE service allows LINE OAs (Official Accounts) to broadcast too. When a LINE OA starts broadcasting, the account would send out a link to its friends to access the broadcasting. This means that a great amount of simultaneous connections can be made at the same time, resulting in traffic spikes.
The system shall be designed to be flexible to accommodate changes
We anticipated a possibility to change our infrastructure in accordance with the service’s growth. To allow making changes simple, the initial design had to be flexible and to reduce coupling as much as possible. We thought it was only a matter of time to scale out with more media servers, to support increasing number of broadcasters, so we tried to keep the level of coupling low in our architecture.
A structure that would satisfy all of these requirements is similar to that of MSA (MicroService Architecture), an architecture commonly used these days. In the interest of time, here is the conclusion. To minimize coupling, we decided to separate transcoding and recording out from the rest of the features.
Structure of LINE LIVE encoder layer
The overall process inside the LINE LIVE encoder layer is illustrated below. You will get more understanding if you get into the details of each component of the layer.
Generally, media servers are in charge of handling the following features:
- Converting RTMP signals sent by broadcasters to a viewable format
- Verifying RTMP signals
- Recording broadcasted streams
On top of providing these basic features, we had to take the following factors into consideration:
LINE LIVE supports various resolutions and bitrates to support users accessing the service from different types of network and device environment. To support such variety, transcoding is required. Video transcoding is rather an expensive task and consumes a lot of server resources, which requires us to have as many servers as possible.
An increase of broadcasters
More the broadcasters, more media servers we need to bring in. However, we must minimize the effect by increasing the number of media servers.
LINE LIVE uses Wowza media server. The authentication technique employed by Wowza uses its servers and the servers’ local files, and such technique doesn’t help much in managing LINE LIVE users. An option for us was to build our own authentication system and have it integrated with our service. Wowza provides an SDK for integrating third party modules with their system, but we decided not to use it due to stability issues. Another side of the story was that, even if we were to build our own module, we didn’t have enough resources for developing it or enough time to verify stability. Moreover, a customized module meant having to check every time Wowza got updated to see everything runs fine, which can be burdensome. Long story short, we decided to take the authentication module out of our media servers.
We were in a search for an alternative for authentication, since we decided to take the module out of media servers. The authentication module had to handle the following two tasks:
- Upon receiving RTMP signals, accept or drop the signals after verifying with the LINE LIVE authentication system
- Relay RTMP signals to a media server if authenticated successfully
What we needed was a proxy, an application that would control RTMP signals. Our research showed that common structures for load balancing uses HAProxy. We could relay RTMP signals using HAProxy, but, when a media server is added or removed, we would have to change all the proxy server configurations, and most of all, authentication still remained as a homework. So, we moved on to a combination of using HAProxy together with Lua Script for authentication. While we were at it, we came across an article on FIFA 2014 World Cup live stream architecture, and this shed some light on us. We could make a good use of the nginx-rtmp-module(rtmp-module based on Nginx). To prove feasibility, we did PoC (Proof of Concept) on using nginx-rtmp-module for authentication and relaying RTMP signals. The result was satisfying. In addition, we ran a series of load testing and long-running tests for checking stability and decided to go for the nginx-rtmp-module as a proxy for our service. As we brought in proxy servers, media servers were relieved of external accesses, allowing us to simplify an endpoint for external accesses.
To maintain the service stability, the manager server takes care of various features as the following:
- Verifying broadcast stream keys
- Monitoring media servers
- Handling load balancing for media servers
- Preventing redundant streaming of broadcastings with the identical stream keys
- Handling subsequent tasks after finishing a broadcast, such as generating a VOD
- Managing idle connection
CDN and origin server
To secure stable viewing experience, two problems had to be solved:
- Internal routing for CDN requests
- Handling traffic spikes
The number of friends for most LINE OAs range from hundreds of thousands to over ten million. When a LINE OA starts broadcasting, a link will be sent to its friends to start viewing the broadcasted program. Often, a lost of users may attempt to view the program simultaneously in a short period of time, which could cause the following two problems.
Firstly, normal requests may be taken as DDoS attacks. Broadcasts are streamed through CDN to LINE users. If multiple users send the same request to CDN in a short period of time, CDN could encounter a cache miss, allowing user requests to be delivered all the way to the origin server. If too many of the same requests come through, the origin server may end up considering normal requests as DDoS attacks.
Secondly, CDN does not know which media server a broadcast is serviced from. The manager server allocates broadcasts to media servers and sometimes we can see dynamic allocation. For example, when an error occurs while a user’s broadcast is encoded on a media server, the system will make a retry and if needed, the manager server may make a different media server to service the user’s broadcast. CDN cannot make instant responses to embrace such sudden changes because CDN is unaware which media server is allocated for which broadcast.
To solve these issues, what we needed was an origin server to take care of CDN requests. An origin server can be setup in either one of the following ways:
- Option A: Store broadcast information in a shared storage and set the shared storage in the origin server
- Option B: Set up a proxy server to link CDN requests to media servers
One of the biggest concerns for us was whether and how we could minimize the scope of an unexpected outage. To cut to the chase, we chose the option B, because shared storage was a problem. An error in a shared storage could influence the whole service. Also, more the broadcasts, the more disk I/Os will be made on the storage and eventually we will need to expand storage. An absence of protective measures like cache for handling traffic spikes, was a problem too.
The reason for going for option B is this. To set up an origin server with a proxy server, we need to add a logic to route CDN requests to the right media server. If we write the logic well, we can make the routing handled all on a memory, requiring no disk I/Os. With the option B, the cause of outage will be limited to that of the code written by engineers and also, we will be able to respond quickly to an outage because the outage will be bound to each machine. For the protective measure, we added a feature on our proxy server for caching, for the times when the same requests are received in a short period of time. The logic of this feature is to get a minimum number of requests from the media server, stack requests in a queue, cache the requests, and to process requests in the queue with cache hits. As a result, we can reduce loads on media servers and even balance the load when a lot of users send identical requests to the original server simultaneously.
Since LINE LIVE can be used on PCs, we are planning to support high-definition streaming sent from PC clients, such as game programs. We are continuously seeking to provide the best user experience to serve various needs rising from expanding our service.
It’s been a while since I’ve written a post, but I am very pleased to share this side of the story of the LINE LIVE service. I’d like to share some materials on LINE LIVE and a few articles that had been helpful in building up our live broadcasting system. You can see from this article on Facebook live—a service that was launched around when the LINE LIVE service was—that they too had to workout problems similar to ours. For those of you who are going through the same thing, I hope the following articles will be of help to you.
- Presentations on LINE LIVE
- References for setting up a live broadcasting service