The Team & Project series takes a look at different departments within the development organization at LINE Corp. (“LINE”) and introduces their roles, team structure, tech stacks, challenges that lie ahead, roadmaps, and more. In this edition, we sat down to chat with the team that develops the LINE app’s messaging server and develops and operates the Apache Kafka platform. Our guests today are Wonpill Seo, Masahiro Ide, Javier Luca de Tena and Yuto Kawamura of the Z Part Team.
Could you start off by telling us about yourselves?
Seo: I’m the manager of the Z Part Team. We develop and operate the LINE app’s messaging server and the Apache Kafka platform.
Ide: As a member of Z Part, I mainly operate Redis clusters used on the messaging platform and develop services that use Redis. I also integrate and support LINE’s services with Armeria, an open source Java RPC framework developed by LINE.
Javier: I work as a tech lead of the HBase Unit under Z Part. We are mainly responsible for developing HBase clusters, and its usage from the business logic that relates to our messaging backend. I like to solve scalability, transactional and reliability issues.
Kawamura: Since joining LINE as a new graduate in 2015, I’ve been a member of the Z Part Team, which is responsible for developing the LINE app’s servers. I am now also a tech lead of the project that develops and operates the Apache Kafka platform, which is provided across LINE’s services. After joining LINE, I was in charge of HBase cluster operations and automation for the first few months. After that, I started the project to improve the LINE app’s server architecture using Apache Kafka. I am still involved in the project to this day. I like performance engineering and reliability engineering.
Can you tell us why you decided to join LINE? And what you find rewarding in your work?
Seo: My colleague from my previous job who started working for LINE introduced me to this job.
Ide: I joined LINE five years ago as a new graduate in engineering. I almost never used the app before joining the company, but I heard about it from my family and friends who used it. At LINE, we can clearly see how we impact users in the real world. The biggest reasons why I joined the company were the large userbase, complexity of its services, and how development is focused on things like performance, availability and reliability of the services.
Javier: At LINE, it’s fascinating to work on large-scale projects with millions of users and billions of requests a day. We face big challenges in the area of performance, availability, reliability, massive data transaction, and abuse detection. All of them really matter, and not many companies in Japan work on challenges on this scale. Even just the fact that the LINE app is widely used is a great source of motivation for development. Another great thing about working for LINE is that you are given a lot of freedom on how you work and who you work with.
Kawamura: I worked part-time for LINE’s service development when I was a student (livedoor Co., Ltd. at the time, which was later integrated into NHN Japan right after Kawamura started working full-time). In college, my research focused on topics like low-level container technology, but I was able to work with a wide variety of web-based technologies in my part-time job. After working for a few years in service development, I became interested in working as a backend developer. Luckily, I was invited by members in charge of the LINE app backend development to become a member of Z Part after graduation. LINE has development centers all over the world, and it’s exciting to be able to work with such a diverse group of talented engineers. The biggest reasons why I decided to join LINE were because there are a lot of opportunities to speak English, and in a good way, it doesn’t have the typical Japanese work culture.
Seo: The work that we do affects the daily lives of many people. It is very rewarding to know that my job has a great impact and influence on people across the world.
Ide: We’ve been very particular about using Redis to improve the performance of our services. However, as it’s an in-memory database, data stored in memory is lost when the server process ends. There are many challenges in securing the data, so we try new things like migrating to another persistent storage, such as HBase, to operate in a safer and more stable manner. Today, I find it rewarding to be able to take on these challenges.
Javier: We have a big responsibility for LINE’s messaging platform. We provide primary storage for a large part of our current and future features of LINE’s messaging backend. Our storages handle large amounts of traffic every day, so we need to ensure data consistency while improving reliability, availability, and performance, as well as keeping system abusers away. We’re also deeply involved in the business logic of the messaging platform, so we support and develop new features while improving the core logic. I think the type of work we do is very rare in Japan, and very interesting.
Kawamura: I agree. It’s difficult to find many companies in Japan where engineers can engage in this sort of work we do at LINE. What I find rewarding about my job is not only to do with the scale of LINE’s operations, but more importantly, the fact that it’s a social infrastructure. Social gaming was the trending and growing market when I joined LINE, but I wanted to work on services that more directly enriched the lives of many people. In this regard, LINE significantly influences the daily lives of our users and the society. We have the big responsibility of maintaining high reliability as many of our users use LINE as a daily communication infrastructure. It’s rewarding for me to be involved in various technical challenges that come from high scalability and reliability, while feeling a good amount of pressure that comes with the work.
Could you tell us about the team’s structure and role?
Seo: LINE’s messaging platform, born with the LINE app, is currently developed by engineers in Japan and South Korea. Since the app’s release in 2011, dozens of types of user data, such as messages and social graphs, increased dramatically with the rapid growth of our services. As a result, the number of servers increased by thousands of units in the storage layer only, and the challenges were not limited to the operating cost: no common solution could solve our issues. This led us to face a rapid increase in uncommon requirements related to scalability, reliability, and complexity.
Thus, the Z Part Team was created to tackle these issues from the root cause. Our team is further divided into groups of five to six, and each of these groups develop and operate one of three technological domains that are central to the messaging platform storage: Redis, HBase and the Apache Kafka platform. The Redis and HBase groups not only operate the OSS servers, as their names suggest, but are also in charge of the work required to use each of the OSS and application layer coding. I’ve been involved in development since the planning stage, and we cooperate with other development teams to understand user scenarios while considering massive amounts of user traffic. Then, we find and implement an appropriate and optimal solution. Furthermore, every team member, as experts of various OSSs, take on new challenges every day, and proactively share their team-wide experiences and best practices of solving these challenges. They also contribute by creating patches for various OSSs and disclose our company’s solutions through OSS.
LINE isn’t just a private messaging service anymore––it is evolving into a huge platform that quickly provides users with information on products, finance and disasters, just to name a few, by linking it together with services of our own and other companies. To keep up with these changes, the Z Part Team focuses on protecting and improving the user experience with advanced technology.
Kawamura: Our team is in charge of developing and operating the Apache Kafka platform. Initially, we started out as a project to improve LINE’s server architecture using Apache Kafka. The model that we constructed for the project functioned so well that it was implemented in other services and projects. As we were already operating highly reliable Kafka clusters, we began receiving requests to use the infrastructure, and our clusters naturally evolved into a platform. It’s only been a few years since LINE started using Apache Kafka, but being used in more than 80 different services now, it’s already one of the most popular middlewares at LINE. We spend a lot of time on reliability engineering to maintain high levels of service while also dealing with enormous amounts of traffic. We work hard to ensure that all services can use the Apache Kakfa platform in the ideal way, including client-side settings, by mainly focusing on SLO visualization, automation, architecture proposal, troubleshooting, and the preventive engineering that follows.
Can you tell us about your team members?
Seo: At Z Part, we have a diverse group of engineers from various countries. For example, we have an engineer who often presents at conferences and writes magazine articles as an OpenJDK contributor and a JVM specialist. Kawamura, who leads the development and operations of the Apache Kafka platform, has presented at the Kafka Summit and the Japan Java User Group conference about LINE’s Apache Kafka clusters and how some issues were solved. Ide, who is in charge of Redis, actively showcases LINE’s technologies at tech conferences like LINE DEVELOPER DAY.
Tell us about the technologies and development environment that the team is using.
Seo: We use Java as our main programming language, but we also use Python, C, Scala or Lua in consideration of technical suitability. We also use libraries like Spring, Thrift, and Protocol Buffers to frameworks like Akka and Central Dogma. Basically, we have a core set of technologies that are strictly selected and managed, but the rest is used relatively freely considering their technical suitability and maintainability.
What challenges are the team now facing and how are you addressing them?
Ide: Migrating to a persistent storage and managing operational costs are our two major challenges. Being in charge of Redis, we believe that it’s essential to migrate to the persistent storage (HBase) because most of the messaging service that we’ve developed is currently stored in Redis. As I said before, Redis is an in-memory database which means that data in memory will be lost if the server process ends. So, we must migrate to HBase to continue creating a service that’s fault tolerant. Not only that, we need to make sure that performance isn’t sacrificed during the migration. We have about 50 Redis clusters and more than 10,000 instances which are only operated by five to six people. Also, most of the operation is done manually, so we need to reduce the downtime and provide higher quality Redis services in a stable manner. Since 2020, the ratio of people mainly working in development to those working in operation has been 3:2. Our roles are not completely separate though, and we may have different roles depending on what needs to be done at the time. Each team member is essentially responsible for different clusters, but we maintain flexibility in our roles as we also work with engineers in Korea. Moving onto operational costs, we want to increase automation so that we can reduce the cost of manual operations and focus more on development issues. Specifically, we’re currently trying to automate the management of clusters using our own cluster management tool as well as Docker and Ansible. Part of this system is already in operation, but it’s still in development.
Redis at LINE (Japanese only)
Javier: At the HBase Unit, we’re engaged in not only maintaining and operating our Apache HBase clusters, but also all of its surroundings, trying to improve the LINE Messaging backend core business logic and performance. Our focus is largely divided into multiple challenges:
- Make HBase the primary storage for most of our features in the LINE messaging backend.
- Building a new service that will store, handle and synchronize user settings in a generic way.
- Migration to new versions of HBase.
- Helping other teams develop new features while also improving core features.
- Improving clusters while reducing costs.
- Improving the disaster recovery plan.
Firstly, we’re trying to make HBase the primary storage for most of the core features of LINE’s messaging service, which is what the Redis Unit has been working on as well. There are several challenges with making HBase as the primary storage, including maintaining data consistency and thinking about the transactions between Redis and HBase with the large amount of traffic that we have. Since this is very challenging to do, we work along with other teams to make architectural and logical changes like making Redis a cache and changing the way we access HBase. Secondly, we’re building a new service to store, handle and synchronize user settings in a generic way, enabling the addition of new settings without additional development costs. Thirdly, we’re migrating to new versions of HBase. We’re doing this not only because of new features, performance, and receiving community support, but also to solve issues with the old versions of HBase like single point of failures. While the migration is ongoing, we’ve designed a system to control and adjust multiple storages, reducing the number of potential inconsistencies.
Furthermore, we’re contributing to the open source community by evaluating the new versions of HBase. Our fourth focus is to help other teams develop new features while improving the core ones. This includes database schema design, cluster building and tuning, monitoring and being involved in the business logic of the application. As this requires a lot of coordination with other teams, we get the job done by communicating with them effectively. Our fifth focus is to keep improving the reliability, availability, and performance of our clusters while reducing costs. We’re currently focusing on improving the recovery time after a machine failure and the performance of disaster recovery clusters, as well as improving DevOps and automation tools to reduce costs. Lastly, our sixth focus is to improve our disaster recovery plan and embrace a more multi-datacenter-oriented architecture for the future. We’re optimizing the system so that we can make the switch to a different datacenter as fast as possible. We’re trying to make better usage of our datacenters by optimizing our clusters and letting other teams use the resources. To avoid the datacenter traffic switch, we started to think about a more active disaster recovery plan and multi-datacenter-oriented approaches. Also, we’re planning to look into new types of databases.
Kawamura: The big challenge we face, as developers and operators of the Apache Kafka platform, is to keep the platform running under optimal conditions for many services by improving its stability and reliability. As I mentioned before, we started out as a single project, and as more services started using the platform, so did the number of streaming records and users. Thus, the first challenge we faced was the stability and reliability of the platform itself. We had a major system failure about two years ago which affected various services that relied on it, so every effort was made to improve reliability last year. You can hear me talk more about this in the presentation I gave at LINE DEVELOPER DAY 2019. In terms of cluster availability per year, we achieved a stability of 99.999%, or five-nines, which is the same level or higher than that of any cloud service vendors. This means that the cumulative downtime per year is under five minutes. Furthermore, our Kafka cluster’s important API response time is set as strict as less than 40ms in the 99th percentile. Based on our achievements last year, our next challenge is to keep the platform running under optimal conditions for many services by improving its stability and reliability, which is a big challenge that will be roughly divided into two parts.
The first part is to decrease operational costs by implementing automation aggressively. Due to the efforts we’ve made so far, we’ve achieved high reliability. However, that came with a very heavy strain on human resources which continues to this day. Assuming that our userbase will continue to grow, we predict that the cost of human resources will inevitably stay the same or increase. Therefore, we need to automate various tasks and transform Kafka into an autonomous platform. A specific example is to create a mechanism by which the platform automatically disconnects a failed machine from service and restores lost data using another machine as a replica, without any manual operation. Currently, we’re developing Kafka into a completely autonomous platform so that the manual operations we’ve been doing to maintain reliability can be performed automatically by machines.
Apache Kafka is now widely used within LINE, and as an infrastructure with high reliability, it’s reducing development costs and improving the stability of various services. However, the second part of the challenge we face is that services aren’t using Apache Kafka in its optimal condition as it’s still in the early stages of development. This will lead to an increase in personnel costs if we provide support to each individual case, so we’re developing tools such as SDKs for internal use, client libraries and Decaton. Released in April 2020, Decaton is a Kafka consumer framework designed to enable concurrent processing of consumption records from one partition, which is something that most other Kafka consumer frameworks can’t do. It can produce high throughput especially with I/O-intensive workloads. Many services already use Decaton, including the LINE messaging appーwhich processes over one million tasks per secondーand Smart Channel introduced in the blog, Usage examples of Decaton, a job queue library that uses Kafka. However, we’ll continue to develop more features of Decaton so that it can be applied to more services to process tasks with high efficiency.
Lastly, please give a message to the readers interested in joining the Z Part Team.
Seo: We respect individuals and believe in their abilities and strengths. We welcome those who are tired of easy, obvious and unchallenging problems.
Ide: In our team, you’ll be able to develop large-scale distributed systems that are used by hundreds of millions of users. Please join us if you’re interested in taking on challenges like improving system availability, reliability, and performance, while working alongside other teams and members from many different locations.
Javier: If you’re willing to work on large-scale projects with lots of users and requests, work with distributed systems and achieve difficult tasks to overcome problems like scalability, availability, reliability and transactions, then please join our team. We consider all aspects of technology involved in the process while we work, including making contributions to the open source community. We also have an international culture, so we don’t expect people to have a lot of experience with the technologies or distributed systems that we use, but we need people who love to learn new things and improve their skills day by day.
Kawamura: Unlike typical web service development, you’ll be able to take on difficult technical challenges that come from high traffic and high service level demands, as well as the big responsibility that comes with it. You don’t have to worry about needing experience in distributed middleware or reliability engineering, because the culture here encourages people to spend time learning at work. This workplace is recommended for people who like to keep improving their skills, and you’ll learn a lot from working with some of the most talented engineers in Japan.