This post is also available in the following languages. Japanese

【Team & Project】Meet the Team Developing the Verda Platform Using OpenStack and Kubernetes

LINE Engineering2020-06-03

LINE Engineering Blog official account

「Team & Project」 takes a look at different departments within LINE's development organization and introduces their roles and team structure, tech stacks, challenges that lie ahead, roadmaps, and more.
In this edition, we sat down to chat with members of the Verda Platform Development Team. Part of the Verda Department which in turn is under the IT Service Center (supervises all infrastructure), the team works on development for the Verda Platform.

Verda Platform Development Team at the zoom meeting

Based on OpenStack and Kubernetes, Verda is used throughout the company as LINE's private cloud. LINE's rigorous security and privacy policies mean that LINE had few options among existing Cloud Service Providers. So, after considering security, scale, platform maturity, and cost factors, we decided to build its own private cloud.
It currently serves over 2,500 in-house developers, while around 60 infrastructure engineers work on additional development for Verda itself.
Our guests today hail from within this development organization: Yuki Nishiwaki, manager of the Verda Platform Development Team, and Masahito Muroi, the team's tech lead for everything IaaS-related.

-- Could you start off by telling us about yourselves?

Nishiwaki: I'm the manager of the Verda Platform Development Team. We develop OpenStack, Bare-metal components, and more for our IaaS platform. We also work on operations and development for container platforms built on IaaS.

Muroi: I joined the Verda Platform Development Team last summer and am currently the tech lead for the IaaS team. So even though I'm quite new, I do have some knowledge of OpenStack since I've worked with it for many years.

-- Can you tell us why you decided to join LINE? And what you find rewarding in your work?

Muroi: For me, I was really interested in both the fact that the team's working language was English and that the platform's development structure was geared towards end users. This organizational approach really sealed the deal. The platform's end users aren't developers, so the focus was on how the platform could provide value to these users.

Nishiwaki: Well, I was really drawn to LINE's global-level services and development centers—it felt like an environment where engineers could be acknowledged for their performance and grow at the same time. If you raise your hand, there's no end to what you can do here. Also, I thought it was really interesting how one organization provided all [the company's] infrastructure. Plus, how the private cloud could also offer all the different developers that infrastructure along with a development environment and framework for streamlining development.

Muroi: At LINE, the company approach is to, where possible, "have everyone be a specialist." This is backed up by the organizational culture. There's such a diversity of employees, so I also find it very rewarding to take part in an internal culture and design/development that keeps global offices in mind.

For our team especially, it's really fascinating how we're able to work together with internal users on forward-looking tasks and feature development, while at the same time being part of the back office side as the department that develops the company's private cloud. Nishiwaki: I agree. Not only are we able to work "globally," but we tackle so many different challenges that are technologically intriguing or have a big impact. By finding solutions, we gain diverse experiences. And from that, the number of things we're able to do only increase. This has been the most rewarding part of the job for me.

From left to right, Nishiwaki is the manager of the Verda Platform Development Team and Muroi who leads the IaaS area

-- Could you tell us about the team's structure and role?

Nishiwaki: Well, as of April 2020, the Verda Platform Development Team has 16 team members. Two of them are located at LINE's Kyoto office. The team is broadly divided into two groups: the Computing Abstraction Group and the IaaS Group.
In the three years since a private cloud was introduced at LINE, the lead time for procuring infrastructure resources like servers has reduced, and developers now have more flexibility in creating and deleting infrastructure resources. Also, many developers have been creating infrastructure resources for all sorts of internal systems on this private cloud.

From the day the private cloud was set up and up until now, our team has focused on the ability to easily create and delete resources, and reducing the lead time for providing resources. But the ways in which the private cloud was being used began to expand along with the increased scale and growing number of supported services and developers. This meant that it wasn't always used in ways we expected. The amount of extra infrastructure resources being prepared increased as more users used the private cloud GUI/API to assemble resources in line with their workloads. Overall, we began to see that the amount of surplus resources was something we wouldn't be able to ignore any longer. The increased scale and variation in services has also caused scaling and operational cost issues for the private cloud as well.
To tackle these challenges, our team currently shares several team-wide missions and is divided into the Computing Abstraction Group and IaaS Group.

The Computing Abstraction Group's mission is to primarily provide a private cloud that runs on an abstraction layer like Kubernetes—instead of a virtual or physical machine—and ensure users aren't directly aware of the infrastructure resources on the cloud. Keeping this in mind, the Group works on operations and development of a managed Kubernetes service and is also in charge of the following:

Through the abstraction layer, reducing risks from unexpected usages by lowering—in a good way—the level of freedom that users have in using infrastructure.
Through the abstraction layer, determining what systems are running and, taking that into account, improving usage efficiency for the underlying infrastructure resources of the abstraction layer under the leadership of the Infrastructure Team.
Providing a company-wide CI/CD system and reducing costs of non-development activities where possible, based on the premise that an abstraction layer is introduced in the execution environment.

On the other hand, the IaaS Group's mission is to support a scale thatis large enough for IaaS resources to be used as the abstraction layer'sunderlying resources, and to provide an interface that allows the abstractionlayer to closely control, for example, scheduling and resource limits. Based onthis mission, the Group is currently working on operations and development ofOpenStack and in-house Bare-metal components.

Also, our team's overarching mission is tobuild an operating environment for the private cloud's control plane that wouldlet a small number of people operate several components at a large scale.

Ultimately, we see the OpenStack and in-house components under our control as examples of complex microservices, and are aiming to have the Computing Abstraction and IaaS Groups help in creating a mechanism that can bring down the associated operating costs to as close to zero as possible.

-- Can you tell us about your team members?

Nishwaki: The Verda Platform Development Team's members come from a number of different countries such as Russia, China, Taiwan, America, France, and India. That's why our team uses English as our main language of communication.

Everyone also comes from a variety of backgrounds. For example, one person was the project team lead for a component within an OpenStack community, another was an OSS core reviewer, some were very active in OSS communities, and others have given talks before at overseas conferences.

-- Tell us about the technologies and development environment that the team is using.

Nishiwaki: Our team is using a variety of technologies and software to cover a broad range of responsibilities, including the IaaS and PaaS layers (managed Kubernetes service). LINE uses OpenStack for the IaaS layer, so OpenStack components and its related technologies are the technology stack that we are using.

Technical and development environment of Verda Platform Development Team

For example, we use Designate, which provides DNS as a Service, and Kubernetes (a core infrastructure for running control plane processes on a private cloud, such as OpenStack) as a middleware to run OpenStack and in-house components as a software.
We chose Kubernetes because it offers a solution that satisfies our unique needs associated with the large "scale" of our services, such as enabling a small team to operate control plane processing for services that we've developed as microservices and managing diverse software with one single interface.
OpenStack and IaaS's in-house components leverage a multiple infrastructure realted OSS to enable many features. As a way to connect physical and virtual machines (VM) to the network, we have adopted diverse technologies and software. For example, CLOS networks and multi-tenancy are implemented with BGP and SRv6.

As Rancher is deployed to manage multiple clusters of Kubernetes in the PaaS layer, technologies for using and constituting Kubernetes and peripheral technologies of Rancher, especially related to cluster management, also fall within our area of responsibility.
By the way, we don't run Rancher just as it is . We customize its management features to fit our specific needs for scale and a private cloud.
Or maybe I should describe it as us creating a Kubernetes customer controller to manage clusters. And Rancher forms the basis of it.
Similar to OpenStack, Kubernetes manages containers by adopting OSS and OOTB features in Linux. So, we also need to know Netfilter and conntrack for enabling containers to talk to each other in virtual IP mode, as well as Linux namespace for isolating containers from one another to avoid interference.

―― What challenges are the team now facing?

Muroi: The IaaS team is facing mainly two challenges: one is the increasing costs involved in the process of developing new features for IaaS, through to delivering them into a real environment; the other being difficulties in identifying the root causes of errors.
To describe the first challenge, as IaaS has scaled in terms of size and the number of features, developing new features has come with more lengthy tasks on the side. The IaaS layer, which our team develops, not only directly manages VM, where user-facing services operate, but also serves Kubernetes (the PaaS layer) and other services like Managed Kafka and Managed MySQL, which are provided to LINE developers as the Verda cloud. As you can imagine, any changes we make to the IaaS layer will impact the entire Verda cloud. It means that even when a newly developed feature can be deployed readily, our job often extends beyond the IaaS layer. For example, we conduct impact analyses on its related features and mitigate the identified impacts prior to deployment, then run additional QA testing based on the result of an impact analysis, and coordinate the deployment schedule—to name just a few. When we recently added an audit log feature for tracking Verda cloud usage, we obviously had to do end-to-end QA testing that covered almost the entire service from both API and GUI perspectives.

The second challengeーdifficulties in identifying the root causes of errorsーcomes from the architecture of Verda; specifically the IaaS being the backbone of the PaaS layer. Development and operation of the entire Verda cloud, including IaaS, relies on microservices. Each service seems as if they are independent of one another, but the IaaS layer serves as the hub for every service, so to speak, which means almost all operations are carried out by way of the IaaS layer. To give you a few examples, creating a Managed MySQL instance needs the creation of a VM, and each service uses OpenStack Keystone for authentication. When any type of error occurs, support from the IaaS layer is inevitable for error analysis.

-- What specifically, is your team doing to address these challenges?

Muroi: We don't think these two challenges can be resolved overnight just by adopting a single new technology. We see them as a common challenge that typically arises with the growing scale of a cloud (e.g. the IaaS and PaaS layers) and the size of the organizations involved. So how do we mitigate that? We take cross-organizational efforts to adapt our organizational culture to the evolving environment, as well as improve our technical acumen and develop proprietary technologies.
To curb the increasing delivery costs, the IaaS team uses Kubernetes's declarative model for deployment and rollback, and implements "operations as code" by developing custom resources and operators. But these measures can only optimize individual parts of a new feature's delivery cycle, which spans from deployment to development and staging environments, through to QA testing, production release and post-release QA testing. That is why we also take additional steps to accelerate the end-to-end cycle from the development stage through to QA testing in the prod environment. These steps don't just aim to automate tests but also to develop a pipeline that bridges the stages of production release and QA testing, as well as work with the QA team to automate QA testing for business logic by turning it into as-a-service—which is otherwise not suited to automated testing.
In dealing with the difficulties of identifying root causes, we visualize the entirety of OpenStack and Verda cloud, as they make up most of the IaaS layer. When I say visualization, it doesn't mean monitoring systems or fetching metrics as we have already been doing that. We look more into the internal behaviour of OpenStack, like fetching more detailed metrics, developing a feature to trace individual requests across the entire Verda cloud by using OpenStack's request-id.
Through these visualization efforts, we are also contributing to the upstream OpenStack community. It's fair to say that the scale of our Verda cloud is immenseーprobably one of the largest as a private cloud in the world. So technical issues related to performance and scale are inevitable. When we encounter these technical challenges, we try to solve them not just for internal enhancement but also for the OSS community as a whole. That's how we contribute to OpenStack community.

Muroi talks about the team's challenge and the team doing to address

-- Tell me about the roadmap for your team?

Muroi: The IaaS Group is aiming to establish the best IaaS on which new features can be developed and operated with a scale ten times larger than today. When you hear scale, it's usually about the number of hypervisors and VMs; for us, it's more than that. We are aiming to establish an infrastructure and culture that facilitate the release and operation of new IaaS features easier than or at least just as easy as today, even with a tenfold increase in the number of users and services of the PaaS layer running on the IaaS. It is generally believed that the larger the scale, the harder it is to make changes in infrastructures. We will challenge this idea. The Verda cloud is now three years old, and we have been expanding the area of our IaaS thus far. As the scale of the cloud has become one of the largest in the world, we also started to face challenges as mentioned earlier. Although we are still halfway along in the initiative, we are trying to implement the following solutions one by one.

Switch from "doingoperation" to "automating(developing) operation" by using recenttechnology like Kubernetes Operator Pattern.
Expand the scope to applySwitching "Operation Concept" beyond OpenStack.
Deploy development delivery feature handled by IaaS to the wider internalcommunity (something like Software Delivery as a Service).

More challenges areemerging on the IaaS user side as we provide a variety of resources as aservice. For example, tasks that used to be done manually or by script in theconventional infrastructure resources can be complex and complicated. Also,applications specific to the conventional infrastructure resources, such as IPACL configuration for databases, have low affinity with cloud nativeinfrastructures. Collaborating with the Computing Abstraction Group, we areplanning on taking some R&D-like measures against these problems.

Nishiwaki: The Computing Abstraction Group has just taken its first step to achieve our goal of providing a cloud system in which users don’t have to worry about infrastructure resources. It was the launch of a managed Kubernetes service in the production environment in October, 2019. This is to help internal developers to first become familiar with application development and operations without being conscious of a server. There is still much work to be done, so we have set up short- and long-term strategies for development.

Short Term Project

We are planning to undertakethe following initiatives to enhance the convenience of Kubernetes clusterscurrently provided for each service development team and increase the number ofuse cases.

Support both VM and PM.
Support enhancement for integration with other private cloud components,such as load balancers and persistent volume.
Support multiple regions (multiple regions for Kubernetes worker,troubleshooting for Kubernetes C-plane).
Identify items to be considered to support large-scale cluster (over 1,000nodes) and implement it accordingly.
Standardize the development process on Kubernetes (monitoring, CI/CD, auditing).

Long Term Project

For the followingshort-term problems, we’re refining our approach to the architecturefundamentally at the DC level.

Wasting resources by providing one service with one cluster.
Having to deploy Kubernetes C-plane for each cluster (about 400 clustersare created as of April 2020).
Design issue of not being able to utilize resources across differentclusters.
High costs for cluster management and long lead time for clusterpreparation.
Long build time for preparing a cluster before use although it can becreated with one single API.
User having to manage capacity of a cluster and individually deployresources on cluster.

In response to thesechallenges, we concurrently have a long-term project to redefine architecturebased on the approaches with the following consideration points and underlyingtechnology validation.

Underlying technology validation that will cover multiple services withone existing cluster, in lieu of building a cluster per service.
(Note: One cluster does not necessarily haveto consist of one Kubernetes C-plane. So, when multiple clusters are managedunder one interface, we recognize it as one cluster.)
- Deployment design when building Kubernetes with over 35,000 physicalservers (for scalability) in multiple regions and locations (for adding failuredomain and reducing latency).
- Consideration of a mechanism that orchestrates multiple clusters.
- Measures to run multiple services with one cluster: optimal isolation fromthe network perspective, security isolation, and QOS control to preventunexpected resource interference.
Streamlining of CI/CD environment to develop, test, and releaseapplications on one cluster (interface) to expand the coverage for services.

(Note: We’re currently driving these efforts on the assumption that we keep using Kubernetes. However, we may decide to use Kubernetes only partially and not use all its features as provided, depending on the result of this project.)

Our long-term project may sound abstract, but it signifies that we’re not planning to only rely on existing solutions that can meet immediate needs and demand. Our development approach is also to go back to the basics without being caught up on existing solutions and discern what technologies we really need. This is our long-run strategy: to start from validating underlying technologies and ultimately introduce new solutions.

-- Lastly, please give a message to the readers interested in the Verda Platform Development Team.

Muroi: Our team members have diverse technical backgrounds. We have people knowledgeable about OpenStack or networks, and some are excellent at coding, to give some examples. People with many different strengths are working together every day. Since the IaaS layer encompasses a variety of technical domains, we have opportunities to challenge ourselves in many fields. We are looking for someone who is willing to take up challenges with confidence in what they’re good at, rather than feeling shy about working in a field that is not necessarily their specialty.

Nishiwaki: The Verda Platform Development Team has a wide range of responsibilities, including the IaaS layer to the PaaS layer. LINE's large number of servers, services, development offices, and developers pose many different challenges. It is one of our unique strengths that we can explore and implement fundamental solutions from the distinct perspective of a team that handles various technical fields.

It is extremely difficult to find an existing solution best suited for this particular situation and scale at LINE. If you are eager to explore and establish a fundamental solution with your own hands by discerning the underlying needs from multifaceted layers, including operation, virtualization, network, and application CI/CD, you are reading about the exact team you're looking for!

"Verda Platform Development Team is hiring!"

Software Engineer / OpenStack / Private Cloud Platform
Software Engineer / Kubernetes / Private Cloud Platform