Working as a New Graduate Engineer -A December to Remember-

Hi there! I’m Yuki Taguchi, and I joined LINE as a new graduate in April 2019. I currently work as a member of the Network Development Team at our Verda Dept. Verda is a large-scale private cloud platform for LINE services. Of the many tasks involving Verda, I’m in charge of developing its network components.

At LINE, in addition to network management based on specialized hardware equipment, we’re also focusing on software-based approaches such as XDP and DPDK. The Verda Network Development Team works to improve the company’s infrastructure mainly through this software-based approach.

The aim of this article is to give you a glimpse of what working as a new graduate infrastructure engineer at LINE looks like. In it, you’ll read about our efforts to support our infrastructure platform, through performance measuring, and quality maintenance of our network software.

Maintaining the performance of our software-based network platform

First of all, to give you some background on my job, I’d like to touch on the situation and issues surrounding Verda.

Since its release in 2017 as a private cloud platform, Verda has enjoyed rapid growth. In the past year, for example, its number of VMs has more than doubled, and a wide variety of services, including our own messaging service, have been deployed on it. This led to calls for a fast and flexible platform that could keep pace with the speed at which these services are developed.

To address these needs, Verda’s network proactively adopts a software-based approach.

For instance, Verda provides LBaaS (Load Balancer as a Service) feature. For this feature, the Network Development Team customizes the Layer 4 load balancer’s entire data plane. In implementing the data plane, we use XDP, a Linux kernel feature, for acceleration.

LINE provides a diverse set of services, which makes multitenancy a must. To provide this multitenant environment, we use a novel routing technology called SRv6. SRv6 is so new that there were few implementation cases at other companies, and we could barely find any hardware products to support it. The paucity of resources led us to use the SRv6 software on the Linux kernel, which runs on general purpose servers. This is just one example of how we work to improve management flexibility by softwarizing the important components of Verda.

Verda is a platform that mediates the massive traffic generated by LINE services, so its software’s forwarding performance is crucial, and must be maintained at all times. To address this requirement, regular benchmarks that allow for a consistent understanding of actual software performance, therefore, are essential.

This is actually easier said than done. Manually building a test environment for every benchmark comes with its own challenges, such as misconfigurations and discrepancies between environments, depending on which resources are available at that particular time. There is no way for results of benchmarks running under such inconsistent conditions to be accurately compared, which means that performance deterioration could sneak its way into the software in the course of development and go unnoticed.

When I joined the team, there wasn’t yet a consolidated benchmark system, and these issues were far from being properly addressed. We eventually arrived at the conclusion that we needed to run declarative benchmarks in a common environment using CI (Continuous Integration) and set out to create an automation system for this purpose.

Automated testing and benchmarking of load balancers

The first task I was assigned was to automate unit tests (function tests) and benchmarks (performance tests) for load balancers, and to present the results so that they could be easily understood by load balancer developers in my team. Verda’s load balancers are implemented using software, so their unit tests, like those for typical server applications, could be easily automated using a CI tool called Drone. With performance benchmarks, however, things were not so simple.

Verda’s load balancers are based on XDP and their packet processing are executed in a physical NIC (network interface card) driver context. Because of this feature, the load balancers are running on a bare metal server, not a virtual environment. This means that they can’t be benchmarked as virtual machine or container forms. So, we needed to prepare a specialized physical environment.

There were other things to consider as well, for instance:

  • How to generate high-rate traffic
  • How to automatically configure test environments
  • What kinds of benchmark scenarios to test.

To resolve these issues, I, respectively:

  • Used a software-based, high-speed traffic generator
  • Automated configuration using provisioning tools such as Ansible
  • Created benchmark scenarios considering the load balancer’s characteristics as well as baseline measurements

These approaches were baked into the design of our automation system. In this system, a CI tool (Drone CI) triggers the provisioning of the dedicated testbed using Ansible, and a high-speed traffic generator called TRex is used to measure the performance.

Here’s an overall look at the completed system:

When I added a benchmark scenario, I did it in close consultation with the load balancer developers. There was, at the time, concern that a new feature of the load balancer might cause performance to decline under certain traffic patterns. We added a new traffic generation program that mimicked these conditions and passed the test results on to the load balancer developers. The task of actually adding the benchmark scenario was quite easy. Because the benchmark environment had already been completed, all we really had to do was create a new traffic generation program and add it to the tester. All the other redundant tasks (e.g., provisioning of switches and servers) had already been automated!

In designing the benchmark, we came across measurement difficulties unique to load balancers. The load balancers used in Verda, by specification, use IPIP tunneling, in which ingress and egress packets have different structures. This means that the actual structure of the packets collected by the receiving side of the traffic generator differs from the structure it expects, which can be problematic. To address this issue, we tweaked the traffic generator program so that only intended packets were counted by the NIC hardware, with the rest counted by software.

Now, when new pull requests are created on the load balancer’s GitHub repository, the benchmark is run automatically in its specialized testbed, and its results will be submitted to the pull request page.

This system provides load balancer developers with an objective understanding of the performance even during the development phase. There’s still room for improvement, however. With regard to visibility, we’re considering presenting data as charts, and making it easier to compare past and present performance. We’re also thinking of adding features supplemental to performance analysis (e.g., one that can show you which process took up how much CPU cost). Through further enhancements, I hope to enable a faster and more accurate development flow for our Network Development Team.

A path toward agile network development

Through my experience, I learned that automating benchmarks comes with a host of advantages. One of them is that declarative benchmarks that are automated using Ansible are highly reproducible, and allowing one to get the same results regardless of who runs the tests. Another advantage is that automated benchmarks make it easy to try out parameter changes on demand.

In addition, as introduced in the LBaaS example, the good use of CI tools allows network software to naturally incorporate benchmark into the development flow, just like any other typical applications. This allows for the prevention of unintended performance deterioration.

In a more recent project, I automated a benchmark for a new SRv6 implementation using XDP.

This new SRv6 implementation was developed by our then intern, Ryoga Saito, as part of the team’s ongoing efforts to accelerate our multitenant environment. (You can read more about these efforts in a separate blog post, available in Japanese only)

This automated benchmark system can benchmark new and existing SRv6 implementations in the same environment, and even generate a performance comparison chart. It enables us to objectively compare the SRv6 data plane performance, making it easier to determine whether heavier workloads can be applied to Verda. The tests can now be run on demand, providing continuous feedback to developers of network functions so that they can promptly make improvements.

In the course of developing this system, we again encountered challenges unique to network software. For example, measurement results would be different for every test we ran. There are many possible causes for these discrepancies—such as inconsistent cache status and CPU core allocation problems—so we had to investigate each one every time, and reconsider our test settings. We continuously updated the benchmark itself in this way to improve the reliability of our measurement data.

Working at the Verda Dept.

So far, this article has been about projects that I have been personally involved in. I’d now like to move on to tell you about my surroundings. The Verda Dept. is made up of many teams in addition to mine, including one that manages cloud platforms such as OpenStack and Kubernetes, the storage team, and UI team. We have multiple locations, in Tokyo, Fukuoka, Kyoto, and also in South Korea. Some teams have members across different locations, often with varying native languages, and so mostly communicate in English.

I myself try my best to create tech docs and communicate on GitHub in English. As for live English conversations, this is something I need, and definitely plan to work on, because the ability to communicate directly in English often lets you interact smoothly with those outside the team.

Verda office members also actively attend and participate in technical conferences and open source communities. Network technologies used in LINE, like SRv6, are state-of-the-art, which means that features to support them have only just been implemented by equipment vendors, or are still in their testing phase. This makes knowledge gleaned from our day-to-day work all the more valuable, so our presentations gain a lot of attention.

The Verda office makes information regarding the technologies and architectures we adopt as open to the public as possible. In fact, the data plane/control plane and LBaaS technologies used in the Verda network have already been introduced at major conferences, including JANOG. I feel very lucky to be able to work in an environment where I can stay inspired, not only by my fellow teammates but by those outside the team as well.

Wrapping up

Because the automated test/benchmark systems described in this article are flexibly designed to accommodate new test subjects, I’m confident that they will continue to be used to test the performance of new technologies and greatly mitigate the workload of network developers.

It’s been approximately six months since I joined the Network Development Team. Tasked with the important job of continuously maintaining the quality of our cloud platform, I find my work extremely rewarding.

As a final note, I hope the efforts at our Verda office and the projects mentioned in this article pique your interest in Verda network development. For more information on our other projects, please check out the many tech conference documents LINE has made available to the public (I’ve provided a list of some of our more recent presentations below).

Thank you for taking the time to read this post!