An intern's tale of designing LINE's firewall secretary

Valeri Haralanov2020-09-28

Valeri is an engineer on the Infra Protection 1 team

Greetings! Today I'm going to write about my experience as an intern working on LFMS - the LINE Firewall Management Service - and more specifically, how I designed a part of it called Service Risk Scoring. But first, let me quickly introduce myself.

My name is Valeri Haralanov - in English-speaking contexts I go by the name of Val. I am currently a 3rd year student in the University of Saitama where I study Information Engineering, a.k.a. Computer Science, and I spend most of my time either working on hobby projects or (recently) playing World of Warcraft Classic (feel free to hit me up for a raid!). This year, 2020, I was accepted for an internship in LINE as a member of team Infra Protection 1 and I spent 6 weeks from mid-August to mid-September working on the project I am going to tell you about. Apart from this, I am also interested in pure mathematics — I recently started learning about Geometric Group Theory — and, for better or worse, I enjoy modeling things in everyday life as close to math as possible.

Well then, now that we have that out of the way, let's talk firewalls!

What is LFMS?

LFMS stands for Line Firewall Management Service and in a nutshell, it is a web interface for, well, managing LINE's firewalls. LINE uses a number of hardware firewalls. A subset of those, made by Palo Alto, are tied to a specific service or project such as LINE Pay or LINK. On the other hand, there are also devices that manage traffic for multiple services as well.

But what exactly is a hardware firewall? I found out the hard way that this is a surprisingly non-trivial question. Most people are likely familiar with the concept of a firewall on their computer — a service that blocks or allows traffic from/to the internet — but when it comes to managing whole networks, it becomes a little more complicated than that. Namely, a hardware firewall can, and often does, perform a similar function as a router, switch, bridge or all of those combined. However, the important part is that, after deciding where a packet is coming from and where it is trying to go, a firewall looks at the contents of the packet and decides whether to pass or drop it. It does that by comparing it to an Access Control List or ACL, which contains multiple rules, which are a set of criteria along with an action — a packet that matches the criteria can either be allowed to reach its destination or denied. Denial can mean a number of different things however suffices it to say that a denied packet will not reach its destination. All of this can be configured on the respective device via a web interface or API.

So how does LFMS play into all of this? The answer is straightforward — LINE uses many firewall devices that each need to be configured separately, however managing ACLs for each individual device is at best inefficient and at worst dangerous. So LFMS's primary goals are the following:

Allow viewing and editing firewall policies based on projects.
Allow users, people who cannot modify the firewall configuration, to safely view ACLs and make requests to modify the configuration.
Allow administrators, people that manage security policies, to change the firewall configuration on a higher abstraction level, so that each modification is correct and traceable.
Aggregate security policy data and integrate information from other services in order to find and diagnose potential security risks.

The first 3 of these points are basically a fancy way of describing proper CRUD operations and were already implemented before I started working on LFMS. The final point — aggregating data and diagnosing risks — is what I focused my work on during the past weeks.

Service Risk - why firewall rules need meta information

Firewalls often offer functionality that allows a person to structure their configuration in a way that is consistent with the business logic. However, while a firewall can make configuration easier, it cannot guarantee anything about correctness. That is to say, there is nothing stopping you from creating a rule that opens your database to the outside world, thus destroying your business. Granted, usually the firewall administrator is not malicious, however security holes can be introduced by mistakes or even unnoticed changes. For example if some allow rule becomes obsolete due to a service being shut down but later an address in that rule is used for something else, the traffic that the firewall enables could be used for malicious purposes.

So, the problem now becomes:

How can the administrator check and evaluate whether the current firewall configuration is consistent with the current business infrastructure?

The answer consists of two parts.

First, the administrator needs meta information — what an address is assigned to, what it is used for, whether the device behind it is alive, whether its ports are open, and so on. Fortunately in LINE's case, almost all of this is available via internal tooling.

Second, this information has to be integrated into the user interface. That is, the administrator should be able to view and search all of the information relevant to a given rule. They should be able to tell at a glance whether a rule is legitimate and safe or dangerous. This is what LFMS's Service Risk Score feature accomplishes.

During the past weeks, the team decided to integrate a number of different services into LFMS and define the following risk indicators for every rule that allows outside traffic to the internal network.

Does the rule have an owner? LINE's firewall is changed based on ACL tickets that are submitted to an internal database — let's call it the LINE Database or LDB for short. Whenever a ticket is approved, the ticket number is included inside the rule's description field which allows us to trace the ticket back from the rule. However, a long time ago when tickets were not used yet, rules were changed based on oral communication and are now untraceable. Therefore if a rule does not have a ticket number (and thus an owner, or in other words, the person who requested it), it is considered a security risk.
Does the rule allow a port range? Usually services require only discrete ports such as 80 for HTTP, 443 for HTTPS, and so on. Allowing a range of ports has its uses, but is not the norm and therefore is considered a security risk.
Does the rule use a destination IP range that is too large? Similar to the above, an IP range that covers more than 256 addresses is rarely needed and could easily introduce a security hole, therefore such rules are considered a security risk.
Have all the destination addresses passed internal risk assessment? LINE conducts risk assessment on all services before opening them to the internet. Naturally, a rule that allows traffic to a service that has not been through this process is a security risk.
Are all the destination addresses registered in LDB? LDB also provides information about every internal server along with its bound IP addresses, and so an address that is not registered in LDB is a natural security risk.
Are all the destination addresses reachable on the ports allowed in the rule? A port that is allowed in the firewall but closed on the server could be an indicator of misconfiguration on either side, so it is considered a security risk.

Currently these indicators are simply checked individually, but the long-term plan is to utilize them and construct a so called risk score — a numeric value that ranks rules based on their overall security risk level. Unfortunately I did not have the expertise required to decide on how this number should be computed, but I managed to implement both gathering and joining the above information to each rule, which should be most of the heavy work.

So, let's talk about the actual software (finally!).

Integrating with LINE's internal tools

In theory, integrating with other tools sounds easy - just hit the API with a query and fetch the result, right?

Well, no. But in order to explain why, I need to go into some details. For the purposes of this explanation, I will refer to LINE's internal tools as external, since from LFMS's point of view they are third-party services.

The reason why "just hitting the API" does not work is mostly about processing time and ease of implementation. In order to list up all the rules that have security risks, along with the status of their indicators, we have to get information from the following external sources.

Palo Alto Device
LDB
Risk Assessment DB
External Port Scanner

Let us consider what happens when we load the Service Risk Score page. First, all the security rules are fetched from Palo Alto, along with information about all addresses, services, and so on. This alone takes a few hundred milliseconds on a development device. Next, we have to decompose the rules into simpler ones, each of which has only a single IP address/subnet in the source and destination field, and opens only a single port or port-range. Thankfully, most rules that we consider do not include many nested address/service groups, so this does not take a significant amount of time.

And now comes the hard part. Each rule can contain either a single IP address or a single IP subnet. However, LDB entries, risk assessment and port status are all information specific to a host, which means that they have to be queried for each IP in the subnet. Since we decided that subnets of size over 256 are considered unsafe, we can skip querying those (especially considering the results cannot be displayed easily), however gathering information for even a safe subnet can take a significant amount of time to complete.

This is why all of this needs to happen asynchronously. That is, gathering all the information from external services should be done independently of the HTTP request to the API — for example on a batch job. Luckily, the LFMS API already uses a task queue, so my job was to utilize that and design the way data will be queried, stored and displayed. Which brings us to The Problem.

The Problem: Where and how much data to store?

Fetching data asynchronously means that we have to store it somewhere and use that when a request comes. LFMS is built on Django, so at first we were using the built-in Django cache on top of Redis. However, it quickly became apparent that a memory cache in its usual definition is not the right tool for this, because:

It does not (fully) support relations.
It is prone to data loss.
Caching on the first request (since timeout) still takes enough time to trigger the front-end timeout.

So I decided to take the obvious route and model everything in the relational database. This was probably the longest part of the development, since I had to decide how faithfully to replicate the data relations and how much of the data to store. The latter is especially important because if you think about it hard enough, you start asking yourself: "why am I reverse engineering the original database?" This made me reach a more general conclusion:

You cannot accomplish everything with only one source of data. Even if you have one giant database with all the data you will ever need, if processing that data takes more time than needed, you need to replicate and/or preprocess it so that you can get what you need quickly.

This applies to LFMS's case as well, since we can close our eyes and consider hitting the API to be the same as doing a very intensive DB query. The only downside is that in an actual DB, one has access to very powerful tools whereas in LFMS's case I had to manually replicate the database tables.

Thankfully, after I managed to create proper models for the data, the rest of the work was pretty much hitting the APIs and converting the JSON response to DB rows. This is done every few minutes/hours/etc. and the Service Risk rules along with their indicators are updated after that. Sure, it took some time to understand how each external API is used and how LINE's network is structured, but I think that is the natural way of progress — after all, I have never developed software that manages whole networks, nor do I have any experience with companies on the scale of LINE.

That being said, there is one more thing that I had to implement in order to finish my task and that is the port scanner.

The Port Scanner

The truth is, there was no port scanner (at least to my team's knowledge) that we could use to get the port status of LINE's services. So I had to make my own. It was an interesting experience to learn how programs such as Nmap and Masscan accomplish what they do, the differences between them along with the constraints that a tasks like port scanning imposes. That being said, the port scanner I made is just a glorified front-end to Masscan, so I do not take any particular pride in it. I designed this as a one-off solution that can kind of handle general scanning requests because I needed to focus on the Service Risk Score component, but I think that implementing a proper port scanning service could very well be worth the effort, especially if its open-sourced as a general fire-and-forget solution.

The Final Product

Well then, I wrote enough pretext, now let me show you the end result!

This is an example of the information one can see for a given rule. The yellow boxes are simply masks for sensitive information. The LFMS Ticket section is for tickets created within LFMS, whereas the REQ section is for tickets submitted to LDB. After that we have IP information for every host — what service it is used for, the name of the administrator, the server's location, port status and Risk Assessment status. Finally, we have some information about the security rule object in Palo Alto that enables this rule.

One thing to note here is that even though there are multiple rules with the name TestRiskScoreRule, this is actually a single rule in Palo Alto, which is defined on multiple addresses and ports.

Closing Thoughts

During the first week of my internship, I got familiar with the project and worked on some very minor issues. To be completely frank, I was slightly disappointed in the simplicity of the work I was doing at first, but fortunately my team, and more specifically my mentor — Duckki Leem — realized this and gave me the task of working on the Service Risk Score component. Even so, I had zero previous experience with networks and had a lot to learn about how they are constructed and protected; how LINE's network is structured and managed and what LFMS's purpose is. It took me quite a bit of time to realize the importance and potential of what LFMS could be. And now I am glad that I worked on this project. I had a wonderful experience working with my colleagues and I sincerely hope that I brought at least as much value to them as they brought to me.

Blog