We are IG and JN, and we work in the mobile VoIP development department at LINE.
In this post, we would like to talk about what we have done to improve the call quality on the newly launched Popcorn Buzz.
Popcorn Buzz is a group call service that enables up to 200 users to call simultaneously over the internet. Popcorn Buzz users will be able to gather in a virtual space like the one you see below. They will be able to talk to each other, or even hold large-scale conference calls and such.
LINE already has a one-on-one VoIP service called “LINE Free Call.” As a similar form of VoIP service, sound quality is one of the top priorities. While Popcorn Buzz incorporates technology that can improve the quality of sound on LINE as well, Popcorn Buzz needs extra care as there are other factors that are not present when handling a one-on-one call service.
One of the most important factors that determine the quality of a call is the environment in which the callers are making the call from. This is because the amount of noise is dependent on the location a call is made. For example, users may choose to make a call at home where it is quiet, or at the cafe where it is more noisy. The recipient of a call made by a user that is calling from a noisy environment like the cafe, will find it difficult to understand what the other person is saying due to the noise. The impact these noises have on the call quality is even larger on Popcorn Buzz, because it is designed to be used with more than two people per call. If even one participant of a call is in a noisy environment, that would impact the call quality of everyone else.
Another factor that may affect call quality is the numerous kinds of devices that could be used in the call. Some devices may cause echoing, which is caused by sound from the speaker traveling back through the microphone sounding like an “echo.” Echoing can be heard more often on devices that have sensitive microphones, loud speakers, and short distance between the speaker and microphone. This happens mostly when calling on the handset or in speaker phone mode as the sound coming from the speaker is much larger than usual, causing echoing.
Some recent mobile devices have functions to reduce echoing, but the effectiveness of echo reduction can vary among devices and manufacturers. In some cases, it may even be better to disable echo reduction altogether as it can be unreliable. The truth of the matter is, there are no surefire ways that can guarantee perfect echo reduction. If a participant of a call owns a device that is susceptible to echoing, it will inevitably affect the call quality for everyone else involved.
Another thing that could negatively affect the call quality of Popcorn Buzz is the characteristics of the microphone on each device. Some devices have sensitive microphones that can transfer the user’s voice loud and clear, while some do not. Popcorn Buzz normalizes user voices so that every participant of a call has the same loudness, reducing any negative affects that may be caused from the volume gap.
Popcorn Buzz has implemented a software module called Voice Quality Enhancement (VQE), in order to deal with the problems mentioned above.
Implementing VQE to Improve Call Quality
Call environment modeling
The call environment of Popcorn Buzz described in the introduction above would be depicted as you can see in < Figure 2 > below. Signals that enter your own device’s microphone include your own voice (Near-End Signal) and the voice of the recipient (Far-End Signal). In other words, echoes and noises will enter the device.
All the signals captured by the microphone will look like < Figure 3 > when they are combined.
However, pure signals (your own voices, echoes, noise) cannot be determined in an actual environment. Only the signals outputted by the speaker and the signals captured by the microphone can. This is why the amount of echoes and noises coming from the captured signal must be estimated for removal.
In order to effectively estimate the amount of echo and noise, Popcorn Buzz divides signals based on the similarity of the signal to be outputted from the speaker with the signal captured on the microphone. The signals are divided into 4 categories as you can see in < Table 1 > below.
<th> Similarity </th> <th> Audio </th> <th> Description </th> </tr>
<td> Low </td> <td> O </td> <td> Near-End Only talk section (Sound of your own voice only) </td> </tr> <tr> <td style="text-align:left"> 2 </td> <td> High </td> <td> O </td> <td> Far-End Only talk section (Sound of another user, echo) </td> </tr> <tr> <td style="text-align:left"> 3 </td> <td> Moderate </td> <td> O </td> <td> Double talk section (Sound of your own voice with another user talking simultaneously) </td> </tr> <tr> <td style="text-align:left"> 4 </td> <td> Low </td> <td> X </td> <td> Silence section (Section where no one was speaking, noise) </td> </tr>
Not only does VQE have modules to remove elements lowering call quality such as noise and echoes, it also have various modules that can improve call quality as well. Among these modules there are Acoustic Echo Canceller (AEC) for echo reduction, Noise Suppressor (NS) for noise suppression, and Automatic Gain Controller (AGC) for audio normalization.
< Figure 4 > below depicts the steps VQE takes to process audio signals. The sound coming from the microphone on the bottom left goes through the AEC/NS/AGC modules in order.
Each module treats the captured signal in different ways depending on which of the 4 categories of < Table 1 > the captured signal falls in.
Acoustic Echo Cancellation (AEC)
As seen in < Figure 3 >, the signal captured by the microphone contains echo and noise. AEC is designed to eliminate echo and only leave the user’s voice and noise. In order to remove echo, the amount of echo must be estimated. As you can seen in < Figure 5 > AEC will determine the amount of echo by estimating the process of how the signal to be outputted from the speaker was captured on the microphone. The result of modeling this process is called the Transfer Function.
In order to estimate the Transfer Function, AEC will separate the signals by the similarity of the signal coming from the speaker with the signal captured by the microphone. These signals are then put into the categories seen in < Table 1 >. The Transfer Function is refreshed with category 2 signals when there is an echo, and not on category 1, 3, and 4 signals. The amount of echo is estimated by multiplying this transfer function with the signal to be outputted by the speaker. Once the estimated amount of echo is removed from the signal coming through the microphone, the echo will be removed only leaving the user’s voice and noise as you can see in < Figure 5 >.
Noise Suppression (NS)
Even after AEC removes echo, the user’s voice is still mixed with noise. NS uses technology that can remove the noise coming from the user’s environment to improve the clarity of speech. In order to remove the noise, the amount of noise must be estimated as well. NS will estimate the amount of noise from category 4 signals seen in < Table 1 >. Once the estimated amount of noise is removed from the captured signal, only the user’s voice will remain as you can see in < Figure 6 >.
Automatic Gain Controller (AGC)
Audio captured by the microphone will differ depending on the user’s current state (environment, emotion, distance with microphone etc.). AGC uses technology that can normalize the various volumes of sound by automatically changing the amplification. In other words, AGC will change small sounds to become louder and loud sounds to become smaller as you can see in < Figure 7 >. Without AGC, the user would have to manually adjust their device’s volume frequently.
While there is only one AGC depicted in < Figure 4 > for simplicity, there is one more AGC. This extra AGC is used for compensating the speaker output on each device by normalizing the audio input (RX audio stream).
VQE Tuning On Popcorn Buzz
On services like Popcorn Buzz, one participant can affect the call quality of the entire room. With many sounds coming from many users mixed into one, one participant’s echoes and noises can negatively impact the overall quality of the call. That is why VQE is much more important on Popcorn Buzz than it is on other one-on-one VoIP services. VQE will be further tuned for more stronger AEC and NS. This can cause other side effects however, as there is trade-off between the suppression strength of AEC and NS and the distortion of near-end signals. In other words, strong suppression can remove echoes and noises, but it can also cause near-end signal distortion. To compensate for the distortion, Popcorn Buzz applies suppression based on which category the captured audio belongs to. For example, category 1 signals from < Table 1 > contain the most important information. That is why only noise will be removed to reduce distortion as much as possible. For category 2 and 4 signals, suppression strength will be stronger as these are mostly echo and noise. Category 3 signals contain important information along with echo and noise, so echo and noise will be suppressed as much as possible while also minimizing distortion.
Providing a consistent user experience is very important. Especially on a service like Popcorn Buzz, where more than two users are talking at once. With the numerous devices available on the market there are many obstacles to providing a consistent experience due to the countless audio characteristics (microphone, speaker specifications) that must be accounted for. While Popcorn Buzz is already available for download, there are still parts where we can make improvements and more features that we can add. Our goal is to guarantee a more pleasant experience using our app to all of our users.