The Repetitions Eliminator R Library "mmetrics"

Yanjin Li2019-09-10

Yanjin is a data scientist at LINE Corp, working on big data projects.

Have you ever faced a situation where you need to compute the same metrics using similar or even the same computation formulas over and over again? How do you usually handle those repetitive tasks? As many engineers in our field of work know, dplyr is one of the most cool and efficient data wrangling grammars that provide us with tons of tangible solutions to most of our day-to-day data manipulation challenges. In this post we would like to introduce you to a package called mmetrics, developed by my colleagues and I.

Let's start by installing the package. You can install the stable release of mmetrics with CRAN by typing the following command:

install.packages("mmetrics")

Or you can install development versions from GitHub by typing the following command:

# install.packages("remotes")
remotes::install_github("y-bar/mmetrics")

Now, let's jump into some real world examples. First, let's load dummy data from the mmetrics package for this example.

library("dplyr")
library("mmetrics")
 
df <- mmetrics::dummy_data
df
#>    gender age cost impression click conversion
#> 1       M  10   51        101     0          0
#> 2       F  20   52        102     3          1
#> 3       M  30   53        103     6          2
#> 4       F  40   54        104     9          3
#> 5       M  50   55        105    12          4
#> 6       F  60   56        106    15          5
#> 7       M  70   57        107    18          6
#> 8       F  80   58        108    21          7
#> 9       M  90   59        109    24          8
#> 10      F 100   60        110    27          9

Most data scientists are probably reminded on a daily basis that computing the same metrics with different groupings is an unavoidable task. For example, below is a computation of metrics by gender groups.

## Summarize by gender
df_summarized_gender <- df %>%
    group_by(gender) %>%
    summarize(
        cost = sum(cost),
        impression = sum(impression),
        click = sum(click),
        conversion = sum(conversion),
        ctr = sum(click) / sum(impression),
        cvr = sum(conversion) / sum(click),
        ctvr = sum(conversion) / sum(impression),
        cpa = sum(cost) / sum(conversion),
        cpc = sum(cost) / sum(click),
        ecpm = sum(cost) / sum(impression) * 1000
    )
 
df_summarized_gender
#>  # A tibble: 2 x 11
#>  gender  cost impression click conversion   ctr   cvr   ctvr   cpa   cpc  ecpm
#>  <fct>  <int>      <int> <dbl>      <int> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
#> 1 F        280        530    75         25 0.142 0.333 0.0472  11.2  3.73  528.
#> 2 M        275        525    60         20 0.114 0.333 0.0381  13.8  4.58  524.

And certain metrics by age groups.

## Summarize by age
df_summarized_age <- df %>%
    group_by(age) %>%
    summarize(
        cost = sum(cost),
        impression = sum(impression),
        click = sum(click),
        conversion = sum(conversion),
        ctr = sum(click) / sum(impression),
        cvr = sum(conversion) / sum(click),
        ctvr = sum(conversion) / sum(impression),
        cpa = sum(cost) / sum(conversion),
        cpc = sum(cost) / sum(click),
        ecpm = sum(cost) / sum(impression) * 1000
    )
 
df_summarized_age
#> # A tibble: 10 x 11
#>      age  cost impression click conversion    ctr     cvr    ctvr    cpa    cpc  ecpm
#>     <dbl> <int>      <int> <dbl>      <int>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl> <dbl>
#> 1    10    51        101     0          0 0      NaN     0       Inf    Inf     505.
#> 2    20    52        102     3          1 0.0294   0.333 0.00980  52     17.3   510.
#> 3    30    53        103     6          2 0.0583   0.333 0.0194   26.5    8.83  515.
#> 4    40    54        104     9          3 0.0865   0.333 0.0288   18      6     519.
#> 5    50    55        105    12          4 0.114    0.333 0.0381   13.8    4.58  524.
#> 6    60    56        106    15          5 0.142    0.333 0.0472   11.2    3.73  528.
#> 7    70    57        107    18          6 0.168    0.333 0.0561    9.5    3.17  533.
#> 8    80    58        108    21          7 0.194    0.333 0.0648    8.29   2.76  537.
#> 9    90    59        109    24          8 0.220    0.333 0.0734    7.38   2.46  541.
#> 10   100    60        110    27          9 0.245    0.333 0.0818    6.67   2.22  545.
 
 
# Add more analysis axis ... plot ... continue
# ...

Run Aggregating and Parallel Computation Methods Simultaneously

In order to play around with data in the above fashion, we would have to write similar chunks of code multiple times in a row. The repeated part needs to stay repeated just because we are grouping data by different variables each time to compute the same metrics. In this case, even dplyr can't save us from the repetition.

This repeated code chunk is not only hard to read, but also lacks elegance. My colleagues and I would often express our discomfort over how bothersome it is to deal with this kind of code. Instead of bothering the "R God" Hadley Wickham, who must already have a full plate with trillions of tasks to handle, we decided to develop our own package to solve this lack-of-beauty-and-efficiency problem. That's how we got to developing the mmetrics package, and we can now follow our "less is more" minimalist aesthetic.

Now, let's see how we can apply mmetrics to a data scientist's daily tasks. First of all, let's define the metrics computation functions in the following style.

# Define metrics
metrics <- mmetrics::define(
    cost = sum(cost),
    impression = sum(impression),
    click = sum(click),
    conversion = sum(conversion),
    ctr = sum(click) / sum(impression),
    cvr = sum(conversion) / sum(click),
    ctvr = sum(conversion) / sum(impression),
    cpa = sum(cost) / sum(conversion),
    cpc = sum(cost) / sum(click),
    ecpm = sum(cost) / sum(impression) * 1000
)

Then, let's apply the metric functions to the raw data set we loaded in the previous section.

## Tabulation yeah!
df_summarized_gender <- mmetrics::add(df, gender, metrics = metrics)
df_summarized_age <- mmetrics::add(df, age, metrics = metrics)

Ta-dah! The results are identical to the output computed in the traditional way, but with a significantly reduced length of code. With a much better aesthetic style to boot!

df_summarized_gender
#>  # A tibble: 2 x 11
#>  gender  cost impression click conversion   ctr   cvr   ctvr   cpa   cpc  ecpm
#>  <fct>  <int>      <int> <dbl>      <int> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
#> 1 F        280        530    75         25 0.142 0.333 0.0472  11.2  3.73  528.
#> 2 M        275        525    60         20 0.114 0.333 0.0381  13.8  4.58  524.
 
 
df_summarized_age
#> # A tibble: 10 x 11
#>      age  cost impression click conversion    ctr     cvr    ctvr    cpa    cpc  ecpm
#>     <dbl> <int>      <int> <dbl>      <int>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl> <dbl>
#> 1    10    51        101     0          0 0      NaN     0       Inf    Inf     505.
#> 2    20    52        102     3          1 0.0294   0.333 0.00980  52     17.3   510.
#> 3    30    53        103     6          2 0.0583   0.333 0.0194   26.5    8.83  515.
#> 4    40    54        104     9          3 0.0865   0.333 0.0288   18      6     519.
#> 5    50    55        105    12          4 0.114    0.333 0.0381   13.8    4.58  524.
#> 6    60    56        106    15          5 0.142    0.333 0.0472   11.2    3.73  528.
#> 7    70    57        107    18          6 0.168    0.333 0.0561    9.5    3.17  533.
#> 8    80    58        108    21          7 0.194    0.333 0.0648    8.29   2.76  537.
#> 9    90    59        109    24          8 0.220    0.333 0.0734    7.38   2.46  541.
#> 10   100    60        110    27          9 0.245    0.333 0.0818    6.67   2.22  545.

The basic concept behind this is that we are making a universal key that can open various doors. By defining the metrics function once, you can apply the pre-defined metrics under various scenarios with different approaches of wrangling and grouping data. Isn't this revised code more easy on the eyes?

Apply One-time Defined R Functions to Various Cases

If you want to shift between aggregation and parallel computations in dplyr, you have to run code between dplyr::mutate() and dplyr::summarize() back-and-forth.

## Case of dplyr::summarize()
df_summarized_age <- df %>%
    group_by(age) %>%
    summarize(
        ctr = sum(click) / sum(impression),
        cvr = sum(conversion) / sum(click)
    )
 
## The case of dplyr::mutate()
df_mutate <- df %>%
mutate(
    ctr = click / impression,
    cvr = conversion / click
)
 
df_mutate
 
#> # A tibble: 10 x 12
#>    gender   age  cost impression click conversion    ctr     cvr    ctvr    cpa    cpc  ecpm
#>    <fct>  <dbl> <int>      <int> <dbl>      <int>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl> <dbl>
#>  1 M         10    51        101     0          0 0      NaN     0       Inf    Inf     505.
#>  2 F         20    52        102     3          1 0.0294   0.333 0.00980  52     17.3   510.
#>  3 M         30    53        103     6          2 0.0583   0.333 0.0194   26.5    8.83  515.
#>  4 F         40    54        104     9          3 0.0865   0.333 0.0288   18      6     519.
#>  5 M         50    55        105    12          4 0.114    0.333 0.0381   13.8    4.58  524.
#>  6 F         60    56        106    15          5 0.142    0.333 0.0472   11.2    3.73  528.
#>  7 M         70    57        107    18          6 0.168    0.333 0.0561    9.5    3.17  533.
#>  8 F         80    58        108    21          7 0.194    0.333 0.0648    8.29   2.76  537.
#>  9 M         90    59        109    24          8 0.220    0.333 0.0734    7.38   2.46  541.
#> 10 F        100    60        110    27          9 0.245    0.333 0.0818    6.67   2.22  545.

One advantage of our mmetrics package over dplyr is that while computations involve both aggregating and parallel computation methods, you do not have to write or define the same formula multiple times just for the sake of grouping purposes. Our mmetrics package provides mmetrics::disaggregate(), which allows you to apply one-time defined functions to various cases.

In this case, you can use mmetrics::disaggregate() to remove the first aggregation function from the argument and return disaggregated metrics. Here is an example of how it works.

# Original metrics. sum() is used for this metrics
metrics
#> <list_of<quosure>>
#>
#> $cost
#> <quosure>
#> expr: ^sum(cost)
#> env: global
#>
#> $ctr
#> <quosure>
#> expr: ^sum(click) / sum(impression)
#> env: global
#> ...

Now, let's remove the aggregation.

# Disaggregate metrics!
metrics_disaggregated <- mmetrics::disaggregate(metrics)
# Woo! sum() are removed!
metrics_disaggregated
#> $cost
#> <quosure>
#> expr: ^cost
#> env:  global
#>
#> $ctr
#> <quosure>
#> expr: ^click / impression
#> env:  global
#> ...

As you can see, it's way easier and more pleasing to look at than including multiple repetitive code chunks. Now, you can apply this new disaggregated metrics function to the data. The output aligns with df_mutate.

df_disaggregated <- mmetrics::add(df, metrics = metrics_disaggregated, summarize = FALSE)
 
df_disaggregated
#> # A tibble: 10 x 12
#>    gender   age  cost impression click conversion    ctr     cvr    ctvr    cpa    cpc  ecpm
#>    <fct>  <dbl> <int>      <int> <dbl>      <int>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl> <dbl>
#>  1 M         10    51        101     0          0 0      NaN     0       Inf    Inf     505.
#>  2 F         20    52        102     3          1 0.0294   0.333 0.00980  52     17.3   510.
#>  3 M         30    53        103     6          2 0.0583   0.333 0.0194   26.5    8.83  515.
#>  4 F         40    54        104     9          3 0.0865   0.333 0.0288   18      6     519.
#>  5 M         50    55        105    12          4 0.114    0.333 0.0381   13.8    4.58  524.
#>  6 F         60    56        106    15          5 0.142    0.333 0.0472   11.2    3.73  528.
#>  7 M         70    57        107    18          6 0.168    0.333 0.0561    9.5    3.17  533.
#>  8 F         80    58        108    21          7 0.194    0.333 0.0648    8.29   2.76  537.
#>  9 M         90    59        109    24          8 0.220    0.333 0.0734    7.38   2.46  541.
#> 10 F        100    60        110    27          9 0.245    0.333 0.0818    6.67   2.22  545.

Conduct Row-wise Metrics Computation

Another useful feature of the library is that you can use these metrics with dplyr::mutate()for row-wise metrics computation.

dplyr::mutate(df, !!!metrics_disaggregated)
#> gender age cost impression click conversion ctr
#> 1 M 10 51 101 0 0 0.00000000
#> 2 F 20 52 102 3 1 0.02941176
#> 3 M 30 53 103 6 2 0.05825243
#> 4 F 40 54 104 9 3 0.08653846
#> 5 M 50 55 105 12 4 0.11428571
#> 6 F 60 56 106 15 5 0.14150943
#> 7 M 70 57 107 18 6 0.16822430
#> 8 F 80 58 108 21 7 0.19444444
#> 9 M 90 59 109 24 8 0.22018349
#> 10 F 100 60 110 27 9 0.24545455

You can also do the same computations using mmetrics::gmutate() defined in our package. In this case, you do not need to write the !!! (bang-bang-bang) operator explicitly.

mmetrics::gmutate(df, metrics = metrics_disaggregated)
#> # A tibble: 10 x 7
#> gender age cost impression click conversion ctr
#> <fct> <dbl> <int> <int> <dbl> <int> <dbl>
#> 1 M 10 51 101 0 0 0
#> 2 F 20 52 102 3 1 0.0294
#> 3 M 30 53 103 6 2 0.0583
#> 4 F 40 54 104 9 3 0.0865
#> 5 M 50 55 105 12 4 0.114
#> 6 F 60 56 106 15 5 0.142
#> 7 M 70 57 107 18 6 0.168
#> 8 F 80 58 108 21 7 0.194
#> 9 M 90 59 109 24 8 0.220
#> 10 F 100 60 110 27 9 0.245

Closing Words

The mmetrics package lessens the burden put on our data scientists' shoulders while doing tabulation work! If you want to get started with mmetrics, we recommend reading the vignettes first.

Please enjoy the package and have a wonderful day! We look forward to the amazing things you'll do with mmetrics!

Credits

mmetrics is brought to you by team y-bar: