Have you ever faced a situation where you need to compute the same metrics using similar or even the same computation formulas over and over again? How do you usually handle those repetitive tasks? As many engineers in our field of work know, dplyr is one of the most cool and efficient data wrangling grammars that provide us with tons of tangible solutions to most of our day-to-day data manipulation challenges. In this post we would like to introduce you to a package called mmetrics, developed by my colleagues and I.
Let's start by installing the package. You can install the stable release of mmetrics with CRAN by typing the following command:
install.packages("mmetrics")
Or you can install development versions from GitHub by typing the following command:
# install.packages("remotes")
remotes::install_github("y-bar/mmetrics")
Now, let's jump into some real world examples. First, let's load dummy data from the mmetrics package for this example.
library("dplyr")
library("mmetrics")
df <- mmetrics::dummy_data
df
#> gender age cost impression click conversion
#> 1 M 10 51 101 0 0
#> 2 F 20 52 102 3 1
#> 3 M 30 53 103 6 2
#> 4 F 40 54 104 9 3
#> 5 M 50 55 105 12 4
#> 6 F 60 56 106 15 5
#> 7 M 70 57 107 18 6
#> 8 F 80 58 108 21 7
#> 9 M 90 59 109 24 8
#> 10 F 100 60 110 27 9
Most data scientists are probably reminded on a daily basis that computing the same metrics with different groupings is an unavoidable task. For example, below is a computation of metrics by gender groups.
## Summarize by gender
df_summarized_gender <- df %>%
group_by(gender) %>%
summarize(
cost = sum(cost),
impression = sum(impression),
click = sum(click),
conversion = sum(conversion),
ctr = sum(click) / sum(impression),
cvr = sum(conversion) / sum(click),
ctvr = sum(conversion) / sum(impression),
cpa = sum(cost) / sum(conversion),
cpc = sum(cost) / sum(click),
ecpm = sum(cost) / sum(impression) * 1000
)
df_summarized_gender
#> # A tibble: 2 x 11
#> gender cost impression click conversion ctr cvr ctvr cpa cpc ecpm
#> <fct> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 F 280 530 75 25 0.142 0.333 0.0472 11.2 3.73 528.
#> 2 M 275 525 60 20 0.114 0.333 0.0381 13.8 4.58 524.
And certain metrics by age groups.
## Summarize by age
df_summarized_age <- df %>%
group_by(age) %>%
summarize(
cost = sum(cost),
impression = sum(impression),
click = sum(click),
conversion = sum(conversion),
ctr = sum(click) / sum(impression),
cvr = sum(conversion) / sum(click),
ctvr = sum(conversion) / sum(impression),
cpa = sum(cost) / sum(conversion),
cpc = sum(cost) / sum(click),
ecpm = sum(cost) / sum(impression) * 1000
)
df_summarized_age
#> # A tibble: 10 x 11
#> age cost impression click conversion ctr cvr ctvr cpa cpc ecpm
#> <dbl> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 10 51 101 0 0 0 NaN 0 Inf Inf 505.
#> 2 20 52 102 3 1 0.0294 0.333 0.00980 52 17.3 510.
#> 3 30 53 103 6 2 0.0583 0.333 0.0194 26.5 8.83 515.
#> 4 40 54 104 9 3 0.0865 0.333 0.0288 18 6 519.
#> 5 50 55 105 12 4 0.114 0.333 0.0381 13.8 4.58 524.
#> 6 60 56 106 15 5 0.142 0.333 0.0472 11.2 3.73 528.
#> 7 70 57 107 18 6 0.168 0.333 0.0561 9.5 3.17 533.
#> 8 80 58 108 21 7 0.194 0.333 0.0648 8.29 2.76 537.
#> 9 90 59 109 24 8 0.220 0.333 0.0734 7.38 2.46 541.
#> 10 100 60 110 27 9 0.245 0.333 0.0818 6.67 2.22 545.
# Add more analysis axis ... plot ... continue
# ...
Run Aggregating and Parallel Computation Methods Simultaneously
In order to play around with data in the above fashion, we would have to write similar chunks of code multiple times in a row. The repeated part needs to stay repeated just because we are grouping data by different variables each time to compute the same metrics. In this case, even dplyr can't save us from the repetition.
This repeated code chunk is not only hard to read, but also lacks elegance. My colleagues and I would often express our discomfort over how bothersome it is to deal with this kind of code. Instead of bothering the "R God" Hadley Wickham, who must already have a full plate with trillions of tasks to handle, we decided to develop our own package to solve this lack-of-beauty-and-efficiency problem. That's how we got to developing the mmetrics package, and we can now follow our "less is more" minimalist aesthetic.
Now, let's see how we can apply mmetrics to a data scientist's daily tasks. First of all, let's define the metrics computation functions in the following style.
# Define metrics
metrics <- mmetrics::define(
cost = sum(cost),
impression = sum(impression),
click = sum(click),
conversion = sum(conversion),
ctr = sum(click) / sum(impression),
cvr = sum(conversion) / sum(click),
ctvr = sum(conversion) / sum(impression),
cpa = sum(cost) / sum(conversion),
cpc = sum(cost) / sum(click),
ecpm = sum(cost) / sum(impression) * 1000
)
Then, let's apply the metric functions to the raw data set we loaded in the previous section.
## Tabulation yeah!
df_summarized_gender <- mmetrics::add(df, gender, metrics = metrics)
df_summarized_age <- mmetrics::add(df, age, metrics = metrics)
Ta-dah! The results are identical to the output computed in the traditional way, but with a significantly reduced length of code. With a much better aesthetic style to boot!
df_summarized_gender
#> # A tibble: 2 x 11
#> gender cost impression click conversion ctr cvr ctvr cpa cpc ecpm
#> <fct> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 F 280 530 75 25 0.142 0.333 0.0472 11.2 3.73 528.
#> 2 M 275 525 60 20 0.114 0.333 0.0381 13.8 4.58 524.
df_summarized_age
#> # A tibble: 10 x 11
#> age cost impression click conversion ctr cvr ctvr cpa cpc ecpm
#> <dbl> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 10 51 101 0 0 0 NaN 0 Inf Inf 505.
#> 2 20 52 102 3 1 0.0294 0.333 0.00980 52 17.3 510.
#> 3 30 53 103 6 2 0.0583 0.333 0.0194 26.5 8.83 515.
#> 4 40 54 104 9 3 0.0865 0.333 0.0288 18 6 519.
#> 5 50 55 105 12 4 0.114 0.333 0.0381 13.8 4.58 524.
#> 6 60 56 106 15 5 0.142 0.333 0.0472 11.2 3.73 528.
#> 7 70 57 107 18 6 0.168 0.333 0.0561 9.5 3.17 533.
#> 8 80 58 108 21 7 0.194 0.333 0.0648 8.29 2.76 537.
#> 9 90 59 109 24 8 0.220 0.333 0.0734 7.38 2.46 541.
#> 10 100 60 110 27 9 0.245 0.333 0.0818 6.67 2.22 545.
The basic concept behind this is that we are making a universal key that can open various doors. By defining the metrics function once, you can apply the pre-defined metrics under various scenarios with different approaches of wrangling and grouping data. Isn't this revised code more easy on the eyes?
Apply One-time Defined R Functions to Various Cases
If you want to shift between aggregation and parallel computations in dplyr, you have to run code between dplyr::mutate()
and dplyr::summarize()
back-and-forth.
## Case of dplyr::summarize()
df_summarized_age <- df %>%
group_by(age) %>%
summarize(
ctr = sum(click) / sum(impression),
cvr = sum(conversion) / sum(click)
)
## The case of dplyr::mutate()
df_mutate <- df %>%
mutate(
ctr = click / impression,
cvr = conversion / click
)
df_mutate
#> # A tibble: 10 x 12
#> gender age cost impression click conversion ctr cvr ctvr cpa cpc ecpm
#> <fct> <dbl> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 M 10 51 101 0 0 0 NaN 0 Inf Inf 505.
#> 2 F 20 52 102 3 1 0.0294 0.333 0.00980 52 17.3 510.
#> 3 M 30 53 103 6 2 0.0583 0.333 0.0194 26.5 8.83 515.
#> 4 F 40 54 104 9 3 0.0865 0.333 0.0288 18 6 519.
#> 5 M 50 55 105 12 4 0.114 0.333 0.0381 13.8 4.58 524.
#> 6 F 60 56 106 15 5 0.142 0.333 0.0472 11.2 3.73 528.
#> 7 M 70 57 107 18 6 0.168 0.333 0.0561 9.5 3.17 533.
#> 8 F 80 58 108 21 7 0.194 0.333 0.0648 8.29 2.76 537.
#> 9 M 90 59 109 24 8 0.220 0.333 0.0734 7.38 2.46 541.
#> 10 F 100 60 110 27 9 0.245 0.333 0.0818 6.67 2.22 545.
One advantage of our mmetrics package over dplyr is that while computations involve both aggregating and parallel computation methods, you do not have to write or define the same formula multiple times just for the sake of grouping purposes. Our mmetrics package provides mmetrics::disaggregate()
, which allows you to apply one-time defined functions to various cases.
In this case, you can use mmetrics::disaggregate()
to remove the first aggregation function from the argument and return disaggregated metrics. Here is an example of how it works.
# Original metrics. sum() is used for this metrics
metrics
#> <list_of<quosure>>
#>
#> $cost
#> <quosure>
#> expr: ^sum(cost)
#> env: global
#>
#> $ctr
#> <quosure>
#> expr: ^sum(click) / sum(impression)
#> env: global
#> ...
Now, let's remove the aggregation.
# Disaggregate metrics!
metrics_disaggregated <- mmetrics::disaggregate(metrics)
# Woo! sum() are removed!
metrics_disaggregated
#> $cost
#> <quosure>
#> expr: ^cost
#> env: global
#>
#> $ctr
#> <quosure>
#> expr: ^click / impression
#> env: global
#> ...
As you can see, it's way easier and more pleasing to look at than including multiple repetitive code chunks. Now, you can apply this new disaggregated metrics
function to the data. The output aligns with df_mutate
.
df_disaggregated <- mmetrics::add(df, metrics = metrics_disaggregated, summarize = FALSE)
df_disaggregated
#> # A tibble: 10 x 12
#> gender age cost impression click conversion ctr cvr ctvr cpa cpc ecpm
#> <fct> <dbl> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 M 10 51 101 0 0 0 NaN 0 Inf Inf 505.
#> 2 F 20 52 102 3 1 0.0294 0.333 0.00980 52 17.3 510.
#> 3 M 30 53 103 6 2 0.0583 0.333 0.0194 26.5 8.83 515.
#> 4 F 40 54 104 9 3 0.0865 0.333 0.0288 18 6 519.
#> 5 M 50 55 105 12 4 0.114 0.333 0.0381 13.8 4.58 524.
#> 6 F 60 56 106 15 5 0.142 0.333 0.0472 11.2 3.73 528.
#> 7 M 70 57 107 18 6 0.168 0.333 0.0561 9.5 3.17 533.
#> 8 F 80 58 108 21 7 0.194 0.333 0.0648 8.29 2.76 537.
#> 9 M 90 59 109 24 8 0.220 0.333 0.0734 7.38 2.46 541.
#> 10 F 100 60 110 27 9 0.245 0.333 0.0818 6.67 2.22 545.
Conduct Row-wise Metrics Computation
Another useful feature of the library is that you can use these metrics with dplyr::mutate()
for row-wise metrics computation.
dplyr::mutate(df, !!!metrics_disaggregated)
#> gender age cost impression click conversion ctr
#> 1 M 10 51 101 0 0 0.00000000
#> 2 F 20 52 102 3 1 0.02941176
#> 3 M 30 53 103 6 2 0.05825243
#> 4 F 40 54 104 9 3 0.08653846
#> 5 M 50 55 105 12 4 0.11428571
#> 6 F 60 56 106 15 5 0.14150943
#> 7 M 70 57 107 18 6 0.16822430
#> 8 F 80 58 108 21 7 0.19444444
#> 9 M 90 59 109 24 8 0.22018349
#> 10 F 100 60 110 27 9 0.24545455
You can also do the same computations using mmetrics::gmutate()
defined in our package. In this case, you do not need to write the !!!
(bang-bang-bang) operator explicitly.
mmetrics::gmutate(df, metrics = metrics_disaggregated)
#> # A tibble: 10 x 7
#> gender age cost impression click conversion ctr
#> <fct> <dbl> <int> <int> <dbl> <int> <dbl>
#> 1 M 10 51 101 0 0 0
#> 2 F 20 52 102 3 1 0.0294
#> 3 M 30 53 103 6 2 0.0583
#> 4 F 40 54 104 9 3 0.0865
#> 5 M 50 55 105 12 4 0.114
#> 6 F 60 56 106 15 5 0.142
#> 7 M 70 57 107 18 6 0.168
#> 8 F 80 58 108 21 7 0.194
#> 9 M 90 59 109 24 8 0.220
#> 10 F 100 60 110 27 9 0.245
Closing Words
The mmetrics package lessens the burden put on our data scientists' shoulders while doing tabulation work! If you want to get started with mmetrics, we recommend reading the vignettes first.
Please enjoy the package and have a wonderful day! We look forward to the amazing things you'll do with mmetrics!
Credits
mmetrics is brought to you by team y-bar:
- Shinichi Takayanagi: He is interested in using data science and machine learning to contribute to business value. Now, as a B2B Data Engineering Group Manager, he is bustling about trying to achieve that goal.
- Takahiro Yoshinaga: Through research on elementary particle theory, he acquired a high mathematical and analytical background. Now, he is developing analytics tools to be easily able to use even if it is not a data scientist.
- Yanjin Li: By living in various cities all over the world, from mainland China, to Seattle, Rome, NYC, and now Tokyo, she has developed her own internationalized state-of-the-art perspectives and applied them to solve business problems by leveraging data science to provide tangible data solutions.
- Chaerim Yeo: His current interesting is to create value via big data processing and machine learning. That's why he is involved in a wide range of fields from data analysis to development.
