statsTable

This document runs through how you might use statsTable on a data project. The function is meant to provide summary statistics for a “data rectangle” which contains a variable to summarize and a variable to group by (for in-group summary calculations).

Sample data

For testing, we use the iris dataset that comes stock with R. It looks something like this:

iris %>% glimpse()
#> Rows: 150
#> Columns: 5
#> $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
#> $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
#> $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
#> $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
#> $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

We tweak it by adding a random bin for each datum, and merging mock weights for each categorical variable (Species and rand_bin). In general, double-check that you apply weights appropriately, and don’t merge weights into a dataset already containing them. If in doubt, discuss with your PM and/or other analysts on your project team.

set.seed(1234)

rand_wt <- data.frame(rand_bin = 1:3,
                      rand_wt = rnorm(3, mean = 1, sd = 0.333))

species_wt <- data.frame(Species = unique(iris$Species),
                         species_wt = rnorm(length(unique(iris$Species)), 
                                            mean = 1, sd = 0.333))

test_data <- iris %>% 
  mutate(rand_bin = sample(1:3, nrow(iris), replace = TRUE)) %>% 
  merge(species_wt,
        by = "Species") %>% 
  merge(rand_wt,
        by = "rand_bin")

test_dt <- data.table(test_data)

test_data %>% glimpse()
#> Rows: 150
#> Columns: 8
#> $ rand_bin     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ Species      <fct> setosa, virginica, versicolor, versicolor, setosa, setosa…
#> $ Sepal.Length <dbl> 5.4, 6.7, 6.3, 6.1, 4.5, 5.2, 5.3, 5.6, 5.1, 5.1, 4.9, 4.…
#> $ Sepal.Width  <dbl> 3.9, 2.5, 3.3, 2.8, 2.3, 4.1, 3.7, 3.0, 2.5, 3.3, 3.6, 3.…
#> $ Petal.Length <dbl> 1.3, 5.8, 4.7, 4.0, 1.3, 1.5, 1.5, 4.1, 3.0, 1.7, 1.4, 1.…
#> $ Petal.Width  <dbl> 0.4, 1.8, 1.6, 1.3, 0.3, 0.1, 0.2, 1.3, 1.1, 0.5, 0.1, 0.…
#> $ species_wt   <dbl> 0.2188827, 1.1685166, 1.1428985, 1.1428985, 0.2188827, 0.…
#> $ rand_wt      <dbl> 0.5980471, 0.5980471, 0.5980471, 0.5980471, 0.5980471, 0.…

Sample table summary

Next, we’ll apply the statsTable function to it to look at how Sepal.Length varies by Species. The summary stats are given in the stats argument, which takes a list of functions in R. For now, other than weighted.mean, the stats argument only accepts functions that produce a single output value.

statsTable(data = test_data,
           summVar = "Sepal.Length",
           groupVar = "Species",
           stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd", "iqr", "cv"),
           totWeightVar = "species_wt", accuracy = 0.01,
           drop0trailing = TRUE) %>% 
    knitr::kable()
#> Warning in statsTable.data.frame(data = test_data, summVar = "Sepal.Length", : Using placeholder weights of 1 for in-group analysis
stat setosa versicolor virginica Total
n 50 50 50 150
min 4.30 4.90 4.90 4.30
max 5.80 7.00 7.90 7.90
mean 5.01 5.94 6.59 5.84
weighted.mean 5.01 5.94 6.59 6.16
median 5.00 5.90 6.50 5.80
sd 0.35 0.52 0.64 0.83
iqr 0.40 0.70 0.67 1.30
cv 0.07 0.09 0.10 0.14

statsTable(data = test_dt,
           summVar = "Sepal.Length",
           groupVar = "rand_bin",
           stats = c("n", "min", "max", "mean", "weighted.mean", "sd", "cv", "median", "mad"),
           totWeightVar = "rand_wt", accuracy = 0.01,
           drop0trailing = TRUE) %>% 
    knitr::kable()
#> Warning in statsTable.data.table(data = test_dt, summVar = "Sepal.Length", : Using placeholder weights of 1 for in-group analysis
stat 1 2 3 Total
n 43 55 52 150
min 4.40 4.40 4.30 4.30
max 7.60 7.70 7.90 7.90
mean 5.89 5.77 5.88 5.84
weighted.mean 5.89 5.77 5.88 5.84
sd 0.74 0.82 0.91 0.83
cv 0.12 0.14 0.16 0.14
median 5.80 5.70 5.95 5.80
mad 0.89 0.89 1.11 1.04

And if we supply no grouping variable?

statsTable(data = test_data,
           summVar = "Sepal.Length",
           stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"),
           totWeightVar = "species_wt", inGroupWeightVar = "species_wt", accuracy = 0.01) %>% 
    knitr::kable()
stat Total
n 150
min 4.30
max 7.90
mean 5.84
weighted.mean 6.16
median 5.80
sd 0.83

statsTable(data = test_data,
           summVar = "Sepal.Length",
           stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"),
           totWeightVar = "species_wt", accuracy = 0.01) %>% 
    knitr::kable()
#> Warning in statsTable.data.frame(data = test_data, summVar = "Sepal.Length", : Using placeholder weights of 1 for in-group analysis
stat Total
n 150
min 4.30
max 7.90
mean 5.84
weighted.mean 6.16
median 5.80
sd 0.83

statsTable(data = test_data,
           summVar = "Sepal.Length",
           stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"),
           totWeightVar = "rand_wt", inGroupWeightVar = "rand_wt", accuracy = 0.01) %>% 
    knitr::kable()
stat Total
n 150
min 4.30
max 7.90
mean 5.84
weighted.mean 5.84
median 5.80
sd 0.83

statsTable(data = test_data,
           summVar = "Sepal.Length",
           stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"),
           totWeightVar = "rand_wt", accuracy = 0.01) %>% 
    knitr::kable()
#> Warning in statsTable.data.frame(data = test_data, summVar = "Sepal.Length", : Using placeholder weights of 1 for in-group analysis
stat Total
n 150
min 4.30
max 7.90
mean 5.84
weighted.mean 5.84
median 5.80
sd 0.83

Further reading

If you are looking to generate a lot of summary statistics at once, or do similar exploratory data work, there are other functions and packages in the R universe you may want to check out. We’ll apply each to the same iris dataset as above.