This document runs through how you might use
statsTable on a data project. The function is meant to
provide summary statistics for a “data rectangle” which contains a
variable to summarize and a variable to group by (for in-group summary
calculations).
For testing, we use the iris dataset that comes stock
with R. It looks something like this:
iris %>% glimpse()
#> Rows: 150
#> Columns: 5
#> $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
#> $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
#> $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
#> $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
#> $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…We tweak it by adding a random bin for each datum, and merging mock
weights for each categorical variable (Species and
rand_bin). In general, double-check that you apply
weights appropriately, and don’t merge weights into a dataset already
containing them. If in doubt, discuss with your PM and/or other
analysts on your project team.
set.seed(1234)
rand_wt <- data.frame(rand_bin = 1:3,
rand_wt = rnorm(3, mean = 1, sd = 0.333))
species_wt <- data.frame(Species = unique(iris$Species),
species_wt = rnorm(length(unique(iris$Species)),
mean = 1, sd = 0.333))
test_data <- iris %>%
mutate(rand_bin = sample(1:3, nrow(iris), replace = TRUE)) %>%
merge(species_wt,
by = "Species") %>%
merge(rand_wt,
by = "rand_bin")
test_dt <- data.table(test_data)
test_data %>% glimpse()
#> Rows: 150
#> Columns: 8
#> $ rand_bin <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ Species <fct> setosa, virginica, versicolor, versicolor, setosa, setosa…
#> $ Sepal.Length <dbl> 5.4, 6.7, 6.3, 6.1, 4.5, 5.2, 5.3, 5.6, 5.1, 5.1, 4.9, 4.…
#> $ Sepal.Width <dbl> 3.9, 2.5, 3.3, 2.8, 2.3, 4.1, 3.7, 3.0, 2.5, 3.3, 3.6, 3.…
#> $ Petal.Length <dbl> 1.3, 5.8, 4.7, 4.0, 1.3, 1.5, 1.5, 4.1, 3.0, 1.7, 1.4, 1.…
#> $ Petal.Width <dbl> 0.4, 1.8, 1.6, 1.3, 0.3, 0.1, 0.2, 1.3, 1.1, 0.5, 0.1, 0.…
#> $ species_wt <dbl> 0.2188827, 1.1685166, 1.1428985, 1.1428985, 0.2188827, 0.…
#> $ rand_wt <dbl> 0.5980471, 0.5980471, 0.5980471, 0.5980471, 0.5980471, 0.…Next, we’ll apply the statsTable function to it to look
at how Sepal.Length varies by Species. The
summary stats are given in the stats argument, which takes
a list of functions in R. For now, other than
weighted.mean, the stats argument only accepts
functions that produce a single output value.
statsTable(data = test_data,
summVar = "Sepal.Length",
groupVar = "Species",
stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd", "iqr", "cv"),
totWeightVar = "species_wt", accuracy = 0.01,
drop0trailing = TRUE) %>%
knitr::kable()
#> Warning in statsTable.data.frame(data = test_data, summVar = "Sepal.Length", : Using placeholder weights of 1 for in-group analysis| stat | setosa | versicolor | virginica | Total |
|---|---|---|---|---|
| n | 50 | 50 | 50 | 150 |
| min | 4.30 | 4.90 | 4.90 | 4.30 |
| max | 5.80 | 7.00 | 7.90 | 7.90 |
| mean | 5.01 | 5.94 | 6.59 | 5.84 |
| weighted.mean | 5.01 | 5.94 | 6.59 | 6.16 |
| median | 5.00 | 5.90 | 6.50 | 5.80 |
| sd | 0.35 | 0.52 | 0.64 | 0.83 |
| iqr | 0.40 | 0.70 | 0.67 | 1.30 |
| cv | 0.07 | 0.09 | 0.10 | 0.14 |
statsTable(data = test_dt,
summVar = "Sepal.Length",
groupVar = "rand_bin",
stats = c("n", "min", "max", "mean", "weighted.mean", "sd", "cv", "median", "mad"),
totWeightVar = "rand_wt", accuracy = 0.01,
drop0trailing = TRUE) %>%
knitr::kable()
#> Warning in statsTable.data.table(data = test_dt, summVar = "Sepal.Length", : Using placeholder weights of 1 for in-group analysis| stat | 1 | 2 | 3 | Total |
|---|---|---|---|---|
| n | 43 | 55 | 52 | 150 |
| min | 4.40 | 4.40 | 4.30 | 4.30 |
| max | 7.60 | 7.70 | 7.90 | 7.90 |
| mean | 5.89 | 5.77 | 5.88 | 5.84 |
| weighted.mean | 5.89 | 5.77 | 5.88 | 5.84 |
| sd | 0.74 | 0.82 | 0.91 | 0.83 |
| cv | 0.12 | 0.14 | 0.16 | 0.14 |
| median | 5.80 | 5.70 | 5.95 | 5.80 |
| mad | 0.89 | 0.89 | 1.11 | 1.04 |
And if we supply no grouping variable?
statsTable(data = test_data,
summVar = "Sepal.Length",
stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"),
totWeightVar = "species_wt", inGroupWeightVar = "species_wt", accuracy = 0.01) %>%
knitr::kable()| stat | Total |
|---|---|
| n | 150 |
| min | 4.30 |
| max | 7.90 |
| mean | 5.84 |
| weighted.mean | 6.16 |
| median | 5.80 |
| sd | 0.83 |
statsTable(data = test_data,
summVar = "Sepal.Length",
stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"),
totWeightVar = "species_wt", accuracy = 0.01) %>%
knitr::kable()
#> Warning in statsTable.data.frame(data = test_data, summVar = "Sepal.Length", : Using placeholder weights of 1 for in-group analysis| stat | Total |
|---|---|
| n | 150 |
| min | 4.30 |
| max | 7.90 |
| mean | 5.84 |
| weighted.mean | 6.16 |
| median | 5.80 |
| sd | 0.83 |
statsTable(data = test_data,
summVar = "Sepal.Length",
stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"),
totWeightVar = "rand_wt", inGroupWeightVar = "rand_wt", accuracy = 0.01) %>%
knitr::kable()| stat | Total |
|---|---|
| n | 150 |
| min | 4.30 |
| max | 7.90 |
| mean | 5.84 |
| weighted.mean | 5.84 |
| median | 5.80 |
| sd | 0.83 |
statsTable(data = test_data,
summVar = "Sepal.Length",
stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"),
totWeightVar = "rand_wt", accuracy = 0.01) %>%
knitr::kable()
#> Warning in statsTable.data.frame(data = test_data, summVar = "Sepal.Length", : Using placeholder weights of 1 for in-group analysis| stat | Total |
|---|---|
| n | 150 |
| min | 4.30 |
| max | 7.90 |
| mean | 5.84 |
| weighted.mean | 5.84 |
| median | 5.80 |
| sd | 0.83 |
If you are looking to generate a lot of summary statistics at once,
or do similar exploratory data work, there are other functions and
packages in the R universe you may want to check out. We’ll apply each
to the same iris dataset as above.
distributions in the xray package,
which only provides table output for numerical variables. Also has
optional chart output for both numerical and categorical
variables
descr, freq, and ctable in
the summarytools
package, for numeric, categorical, and cross-tab summaries. Use the
tb function to produce data frame output (as opposed to
formatted for console printing)
freq is also the name of a function in the descr
package, for standardized frequency tables of a single variable (keeps
NAs). Also has optional chart output