This document runs through how you might use
statsTable
on a data project. The function is meant to
provide summary statistics for a “data rectangle” which contains a
variable to summarize and a variable to group by (for in-group summary
calculations).
For testing, we use the iris
dataset that comes stock
with R. It looks something like this:
iris %>% glimpse()
#> Rows: 150
#> Columns: 5
#> $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
#> $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
#> $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
#> $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
#> $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
We tweak it by adding a random bin for each datum, and merging mock
weights for each categorical variable (Species
and
rand_bin
). In general, double-check that you apply
weights appropriately, and don’t merge weights into a dataset already
containing them. If in doubt, discuss with your PM and/or other
analysts on your project team.
set.seed(1234)
rand_wt <- data.frame(rand_bin = 1:3,
rand_wt = rnorm(3, mean = 1, sd = 0.333))
species_wt <- data.frame(Species = unique(iris$Species),
species_wt = rnorm(length(unique(iris$Species)),
mean = 1, sd = 0.333))
test_data <- iris %>%
mutate(rand_bin = sample(1:3, nrow(iris), replace = TRUE)) %>%
merge(species_wt,
by = "Species") %>%
merge(rand_wt,
by = "rand_bin")
test_dt <- data.table(test_data)
test_data %>% glimpse()
#> Rows: 150
#> Columns: 8
#> $ rand_bin <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ Species <fct> setosa, virginica, versicolor, versicolor, setosa, setosa…
#> $ Sepal.Length <dbl> 5.4, 6.7, 6.3, 6.1, 4.5, 5.2, 5.3, 5.6, 5.1, 5.1, 4.9, 4.…
#> $ Sepal.Width <dbl> 3.9, 2.5, 3.3, 2.8, 2.3, 4.1, 3.7, 3.0, 2.5, 3.3, 3.6, 3.…
#> $ Petal.Length <dbl> 1.3, 5.8, 4.7, 4.0, 1.3, 1.5, 1.5, 4.1, 3.0, 1.7, 1.4, 1.…
#> $ Petal.Width <dbl> 0.4, 1.8, 1.6, 1.3, 0.3, 0.1, 0.2, 1.3, 1.1, 0.5, 0.1, 0.…
#> $ species_wt <dbl> 0.2188827, 1.1685166, 1.1428985, 1.1428985, 0.2188827, 0.…
#> $ rand_wt <dbl> 0.5980471, 0.5980471, 0.5980471, 0.5980471, 0.5980471, 0.…
Next, we’ll apply the statsTable
function to it to look
at how Sepal.Length
varies by Species
. The
summary stats are given in the stats
argument, which takes
a list of functions in R. For now, other than
weighted.mean
, the stats
argument only accepts
functions that produce a single output value.
statsTable(data = test_data,
summVar = "Sepal.Length",
groupVar = "Species",
stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd", "iqr", "cv"),
totWeightVar = "species_wt", accuracy = 0.01,
drop0trailing = TRUE) %>%
knitr::kable()
#> Warning in statsTable.data.frame(data = test_data, summVar = "Sepal.Length", : Using placeholder weights of 1 for in-group analysis
stat | setosa | versicolor | virginica | Total |
---|---|---|---|---|
n | 50 | 50 | 50 | 150 |
min | 4.30 | 4.90 | 4.90 | 4.30 |
max | 5.80 | 7.00 | 7.90 | 7.90 |
mean | 5.01 | 5.94 | 6.59 | 5.84 |
weighted.mean | 5.01 | 5.94 | 6.59 | 6.16 |
median | 5.00 | 5.90 | 6.50 | 5.80 |
sd | 0.35 | 0.52 | 0.64 | 0.83 |
iqr | 0.40 | 0.70 | 0.67 | 1.30 |
cv | 0.07 | 0.09 | 0.10 | 0.14 |
statsTable(data = test_dt,
summVar = "Sepal.Length",
groupVar = "rand_bin",
stats = c("n", "min", "max", "mean", "weighted.mean", "sd", "cv", "median", "mad"),
totWeightVar = "rand_wt", accuracy = 0.01,
drop0trailing = TRUE) %>%
knitr::kable()
#> Warning in statsTable.data.table(data = test_dt, summVar = "Sepal.Length", : Using placeholder weights of 1 for in-group analysis
stat | 1 | 2 | 3 | Total |
---|---|---|---|---|
n | 43 | 55 | 52 | 150 |
min | 4.40 | 4.40 | 4.30 | 4.30 |
max | 7.60 | 7.70 | 7.90 | 7.90 |
mean | 5.89 | 5.77 | 5.88 | 5.84 |
weighted.mean | 5.89 | 5.77 | 5.88 | 5.84 |
sd | 0.74 | 0.82 | 0.91 | 0.83 |
cv | 0.12 | 0.14 | 0.16 | 0.14 |
median | 5.80 | 5.70 | 5.95 | 5.80 |
mad | 0.89 | 0.89 | 1.11 | 1.04 |
And if we supply no grouping variable?
statsTable(data = test_data,
summVar = "Sepal.Length",
stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"),
totWeightVar = "species_wt", inGroupWeightVar = "species_wt", accuracy = 0.01) %>%
knitr::kable()
stat | Total |
---|---|
n | 150 |
min | 4.30 |
max | 7.90 |
mean | 5.84 |
weighted.mean | 6.16 |
median | 5.80 |
sd | 0.83 |
statsTable(data = test_data,
summVar = "Sepal.Length",
stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"),
totWeightVar = "species_wt", accuracy = 0.01) %>%
knitr::kable()
#> Warning in statsTable.data.frame(data = test_data, summVar = "Sepal.Length", : Using placeholder weights of 1 for in-group analysis
stat | Total |
---|---|
n | 150 |
min | 4.30 |
max | 7.90 |
mean | 5.84 |
weighted.mean | 6.16 |
median | 5.80 |
sd | 0.83 |
statsTable(data = test_data,
summVar = "Sepal.Length",
stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"),
totWeightVar = "rand_wt", inGroupWeightVar = "rand_wt", accuracy = 0.01) %>%
knitr::kable()
stat | Total |
---|---|
n | 150 |
min | 4.30 |
max | 7.90 |
mean | 5.84 |
weighted.mean | 5.84 |
median | 5.80 |
sd | 0.83 |
statsTable(data = test_data,
summVar = "Sepal.Length",
stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"),
totWeightVar = "rand_wt", accuracy = 0.01) %>%
knitr::kable()
#> Warning in statsTable.data.frame(data = test_data, summVar = "Sepal.Length", : Using placeholder weights of 1 for in-group analysis
stat | Total |
---|---|
n | 150 |
min | 4.30 |
max | 7.90 |
mean | 5.84 |
weighted.mean | 5.84 |
median | 5.80 |
sd | 0.83 |
If you are looking to generate a lot of summary statistics at once,
or do similar exploratory data work, there are other functions and
packages in the R universe you may want to check out. We’ll apply each
to the same iris
dataset as above.
distributions
in the xray
package,
which only provides table output for numerical variables. Also has
optional chart output for both numerical and categorical
variables
descr
, freq
, and ctable
in
the summarytools
package, for numeric, categorical, and cross-tab summaries. Use the
tb
function to produce data frame output (as opposed to
formatted for console printing)
freq
is also the name of a function in the descr
package, for standardized frequency tables of a single variable (keeps
NAs). Also has optional chart output