--- title: "statsTable" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{statsTable} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(coreStatsNMR) library(dplyr) library(tidyr) library(data.table) set.seed(1234) ``` This document runs through how you might use `statsTable` on a data project. The function is meant to provide summary statistics for a "data rectangle" which contains a variable to summarize and a variable to group by (for in-group summary calculations). ### Sample data For testing, we use the `iris` dataset that comes stock with R. It looks something like this: ```{r iris} iris %>% glimpse() ``` We tweak it by adding a random bin for each datum, and merging mock weights for each categorical variable (`Species` and `rand_bin`). In general, **double-check that you apply weights appropriately, and don't merge weights into a dataset already containing them**. If in doubt, discuss with your PM and/or other analysts on your project team. ```{r test_data} set.seed(1234) rand_wt <- data.frame(rand_bin = 1:3, rand_wt = rnorm(3, mean = 1, sd = 0.333)) species_wt <- data.frame(Species = unique(iris$Species), species_wt = rnorm(length(unique(iris$Species)), mean = 1, sd = 0.333)) test_data <- iris %>% mutate(rand_bin = sample(1:3, nrow(iris), replace = TRUE)) %>% merge(species_wt, by = "Species") %>% merge(rand_wt, by = "rand_bin") test_dt <- data.table(test_data) test_data %>% glimpse() ``` ### Sample table summary Next, we'll apply the `statsTable` function to it to look at how `Sepal.Length` varies by `Species`. The summary stats are given in the `stats` argument, which takes a list of functions in R. For now, other than `weighted.mean`, the `stats` argument only accepts functions that produce a single output value. ```{r statsTable_species} statsTable(data = test_data, summVar = "Sepal.Length", groupVar = "Species", stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd", "iqr", "cv"), totWeightVar = "species_wt", accuracy = 0.01, drop0trailing = TRUE) %>% knitr::kable() statsTable(data = test_dt, summVar = "Sepal.Length", groupVar = "rand_bin", stats = c("n", "min", "max", "mean", "weighted.mean", "sd", "cv", "median", "mad"), totWeightVar = "rand_wt", accuracy = 0.01, drop0trailing = TRUE) %>% knitr::kable() ``` And if we supply no grouping variable? ```{r statsTable_noGroup} statsTable(data = test_data, summVar = "Sepal.Length", stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"), totWeightVar = "species_wt", inGroupWeightVar = "species_wt", accuracy = 0.01) %>% knitr::kable() statsTable(data = test_data, summVar = "Sepal.Length", stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"), totWeightVar = "species_wt", accuracy = 0.01) %>% knitr::kable() statsTable(data = test_data, summVar = "Sepal.Length", stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"), totWeightVar = "rand_wt", inGroupWeightVar = "rand_wt", accuracy = 0.01) %>% knitr::kable() statsTable(data = test_data, summVar = "Sepal.Length", stats = c("n", "min", "max", "mean", "weighted.mean", "median", "sd"), totWeightVar = "rand_wt", accuracy = 0.01) %>% knitr::kable() ``` ### Further reading If you are looking to generate a lot of summary statistics at once, or do similar exploratory data work, there are other functions and packages in the R universe you may want to check out. We'll apply each to the same `iris` dataset as above. - `distributions` in the [`xray`](https://github.com/sicarul/xray/) package, which only provides table output for numerical variables. Also has optional chart output for both numerical and categorical variables - `descr`, `freq`, and `ctable` in the [`summarytools`](https://cran.r-project.org/web/packages/summarytools/vignettes/Introduction.html) package, for numeric, categorical, and cross-tab summaries. Use the `tb` function to produce data frame output (as opposed to formatted for console printing) - `freq` is also the name of a function in the [`descr`](https://cran.r-project.org/web/packages/descr/index.htmlg) package, for standardized frequency tables of a single variable (keeps NAs). Also has optional chart output