--- title: "stratRandSample" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{stratRandSample} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(coreStatsNMR) library(dplyr) library(data.table) set.seed(1234) ``` `stratRandSample` is a function for generating a random stratified sample from a dataset. It takes up to 6 input values. The first three are required: a dataframe (or data.table), a variable listing strata, another specifying the number (or proportion) of units per stratum. The remaining three are optional: `select` allows for subsetting strata in line with the sampling, replace toggles sampling with replacement, and `bothSets` toggles whether the function returns both the sampled and non-sampled portions of the dataset. ### Sample data For testing, we use the `mtcars` dataset that comes stock with R. It contains data from the 1974 Motor Trend US magazine, qith fuel economy and other statistics for a set of 32 cars. It looks something like this: ```{r mtcars} glimpse(mtcars) ``` We add random weights to the data for testing below, and force a few columns into factors for the sake of our examples. ```{r test_data, echo=TRUE} set.seed(1234) rand_wt <- data.frame(rand_bin = 1:3, rand_wt = rnorm(3, mean = 1, sd = 0.333)) test_data <- mtcars %>% mutate(index = row_number(), rand_bin = sample(1:3, nrow(mtcars), replace = TRUE)) %>% merge(rand_wt, by = "rand_bin") %>% mutate(across(c(cyl, am, vs, gear, carb), as.factor)) test_dt <- data.table(test_data) ``` The data has `r nrow(test_data)` rows. ```{r} nrow(test_data) ``` #### Random 80% of dataset ```{r simple_rand} set.seed(1234) stratRandSample(test_data, size = 0.8) %>% nrow() stratRandSample(test_dt, size = 0.8) %>% nrow() ``` #### Random 30% of each group (# of cylinders) ```{r strat_sampleA} stratRandSample(test_data, "cyl", 0.3) %>% count(cyl) stratRandSample(test_dt, "cyl", 0.3) %>% count(cyl) ``` #### Random sample, different % per stratum - 50% of 4-cylinder cars - 30% of 6-cylinder cars - 10% of 8-cylinder cars ```{r strat_sampleB} stratRandSample(test_data, "cyl", size = c("4" = 0.5, "6" = 0.3, "8" = 0.1)) %>% count(cyl) stratRandSample(test_dt, "cyl", size = c("4" = 0.5, "6" = 0.3, "8" = 0.1)) %>% count(cyl) ``` #### Misc. things Automatically rounds to integer number of rows (e.g. 0.081*11 produces 1 row) ```{r truncA} stratRandSample(test_data, "cyl", size = c("4" = 0.081, "6" = 0.407, "8" = 0.513)) %>% count(cyl) ``` ```{r truncB} stratRandSample(test_dt, "cyl", size = c("4" = 0.09, "6" = 0.49, "8" = 0.59)) %>% count(cyl) ```