stratRandSample
is a function for generating a
random stratified sample from a dataset. It takes up to 6 input values.
The first three are required: a dataframe (or data.table), a variable
listing strata, another specifying the number (or proportion) of units
per stratum. The remaining three are optional: select
allows for subsetting strata in line with the sampling, replace toggles
sampling with replacement, and bothSets
toggles whether the
function returns both the sampled and non-sampled portions of the
dataset.
For testing, we use the mtcars
dataset that comes stock
with R. It contains data from the 1974 Motor Trend US magazine, qith
fuel economy and other statistics for a set of 32 cars. It looks
something like this:
glimpse(mtcars)
#> Rows: 32
#> Columns: 11
#> $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
#> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
#> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
#> $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
#> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
#> $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
#> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
#> $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
#> $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
#> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
We add random weights to the data for testing below, and force a few columns into factors for the sake of our examples.
set.seed(1234)
rand_wt <- data.frame(rand_bin = 1:3,
rand_wt = rnorm(3, mean = 1, sd = 0.333))
test_data <- mtcars %>%
mutate(index = row_number(),
rand_bin = sample(1:3, nrow(mtcars), replace = TRUE)) %>%
merge(rand_wt,
by = "rand_bin") %>%
mutate(across(c(cyl, am, vs, gear, carb), as.factor))
test_dt <- data.table(test_data)
The data has 32 rows.