stratRandSample

stratRandSample is a function for generating a random stratified sample from a dataset. It takes up to 6 input values. The first three are required: a dataframe (or data.table), a variable listing strata, another specifying the number (or proportion) of units per stratum. The remaining three are optional: select allows for subsetting strata in line with the sampling, replace toggles sampling with replacement, and bothSets toggles whether the function returns both the sampled and non-sampled portions of the dataset.

Sample data

For testing, we use the mtcars dataset that comes stock with R. It contains data from the 1974 Motor Trend US magazine, qith fuel economy and other statistics for a set of 32 cars. It looks something like this:

glimpse(mtcars)
#> Rows: 32
#> Columns: 11
#> $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
#> $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
#> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
#> $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
#> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
#> $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
#> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
#> $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
#> $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
#> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

We add random weights to the data for testing below, and force a few columns into factors for the sake of our examples.

set.seed(1234)

rand_wt <- data.frame(rand_bin = 1:3,
                      rand_wt = rnorm(3, mean = 1, sd = 0.333))

test_data <- mtcars %>% 
  mutate(index = row_number(),
         rand_bin = sample(1:3, nrow(mtcars), replace = TRUE)) %>% 
  merge(rand_wt,
        by = "rand_bin") %>% 
  mutate(across(c(cyl, am, vs, gear, carb), as.factor))

test_dt <- data.table(test_data)

The data has 32 rows.

nrow(test_data)
#> [1] 32

Random 80% of dataset

set.seed(1234)
stratRandSample(test_data, size = 0.8) %>% 
  nrow()
#> [1] 26

stratRandSample(test_dt, size = 0.8) %>% 
  nrow()
#> [1] 26

Random 30% of each group (# of cylinders)

stratRandSample(test_data, "cyl", 0.3) %>% 
  count(cyl)
#>   cyl n
#> 1   4 3
#> 2   6 2
#> 3   8 4
stratRandSample(test_dt, "cyl", 0.3) %>% 
  count(cyl)
#>       cyl     n
#>    <fctr> <int>
#> 1:      4     3
#> 2:      6     2
#> 3:      8     4

Random sample, different % per stratum

  • 50% of 4-cylinder cars
  • 30% of 6-cylinder cars
  • 10% of 8-cylinder cars
stratRandSample(test_data, "cyl",
                size = c("4" = 0.5, "6" = 0.3, "8" = 0.1)) %>% 
  count(cyl)
#>   cyl n
#> 1   4 6
#> 2   6 2
#> 3   8 1
stratRandSample(test_dt, "cyl",
                size = c("4" = 0.5, "6" = 0.3, "8" = 0.1)) %>% 
  count(cyl)
#>       cyl     n
#>    <fctr> <int>
#> 1:      4     6
#> 2:      6     2
#> 3:      8     1

Misc. things

Automatically rounds to integer number of rows (e.g. 0.081*11 produces 1 row)

stratRandSample(test_data, "cyl",
                size = c("4" = 0.081, "6" = 0.407, "8" = 0.513)) %>% 
  count(cyl)
#>   cyl n
#> 1   4 1
#> 2   6 3
#> 3   8 7
stratRandSample(test_dt, "cyl",
                size = c("4" = 0.09, "6" = 0.49, "8" = 0.59)) %>% 
  count(cyl)
#>       cyl     n
#>    <fctr> <int>
#> 1:      4     1
#> 2:      6     3
#> 3:      8     8