hierarch.resampling.Bootstrapper
- class hierarch.resampling.Bootstrapper(random_state=None, kind='weights')
Bases:
objectThis transformer performs a nested bootstrap on the target data. Undefined behavior if the target data is not lexicographically sorted.
- Parameters:
- random_stateint or numpy.random.Generator instance, optional
Seeds the Bootstrapper for reproducibility, by default None
- kind{ “weights”, “bayesian”, “indexes” }
Specifies the bootstrapping algorithm.
“weights” generates a set of new integer weights for each datapoint.
“bayesian” generates a set of new real weights for each datapoint.
“indexes” generates a set of new indexes for the dataset. Mathematically, this is equivalent to demanding integer weights.
Notes
These approaches have different outputs - “weights” and “bayesian” output arrays the same size of the original array, but with every y-value multiplied by generated weight. “indexes” will output an array that is not necessarily the same size as the original array, but the weight of each y-value is 1, so certain metrics are easier to compute. Assuming both algorithms generated the “same” sample in terms of reweights, the arrays will be equivalent after the groupby and aggregate step.
“bayesian” has no reindexing equivalent.
Examples
Generate a simple design matrix with dependent variable always equal to 1.
>>> from hierarch.power import DataSimulator >>> paramlist = [[1]*2, [0]*6, [0]*18] >>> hierarchy = [2, 3, 3] >>> datagen = DataSimulator(paramlist) >>> datagen.fit(hierarchy) >>> data = datagen.generate() >>> data array([[1., 1., 1., 1.], [1., 1., 2., 1.], [1., 1., 3., 1.], [1., 2., 1., 1.], [1., 2., 2., 1.], [1., 2., 3., 1.], [1., 3., 1., 1.], [1., 3., 2., 1.], [1., 3., 3., 1.], [2., 1., 1., 1.], [2., 1., 2., 1.], [2., 1., 3., 1.], [2., 2., 1., 1.], [2., 2., 2., 1.], [2., 2., 3., 1.], [2., 3., 1., 1.], [2., 3., 2., 1.], [2., 3., 3., 1.]])
Generate a bootstrapped sample by resampling column 1, then column 2. The “weights” algorithm multiplies all of the dependent variable values by the resampled weights. Starting at column 1 means that some column 2 clusters might be zero-weighted.
>>> boot = Bootstrapper(random_state=1, kind="weights") >>> boot.fit(data, skip=None) >>> boot.transform(data, start=1) array([[1., 1., 1., 3.], [1., 1., 2., 0.], [1., 1., 3., 3.], [1., 2., 1., 0.], [1., 2., 2., 0.], [1., 2., 3., 0.], [1., 3., 1., 1.], [1., 3., 2., 1.], [1., 3., 3., 1.], [2., 1., 1., 0.], [2., 1., 2., 0.], [2., 1., 3., 0.], [2., 2., 1., 1.], [2., 2., 2., 1.], [2., 2., 3., 1.], [2., 3., 1., 2.], [2., 3., 2., 3.], [2., 3., 3., 1.]])
Starting at column 2 means that every column 1 cluster has equal weight.
>>> boot = Bootstrapper(random_state=1, kind="weights") >>> boot.fit(data, skip=None) >>> boot.transform(data, start=2) array([[1., 1., 1., 2.], [1., 1., 2., 0.], [1., 1., 3., 1.], [1., 2., 1., 0.], [1., 2., 2., 1.], [1., 2., 3., 2.], [1., 3., 1., 2.], [1., 3., 2., 0.], [1., 3., 3., 1.], [2., 1., 1., 1.], [2., 1., 2., 1.], [2., 1., 3., 1.], [2., 2., 1., 1.], [2., 2., 2., 0.], [2., 2., 3., 2.], [2., 3., 1., 1.], [2., 3., 2., 1.], [2., 3., 3., 1.]])
Skipping column 2 results in only column 1 clusters being resampled.
>>> boot = Bootstrapper(random_state=1, kind="weights") >>> boot.fit(data, skip=[2]) >>> boot.transform(data, start=1) array([[1., 1., 1., 2.], [1., 1., 2., 2.], [1., 1., 3., 2.], [1., 2., 1., 0.], [1., 2., 2., 0.], [1., 2., 3., 0.], [1., 3., 1., 1.], [1., 3., 2., 1.], [1., 3., 3., 1.], [2., 1., 1., 0.], [2., 1., 2., 0.], [2., 1., 3., 0.], [2., 2., 1., 1.], [2., 2., 2., 1.], [2., 2., 3., 1.], [2., 3., 1., 2.], [2., 3., 2., 2.], [2., 3., 3., 2.]])
Changing the algorithm to “indexes” gives a more familiar result.
>>> boot = Bootstrapper(random_state=1, kind="indexes") >>> boot.fit(data, skip=None) >>> boot.transform(data, start=1) array([[1., 1., 1., 1.], [1., 1., 1., 1.], [1., 1., 1., 1.], [1., 1., 3., 1.], [1., 1., 3., 1.], [1., 1., 3., 1.], [1., 3., 1., 1.], [1., 3., 2., 1.], [1., 3., 3., 1.], [2., 2., 1., 1.], [2., 2., 2., 1.], [2., 2., 3., 1.], [2., 3., 1., 1.], [2., 3., 1., 1.], [2., 3., 2., 1.], [2., 3., 2., 1.], [2., 3., 2., 1.], [2., 3., 3., 1.]])
The Bayesian bootstrap is the same as the Efron bootstrap, but allows the resampled weights to take any real value up to the sum of the original weights in that cluster.
>>> boot = Bootstrapper(random_state=2, kind="bayesian") >>> boot.fit(data, skip=None) >>> boot.transform(data, start=1) array([[1. , 1. , 1. , 0.92438197], [1. , 1. , 2. , 1.65820553], [1. , 1. , 3. , 1.31019207], [1. , 2. , 1. , 3.68556477], [1. , 2. , 2. , 0.782951 ], [1. , 2. , 3. , 0.01428243], [1. , 3. , 1. , 0.03969449], [1. , 3. , 2. , 0.04616013], [1. , 3. , 3. , 0.53856761], [2. , 1. , 1. , 4.4725425 ], [2. , 1. , 2. , 1.83458204], [2. , 1. , 3. , 0.16269176], [2. , 2. , 1. , 0.53223701], [2. , 2. , 2. , 0.37478853], [2. , 2. , 3. , 0.07456895], [2. , 3. , 1. , 0.27616575], [2. , 3. , 2. , 0.11271856], [2. , 3. , 3. , 1.15970489]])
Methods
fit(data[, skip, y])Fit the bootstrapper to the target data.
transform(data, start)Generate a bootstrapped sample from target data.
- fit(data: ndarray, skip=None, y=-1) None
Fit the bootstrapper to the target data.
- Parameters:
- data2D array
Target data. Must be lexicographically sorted.
- sortbool
Set to false is data is already sorted by row, by default True.
- skiplist of integers, optional
Columns to skip in the bootstrap. Skip columns that were sampled without replacement from the prior column, by default [].
- yint, optional
column index of the dependent variable, by default -1
- Raises:
- ValueError
Raises error if the input data is not a numpy numeric array.
- AttributeError
Raises error if the input data is not a numpy array.
- transform(data: ndarray, start: int) ndarray
Generate a bootstrapped sample from target data.
- Parameters:
- data2D array
Target data. Must be sorted by row.
- startint
Column index of the first column to be bootstrapped.
- Returns:
- 2D array
Array matching target data, but resampled with replacement according to “kind” argument.