hierarch.resampling.Bootstrapper

class hierarch.resampling.Bootstrapper(random_state=None, kind='weights')

Bases: object

This transformer performs a nested bootstrap on the target data. Undefined behavior if the target data is not lexicographically sorted.

Parameters:

random_stateint or numpy.random.Generator instance, optional

Seeds the Bootstrapper for reproducibility, by default None

kind{ “weights”, “bayesian”, “indexes” }

Specifies the bootstrapping algorithm.

“weights” generates a set of new integer weights for each datapoint.

“bayesian” generates a set of new real weights for each datapoint.

“indexes” generates a set of new indexes for the dataset. Mathematically, this is equivalent to demanding integer weights.

Notes

These approaches have different outputs - “weights” and “bayesian” output arrays the same size of the original array, but with every y-value multiplied by generated weight. “indexes” will output an array that is not necessarily the same size as the original array, but the weight of each y-value is 1, so certain metrics are easier to compute. Assuming both algorithms generated the “same” sample in terms of reweights, the arrays will be equivalent after the groupby and aggregate step.

“bayesian” has no reindexing equivalent.

Examples

Generate a simple design matrix with dependent variable always equal to 1.

>>> from hierarch.power import DataSimulator
>>> paramlist = [[1]*2, [0]*6, [0]*18]
>>> hierarchy = [2, 3, 3]
>>> datagen = DataSimulator(paramlist)
>>> datagen.fit(hierarchy)
>>> data = datagen.generate()
>>> data
array([[1., 1., 1., 1.],
       [1., 1., 2., 1.],
       [1., 1., 3., 1.],
       [1., 2., 1., 1.],
       [1., 2., 2., 1.],
       [1., 2., 3., 1.],
       [1., 3., 1., 1.],
       [1., 3., 2., 1.],
       [1., 3., 3., 1.],
       [2., 1., 1., 1.],
       [2., 1., 2., 1.],
       [2., 1., 3., 1.],
       [2., 2., 1., 1.],
       [2., 2., 2., 1.],
       [2., 2., 3., 1.],
       [2., 3., 1., 1.],
       [2., 3., 2., 1.],
       [2., 3., 3., 1.]])

Generate a bootstrapped sample by resampling column 1, then column 2. The “weights” algorithm multiplies all of the dependent variable values by the resampled weights. Starting at column 1 means that some column 2 clusters might be zero-weighted.

>>> boot = Bootstrapper(random_state=1, kind="weights")
>>> boot.fit(data, skip=None)
>>> boot.transform(data, start=1)
array([[1., 1., 1., 3.],
       [1., 1., 2., 0.],
       [1., 1., 3., 3.],
       [1., 2., 1., 0.],
       [1., 2., 2., 0.],
       [1., 2., 3., 0.],
       [1., 3., 1., 1.],
       [1., 3., 2., 1.],
       [1., 3., 3., 1.],
       [2., 1., 1., 0.],
       [2., 1., 2., 0.],
       [2., 1., 3., 0.],
       [2., 2., 1., 1.],
       [2., 2., 2., 1.],
       [2., 2., 3., 1.],
       [2., 3., 1., 2.],
       [2., 3., 2., 3.],
       [2., 3., 3., 1.]])

Starting at column 2 means that every column 1 cluster has equal weight.

>>> boot = Bootstrapper(random_state=1, kind="weights")
>>> boot.fit(data, skip=None)
>>> boot.transform(data, start=2)
array([[1., 1., 1., 2.],
       [1., 1., 2., 0.],
       [1., 1., 3., 1.],
       [1., 2., 1., 0.],
       [1., 2., 2., 1.],
       [1., 2., 3., 2.],
       [1., 3., 1., 2.],
       [1., 3., 2., 0.],
       [1., 3., 3., 1.],
       [2., 1., 1., 1.],
       [2., 1., 2., 1.],
       [2., 1., 3., 1.],
       [2., 2., 1., 1.],
       [2., 2., 2., 0.],
       [2., 2., 3., 2.],
       [2., 3., 1., 1.],
       [2., 3., 2., 1.],
       [2., 3., 3., 1.]])

Skipping column 2 results in only column 1 clusters being resampled.

>>> boot = Bootstrapper(random_state=1, kind="weights")
>>> boot.fit(data, skip=[2])
>>> boot.transform(data, start=1)
array([[1., 1., 1., 2.],
       [1., 1., 2., 2.],
       [1., 1., 3., 2.],
       [1., 2., 1., 0.],
       [1., 2., 2., 0.],
       [1., 2., 3., 0.],
       [1., 3., 1., 1.],
       [1., 3., 2., 1.],
       [1., 3., 3., 1.],
       [2., 1., 1., 0.],
       [2., 1., 2., 0.],
       [2., 1., 3., 0.],
       [2., 2., 1., 1.],
       [2., 2., 2., 1.],
       [2., 2., 3., 1.],
       [2., 3., 1., 2.],
       [2., 3., 2., 2.],
       [2., 3., 3., 2.]])

Changing the algorithm to “indexes” gives a more familiar result.

>>> boot = Bootstrapper(random_state=1, kind="indexes")
>>> boot.fit(data, skip=None)
>>> boot.transform(data, start=1)
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 3., 1.],
       [1., 1., 3., 1.],
       [1., 1., 3., 1.],
       [1., 3., 1., 1.],
       [1., 3., 2., 1.],
       [1., 3., 3., 1.],
       [2., 2., 1., 1.],
       [2., 2., 2., 1.],
       [2., 2., 3., 1.],
       [2., 3., 1., 1.],
       [2., 3., 1., 1.],
       [2., 3., 2., 1.],
       [2., 3., 2., 1.],
       [2., 3., 2., 1.],
       [2., 3., 3., 1.]])

The Bayesian bootstrap is the same as the Efron bootstrap, but allows the resampled weights to take any real value up to the sum of the original weights in that cluster.

>>> boot = Bootstrapper(random_state=2, kind="bayesian")
>>> boot.fit(data, skip=None)
>>> boot.transform(data, start=1)
array([[1.        , 1.        , 1.        , 0.92438197],
       [1.        , 1.        , 2.        , 1.65820553],
       [1.        , 1.        , 3.        , 1.31019207],
       [1.        , 2.        , 1.        , 3.68556477],
       [1.        , 2.        , 2.        , 0.782951  ],
       [1.        , 2.        , 3.        , 0.01428243],
       [1.        , 3.        , 1.        , 0.03969449],
       [1.        , 3.        , 2.        , 0.04616013],
       [1.        , 3.        , 3.        , 0.53856761],
       [2.        , 1.        , 1.        , 4.4725425 ],
       [2.        , 1.        , 2.        , 1.83458204],
       [2.        , 1.        , 3.        , 0.16269176],
       [2.        , 2.        , 1.        , 0.53223701],
       [2.        , 2.        , 2.        , 0.37478853],
       [2.        , 2.        , 3.        , 0.07456895],
       [2.        , 3.        , 1.        , 0.27616575],
       [2.        , 3.        , 2.        , 0.11271856],
       [2.        , 3.        , 3.        , 1.15970489]])

Methods

`fit`(data[, skip, y])	Fit the bootstrapper to the target data.
`transform`(data, start)	Generate a bootstrapped sample from target data.

fit(data: ndarray, skip=None, y=-1) → None

Fit the bootstrapper to the target data.

Parameters:

data2D array: Target data. Must be lexicographically sorted.
sortbool: Set to false is data is already sorted by row, by default True.
skiplist of integers, optional: Columns to skip in the bootstrap. Skip columns that were sampled without replacement from the prior column, by default [].
yint, optional: column index of the dependent variable, by default -1

Raises:

ValueError: Raises error if the input data is not a numpy numeric array.
AttributeError: Raises error if the input data is not a numpy array.

transform(data: ndarray, start: int) → ndarray

Generate a bootstrapped sample from target data.

Parameters:

data2D array: Target data. Must be sorted by row.
startint: Column index of the first column to be bootstrapped.

Returns:

2D array: Array matching target data, but resampled with replacement according to “kind” argument.