hierarch.stats.two_sample_test

hierarch.stats.two_sample_test(data_array, treatment_col: int, compare='means', skip=None, bootstraps=100, permutations=1000, kind='weights', return_null=False, random_state=None)

Two-tailed two-sample hierarchical permutation test.

Parameters
data_array2D numpy array or pandas DataFrame

Array-like containing both the independent and dependent variables to be analyzed. It’s assumed that the final (rightmost) column contains the dependent variable values.

treatment_colint

The index number of the column containing “two samples” to be compared. Indexing starts at 0.

comparestr, optional

The test statistic to use to perform the hypothesis test. “means” automatically calls the Welch t-statistic for a difference of means test, by default “means”

skiplist of ints, optional

Columns to skip in the bootstrap. Skip columns that were sampled without replacement from the prior column, by default None

bootstrapsint, optional

Number of bootstraps to perform, by default 100. Can be set to 1 for a permutation test without any bootstrapping.

permutationsint or “all”, optional

Number of permutations to perform PER bootstrap sample. “all” for exact test, by default 1000

kindstr, optional

Bootstrap algorithm - see Bootstrapper class, by default “weights”

return_nullbool, optional

Return the null distribution as well as the p value, by default False

seedint or numpy random Generator, optional

Seedable for reproducibility., by default None

Returns
float64

p-value for the hypothesis test

list

Empirical null distribution used to calculate the p-value

Raises
TypeError

Raised if input data is not ndarray or DataFrame.

ValueError

Raised if treatment_col has more than two different labels in it.

KeyError

If comparison is a string, it must be in the TEST_STATISTICS dictionary.

AttributeError

If comparison is a custom statistic, it must be a function.

Examples

Specify the parameters of a dataset with a difference of means of 2.

>>> from hierarch.power import DataSimulator
>>> import scipy.stats as stats
>>> paramlist = [[0, 2], [stats.norm], [stats.norm]]
>>> hierarchy = [2, 4, 3]
>>> datagen = DataSimulator(paramlist, random_state=123)
>>> datagen.fit(hierarchy)
>>> data = datagen.generate()
>>> print(data.shape)
(24, 4)
>>> two_sample_test(data, treatment_col=0,
...                 bootstraps=1000, permutations='all',
...                 random_state=1)
0.03402857142857143

Instead of an exact test, a number of random permutations can be specified. In this case there are 70 possible permutations.

>>> two_sample_test(data, treatment_col=0,
...                 bootstraps=1000, permutations=70,
...                 random_state=1)
0.03357142857142857

The treatment column does not have to be the outermost column.

>>> paramlist = [[stats.norm], [0, 1]*3, [stats.norm], [stats.norm]]
>>> hierarchy = [3, 2, 4, 3]
>>> datagen = DataSimulator(paramlist, random_state=123)
>>> datagen.fit(hierarchy)
>>> data = datagen.generate()
>>> print(data.shape)
(72, 5)

Because of the larger number of possible permutations, it is usually better to reduce the number of bootstraps and increase the number of permutations.

>>> two_sample_test(data, treatment_col=0,
...                 bootstraps=100, permutations=1000,
...                 random_state=1)
Traceback (most recent call last):
    ...
ValueError: Needs 2 samples.

Make sure that treatment_col is set to right column index.

>>> two_sample_test(data, treatment_col=1,
...                 bootstraps=100, permutations=1000,
...                 random_state=1)
0.00276