Paired Sample Hypothesis Testing via Bootstrapping with Polars
Hypothesis Testing on Paired Sample data when the Frequentist Approach is not a suitable option
Introduction
Suppose you are an environmental scientist. You are tasked with measuring the Zinc concentration from the surface and bottom of 10 lakes. Your goal is to determine whether there is a significant difference between the concentration of Zinc at the bottom vs at the surface of lakes.
After collecting your samples, your data looks like this:
import polars as pl
zinc = pl.read_csv("zinc.csv")
print(zinc)
shape: (10, 3)
┌────────┬────────┬─────────┐
│ loc_id ┆ bottom ┆ surface │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞════════╪════════╪═════════╡
│ 1 ┆ 0.43 ┆ 0.415 │
│ 2 ┆ 0.266 ┆ 0.238 │
│ 3 ┆ 0.567 ┆ 0.39 │
│ 4 ┆ 0.531 ┆ 0.41 │
│ 5 ┆ 0.707 ┆ 0.605 │
│ 6 ┆ 0.716 ┆ 0.609 │
│ 7 ┆ 0.651 ┆ 0.632 │
│ 8 ┆ 0.589 ┆ 0.523 │
│ 9 ┆ 0.469 ┆ 0.411 │
│ 10 ┆ 0.723 ┆ 0.612 │
└────────┴────────┴─────────┘
NOTE: If you would like to follow along, you can import the above data directly from the printed output with pl.from_repr():
zinc = pl.from_repr("""
┌────────┬────────┬─────────┐
│ loc_id ┆ bottom ┆ surface │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞════════╪════════╪═════════╡
│ 1 ┆ 0.43 ┆ 0.415 │
│ 2 ┆ 0.266 ┆ 0.238 │
│ 3 ┆ 0.567 ┆ 0.39 │
│ 4 ┆ 0.531 ┆ 0.41 │
│ 5 ┆ 0.707 ┆ 0.605 │
│ 6 ┆ 0.716 ┆ 0.609 │
│ 7 ┆ 0.651 ┆ 0.632 │
│ 8 ┆ 0.589 ┆ 0.523 │
│ 9 ┆ 0.469 ┆ 0.411 │
│ 10 ┆ 0.723 ┆ 0.612 │
└────────┴────────┴─────────┘
""")
The Frequentist Dilema
The approach here is to use hypothesis testing to determine if the difference between the bottom vs surface Zinc concentrations is statistically different based on an agreed upon alpha significance (ex: a = 0.05, a = 0.01, etc). However the issue is that your n = 10 sample size is not a large enough subsample to utilize a cannonical frequentist approach.
In other words, you do not have enough samples to assume a normal distribution, which is a prerequisite for the t-test required to measure alpha significance.
Bootstrap To The Rescue
Instead of assuming a normal distribution, we can instead create our own! Specifically, we can create a Bootsrap Distribution that will naturally take on a normal shape.
In order to do this, lets first define our two Hypotheses.
Our Null Hypothesis is that there is no difference (ie, "null difference") between the concentration of Zinc at the bottom of a lake vs the surface of a lake. We can mathematically represent this null hypothesis as follows:
Ho: avg(surface) = avg(bottom)
Or in other words, the average surface concentration minus the average bottom concentration is equal to zero:
Ho: avg(surface) - avg(bottom) = 0
Once the null hypothesis is defined, our Alternative Hypothesis is simply the antithesis of the null hypothesis - the average surface concentration minus the average bottom concentration is NOT equal to zero:
Ha: avg(surface) - avg(bottom) =/= 0
The Setup
The first step is to calculate the Observed Difference of Means between the concentration of Zinc at the surface vs the bottom of the 10 lakes in our subsample:
obs_diff_means = (
zinc
.select(
pl.col("surface").mean().alias("mean_surface"),
pl.col("bottom").mean().alias("mean_bottom")
)
.select(
pl.col("mean_surface") - pl.col("mean_bottom")
)
.to_series().to_list()[0]
)
obs_diff_means
-0.0804
So, the question now becomes: is -0.0804 significantly different than 0?
The Bootsrap Distribution
When working with paired sample data, I like to first reformat my dataset to make it easier to perform sampled grouping in the subsequent steps. In Polars this is done using the .unpivot() function:
zinc = (
zinc
.unpivot(index = "loc_id")
.sort(pl.col("loc_id"), pl.col("variable"))
)
print(zinc)
shape: (20, 3)
┌────────┬──────────┬───────┐
│ loc_id ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 │
╞════════╪══════════╪═══════╡
│ 1 ┆ bottom ┆ 0.43 │
│ 1 ┆ surface ┆ 0.415 │
│ 2 ┆ bottom ┆ 0.266 │
│ 2 ┆ surface ┆ 0.238 │
│ 3 ┆ bottom ┆ 0.567 │
│ … ┆ … ┆ … │
│ 8 ┆ surface ┆ 0.523 │
│ 9 ┆ bottom ┆ 0.469 │
│ 9 ┆ surface ┆ 0.411 │
│ 10 ┆ bottom ┆ 0.723 │
│ 10 ┆ surface ┆ 0.612 │
└────────┴──────────┴───────┘
The bootstrap logic is fairly simple. Our Null Hypothesis is that there is no difference between the Zinc concentration of the surface vs the bottom of each lake. Consequently, we assume that it would make no difference if we were to shuffle the surface values with the bottom values at each lake. So that is exactly what we will do.
A single shuffle round consists of the following steps:
• Randomly shuffle concentrations at each of the 10 lakes
• Calculate the difference of means for that shuffle
Lets perform one boot shuffle:
boot_diff_means = (
zinc
## Group by location ID (lake id) and shuffle the surface / bottom conc values:
.group_by("loc_id")
.agg(
pl.col("value").sample(fraction = 1, shuffle = True)
)
## Unnest the shuffled values into their own columns:
.with_columns(
pl.col("value").list.to_struct()
)
.unnest("value")
## Calucate a Boot Difference of Means:
.select(
pl.col("field_0").mean().alias("mean_surface"),
pl.col("field_1").mean().alias("mean_bottom")
)
.select(
pl.col("mean_surface") - pl.col("mean_bottom")
)
.to_series().to_list()[0]
)
boot_diff_means
0.0544
We can now scale that one shuffle round to 5,000 rounds to create a Null Distribution - ie, a distribution created assuming the Null Hypothesis. Those 5,000 rounds will be more than sufficient to ensure that our null distribution is normally distributed.
This is the VIP Concept of this exercise. If we can create a distribution using logic derived from our Null Hypothesis, we are able to directly compare the Observed Difference of Means to it.
boot_list = []
for _ in range(5000):
boot_diff_means = (
zinc
## Group by location ID (lake id) and shuffle the surface / bottom conc values:
.group_by("loc_id")
.agg(
pl.col("value").sample(fraction = 1, shuffle = True)
)
## Unnest the shuffled values into their own columns:
.with_columns(
pl.col("value").list.to_struct()
)
.unnest("value")
## Calucate a Boot Difference of Means:
.select(
pl.col("field_0").mean().alias("mean_surface"),
pl.col("field_1").mean().alias("mean_bottom")
)
.select(
pl.col("mean_surface") - pl.col("mean_bottom")
)
.to_series().to_list()[0]
)
## Populate the boot list:
boot_list.append(boot_diff_means)
We can confirm that our 5,000 rounds created a normally distributed Null Distribution by plotting a histogram:
boot_df = pl.DataFrame({"n": boot_list})
(
boot_df
.select(pl.col("n"))
.to_series()
.plot.hist()
)
Calculating P-Value
The final step is to actually test our Null Hypothesis. We can do this by calculating the p-value and comparing it to our predetermined alpha significance score. For this exercise we will use a highly deterministic alpha significance of 0.001:
alpha_sig = 0.001
The p-value calculation is very easy now that we have a Null Distribution. It is simply the proportion of the null distribution that is less than or equal to our observed value.
In other words:
pval = boot_df.filter(pl.col("n") <= obs_diff_means).shape[0] / boot_df.shape[0]
pval
0.0006
Finally, we test our Null Hypothesis:
if pval < alpha_sig:
print(f"Because our p-value of {pval} is LESS THAN than the alpha significance ({alpha_sig}) we CAN reject the Null Hypothesis (in favor of the Alt Hypothesis)")
Because our p-value of 0.0006 is LESS THAN than the alpha significance (0.001) we CAN reject the Null Hypothesis (in favor of the Alt Hypothesis)
Summary
So there we have it. Our p-value falls below the alpha significance of 0.001, which tells us that we can reject our Null Hypothesis, and therefore there exists sufficient evidence that the Zinc concentration at the surface of the lakes is significantly different than the Zinc concentration at the top of the lakes.
Thanks for reading!