Nonparametric Statistics

July 20, 2018

Linear Models

Linear models have several specialized varients
- Two-Sample T-test
- Simple Regression
- ANOVA - (Essentially a 2-sample t-test with more than two categories)
- Multiple Regression - Several explanatory covariates (either categorical or continuous)
- Mixed Effects Models
- Generalized Additive Models (GAMs)

T-tests

In order for a t-test with equal variances to be appropriate, one of the following must be true:
- The observed data is could reasonably be distributed \[Y_{ij} = \mu_i + \epsilon_{ij} \;\;\; \textrm{ where } \epsilon_{ij} \stackrel{iid}{\sim} N(0, \sigma)\]
- The sample size is sufficiently large for the Central Limit Theorem to use that \(\bar{Y}_i \stackrel{.}{\sim} N\left( \mu_i, \frac{\sigma}{\sqrt{n}} \right)\)

Linear Models in general

Have several model assumptions
- Independence of error terms
- Constant variance of error terms
- Normally distributed error terms
"All models are wrong, but some are useful" - George Box
Model assumptions must be approximately met for the results to be useful.

Linear Models are robust (to some violations of assumptions)

If the model assumptions aren't (approximately) met:
- Non-Independence - BIG PROBLEM!
  - Addressed by modeling how the observations are related
  - Should be addressed in the study design, not post-hoc.
- Non-Constant Variance - Problem!
  - Addressed by transforming the response variable
- Non-Normality of Errors - Problem if insufficient sample size.
  - Addressed by transforming the response variable

An Example

The ratio of DDE (related to DDT) to PCB concentrations in bird eggs has been shown to have had a number of biological implications. The ratio is used as an indication of the movement of contamination through the food chain.

Data - Dotplots

Data - Boxplots

Data

Every single Aquatic observation is lower the the smallest Terrestrial.
Obviously there significant difference between Aquatic and Terrestrial, but how should I quantify that difference statistically?
- The sample size is pretty small, so maybe it is just random chance…
- But the data, particularly for the Terrestrial data, is certainly not Normally distributed.

Typical fix for model assumptions

Non-Constant Variance or Non-Normality?
- Transform the response variable \(Y\)
  - \(\log(Y)\)
  - \(\sqrt(Y)\)
  - rank transformation
  - sign transformation
- These transformations are very commonly done!

Log Transform the Data?

Sign Transformations

While the log-transformation is better, it still isn't great.
But these are clearly different distributions.
Lets just make the sign transformation where we pivot the Ratio values about \(y=1\). \[Y_i^* = \left\{ \begin{aligned} -1 & \textrm{ if } Y_i < 1 \\ +1 & \textrm{ if } Y_i \ge 1 \\ \end{aligned} \right.\] \[Y_i^* = \textrm{sign}(Y_i - 1)\]

This feels sort-of like a Chi-Squared Test!

	-1	+1
Aquatic	11	0
Terrestial	0	11

What would happen if the distributions were the same?

We are building up the idea of the sampling distribution assuming
\(H_0\): No difference in distribution
If there were no difference between groups, then the -1 and 1 values would be equally spread between the two groups.

I might have seen the following

	-1	+1
Aquatic	5	6
Terrestial	7	4

Give the groups numerical labels as well!

	-1	+1
-1	11	0
+1	0	11

Test statistic?
Define \(G_i\) as the group number (-1 or 1)
\[X = \sum_{i=1}^n G_i y_i^* = 22\]
This is as large as the test statistic could possibly be.

What does the sampling distribution look like?

We need to figure out the distribution of the test statistic under the null hypothesis of no difference
Should have the 11 Aquatic observations evenly distributed between above and below 1 classes.
Likewise, should have the 11 Terrestrial observations evenly distributed between the above/below 1 classes.
We could use probability theory to figure this out… but we are lazy.

Data we start with

	Ratio	Type	Group	Sign
1	76.5	Terrestrial	1	1
2	6.03	Terrestrial	1	1
3	3.51	Terrestrial	1	1
20	0.16	Aquatic	-1	-1
21	0.18	Aquatic	-1	-1
22	0.02	Aquatic	-1	-1

Shuffle the Sign Values

PollutionRatios %>%
  mutate( Sign.Sim = mosaic::shuffle(Sign) ) %>%
  FSA::headtail()

##    Ratio        Type Group Sign Sign.Sim
## 1  76.50 Terrestrial     1    1        1
## 2   6.03 Terrestrial     1    1       -1
## 3   3.51 Terrestrial     1    1       -1
## 20  0.16     Aquatic    -1   -1        1
## 21  0.18     Aquatic    -1   -1        1
## 22  0.02     Aquatic    -1   -1       -1

Multiply Group * Sign.Sim

PollutionRatios %>%
  mutate( Sign.Sim = mosaic::shuffle(Sign) ) %>%
  mutate( output = Group * Sign.Sim ) %>%
  FSA::headtail()

##    Ratio        Type Group Sign Sign.Sim output
## 1  76.50 Terrestrial     1    1        1      1
## 2   6.03 Terrestrial     1    1       -1     -1
## 3   3.51 Terrestrial     1    1        1      1
## 20  0.16     Aquatic    -1   -1        1     -1
## 21  0.18     Aquatic    -1   -1       -1      1
## 22  0.02     Aquatic    -1   -1       -1      1

Then sum the output column

PollutionRatios %>%
  mutate( Sign.Sim = mosaic::shuffle(Sign) ) %>%
  mutate( output = Group * Sign.Sim ) %>%
  summarise( X = sum( output ) )

##   X
## 1 2

Do this MANY MANY times

SamplingDist <- mosaic::do(10000) * 
  PollutionRatios %>%
    mutate( Sign.Sim = mosaic::shuffle(Sign) ) %>%
    mutate( output = Group * Sign.Sim ) %>%
    summarise( X = sum( output ) )

The output distribution

##        X
## 1      2
## 2     -2
## 3      2
## 9998  -6
## 9999  -6
## 10000  2

The output distribution

The p-value

Recall a p-value is "the probability of seeing your data, or something more extreme, given the null hypothesis is true."
Because we saw NO simulations out of 10,000 that were as or more extreme, we can approximate the p-value as less than 1 in 10,000.

Sign Transformation Summary

The sign transformation takes a quantitative variable and turns it into a categorical variable with just two levels.
If the resulting data is amenable, we could then do a traditional categorical analysis, but often simulating the sampling distribution is just as easy.
But in reducing a quantitative variable to just two classes, we've lost a huge amount of information.
Is there an intermediate transformation?

Rank Transformation

Sort all of the data, smallest-to-largest, and call the order number the rank.
Smallest value has rank 1, second smallest has rank 2, etc, until the largest value has rank \(n\).
If there are ties, give an average rank.

Example: Camouflage Tent Material

The military was testing if a green patterned camouflage material for tents was more effective than a solid plain green material. Data below is the average distance an observer was away from a particular tent when it was first observed.

Example: Camouflage Tent Material

Distance	Type	Rank
10	Camo	1
12	Camo	2
14	Camo	3
16	Camo	4.5
16	Plain	4.5
18	Camo	6
20	Camo	7.5
20	Camo	7.5

Now what?

We have two options:

Treat the ranks as data and do a standard t-test.
- This is ok.
- If \(H_0\) is true, then the ranks would be uniformly distributed, and with only moderate sample sizes, the Central Limit Theorem is sufficient that this is ok.
- But this is based on the mean rank.
Come up with a test statistic that targets the median.
- The median is an appropriate measure of center in skewed distributions.
- What we'll use will detect a difference in distribution, not just difference in medians.
- This should be a more powerful test.

Standard T-test on the ranks

## 
##  Welch Two Sample t-test
## 
## data:  Rank by Type
## t = -3.7247, df = 17.849, p-value = 0.00157
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.88938  -3.31062
## sample estimates:
##  mean in group Camo mean in group Plain 
##                 6.7                14.3

Wilcoxen Rank Sum

Let
- \(n_i\) be the number of observations in group \(i\)
- \(R_{ij}\) the the rank of the \(j\)th observation in group \(i\)
- \(n = \sum n_i\)
For each group, calculate the sum of the ranks \[R_i = \sum_{j=1}^{n_i} R_{ij}\]

Wilcoxen Rank Sum

Note, for two groups: \(R_1 + R_2 = n(n+1)/2\)
Under the null hypotheses, \[R_1 \approx R_2 \approx \frac{n(n+1)}{4}\]
Let \(W = R_1 - R_2\)

Wilcoxen Rank Sum

  Tents %>%
    group_by(Type) %>%
    summarise( R = sum( Rank ) )

## # A tibble: 2 x 2
##   Type      R
##   <fct> <dbl>
## 1 Camo     67
## 2 Plain   143

Wilcoxen Rank Sum

  Tents %>%
    group_by(Type) %>%
    summarise( R = sum( Rank ) ) %>%
    summarise( W=diff(R) )

## # A tibble: 1 x 1
##       W
##   <dbl>
## 1    76

Simulating the Sampling Distribution under \(H_0\)

SamplingDist <- mosaic::do(10000) * 
  Tents %>%
    mutate( Rank.Sim = mosaic::shuffle(Rank) ) %>%
    group_by(Type) %>%
    summarise( R = sum( Rank.Sim ) ) %>%
    summarise( W=diff(R) )

Simulating the Sampling Distribution under \(H_0\)

Calculating an estimated p-value

SamplingDist %>%
  summarize( p.value = mosaic::prop( abs(W) >= 76 ) )

##   p.value
## 1  0.0037

Mann-Whitney U-Statistic

Another choice for a test statistic, which is traditionally used is: \[U_i = R_i - \frac{n_i(n_i + 1)}{2}\] \[U = \min (U_1, U_2)\]

\(U_1\) is the number of pairs \((j,k)\) such that \(y_{1,j}, y_{2,k}\).
Before we could simulate the Sampling Distribution, the U-Statistic had a more convenient set of tables to use and could deal with ties better.
Equivalent to Wilcoxen Sum Rank.

Considerations

Nonpparametric tests are more widely applicable than a standard t-test.
By using rank or sign transformed responses, we cannot make confidence intervals on the scale that is scientifically useful.
If the usual requirements are met for a t-test, then the standard approach is more powerful and should preferentially be used.

Simulation Study: Goals to Assertain

Question 1: If \(H_0\) is true, does an \(\alpha\)-level test reject the null hypothesis \(\alpha\)% of the time.
Question 2: If \(H_0\) is false, which test procedure rejects \(H_0\) more frequently? The probability of rejecting \(H_0\) is referred to as power.

Distributions to sample from

What if the data was Normally distributed?

Normal Data - Type I Error Rate

If \(H_0\) is true, we should reject \(H_0\) at some specified rate. Usually 0.05.
Generate 10000 datasets where \(H_0\) is true, and perform each analysis.
Observe the proportion of datasets that result in a p-value less than 0.05.
Give a 95% Confidence Interval for the Type I Error rate, too.
If Type I Error Rate is too big, that is BAD.

Normal Data - Type I Error Rate

method	Est	lwr	upr
t-test	0.0469	0.043	0.051
Rank Tranformed t-test	0.0495	0.045	0.054
Mann-Whitney	0.0409	0.037	0.045

Normal Data - Power

Generate many datasets where the null hypothesis is false and the difference in means is not zero.
The percentage of datasets that result in a statistically significant difference is the power.
As the means of the two populations get farther apart (\(\delta\)), the power, probability of rejecting \(H_0\), should get larger.
The larger the probability or rejection, the better!

Normal Data - Power

Gamma distributed data - Type I Error Rate

method	Est	lwr	upr
t-test	0.046	0.042	0.051
Rank Tranformed t-test	0.053	0.049	0.057
Mann-Whitney	0.043	0.039	0.047

Gamma Distribution - Power

LogNormal distributed data - Type I Error Rates

method	Est	lwr	upr
t-test	0.018	0.011	0.029
Rank Tranformed t-test	0.051	0.039	0.067
Mann-Whitney	0.043	0.032	0.058

Linear Models

T-tests

Linear Models in general

Linear Models are robust (to some violations of assumptions)

An Example

Data - Dotplots

Data - Boxplots

Data

Typical fix for model assumptions

Log Transform the Data?

Sign Transformations

This feels sort-of like a Chi-Squared Test!

What would happen if the distributions were the same?

I might have seen the following

Give the groups numerical labels as well!

What does the sampling distribution look like?

Data we start with

Shuffle the Sign Values

Multiply Group * Sign.Sim

Then sum the output column

Do this MANY MANY times

The output distribution

The output distribution

The p-value

Sign Transformation Summary

Rank Transformation

Example: Camouflage Tent Material

Example: Camouflage Tent Material

Now what?

Standard T-test on the ranks

Wilcoxen Rank Sum

Wilcoxen Rank Sum

Wilcoxen Rank Sum

Wilcoxen Rank Sum

Simulating the Sampling Distribution under \(H_0\)

Simulating the Sampling Distribution under \(H_0\)

Calculating an estimated p-value

Mann-Whitney U-Statistic

Considerations

Simulation Study: Goals to Assertain

Distributions to sample from

What if the data was Normally distributed?

Normal Data - Type I Error Rate

Normal Data - Type I Error Rate

Normal Data - Power

Normal Data - Power

Gamma distributed data - Type I Error Rate

Gamma Distribution - Power

LogNormal distributed data - Type I Error Rates

LogNormal distributed data - power