July 20, 2018

Linear Models

  • Linear models have several specialized varients
    • Two-Sample T-test
    • Simple Regression
    • ANOVA - (Essentially a 2-sample t-test with more than two categories)
    • Multiple Regression - Several explanatory covariates (either categorical or continuous)
    • Mixed Effects Models
    • Generalized Additive Models (GAMs)

T-tests

  • In order for a t-test with equal variances to be appropriate, one of the following must be true:
    • The observed data is could reasonably be distributed \[Y_{ij} = \mu_i + \epsilon_{ij} \;\;\; \textrm{ where } \epsilon_{ij} \stackrel{iid}{\sim} N(0, \sigma)\]
    • The sample size is sufficiently large for the Central Limit Theorem to use that \(\bar{Y}_i \stackrel{.}{\sim} N\left( \mu_i, \frac{\sigma}{\sqrt{n}} \right)\)

Linear Models in general

  • Have several model assumptions
    • Independence of error terms
    • Constant variance of error terms
    • Normally distributed error terms
  • "All models are wrong, but some are useful" - George Box
  • Model assumptions must be approximately met for the results to be useful.

Linear Models are robust (to some violations of assumptions)

  • If the model assumptions aren't (approximately) met:
    • Non-Independence - BIG PROBLEM!
      • Addressed by modeling how the observations are related
      • Should be addressed in the study design, not post-hoc.
    • Non-Constant Variance - Problem!
      • Addressed by transforming the response variable
    • Non-Normality of Errors - Problem if insufficient sample size.
      • Addressed by transforming the response variable

An Example

The ratio of DDE (related to DDT) to PCB concentrations in bird eggs has been shown to have had a number of biological implications. The ratio is used as an indication of the movement of contamination through the food chain.

Data - Dotplots

Data - Boxplots

Data

  • Every single Aquatic observation is lower the the smallest Terrestrial.
  • Obviously there significant difference between Aquatic and Terrestrial, but how should I quantify that difference statistically?
    • The sample size is pretty small, so maybe it is just random chance…
    • But the data, particularly for the Terrestrial data, is certainly not Normally distributed.

Typical fix for model assumptions

  • Non-Constant Variance or Non-Normality?
    • Transform the response variable \(Y\)
      • \(\log(Y)\)
      • \(\sqrt(Y)\)
      • rank transformation
      • sign transformation
    • These transformations are very commonly done!

Log Transform the Data?

Sign Transformations

  • While the log-transformation is better, it still isn't great.
  • But these are clearly different distributions.
  • Lets just make the sign transformation where we pivot the Ratio values about \(y=1\). \[Y_i^* = \left\{ \begin{aligned} -1 & \textrm{ if } Y_i < 1 \\ +1 & \textrm{ if } Y_i \ge 1 \\ \end{aligned} \right.\] \[Y_i^* = \textrm{sign}(Y_i - 1)\]

This feels sort-of like a Chi-Squared Test!

  -1 +1
Aquatic 11 0
Terrestial 0 11

What would happen if the distributions were the same?

  • We are building up the idea of the sampling distribution assuming
  • \(H_0\): No difference in distribution
  • If there were no difference between groups, then the -1 and 1 values would be equally spread between the two groups.

I might have seen the following

  -1 +1
Aquatic 5 6
Terrestial 7 4

Give the groups numerical labels as well!

  -1 +1
-1 11 0
+1 0 11
  • Test statistic?
  • Define \(G_i\) as the group number (-1 or 1)
  • \[X = \sum_{i=1}^n G_i y_i^* = 22\]
  • This is as large as the test statistic could possibly be.

What does the sampling distribution look like?

  • We need to figure out the distribution of the test statistic under the null hypothesis of no difference
  • Should have the 11 Aquatic observations evenly distributed between above and below 1 classes.
  • Likewise, should have the 11 Terrestrial observations evenly distributed between the above/below 1 classes.
  • We could use probability theory to figure this out… but we are lazy.

Data we start with

  Ratio Type Group Sign
1 76.5 Terrestrial 1 1
2 6.03 Terrestrial 1 1
3 3.51 Terrestrial 1 1
20 0.16 Aquatic -1 -1
21 0.18 Aquatic -1 -1
22 0.02 Aquatic -1 -1

Shuffle the Sign Values

PollutionRatios %>%
  mutate( Sign.Sim = mosaic::shuffle(Sign) ) %>%
  FSA::headtail() 
##    Ratio        Type Group Sign Sign.Sim
## 1  76.50 Terrestrial     1    1        1
## 2   6.03 Terrestrial     1    1       -1
## 3   3.51 Terrestrial     1    1       -1
## 20  0.16     Aquatic    -1   -1        1
## 21  0.18     Aquatic    -1   -1        1
## 22  0.02     Aquatic    -1   -1       -1

Multiply Group * Sign.Sim

PollutionRatios %>%
  mutate( Sign.Sim = mosaic::shuffle(Sign) ) %>%
  mutate( output = Group * Sign.Sim ) %>%
  FSA::headtail() 
##    Ratio        Type Group Sign Sign.Sim output
## 1  76.50 Terrestrial     1    1        1      1
## 2   6.03 Terrestrial     1    1       -1     -1
## 3   3.51 Terrestrial     1    1        1      1
## 20  0.16     Aquatic    -1   -1        1     -1
## 21  0.18     Aquatic    -1   -1       -1      1
## 22  0.02     Aquatic    -1   -1       -1      1

Then sum the output column

PollutionRatios %>%
  mutate( Sign.Sim = mosaic::shuffle(Sign) ) %>%
  mutate( output = Group * Sign.Sim ) %>%
  summarise( X = sum( output ) ) 
##   X
## 1 2

Do this MANY MANY times

SamplingDist <- mosaic::do(10000) * 
  PollutionRatios %>%
    mutate( Sign.Sim = mosaic::shuffle(Sign) ) %>%
    mutate( output = Group * Sign.Sim ) %>%
    summarise( X = sum( output ) ) 

The output distribution

##        X
## 1      2
## 2     -2
## 3      2
## 9998  -6
## 9999  -6
## 10000  2

The output distribution

The p-value

  • Recall a p-value is "the probability of seeing your data, or something more extreme, given the null hypothesis is true."
  • Because we saw NO simulations out of 10,000 that were as or more extreme, we can approximate the p-value as less than 1 in 10,000.

Sign Transformation Summary

  • The sign transformation takes a quantitative variable and turns it into a categorical variable with just two levels.
  • If the resulting data is amenable, we could then do a traditional categorical analysis, but often simulating the sampling distribution is just as easy.
  • But in reducing a quantitative variable to just two classes, we've lost a huge amount of information.
  • Is there an intermediate transformation?

Rank Transformation

  • Sort all of the data, smallest-to-largest, and call the order number the rank.
  • Smallest value has rank 1, second smallest has rank 2, etc, until the largest value has rank \(n\).
  • If there are ties, give an average rank.

Example: Camouflage Tent Material

The military was testing if a green patterned camouflage material for tents was more effective than a solid plain green material. Data below is the average distance an observer was away from a particular tent when it was first observed.

Example: Camouflage Tent Material

Distance Type Rank
10 Camo 1
12 Camo 2
14 Camo 3
16 Camo 4.5
16 Plain 4.5
18 Camo 6
20 Camo 7.5
20 Camo 7.5

Now what?

We have two options:

  • Treat the ranks as data and do a standard t-test.
    • This is ok.
    • If \(H_0\) is true, then the ranks would be uniformly distributed, and with only moderate sample sizes, the Central Limit Theorem is sufficient that this is ok.
    • But this is based on the mean rank.
  • Come up with a test statistic that targets the median.
    • The median is an appropriate measure of center in skewed distributions.
    • What we'll use will detect a difference in distribution, not just difference in medians.
    • This should be a more powerful test.

Standard T-test on the ranks

## 
##  Welch Two Sample t-test
## 
## data:  Rank by Type
## t = -3.7247, df = 17.849, p-value = 0.00157
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.88938  -3.31062
## sample estimates:
##  mean in group Camo mean in group Plain 
##                 6.7                14.3

Wilcoxen Rank Sum

  • Let
    • \(n_i\) be the number of observations in group \(i\)
    • \(R_{ij}\) the the rank of the \(j\)th observation in group \(i\)
    • \(n = \sum n_i\)
  • For each group, calculate the sum of the ranks \[R_i = \sum_{j=1}^{n_i} R_{ij}\]

Wilcoxen Rank Sum

  • Note, for two groups: \(R_1 + R_2 = n(n+1)/2\)

  • Under the null hypotheses, \[R_1 \approx R_2 \approx \frac{n(n+1)}{4}\]

  • Let \(W = R_1 - R_2\)

Wilcoxen Rank Sum

  Tents %>%
    group_by(Type) %>%
    summarise( R = sum( Rank ) ) 
## # A tibble: 2 x 2
##   Type      R
##   <fct> <dbl>
## 1 Camo     67
## 2 Plain   143

Wilcoxen Rank Sum

  Tents %>%
    group_by(Type) %>%
    summarise( R = sum( Rank ) ) %>%
    summarise( W=diff(R) )
## # A tibble: 1 x 1
##       W
##   <dbl>
## 1    76

Simulating the Sampling Distribution under \(H_0\)

SamplingDist <- mosaic::do(10000) * 
  Tents %>%
    mutate( Rank.Sim = mosaic::shuffle(Rank) ) %>%
    group_by(Type) %>%
    summarise( R = sum( Rank.Sim ) ) %>%
    summarise( W=diff(R) )

Simulating the Sampling Distribution under \(H_0\)

Calculating an estimated p-value

SamplingDist %>%
  summarize( p.value = mosaic::prop( abs(W) >= 76 ) )
##   p.value
## 1  0.0037

Mann-Whitney U-Statistic

Another choice for a test statistic, which is traditionally used is: \[U_i = R_i - \frac{n_i(n_i + 1)}{2}\] \[U = \min (U_1, U_2)\]

  • \(U_1\) is the number of pairs \((j,k)\) such that \(y_{1,j}, y_{2,k}\).
  • Before we could simulate the Sampling Distribution, the U-Statistic had a more convenient set of tables to use and could deal with ties better.
  • Equivalent to Wilcoxen Sum Rank.

Considerations

  • Nonpparametric tests are more widely applicable than a standard t-test.
  • By using rank or sign transformed responses, we cannot make confidence intervals on the scale that is scientifically useful.
  • If the usual requirements are met for a t-test, then the standard approach is more powerful and should preferentially be used.

Simulation Study: Goals to Assertain

  • Question 1: If \(H_0\) is true, does an \(\alpha\)-level test reject the null hypothesis \(\alpha\)% of the time.

  • Question 2: If \(H_0\) is false, which test procedure rejects \(H_0\) more frequently? The probability of rejecting \(H_0\) is referred to as power.

Distributions to sample from

What if the data was Normally distributed?

Normal Data - Type I Error Rate

  • If \(H_0\) is true, we should reject \(H_0\) at some specified rate. Usually 0.05.
  • Generate 10000 datasets where \(H_0\) is true, and perform each analysis.
  • Observe the proportion of datasets that result in a p-value less than 0.05.
  • Give a 95% Confidence Interval for the Type I Error rate, too.
  • If Type I Error Rate is too big, that is BAD.

Normal Data - Type I Error Rate

method Est lwr upr
t-test 0.0469 0.043 0.051
Rank Tranformed t-test 0.0495 0.045 0.054
Mann-Whitney 0.0409 0.037 0.045

Normal Data - Power

  • Generate many datasets where the null hypothesis is false and the difference in means is not zero.
  • The percentage of datasets that result in a statistically significant difference is the power.
  • As the means of the two populations get farther apart (\(\delta\)), the power, probability of rejecting \(H_0\), should get larger.
  • The larger the probability or rejection, the better!

Normal Data - Power

Gamma distributed data - Type I Error Rate

method Est lwr upr
t-test 0.046 0.042 0.051
Rank Tranformed t-test 0.053 0.049 0.057
Mann-Whitney 0.043 0.039 0.047

Gamma Distribution - Power

LogNormal distributed data - Type I Error Rates

method Est lwr upr
t-test 0.018 0.011 0.029
Rank Tranformed t-test 0.051 0.039 0.067
Mann-Whitney 0.043 0.032 0.058

LogNormal distributed data - power