Chapter 1 Data Manipulation

Most of the time, our data is in the form of a data frame and we are interested in exploring the relationships. This chapter explores how to manipulate data frames and methods.

1.1 Classic R functions for summarizing rows and columns

1.1.1 summary()

The first method is to calculate some basic summary statistics (minimum, 25th, 50th, 75th percentiles, maximum and mean) of each column. If a column is categorical, the summary function will return the number of observations in each category.

# use the iris data set which has both numerical and categorical variables
data( iris )
str(iris)    # recall what columns we have 
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# display the summary for each column
summary( iris )
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

1.1.2 apply()

The summary function is convenient, but we want the ability to pick another function to apply to each column and possibly to each row. To demonstrate this, suppose we have data frame that contains students grades over the semester.

# make up some data
grades <- data.frame(
  l.name = c('Cox', 'Dorian', 'Kelso', 'Turk'),
  Exam1 = c(93, 89, 80, 70),
  Exam2 = c(98, 70, 82, 85),
  Final = c(96, 85, 81, 92) )

The apply() function will apply an arbitrary function to each row (or column) of a matrix or a data frame and then aggregate the results into a vector.

# Because I can't take the mean of the last names column,
# remove the name column
scores <- grades[,-1]
scores
##   Exam1 Exam2 Final
## 1    93    98    96
## 2    89    70    85
## 3    80    82    81
## 4    70    85    92
# Summarize each column by calculating the mean.
apply( scores,      # what object do I want to apply the function to
       MARGIN=2,    # rows = 1, columns = 2, (same order as [rows, cols]
       FUN=mean     # what function do we want to apply     
     )
## Exam1 Exam2 Final 
## 83.00 83.75 88.50

To apply a function to the rows, we just change which margin we want. We might want to calculate the average exam score for person.

apply( scores,      # what object do I want to apply the function to
       MARGIN=1,    # rows = 1, columns = 2, (same order as [rows, cols]
       FUN=mean     # what function do we want to apply
     )
## [1] 95.66667 81.33333 81.00000 82.33333

This is useful, but it would be more useful to concatenate this as a new column in my grades data frame.

average <- apply( 
  scores,      # what object do I want to apply the function to
  MARGIN=1,    # rows = 1, columns = 2, (same order as [rows, cols]
  FUN=mean     # what function do we want to apply
)
grades <- cbind( grades, average ) # squish together 
grades
##   l.name Exam1 Exam2 Final  average
## 1    Cox    93    98    96 95.66667
## 2 Dorian    89    70    85 81.33333
## 3  Kelso    80    82    81 81.00000
## 4   Turk    70    85    92 82.33333

There are several variants of the apply() function, and the variant I use most often is the function sapply(), which will apply a function to each element of a list or vector and returns a corresponding list or vector of results.

1.2 Package dplyr

library(dplyr)   # load the dplyr package!

Many of the tools to manipulate data frames in R were written without a consistent syntax and are difficult use together. To remedy this, Hadley Wickham (the writer of ggplot2) introduced a package called plyr which was quite useful. As with many projects, his first version was good but not great and he introduced an improved version that works exclusively with data.frames called dplyr which we will investigate. The package dplyr strives to provide a convenient and consistent set of functions to handle the most common data frame manipulations and a mechanism for chaining these operations together to perform complex tasks.

The Dr Wickham has put together a very nice introduction to the package that explains in more detail how the various pieces work and I encourage you to read it at some point. [http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html].

One of the aspects about the data.frame object is that R does some simplification for you, but it does not do it in a consistent manner. Somewhat obnoxiously character strings are always converted to factors and subsetting might return a data.frame or a vector or a scalar. This is fine at the command line, but can be problematic when programming. Furthermore, many operations are pretty slow using data.frame. To get around this, Dr Wickham introduced a modified version of the data.frame called a tibble. A tibble is a data.frame but with a few extra bits. For now we can ignore the differences.

The pipe command %>% allows for very readable code. The idea is that the %>% operator works by translating the command a %>% f(b) to the expression f(a,b). This operator works on any function and was introduced in the magrittr package. The beauty of this comes when you have a suite of functions that takes input arguments of the same type as their output.

For example if we wanted to start with x, and first apply function f(), then g(), and then h(), the usual R command would be h(g(f(x))) which is hard to read because you have to start reading at the innermost set of parentheses. Using the pipe command %>%, this sequence of operations becomes x %>% f() %>% g() %>% h().

Dr Wickham gave the following example of readability:

bopping(
  scooping_up(
    hopping_through(foo_foo),
    field_mice),
  head)

is more readably written:

foo_foo %>% 
  hopping_through(forest) %>%
  scooping_up( field_mice) %>%
  bopping( head )

In dplyr, all the functions below take a data set as its first argument and outputs an appropriately modified data set. This will allow me to chain together commands in a readable fashion.

1.2.1 Verbs

The foundational operations to perform on a data set are:

  • Subsetting - Returns a with only particular columns or rows

    select - Selecting a subset of columns by name or column number.

    filter - Selecting a subset of rows from a data frame based on logical expressions.

    slice - Selecting a subset of rows by row number.

  • arrange - Re-ordering the rows of a data frame.

  • mutate - Add a new column that is some function of other columns.

  • summarise - calculate some summary statistic of a column of data. This collapses a set of rows into a single row.

Each of these operations is a function in the package dplyr. These functions all have a similar calling syntax, the first argument is a data set, subsequent arguments describe what to do with the input data frame and you can refer to the columns without using the df$column notation. All of these functions will return a data set.

1.2.1.1 Subsetting with select, filter, and slice

These function allows you select certain columns and rows of a data frame.

1.2.1.1.1 select()

Often you only want to work with a small number of columns of a data frame. It is relatively easy to do this using the standard [,col.name] notation, but is often pretty tedious.

# recall what the grades are
grades
##   l.name Exam1 Exam2 Final  average
## 1    Cox    93    98    96 95.66667
## 2 Dorian    89    70    85 81.33333
## 3  Kelso    80    82    81 81.00000
## 4   Turk    70    85    92 82.33333

I could select the columns Exam columns by hand, or by using an extension of the : operator

grades %>% select( Exam1, Exam2 )   # Exam1 and Exam2
##   Exam1 Exam2
## 1    93    98
## 2    89    70
## 3    80    82
## 4    70    85
grades %>% select( Exam1:Final )    # Columns Exam1 through Final
##   Exam1 Exam2 Final
## 1    93    98    96
## 2    89    70    85
## 3    80    82    81
## 4    70    85    92
grades %>% select( -Exam1 )         # Negative indexing by name works
##   l.name Exam2 Final  average
## 1    Cox    98    96 95.66667
## 2 Dorian    70    85 81.33333
## 3  Kelso    82    81 81.00000
## 4   Turk    85    92 82.33333
grades %>% select( 1:2 )            # Can select column by column position
##   l.name Exam1
## 1    Cox    93
## 2 Dorian    89
## 3  Kelso    80
## 4   Turk    70

The select() command has a few other tricks. There are functional calls that describe the columns you wish to select that take advantage of pattern matching. I generally can get by with starts_with(), ends_with(), and contains(), but there is a final operator matches() that takes a regular expression.

grades %>% select( starts_with('Exam') )   # Exam1 and Exam2
##   Exam1 Exam2
## 1    93    98
## 2    89    70
## 3    80    82
## 4    70    85
1.2.1.1.2 filter()

It is common to want to select particular rows where we have some logically expression to pick the rows.

# select students with Final grades greater than 90
grades %>% filter(Final > 90)
##   l.name Exam1 Exam2 Final  average
## 1    Cox    93    98    96 95.66667
## 2   Turk    70    85    92 82.33333

You can have multiple logical expressions to select rows and they will be logically combined so that only rows that satisfy all of the conditions are selected. The logicals are joined together using & (and) operator or the | (or) operator and you may explicitly use other logicals. For example a factor column type might be used to select rows where type is either one or two via the following: type==1 | type==2.

# select students with Final grades above 90 and
# average score also above 90
grades %>% filter(Final > 90, average > 90)
##   l.name Exam1 Exam2 Final  average
## 1    Cox    93    98    96 95.66667
# we could also use an "and" condition
grades %>% filter(Final > 90 & average > 90)
##   l.name Exam1 Exam2 Final  average
## 1    Cox    93    98    96 95.66667
1.2.1.1.3 slice()

When you want to filter rows based on row number, this is called slicing.

# grab the first 2 rows
grades %>% slice(1:2)
## # A tibble: 2 x 5
##   l.name Exam1 Exam2 Final  average
##   <fctr> <dbl> <dbl> <dbl>    <dbl>
## 1    Cox    93    98    96 95.66667
## 2 Dorian    89    70    85 81.33333

1.2.1.2 arrange()

We often need to re-order the rows of a data frame. For example, we might wish to take our grade book and sort the rows by the average score, or perhaps alphabetically. The arrange() function does exactly that. The first argument is the data frame to re-order, and the subsequent arguments are the columns to sort on. The order of the sorting column determines the precedent… the first sorting column is first used and the second sorting column is only used to break ties.

grades %>% arrange(l.name)
##   l.name Exam1 Exam2 Final  average
## 1    Cox    93    98    96 95.66667
## 2 Dorian    89    70    85 81.33333
## 3  Kelso    80    82    81 81.00000
## 4   Turk    70    85    92 82.33333

The default sorting is in ascending order, so to sort the grades with the highest scoring person in the first row, we must tell arrange to do it in descending order using desc(column.name).

grades %>% arrange(desc(Final))
##   l.name Exam1 Exam2 Final  average
## 1    Cox    93    98    96 95.66667
## 2   Turk    70    85    92 82.33333
## 3 Dorian    89    70    85 81.33333
## 4  Kelso    80    82    81 81.00000

In a more complicated example, consider the following data and we want to order it first by Treatment Level and secondarily by the y-value. I want the Treatment level in the default ascending order (Low, Medium, High), but the y variable in descending order.

# make some data
dd <- data.frame(
  Trt = factor(c("High", "Med", "High", "Low"),        
               levels = c("Low", "Med", "High")),
  y = c(8, 3, 9, 9),      
  z = c(1, 1, 1, 2)) 
dd
##    Trt y z
## 1 High 8 1
## 2  Med 3 1
## 3 High 9 1
## 4  Low 9 2
# arrange the rows first by treatment, and then by y (y in descending order)
dd %>% arrange(Trt, desc(y))
##    Trt y z
## 1  Low 9 2
## 2  Med 3 1
## 3 High 9 1
## 4 High 8 1

1.2.1.3 mutate()

I often need to create a new column that is some function of the old columns. This was often cumbersome. Consider code to calculate the average grade in my grade book example.

grades$average <- (grades$Exam1 + grades$Exam2 + grades$Final) / 3

Instead, we could use the mutate() function and avoid all the grades$ nonsense.

grades %>% mutate( average = (Exam1 + Exam2 + Final)/3 )
##   l.name Exam1 Exam2 Final  average
## 1    Cox    93    98    96 95.66667
## 2 Dorian    89    70    85 81.33333
## 3  Kelso    80    82    81 81.00000
## 4   Turk    70    85    92 82.33333

You can do multiple calculations within the same mutate() command, and you can even refer to columns that were created in the same mutate() command.

grades %>% mutate( 
  average = (Exam1 + Exam2 + Final)/3,
  grade = cut(average, c(0, 60, 70, 80, 90, 100),  # cut takes numeric variable
                       c( 'F','D','C','B','A')) )  # and makes a factor
##   l.name Exam1 Exam2 Final  average grade
## 1    Cox    93    98    96 95.66667     A
## 2 Dorian    89    70    85 81.33333     B
## 3  Kelso    80    82    81 81.00000     B
## 4   Turk    70    85    92 82.33333     B

1.2.1.4 summarise()

By itself, this function is quite boring, but will become useful later on. Its purpose is to calculate summary statistics using any or all of the data columns. Notice that we get to chose the name of the new column. The way to think about this is that we are collapsing information stored in multiple rows into a single row of values.

# calculate the mean of exam 1
grades %>% summarise( mean.E1=mean(Exam1))
##   mean.E1
## 1      83

We could calculate multiple summary statistics if we like.

# calculate the mean and standard deviation 
grades %>% summarise( mean.E1=mean(Exam1), stddev.E1=sd(Exam2) )
##   mean.E1 stddev.E1
## 1      83      11.5

If we want to apply the same statistic to each column, we use the summarise_each() command. We have to be a little careful here because the function you use has to work on every column (that isn’t part of the grouping structure (see group_by())).

# calculate the mean and stddev of each column - Cannot do this to Names!
grades %>% 
  select( Exam1:Final ) %>%
  summarise_each( funs(mean, sd) )
## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over all variables, use `summarise_all()`
##   Exam1_mean Exam2_mean Final_mean Exam1_sd Exam2_sd Final_sd
## 1         83      83.75       88.5 10.23067     11.5 6.757712

1.2.1.5 Miscellaneous functions

There are some more function that are useful but aren’t as commonly used. For sampling the functions sample_n() and sample_frac() will take a subsample of either n rows or of a fraction of the data set. The function n() returns the number of rows in the data set. Finally rename() will rename a selected column.

1.2.2 Split, apply, combine

Aside from unifying the syntax behind the common operations, the major strength of the dplyr package is the ability to split a data frame into a bunch of sub-dataframes, apply a sequence of one or more of the operations we just described, and then combine results back together. We’ll consider data from an experiment from spinning wool into yarn. This experiment considered two different types of wool (A or B) and three different levels of tension on the thread. The response variable is the number of breaks in the resulting yarn. For each of the 6 wool:tension combinations, there are 9 replicated observations per wool:tension level.

data(warpbreaks)
str(warpbreaks)
## 'data.frame':    54 obs. of  3 variables:
##  $ breaks : num  26 30 54 25 70 52 51 26 67 18 ...
##  $ wool   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...

The first we must do is to create a data frame with additional information about how to break the data into sub-dataframes. In this case, I want to break the data up into the 6 wool-by-tension combinations. Initially we will just figure out how many rows are in each wool-by-tension combination.

# group_by:  what variable(s) shall we group one
# n() is a function that returns how many rows are in the 
#   currently selected sub-dataframe
warpbreaks %>% 
  group_by( wool, tension) %>%    # grouping
  summarise(n = n() )             # how many in each group
## # A tibble: 6 x 3
## # Groups:   wool [?]
##     wool tension     n
##   <fctr>  <fctr> <int>
## 1      A       L     9
## 2      A       M     9
## 3      A       H     9
## 4      B       L     9
## 5      B       M     9
## 6      B       H     9

The group_by function takes a data.frame and returns the same data.frame, but with some extra information so that any subsequent function acts on each unique combination defined in the group_by. If you wish to remove this behavior, use group_by() to reset the grouping to have no grouping variable.

Using the same summarise function, we could calculate the group mean and standard deviation for each wool-by-tension group.

warpbreaks %>% 
  group_by(wool, tension) %>%
  summarise( n           = n(),             # I added some formatting to show the
             mean.breaks = mean(breaks),    # reader I am calculating several
             sd.breaks   = sd(breaks))      # statistics.
## # A tibble: 6 x 5
## # Groups:   wool [?]
##     wool tension     n mean.breaks sd.breaks
##   <fctr>  <fctr> <int>       <dbl>     <dbl>
## 1      A       L     9    44.55556 18.097729
## 2      A       M     9    24.00000  8.660254
## 3      A       H     9    24.55556 10.272671
## 4      B       L     9    28.22222  9.858724
## 5      B       M     9    28.77778  9.431036
## 6      B       H     9    18.77778  4.893306

If instead of summarizing each split, we might want to just do some calculation and the output should have the same number of rows as the input data frame. In this case I’ll tell dplyr that we are mutating the data frame instead of summarizing it. For example, suppose that I want to calculate the residual value \[e_{ijk}=y_{ijk}-\bar{y}_{ij\cdot}\] where \(\bar{y}_{ij\cdot}\) is the mean of each wool:tension combination.

warpbreaks %>% 
   group_by(wool, tension) %>%                 # group by wool:tension
   mutate(resid = breaks - mean(breaks)) %>%   # mean(breaks) of the group!
   head(  )                                    # show the first couple of rows
## # A tibble: 6 x 4
## # Groups:   wool, tension [1]
##   breaks   wool tension      resid
##    <dbl> <fctr>  <fctr>      <dbl>
## 1     26      A       L -18.555556
## 2     30      A       L -14.555556
## 3     54      A       L   9.444444
## 4     25      A       L -19.555556
## 5     70      A       L  25.444444
## 6     52      A       L   7.444444

1.2.3 Chaining commands together

In the previous examples we have used the %>% operator to make the code more readable but to really appreciate this, we should examine the alternative.

Suppose we have the results of a small 5K race. The data given to us is in the order that the runners signed up but we want to calculate the results for each gender, calculate the placings, and the sort the data frame by gender and then place. We can think of this process as having three steps:

  1. Splitting
  2. Ranking
  3. Re-arranging.
# input the initial data
race.results <- data.frame(
  name=c('Bob', 'Jeff', 'Rachel', 'Bonnie', 'Derek', 'April','Elise','David'),
  time=c(21.23, 19.51, 19.82, 23.45, 20.23, 24.22, 28.83, 15.73),
  gender=c('M','M','F','F','M','F','F','M'))

We could run all the commands together using the following code:

arrange( 
  mutate(
    group_by( 
      race.results,         # using race.results
      gender),              # group by gender     
    place = rank( time )),  # mutate to calculate the place column
  gender, place)            # arrange the result by gender and place
## # A tibble: 8 x 4
## # Groups:   gender [2]
##     name  time gender place
##   <fctr> <dbl> <fctr> <dbl>
## 1 Rachel 19.82      F     1
## 2 Bonnie 23.45      F     2
## 3  April 24.22      F     3
## 4  Elise 28.83      F     4
## 5  David 15.73      M     1
## 6   Jeff 19.51      M     2
## 7  Derek 20.23      M     3
## 8    Bob 21.23      M     4

This is very difficult to read because you have to read the code from the inside out.

Another (and slightly more readable) way to complete our task is to save each intermediate step of our process and then use that in the next step:

temp.df0 <- race.results %>% group_by( gender)
temp.df1 <- temp.df0 %>% mutate( place = rank(time) )
temp.df2 <- temp.df1 %>% arrange( gender, place )

It would be nice if I didn’t have to save all these intermediate results because keeping track of temp1 and temp2 gets pretty annoying if I keep changing the order of how things or calculated or add/subtract steps. This is exactly what %>% does for me.

race.results %>%
  group_by( gender ) %>%
  mutate( place = rank(time)) %>%
  arrange( gender, place )
## # A tibble: 8 x 4
## # Groups:   gender [2]
##     name  time gender place
##   <fctr> <dbl> <fctr> <dbl>
## 1 Rachel 19.82      F     1
## 2 Bonnie 23.45      F     2
## 3  April 24.22      F     3
## 4  Elise 28.83      F     4
## 5  David 15.73      M     1
## 6   Jeff 19.51      M     2
## 7  Derek 20.23      M     3
## 8    Bob 21.23      M     4

1.3 Reshaping data

library(tidyr)   # for the gather/spread commands
library(dplyr)   # for the join stuff

Most of the time, our data is in the form of a data frame and we are interested in exploring the relationships. However most procedures in R expect the data to show up in a ‘long’ format where each row is an observation and each column is a covariate. In practice, the data is often not stored like that and the data comes to us with repeated observations included on a single row. This is often done as a memory saving technique or because there is some structure in the data that makes the ‘wide’ format attractive. As a result, we need a way to convert data from ‘wide’ to ‘long’ and vice-versa.

1.3.1 tidyr

There is a common issue with obtaining data with many columns that you wish were organized as rows. For example, I might have data in a grade book that has several homework scores and I’d like to produce a nice graph that has assignment number on the x-axis and score on the y-axis. Unfortunately this is incredibly hard to do when the data is arranged in the following way:

grade.book <- rbind(
  data.frame(name='Alison',  HW.1=8, HW.2=5, HW.3=8),
  data.frame(name='Brandon', HW.1=5, HW.2=3, HW.3=6),
  data.frame(name='Charles', HW.1=9, HW.2=7, HW.3=9))
grade.book
##      name HW.1 HW.2 HW.3
## 1  Alison    8    5    8
## 2 Brandon    5    3    6
## 3 Charles    9    7    9

What we want to do is turn this data frame from a wide data frame into a long data frame. In MS Excel this is called pivoting. Essentially I’d like to create a data frame with three columns: name, assignment, and score. That is to say that each homework datum really has three pieces of information: who it came from, which homework it was, and what the score was. It doesn’t conceptually matter if I store it as 3 columns or 3 rows so long as there is a way to identify how a student scored on a particular homework. So we want to reshape the HW1 to HW3 columns into two columns (assignment and score).

This package was built by the sample people that created dplyr and ggplot2 and there is a nice introduction at: [http://blog.rstudio.org/2014/07/22/introducing-tidyr/]

1.3.1.1 Verbs

As with the dplyr package, there are two main verbs to remember:

  1. gather - Gather multiple columns that are related into two columns that contain the original column name and the value. For example for columns HW1, HW2, HW3 we would gather them into two column HomeworkNumber and Score. In this case, we refer to HomeworkNumber as the key column and Score as the value column. So for any key:value pair you know everything you need.

  2. spread - This is the opposite of gather. This takes a key column (or columns) and a results column and forms a new column for each level of the key column(s).

# first we gather the score columns into columns we'll name Assesment and Score
tidy.scores <- grade.book %>% 
  gather( key=Assessement,  # What should I call the key column
          value=Score,      # What should I call the values column
          HW.1:HW.3         # which columns to apply this to
          )
tidy.scores
##      name Assessement Score
## 1  Alison        HW.1     8
## 2 Brandon        HW.1     5
## 3 Charles        HW.1     9
## 4  Alison        HW.2     5
## 5 Brandon        HW.2     3
## 6 Charles        HW.2     7
## 7  Alison        HW.3     8
## 8 Brandon        HW.3     6
## 9 Charles        HW.3     9

To spread the key:value pairs out into a matrix, we use the spread() command.

# Turn the Assessment/Score pair of columns into one column per factor level of Assessment
tidy.scores %>% spread( key=Assessement, value=Score )
##      name HW.1 HW.2 HW.3
## 1  Alison    8    5    8
## 2 Brandon    5    3    6
## 3 Charles    9    7    9

One way to keep straight which is the key column is that the key is the category, while value is the numerical value or response.

1.4 Storing Data in Multiple Tables

In many datasets it is common to store data across multiple tables, usually with the goal of minimizing memory used as well as providing minimal duplication of information so any change that must be made is only made in a single place.

To see the rational why we might do this, consider building a data set of blood donations by a variety of donors across several years. For each blood donation, we will perform some assay and measure certain qualities about the blood and the patients health at the donation.

##   Donor Hemoglobin Systolic Diastolic
## 1 Derek       17.4      121        80
## 2  Jeff       16.9      145       101

But now we have to ask, what happens when we have a donor that has given blood multiple times? In this case we should just have multiple rows per person along with a date column to uniquely identify a particular donation.

donations
##   Donor       Date Hemoglobin Systolic Diastolic
## 1 Derek 2017-04-14       17.4      120        79
## 2 Derek 2017-06-20       16.5      121        80
## 3  Jeff 2017-08-14       16.9      145       101

I would like to include additional information about the donor where that infomation doesn’t change overtime. For example we might want to have information about the donar’s birthdate, sex, blood type. However, I don’t want that information in every single donation line. Otherwise if I mistype a birthday and have to correct it, I would have to correct it everywhere. For information about the donor, should live in a donors table, while information about a particular donation should live in the donations table.

Furthermore, there are many Jeffs and Dereks in the world and to maintain a unique identifier (without using Social Security numbers) I will just create a Donor_ID code that will uniquely identify a person. Similarly I will create a Donation_ID that will uniquely identify a dontation.

donors
##   Donor_ID F_Name L_Name B_Type      Birth       Street      City State
## 1  Donor_1  Derek    Lee     O+ 1976-09-17 7392 Willard Flagstaff    AZ
## 2  Donor_2   Jeff  Smith      A 1974-06-23     873 Vine   Bozeman    MT
donations
##   Donation_ID Donor_ID       Date Hemoglobin Systolic Diastolic
## 1  Donation_1  Donor_1 2017-04-14       17.4      120        79
## 2  Donation_2  Donor_1 2017-06-20       16.5      121        80
## 3  Donation_3  Donor_2 2017-08-14       16.9      145       101

If we have a new donor walk in and give blood, then we’ll have to create a new entry in the donors table as well as a new entry in the donations table. If an experienced donor gives again, we just have to create a new entry in the donations table.

donors
##   Donor_ID F_Name L_Name B_Type      Birth       Street      City State
## 1  Donor_1  Derek    Lee     O+ 1976-09-17 7392 Willard Flagstaff    AZ
## 2  Donor_2   Jeff  Smith      A 1974-06-23     873 Vine   Bozeman    MT
## 3  Donor_3 Aubrey    Lee     O+ 1980-12-15 7392 Willard Flagstaff    AZ
donations
##   Donation_ID Donor_ID       Date Hemoglobin Systolic Diastolic
## 1  Donation_1  Donor_1 2017-04-14       17.4      120        79
## 2  Donation_2  Donor_1 2017-06-20       16.5      121        80
## 3  Donation_3  Donor_2 2017-08-14       16.9      145       101
## 4  Donation_4  Donor_1 2017-08-26       17.6      120        79
## 5  Donation_5  Donor_4 2017-08-26       16.1      137        90

This data storage set-up might be flexible enough for us. However what happens if somebody moves? If we don’t want to keep the historical information, then we could just change the person’s Street_Address, City, and State values. If we do want to keep that, then we could create donor_addresses table that contains a Start_Date and End_Date that denotes the period of time that the address was valid.

donor_addresses
##   Donor_ID       Street      City State Start_Date   End_Date
## 1  Donor_1 346 Treeline   Pullman    WA 2015-01-26 2016-06-27
## 2  Donor_1     645 Main Flagstsff    AZ 2016-06-28 2017-07-02
## 3  Donor_1 7392 Willard Flagstaff    AZ 2017-07-03       <NA>
## 4  Donor_2     873 Vine   Bozeman    MT 2015-03-17       <NA>
## 5  Donor_3 7392 Willard Flagstaff    AZ 2017-06-01       <NA>

Given this data structure, we can now easily create new donations as well as store donor information. In the event that we need to change something about a donor, there is only one place to make that change.

However, having data spread across multiple tables is challenging because I often want that information squished back together. For example, the blood donations services might want to find all ‘O’ or ‘O+’ donors in Flagstaff and their current mailing address and send them some notification about blood supplies being low. So we need someway to join the donors and donor_addresses tables together in a sensible manner.

1.4.1 Table Joins

Often we need to squish together two data frames but they do not have the same number of rows. Consider the case where we have a data frame of observations of fish and a separate data frame that contains information about lake (perhaps surface area, max depth, pH, etc). I want to store them as two separate tables so that when I have to record a lake level observation, I only input it one place. This decreases the chance that I make a copy/paste error.

To illustrate the different types of table joins, we’ll consider two different tables.

# tibbles are just data.frames that print a bit nicer and don't automaticall
# convert character columns into factors.  They behave a bit more consistently
# in a wide variety of situations compared to data.frames.
Fish.Data <- tibble(
  Lake_ID = c('A','A','B','B','C','C'), 
  Fish.Weight=rnorm(6, mean=260, sd=25) ) # make up some data
Lake.Data <- tibble(
  Lake_ID = c(    'B','C','D'),   
  Lake_Name = c('Lake Elaine', 'Mormon Lake', 'Lake Mary'),   
  pH=c(6.5, 6.3, 6.1),
  area = c(40, 210, 240),
  avg_depth = c(8, 10, 38))
Fish.Data
## # A tibble: 6 x 2
##   Lake_ID Fish.Weight
##     <chr>       <dbl>
## 1       A    268.4052
## 2       A    237.3458
## 3       B    227.0830
## 4       B    261.5135
## 5       C    262.9933
## 6       C    235.9487
Lake.Data
## # A tibble: 3 x 5
##   Lake_ID   Lake_Name    pH  area avg_depth
##     <chr>       <chr> <dbl> <dbl>     <dbl>
## 1       B Lake Elaine   6.5    40         8
## 2       C Mormon Lake   6.3   210        10
## 3       D   Lake Mary   6.1   240        38

Notice that each of these tables has a column labled Lake_ID. When we join these two tables, the row that describes lake A should be duplicated for each row in the Fish.Data that corresponds with fish caught from lake A.

full_join(Fish.Data, Lake.Data)
## Joining, by = "Lake_ID"
## # A tibble: 7 x 6
##   Lake_ID Fish.Weight   Lake_Name    pH  area avg_depth
##     <chr>       <dbl>       <chr> <dbl> <dbl>     <dbl>
## 1       A    268.4052        <NA>    NA    NA        NA
## 2       A    237.3458        <NA>    NA    NA        NA
## 3       B    227.0830 Lake Elaine   6.5    40         8
## 4       B    261.5135 Lake Elaine   6.5    40         8
## 5       C    262.9933 Mormon Lake   6.3   210        10
## 6       C    235.9487 Mormon Lake   6.3   210        10
## 7       D          NA   Lake Mary   6.1   240        38

Notice that because we didn’t have any fish caught in lake D and we don’t have any Lake information about lake A, when we join these two tables, we end up introducing missing observations into the resulting data frame.

The other types of joins govern the behavor or these missing data.

left_join(A, B) For each row in A, match with a row in B, but don’t create any more rows than what was already in A.

inner_join(A,B) Only match row values where both data frames have a value.

left_join(Fish.Data, Lake.Data)
## Joining, by = "Lake_ID"
## # A tibble: 6 x 6
##   Lake_ID Fish.Weight   Lake_Name    pH  area avg_depth
##     <chr>       <dbl>       <chr> <dbl> <dbl>     <dbl>
## 1       A    268.4052        <NA>    NA    NA        NA
## 2       A    237.3458        <NA>    NA    NA        NA
## 3       B    227.0830 Lake Elaine   6.5    40         8
## 4       B    261.5135 Lake Elaine   6.5    40         8
## 5       C    262.9933 Mormon Lake   6.3   210        10
## 6       C    235.9487 Mormon Lake   6.3   210        10
inner_join(Fish.Data, Lake.Data)
## Joining, by = "Lake_ID"
## # A tibble: 4 x 6
##   Lake_ID Fish.Weight   Lake_Name    pH  area avg_depth
##     <chr>       <dbl>       <chr> <dbl> <dbl>     <dbl>
## 1       B    227.0830 Lake Elaine   6.5    40         8
## 2       B    261.5135 Lake Elaine   6.5    40         8
## 3       C    262.9933 Mormon Lake   6.3   210        10
## 4       C    235.9487 Mormon Lake   6.3   210        10

The above examples assumed that the column used to join the two tables was named the same in both tables. This is good practice to try to do, but sometimes you have to work with data where that isn’t the case. In that situation you can use the by=c("ColName.A"="ColName.B") syntax where ColName.A represents the name of the column in the first data frame and ColName.B is the equivalent column in the second data frame.

1.5 Exercises

  1. The dataset ChickWeight tracks the weights of 48 baby chickens (chicks) feed four different diets.
    1. Load the dataset using

      data(ChickWeight)
    2. Look at the help files for the description of the columns.
    1. Remove all the observations except for the weights on day 10 and day 20.
    2. Calculate the mean and standard deviation for each diet group on days 10 and 20.
  2. The OpenIntro textbook on statistics includes a data set on body dimensions.
    1. Load the file using

      Body <- read.csv('http://www.openintro.org/stat/data/bdims.csv')
    2. The column sex is coded as a 1 if the individual is male and 0 if female. This is a non-intuitive labeling system. Create a new column sex.MF that uses labels Male and Female.
    3. The columns wgt and hgt measure weight and height in kilograms and centimeters (respectively). Use these to calculate the Body Mass Index (BMI) for each individual where \[BMI=\frac{Weight\,(kg)}{\left[Height\,(m)\right]^{2}}\]

    4. Double check that your calculated BMI column is correct by examining the summary statistics of the column. BMI values should be between 18 to 40 or so. Did you make an error in your calculation?

    5. The function cut takes a vector of continuous numerical data and creates a factor based on your give cut-points.

      # Define a continuous vector to convert to a factor
      x <- 1:10
      
      # divide range of x into three groups of equal length
      cut(x, breaks=3)
      
      # divide x into four groups, where I specify all 5 break points 
      cut(x, breaks = c(0, 2.5, 5.0, 7.5, 10))
      # (0,2.5] (2.5,5] means 2.5 is included in first group
      # right=FALSE changes this to make 2.5 included in the second  
      
      # divide x into 3 groups, but give them a nicer
      # set of group names
      cut(x, breaks=3, labels=c('Low','Medium','High'))

      Create a new column of in the data frame that divides the age into decades (10-19, 20-29, 30-39, etc). Notice the oldest person in the study is 67.

      Body <- Body %>%
        mutate( Age.Grp = cut(age,
                              breaks=c(10,20,30,40,50,60,70),
                              right=FALSE))
    6. Find the average BMI for each Sex and Age group.

  3. Suppose we are given information about the maximum daily temperature from a weather station in Flagstaff, AZ. The file is available at the GitHub site that this book is hosted on.

    FlagTemp <-  read.csv(
      'https://github.com/dereksonderegger/570L/raw/master/data-raw/FlagMaxTemp.csv',
       header=TRUE, sep=',')

    This file is in a wide format, where each row represents a month and the columns X1, X2, …, X31 represent the day of the month the observation was made.

    1. Convert data set to the long format where the data has only four columns: Year, Month, Day, Tmax.

    2. Calculate the average monthly maximum temperature for each Month in the dataset (So there will be 365 mean maximum temperatures). You’ll probably have some issues taking the mean because there are a number of values that are missing and by default R refuses to take means and sums when there is missing data. The argument na.rm=TRUE to mean() allows you to force R to remove the missing observations before calculating the mean.

    3. Convert the average month maximums back to a wide data format where each line represents a year and there are 12 columns of temperature data (one for each month) along with a column for the year. There will be a couple of months that still have missing data because the weather station was out of commision for those months and there was NO data for the entire month.

  4. A common task is to take a set of data that has multiple categorical variables and create a table of the number of cases for each combination. An introductory statistics textbook contains a dataset summarizing student surveys from several sections of an intro class. The two variables of interest for us are Gender and Year which are the students gender and year in college.
    1. Download the dataset and correctly order the Year variable using the following:

      Survey <- read.csv('http://www.lock5stat.com/datasets/StudentSurvey.csv', na.strings=c('',' ')) %>%
         mutate(Year = factor(Year, levels=c('FirstYear','Sophomore','Junior','Senior')))
    2. Using some combination of dplyr functions, produce a data set with eight rows that contains the number of responses for each gender:year combination. Notice there are two females that neglected to give their Year and you should remove them first. The function is.na(Year) will return logical values indicating if the Year value was missing and you can flip those values using the negation operator !. So you might consider using !is.na(Year) as the argument to a filter() command. Alternatively you sort on Year and remove the first two rows using slice(-2:-1). Next you’ll want to summarize each Year/Gender group using the n() function which gives the number of rows in a data set.
    3. Using tidyr commands, produce a table of the number of responses in the following form:

      Gender First Year Sophmore Junior Senior
      Female
      Male
  5. The package nycflights contains information about all the flights that arrived in or left from New York City in 2013. This package contains five data tables, but there are three data tables we will work with. The data table flights gives information about a particular flight, airports gives information about a particular airport, and airlines gives information about each airline. Create a table of all the flights on February 14th by Virgin America that has columns for the carrier, destination, departure time, and flight duration. Join this table with the airports information for the destination. Notice that because the column for the destination airport code doesn’t match up between flights and airports, you’ll have to use the by=c("TableA.Col"="TableB.Col") argument where you insert the correct names for TableA.Col and TableB.Col.