Chapter 8 Data Reshaping
# library(tidyr) # for the gather/spread commands
# library(dplyr) # for the join stuff
library(tidyverse) # dplyr, tidyr, ggplot2, etc.
Most of the time, our data is in the form of a data frame and we are interested in exploring the relationships. However most procedures in R expect the data to show up in a ‘long’ format where each row is an observation and each column is a covariate. In practice, the data is often not stored like that and the data comes to us with repeated observations included on a single row. This is often done as a memory saving technique or because there is some structure in the data that makes the ‘wide’ format attractive. As a result, we need a way to convert data from ‘wide’ to ‘long’ and vice-versa.
Next we need a way to squish two data frames together. It is often advantagous to store data that would be be repeated seperately in a different table so that a particular piece of information lives in only one location. This makes the data easier to modify, and more likely to maintain consistence. However, this practice requires that, when necessary, we can add information to a table, that might involve a lot of duplicated rows.
8.1 tidyr
There is a common issue with obtaining data with many columns that you wish were organized as rows. For example, I might have data in a grade book that has several homework scores and I’d like to produce a nice graph that has assignment number on the x-axis and score on the y-axis. Unfortunately this is incredibly hard to do when the data is arranged in the following way:
grade.book <- rbind(
data.frame(name='Alison', HW.1=8, HW.2=5, HW.3=8, HW.4=4),
data.frame(name='Brandon', HW.1=5, HW.2=3, HW.3=6, HW.4=9),
data.frame(name='Charles', HW.1=9, HW.2=7, HW.3=9, HW.4=10))
grade.book
## name HW.1 HW.2 HW.3 HW.4
## 1 Alison 8 5 8 4
## 2 Brandon 5 3 6 9
## 3 Charles 9 7 9 10
What we want to do is turn this data frame from a wide data frame into a long data frame. In MS Excel this is called pivoting. Essentially I’d like to create a data frame with three columns: name
, assignment
, and score
. That is to say that each homework datum really has three pieces of information: who it came from, which homework it was, and what the score was. It doesn’t conceptually matter if I store it as 3 rows of 4 columns or 12 rows so long as there is a way to identify how a student scored on a particular homework. So we want to reshape the HW1 to HW4 columns into two columns (assignment and score).
This package was built by the same people that created dplyr and ggplot2 and there is a nice introduction at: [http://blog.rstudio.org/2014/07/22/introducing-tidyr/]
8.1.1 Verbs
As with the dplyr package, there are two main verbs to remember:
gather
- Gather multiple columns that are related into two columns that contain the original column name and the value. For example for columns HW1, HW2, HW3 we would gather them into two column HomeworkNumber and Score. In this case, we refer to HomeworkNumber as the key column and Score as the value column. So for any key:value pair you know everything you need.spread
- This is the opposite of gather. This takes a key column (or columns) and a results column and forms a new column for each level of the key column(s).
# first we gather the score columns into columns we'll name Assesment and Score
tidy.scores <- grade.book %>%
gather( key=Homework, # What should I call the key column
value=Score, # What should I call the values column
HW.1:HW.4 # which columns to apply this to
)
tidy.scores
## name Homework Score
## 1 Alison HW.1 8
## 2 Brandon HW.1 5
## 3 Charles HW.1 9
## 4 Alison HW.2 5
## 5 Brandon HW.2 3
## 6 Charles HW.2 7
## 7 Alison HW.3 8
## 8 Brandon HW.3 6
## 9 Charles HW.3 9
## 10 Alison HW.4 4
## 11 Brandon HW.4 9
## 12 Charles HW.4 10
To spread the key:value pairs out into a matrix, we use the spread()
command.
# Turn the Assessment/Score pair of columns into one column per factor level of Assessment
tidy.scores %>% spread( key=Homework, value=Score )
## name HW.1 HW.2 HW.3 HW.4
## 1 Alison 8 5 8 4
## 2 Brandon 5 3 6 9
## 3 Charles 9 7 9 10
One way to keep straight which is the key
column is that the key is the category, while value
is the numerical value or response.
8.2 Storing Data in Multiple Tables
In many datasets it is common to store data across multiple tables, usually with the goal of minimizing memory used as well as providing minimal duplication of information so any change that must be made is only made in a single place.
To see the rational why we might do this, consider building a data set of blood donations by a variety of donors across several years. For each blood donation, we will perform some assay and measure certain qualities about the blood and the patients health at the donation.
## Donor Hemoglobin Systolic Diastolic
## 1 Derek 17.4 121 80
## 2 Jeff 16.9 145 101
But now we have to ask, what happens when we have a donor that has given blood multiple times? In this case we should just have multiple rows per person along with a date column to uniquely identify a particular donation.
donations
## Donor Date Hemoglobin Systolic Diastolic
## 1 Derek 2017-04-14 17.4 120 79
## 2 Derek 2017-06-20 16.5 121 80
## 3 Jeff 2017-08-14 16.9 145 101
I would like to include additional information about the donor where that infomation doesn’t change overtime. For example we might want to have information about the donar’s birthdate, sex, blood type. However, I don’t want that information in every single donation line. Otherwise if I mistype a birthday and have to correct it, I would have to correct it everywhere. For information about the donor, should live in a donors
table, while information about a particular donation should live in the donations
table.
Furthermore, there are many Jeffs and Dereks in the world and to maintain a unique identifier (without using Social Security numbers) I will just create a Donor_ID
code that will uniquely identify a person. Similarly I will create a Donation_ID
that will uniquely identify a dontation.
donors
## Donor_ID F_Name L_Name B_Type Birth Street City State
## 1 Donor_1 Derek Lee O+ 1976-09-17 7392 Willard Flagstaff AZ
## 2 Donor_2 Jeff Smith A 1974-06-23 873 Vine Bozeman MT
donations
## Donation_ID Donor_ID Date Hemoglobin Systolic Diastolic
## 1 Donation_1 Donor_1 2017-04-14 17.4 120 79
## 2 Donation_2 Donor_1 2017-06-20 16.5 121 80
## 3 Donation_3 Donor_2 2017-08-14 16.9 145 101
If we have a new donor walk in and give blood, then we’ll have to create a new entry in the donors
table as well as a new entry in the donations
table. If an experienced donor gives again, we just have to create a new entry in the donations table.
donors
## Donor_ID F_Name L_Name B_Type Birth Street City State
## 1 Donor_1 Derek Lee O+ 1976-09-17 7392 Willard Flagstaff AZ
## 2 Donor_2 Jeff Smith A 1974-06-23 873 Vine Bozeman MT
## 3 Donor_3 Aubrey Lee O+ 1980-12-15 7392 Willard Flagstaff AZ
donations
## Donation_ID Donor_ID Date Hemoglobin Systolic Diastolic
## 1 Donation_1 Donor_1 2017-04-14 17.4 120 79
## 2 Donation_2 Donor_1 2017-06-20 16.5 121 80
## 3 Donation_3 Donor_2 2017-08-14 16.9 145 101
## 4 Donation_4 Donor_1 2017-08-26 17.6 120 79
## 5 Donation_5 Donor_3 2017-08-26 16.1 137 90
This data storage set-up might be flexible enough for us. However what happens if somebody moves? If we don’t want to keep the historical information, then we could just change the person’s Street_Address
, City
, and State
values. If we do want to keep that, then we could create donor_addresses
table that contains a Start_Date
and End_Date
that denotes the period of time that the address was valid.
donor_addresses
## Donor_ID Street City State Start_Date End_Date
## 1 Donor_1 346 Treeline Pullman WA 2015-01-26 2016-06-27
## 2 Donor_1 645 Main Flagstsff AZ 2016-06-28 2017-07-02
## 3 Donor_1 7392 Willard Flagstaff AZ 2017-07-03 <NA>
## 4 Donor_2 873 Vine Bozeman MT 2015-03-17 <NA>
## 5 Donor_3 7392 Willard Flagstaff AZ 2017-06-01 <NA>
Given this data structure, we can now easily create new donations as well as store donor information. In the event that we need to change something about a donor, there is only one place to make that change.
However, having data spread across multiple tables is challenging because I often want that information squished back together. For example, the blood donations services might want to find all ‘O’ or ‘O+’ donors in Flagstaff and their current mailing address and send them some notification about blood supplies being low. So we need someway to join the donors
and donor_addresses
tables together in a sensible manner.
8.3 Table Joins
Often we need to squish together two data frames but they do not have the same number of rows. Consider the case where we have a data frame of observations of fish and a separate data frame that contains information about lake (perhaps surface area, max depth, pH, etc). I want to store them as two separate tables so that when I have to record a lake level observation, I only input it one place. This decreases the chance that I make a copy/paste error.
To illustrate the different types of table joins, we’ll consider two different tables.
# tibbles are just data.frames that print a bit nicer and don't automatically
# convert character columns into factors. They behave a bit more consistently
# in a wide variety of situations compared to data.frames.
Fish.Data <- tibble(
Lake_ID = c('A','A','B','B','C','C'),
Fish.Weight=rnorm(6, mean=260, sd=25) ) # make up some data
Lake.Data <- tibble(
Lake_ID = c( 'B','C','D'),
Lake_Name = c('Lake Elaine', 'Mormon Lake', 'Lake Mary'),
pH=c(6.5, 6.3, 6.1),
area = c(40, 210, 240),
avg_depth = c(8, 10, 38))
Fish.Data
## # A tibble: 6 x 2
## Lake_ID Fish.Weight
## <chr> <dbl>
## 1 A 263.
## 2 A 276.
## 3 B 260.
## 4 B 273.
## 5 C 252.
## 6 C 216.
Lake.Data
## # A tibble: 3 x 5
## Lake_ID Lake_Name pH area avg_depth
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 B Lake Elaine 6.50 40. 8.
## 2 C Mormon Lake 6.30 210. 10.
## 3 D Lake Mary 6.10 240. 38.
Notice that each of these tables has a column labled Lake_ID
. When we join these two tables, the row that describes lake A
should be duplicated for each row in the Fish.Data
that corresponds with fish caught from lake A
.
full_join(Fish.Data, Lake.Data)
## Joining, by = "Lake_ID"
## # A tibble: 7 x 6
## Lake_ID Fish.Weight Lake_Name pH area avg_depth
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 A 263. <NA> NA NA NA
## 2 A 276. <NA> NA NA NA
## 3 B 260. Lake Elaine 6.50 40. 8.
## 4 B 273. Lake Elaine 6.50 40. 8.
## 5 C 252. Mormon Lake 6.30 210. 10.
## 6 C 216. Mormon Lake 6.30 210. 10.
## 7 D NA Lake Mary 6.10 240. 38.
Notice that because we didn’t have any fish caught in lake D
and we don’t have any Lake information about lake A
, when we join these two tables, we end up introducing missing observations into the resulting data frame.
The other types of joins govern the behavor or these missing data.
left_join(A, B)
For each row in A, match with a row in B, but don’t create any more rows than what was already in A.
inner_join(A,B)
Only match row values where both data frames have a value.
left_join(Fish.Data, Lake.Data)
## Joining, by = "Lake_ID"
## # A tibble: 6 x 6
## Lake_ID Fish.Weight Lake_Name pH area avg_depth
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 A 263. <NA> NA NA NA
## 2 A 276. <NA> NA NA NA
## 3 B 260. Lake Elaine 6.50 40. 8.
## 4 B 273. Lake Elaine 6.50 40. 8.
## 5 C 252. Mormon Lake 6.30 210. 10.
## 6 C 216. Mormon Lake 6.30 210. 10.
inner_join(Fish.Data, Lake.Data)
## Joining, by = "Lake_ID"
## # A tibble: 4 x 6
## Lake_ID Fish.Weight Lake_Name pH area avg_depth
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 B 260. Lake Elaine 6.50 40. 8.
## 2 B 273. Lake Elaine 6.50 40. 8.
## 3 C 252. Mormon Lake 6.30 210. 10.
## 4 C 216. Mormon Lake 6.30 210. 10.
The above examples assumed that the column used to join the two tables was named the same in both tables. This is good practice to try to do, but sometimes you have to work with data where that isn’t the case. In that situation you can use the by=c("ColName.A"="ColName.B")
syntax where ColName.A
represents the name of the column in the first data frame and ColName.B
is the equivalent column in the second data frame.
Finally, the combination of gather
and join
allows me to do some very complex calculations across many columns of a data set. For example, I might gather up a set of columns, calculate some summary statistics, and then join the result back to original data set.
grade.book %>%
group_by(name) %>%
gather( key=Homework, value=Score, HW.1:HW.4 ) %>%
summarise( HW.avg = mean(Score) ) %>%
left_join( grade.book, . )
## Joining, by = "name"
## name HW.1 HW.2 HW.3 HW.4 HW.avg
## 1 Alison 8 5 8 4 6.25
## 2 Brandon 5 3 6 9 5.75
## 3 Charles 9 7 9 10 8.75
8.4 Exercises
Suppose we are given information about the maximum daily temperature from a weather station in Flagstaff, AZ. The file is available at the GitHub site that this book is hosted on.
FlagTemp <- read.csv( 'https://github.com/dereksonderegger/570L/raw/master/data-raw/FlagMaxTemp.csv', header=TRUE, sep=',')
This file is in a wide format, where each row represents a month and the columns X1, X2, …, X31 represent the day of the month the observation was made.
Convert data set to the long format where the data has only four columns:
Year
,Month
,Day
,Tmax
.Calculate the average monthly maximum temperature for each Month in the dataset (So there will be 365 mean maximum temperatures). You’ll probably have some issues taking the mean because there are a number of values that are missing and by default R refuses to take means and sums when there is missing data. The argument
na.rm=TRUE
tomean()
allows you to force R to remove the missing observations before calculating the mean.Convert the average month maximums back to a wide data format where each line represents a year and there are 12 columns of temperature data (one for each month) along with a column for the year. There will be a couple of months that still have missing data because the weather station was out of commision for those months and there was NO data for the entire month.
- A common task is to take a set of data that has multiple categorical variables and create a table of the number of cases for each combination. An introductory statistics textbook contains a dataset summarizing student surveys from several sections of an intro class. The two variables of interest for us are
Gender
andYear
which are the students gender and year in college.Download the dataset and correctly order the
Year
variable using the following:Survey <- read.csv('http://www.lock5stat.com/datasets/StudentSurvey.csv', na.strings=c('',' ')) %>% mutate(Year = factor(Year, levels=c('FirstYear','Sophomore','Junior','Senior')))
- Using some combination of
dplyr
functions, produce a data set with eight rows that contains the number of responses for each gender:year combination. Notice there are two females that neglected to give their Year and you should remove them first. The functionis.na(Year)
will return logical values indicating if the Year value was missing and you can flip those values using the negation operator!
. So you might consider using!is.na(Year)
as the argument to afilter()
command. Alternatively you could sort onYear
and remove the first two rows usingslice(-2:-1)
. Next you’ll want to summarize each Year/Gender group using then()
function which gives the number of rows in a data set. Using
tidyr
commands, produce a table of the number of responses in the following form:Gender First Year Sophmore Junior Senior Female Male
The package
nycflights13
contains information about all the flights that arrived in or left from New York City in 2013. This package contains five data tables, but there are three data tables we will work with. The data tableflights
gives information about a particular flight,airports
gives information about a particular airport, andairlines
gives information about each airline. Create a table of all the flights on February 14th by Virgin America that has columns for the carrier, destination, departure time, and flight duration. Join this table with the airports information for the destination. Notice that because the column for the destination airport code doesn’t match up betweenflights
andairports
, you’ll have to use theby=c("TableA.Col"="TableB.Col")
argument where you insert the correct names forTableA.Col
andTableB.Col
.