Chapter 13 String Manipulation

Strings make up a very important class of data. Data being read into R often come in the form of character strings where different parts might mean different things. For example a sample ID of “R1_P2_C1_2012_05_28” might represent data from Region 1, Park 2, Camera 1, taken on May 28, 2012. It is important that we have a set of utilities that allow us to split and combine character strings in a easy and consistent fashion.

Unfortunately, the utilities included in the base version of R are somewhat inconsistent and were not designed to work nicely together. Hadley Wickham, the developer of ggplot2 and dplyr has this to say:

“R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R.” – Hadley Wickham

For this chapter we will introduce the most commonly used functions from the base version of R that you might use or see in other people’s code. Second, we introduce Dr Wickham’s stringr package that provides many useful functions that operate in a consistent manner.

13.1 Base function

1.1.1 paste()

The most basic thing we will want to do is to combine two strings or to combine a string with a numerical value. The paste() command will take one or more R objects and converts them to character strings and then pastes them together to form one or more character strings. It has the form:

paste( ..., sep = ' ', collapse = NULL )

The ... piece means that we can pass any number of objects to be pasted together. The sep argument gives the string that separates the strings to be joined and the collapse argument that specifies if a simplification should be performed after being pasting together.

Suppose we want to combine the strings “Peanut butter” and “Jelly” then we could execute:

paste( "PeanutButter", "Jelly" )
## [1] "PeanutButter Jelly"

Notice that without specifying the separator character, R chose to put a space between the two strings. We could specify whatever we wanted:

paste( "Hello", "World", sep='_' )
## [1] "Hello_World"

Also we can combine strings with numerical values

paste( "Pi is equal to", pi )
## [1] "Pi is equal to 3.14159265358979"

We can combine vectors of similar or different lengths as well. By default R assumes that you want to produce a vector of character strings as output.

paste( "n =", c(5,25,100) )
## [1] "n = 5"   "n = 25"  "n = 100"
first.names <- c('Robb','Stannis','Daenerys')
last.names <- c('Stark','Baratheon','Targaryen')
paste( first.names, last.names)
## [1] "Robb Stark"         "Stannis Baratheon"  "Daenerys Targaryen"

If we want paste() produce just a single string of output, use the collapse= argument to paste together each of the output vectors (separated by the collapse character).

paste( "n =", c(5,25,100) )  # Produces 3 strings
## [1] "n = 5"   "n = 25"  "n = 100"
paste( "n =", c(5,25,100), collapse=':' ) # collapses output into one string
## [1] "n = 5:n = 25:n = 100"
paste(first.names, last.names, sep='.', collapse=' : ')
## [1] "Robb.Stark : Stannis.Baratheon : Daenerys.Targaryen"

Notice we could use the paste() command with the collapse option to combine a vector of character strings together.

paste(first.names, collapse=':')
## [1] "Robb:Stannis:Daenerys"

13.2 Package stringr: basic operations

The goal of stringr is to make a consistent user interface to a suite of functions to manipulate strings. “(stringr) is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.” - Hadley Wickham

We’ll investigate the most commonly used function but there are many we will ignore.

Function Description

str_c()

string concatenation, similar to paste

str_length()

number of characters in the string

str_sub()

extract a substring

str_trim()

remove leading and trailing whitespace

str_pad()

pad a string with empty space to make it a certain length

13.2.1 Concatenating with str_c() or str_join()

The first thing we do is to concatenate two strings or two vectors of strings similarly to the paste() command. The str_c() and str_join() functions are a synonym for the exact same function, but str_join() might be a more natural verb to use and remember. The syntax is:

str_c( ..., sep='', collapse=NULL)

You can think of the inputs building a matrix of strings, with each input creating a column of the matrix. For each row, str_c() first joins all the columns (using the separator character given in sep) into a single column of strings. If the collapse argument is non-NULL, the function takes the vector and joins each element together using collapse as the separator character.

# load the stringr library
library(stringr)

# envisioning the matrix of strings
cbind(first.names, last.names)
##      first.names last.names 
## [1,] "Robb"      "Stark"    
## [2,] "Stannis"   "Baratheon"
## [3,] "Daenerys"  "Targaryen"
# join the columns together
full.names <- str_c( first.names, last.names, sep='.')
cbind( first.names, last.names, full.names)
##      first.names last.names  full.names          
## [1,] "Robb"      "Stark"     "Robb.Stark"        
## [2,] "Stannis"   "Baratheon" "Stannis.Baratheon" 
## [3,] "Daenerys"  "Targaryen" "Daenerys.Targaryen"
# Join each of the rows together separated by collapse
str_c( first.names, last.names, sep='.', collapse=' : ')
## [1] "Robb.Stark : Stannis.Baratheon : Daenerys.Targaryen"

13.2.2 Calculating string length with str_length()

The str_length() function calculates the length of each string in the vector of strings passed to it.

text <- c('WordTesting', 'With a space', NA, 'Night')
str_length( text )
## [1] 11 12 NA  5

Notice that str_length() correctly interprets the missing data as missing and that the length ought to also be missing.

13.2.3 Extracting substrings with str_sub()

If we know we want to extract the \(3^{rd}\) through \(6^{th}\) letters in a string, this function will grab them.

str_sub(text, start=3, end=6)
## [1] "rdTe" "th a" NA     "ght"

If a given string isn’t long enough to contain all the necessary indices, str_sub() returns only the letters that where there (as in the above case for “Night”

13.2.4 Pad a string with str_pad()

Sometimes we to make every string in a vector the same length to facilitate display or in the creation of a uniform system of assigning ID numbers. The str_pad() function will add spaces at either the beginning or end of the of every string appropriately.

str_pad(first.names, width=8)
## [1] "    Robb" " Stannis" "Daenerys"
str_pad(first.names, width=8, side='right', pad='*')
## [1] "Robb****" "Stannis*" "Daenerys"

13.2.5 Trim a string with str_trim()

This removes any leading or trailing whitespace where whitespace is defined as spaces ‘’, tabs \t or returns \n.

text <- ' Some text. \n  '
print(text)
## [1] " Some text. \n  "
str_trim(text)
## [1] "Some text."

13.3 Package stringr: Pattern Matching

The previous commands are all quite useful but the most powerful string operation is take a string and match some pattern within it. The following commands are available within stringr.

Function Description

str_detect()

Detect if a pattern occurs in input string

str_locate() str_locate_all()

Locates the first (or all) positions of a pattern.

str_extract() str_extract_all()

Extracts the first (or all) substrings corresponding to a pattern

str_replace() str_replace_all()

Replaces the matched substring(s) with a new pattern

str_split() str_split_fixed()

Splits the input string based on the inputed pattern

We will first examine these functions using a very simple pattern matching algorithm where we are matching a specific pattern. For most people, this is as complex as we need.

Suppose that we have a vector of strings that contain a date in the form “2012-May-27” and we want to manipulate them to extract certain information.

test.vector <- c('2008-Feb-10', '2010-Sept-18', '2013-Jan-11', '2016-Jan-2')

13.3.1 Detecting a pattern using str_detect()

Suppose we want to know which dates are in September. We want to detect if the pattern “Sept” occurs in the strings. It is important that I used fixed(“Sept”) in this code to “turn off” the complicated regular expression matching rules and just look for exactly what I specified.

str_detect( test.vector, pattern=fixed('Sept') )
## [1] FALSE  TRUE FALSE FALSE

Here we see that the second string in the test vector included the substring “Sept” but none of the others did.

13.3.2 Locating a pattern using str_locate()

To figure out where the “-” characters are, we can use the str_locate() function.

str_locate(test.vector, pattern=fixed('-') )
##      start end
## [1,]     5   5
## [2,]     5   5
## [3,]     5   5
## [4,]     5   5

which shows that the first dash occurs as the \(5^{th}\) character in each string. If we wanted all the dashes in the string the following works.

str_locate_all(test.vector, pattern=fixed('-') )
## [[1]]
##      start end
## [1,]     5   5
## [2,]     9   9
## 
## [[2]]
##      start end
## [1,]     5   5
## [2,]    10  10
## 
## [[3]]
##      start end
## [1,]     5   5
## [2,]     9   9
## 
## [[4]]
##      start end
## [1,]     5   5
## [2,]     9   9

The output of str_locate_all() is a list of matrices that gives the start and end of each matrix. Using this information, we could grab the Year/Month/Day information out of each of the dates. We won’t do that here because it will be easier to do this using str_split().

13.3.3 Replacing substrings using str_replace()

Suppose we didn’t like using “-” to separate the Year/Month/Day but preferred a space, or an underscore, or something else. This can be done by replacing all of the “-” with the desired character. The str_replace() function only replaces the first match, but str_replace_all() replaces all matches.

str_replace(test.vector, pattern=fixed('-'), replacement=fixed(':') )
## [1] "2008:Feb-10"  "2010:Sept-18" "2013:Jan-11"  "2016:Jan-2"
str_replace_all(test.vector, pattern=fixed('-'), replacement=fixed(':') )
## [1] "2008:Feb:10"  "2010:Sept:18" "2013:Jan:11"  "2016:Jan:2"

13.3.4 Splitting into substrings using str_split()

We can split each of the dates into three smaller substrings using the str_split() command, which returns a list where each element of the list is a vector containing pieces of the original string (excluding the pattern we matched on).

If we know that all the strings will be split into a known number of substrings (we have to specify how many substrings to match with the n= argument), we can use str_split_fixed() to get a matrix of substrings instead of list of substrings. It is somewhat unfortunate that the _fixed modifier to the function name is the same as what we use to specify to use simple pattern matching.

str_split_fixed(test.vector, pattern=fixed('-'), n=3)
##      [,1]   [,2]   [,3]
## [1,] "2008" "Feb"  "10"
## [2,] "2010" "Sept" "18"
## [3,] "2013" "Jan"  "11"
## [4,] "2016" "Jan"  "2"

13.4 Regular Expressions

The next section will introduce using regular expressions. Regular expressions are a way to specify very complicated patterns. Go look at https://xkcd.com/208/ to gain insight into just how geeky regular expressions are.

Regular expressions are a way of precisely writing out patterns that are very complicated. The stringr package pattern arguments can be given using standard regular expressions (not perl-style!) instead of using fixed strings.

Regular expressions are extremely powerful for sifting through large amounts of text. For example, we might want to extract all of the 4 digit substrings (the years) out of our dates vector, or I might want to find all cases in a paragraph of text of words that begin with a capital letter and are at least 5 letters long. In another, somewhat nefarious example, spammers might have downloaded a bunch of text from webpages and want to be able to look for email addresses. So as a first pass, they want to match a pattern: \[\underset{\textrm{1 or more letters}}{\underbrace{\texttt{Username}}}\texttt{@}\;\;\underset{\textrm{1 or more letter}}{\underbrace{\texttt{OrganizationName}}}\;\texttt{.\;}\begin{cases} \texttt{com}\\ \texttt{org}\\ \texttt{edu} \end{cases}\] where the Username and OrganizationName can be pretty much anything, but a valid email address looks like this. We might get even more creative and recognize that my list of possible endings could include country codes as well.

For most people, I don’t recommend opening the regular expression can-of-worms, but it is good to know that these pattern matching utilities are available within R and you don’t need to export your pattern matching problems to Perl or Python.

13.5 Exercises

  1. The following file names were used in a camera trap study. The S number represents the site, P is the plot within a site, C is the camera number within the plot, the first string of numbers is the YearMonthDay and the second string of numbers is the HourMinuteSecond.
file.names <- c( 'S123.P2.C10_20120621_213422.jpg',
                 'S10.P1.C1_20120622_050148.jpg',
                 'S187.P2.C2_20120702_023501.jpg')

Use a combination of str_sub() and str_split() to produce a data frame with columns corresponding to the site, plot, camera, year, month, day, hour, minute, and second for these three file names. So we want to produce code that will create the data frame:

 Site Plot Camera Year Month Day Hour Minute Second
 S123   P2    C10 2012    06  21   21     34     22
  S10   P1     C1 2012    06  22   05     01     48
 S187   P2     C2 2012    07  02   02     35     01

Hint: Convert all the dashes to periods and then split on the dots. After that you’ll have to further tear apart the date and time columns using str_sub().