Data types

Session 1

1 / 40

Conf - Data types

2 / 40

Go ahead and open the milestone labeled "Conf - Data types" in the campsite. We'll be using this for a few exercises as we go along.

Data types

Strings, factors and dates

3 / 40

Strings

4 / 40

Meet stringr

stringr hex sticker

https://stringr.tidyverse.org/

5 / 40

The stringr package is your tidyverse companion to all things strings.

stringr provides a cohesive set of functions that make working with character data in R as easy as possible. Most any task you can think of that involves character data can be accomplished with stringr.

It's part of the core tidyverse, along with packages like dplyr and ggplot2, so stringr functions play really nicely with dplyr functions like filter() and mutate().

Let's look at a concrete example.

Breed Traits data set

breed_traits

## # A tibble: 195 × 15
##   breed          affection shedding drooling openness playfulness protectiveness
##   <chr>              <dbl>    <dbl>    <dbl>    <dbl>       <dbl>          <dbl>
## 1 Retrievers (L…         5        4        2        5           5              3
## 2 French Bulldo…         5        3        3        5           5              3
## 3 German Shephe…         5        4        2        3           4              5
## 4 Retrievers (G…         5        4        2        5           4              3
## 5 Bulldogs               4        3        3        4           4              3
## 6 Poodles                5        1        1        5           5              5
## # ℹ 189 more rows
## # ℹ 8 more variables: adaptability <dbl>, trainability <dbl>, energy <dbl>,
## #   barking <dbl>, stimulation_needs <dbl>, good_w_children <dbl>,
## #   good_w_other_dogs <dbl>, grooming_freq <dbl>

6 / 40

The breed_traits dataset is a fun dataset that contains information on 195 dog breeds, with scores (on a 1-5 scale) for 15 traits (e.g. how affectionate the breed is, how much it sheds, how playful it is, etc). This data comes courtesy of the American Kennel Club.

In our analysis, want to compare traits across terrier breeds only, of which there are many types.

Cartoon of 18 dog breeds

7 / 40

To make this very clear, we have 195 dog breeds (with 18 very good boys and girls are pictured here as an example)...

Cartoon of 18 dog breeds with only four terrier breeds highlighted

8 / 40

...and we want to subset the data so that we can continue our analysis on terrier breeds. Note that I don't know how many of the 195 breeds in the dataset have "terrier" as part of their name, but I want to keep them all.

(The four highlighted breeds, from top to bottom, left to right, are Scottish, Bull, Boston, and Russell terriers.)

Sniffing out terrier breeds

breed_traits %>% 
  filter(breed == "Yorkshire Terriers")

## # A tibble: 1 × 15
##   breed          affection shedding drooling openness playfulness protectiveness
##   <chr>              <dbl>    <dbl>    <dbl>    <dbl>       <dbl>          <dbl>
## 1 Yorkshire Ter…         5        1        1        5           4              5
## # ℹ 8 more variables: adaptability <dbl>, trainability <dbl>, energy <dbl>,
## #   barking <dbl>, stimulation_needs <dbl>, good_w_children <dbl>,
## #   good_w_other_dogs <dbl>, grooming_freq <dbl>

9 / 40

When I say "subset", alarm bells are probably going off in your head that we we'll be using the filter() function.

Using what we've already know how to do, we can print the breed_traits table and scan through the paginated results in RMarkdown to find our first match — Yorkshire terriers.

We'll use the == operator to match the string, and get one row in the output.

Sniffing out terrier breeds

breed_traits %>% 
  filter(breed %in% c("Yorkshire Terriers", "Boston Terriers"))

## # A tibble: 2 × 15
##   breed          affection shedding drooling openness playfulness protectiveness
##   <chr>              <dbl>    <dbl>    <dbl>    <dbl>       <dbl>          <dbl>
## 1 Yorkshire Ter…         5        1        1        5           4              5
## 2 Boston Terrie…         5        2        1        5           5              3
## # ℹ 8 more variables: adaptability <dbl>, trainability <dbl>, energy <dbl>,
## #   barking <dbl>, stimulation_needs <dbl>, good_w_children <dbl>,
## #   good_w_other_dogs <dbl>, grooming_freq <dbl>

10 / 40

And then our second match — Boston terriers.

This time, we'll use the %in% operator to match a vector of strings, and get two rows in the output.

You can where this is going...

If you think about extending this process to all 200 or so rows, you'll realize that filtering with explicit strings isn't really a scalable solution. Even in this relatively small and tidy dataset, we can see that it becomes tedious and error-prone very quickly.

Sniffing out terrier breeds ... with pattern matching!11 / 40

And you'd be right to intuit that there's a simpler way. All we, the humans, are doing is looking for the pattern "Terrier" in the breed column. This is exactly the kind of simple but highly repetitive task that's well-suited to outsource to our computers.

That's where stringr comes in.

Filtering with `str_detect()`

breed_traits %>% 
  filter(str_detect(breed, pattern = "Terrier"))

## # A tibble: 36 × 15
##    breed         affection shedding drooling openness playfulness protectiveness
##    <chr>             <dbl>    <dbl>    <dbl>    <dbl>       <dbl>          <dbl>
##  1 Yorkshire Te…         5        1        1        5           4              5
##  2 Boston Terri…         5        2        1        5           5              3
##  3 West Highlan…         5        3        1        4           5              5
##  4 Scottish Ter…         5        2        2        3           4              5
##  5 Soft Coated …         5        1        2        3           3              3
##  6 Airedale Ter…         3        1        1        3           3              5
##  7 Bull Terriers         4        3        1        4           4              3
##  8 Russell Terr…         5        3        1        5           5              4
##  9 Cairn Terrie…         4        2        1        3           4              4
## 10 Staffordshir…         5        2        3        4           4              5
## # ℹ 26 more rows
## # ℹ 8 more variables: adaptability <dbl>, trainability <dbl>, energy <dbl>,
## #   barking <dbl>, stimulation_needs <dbl>, good_w_children <dbl>,
## #   good_w_other_dogs <dbl>, grooming_freq <dbl>

12 / 40

The str_detect() function searches for the presence of a pattern in a string and returns a logical vector that's TRUE if the pattern is detected, or FALSE if it's not. That makes it a very powerful function in combination with filter().

In the example code, we keep only the rows where the sequence "Terrier" is found in the breed column, and drop the rest.

stringr functions

Character manipulation

str_sub("Introduction to the tidyverse", 21, 24)

## [1] "tidy"

13 / 40

In addition to pattern matching, you can use stringr to manipulate strings in a variety of ways. I'll show just a couple examples.

We can extract substrings from a vector using str_sub(), in this case by extracting the 21st through 24th characters which form the word "tidy".

stringr functions

Whitespace tools

str_trim("   Introduction to the tidyverse          ")

## [1] "Introduction to the tidyverse"

14 / 40

We can trim whitespace from a string using str_trim(), which can be a quick and easy data cleaning step.

These are just a couple examples of the many ways you can use stringr to manipulate strings.

Your Turn 1

Use str_subset() to subset the elements of the fruit vector that are made up of two or more words.

# preview `fruit`, which is loaded along with stringr
library(stringr)
fruit

##  [1] "apple"             "apricot"           "avocado"          
##  [4] "banana"            "bell pepper"       "bilberry"         
##  [7] "blackberry"        "blackcurrant"      "blood orange"     
## [10] "blueberry"         "boysenberry"       "breadfruit"       
## [13] "canary melon"      "cantaloupe"        "cherimoya"        
## [16] "cherry"            "chili pepper"      "clementine"       
## [19] "cloudberry"        "coconut"           "cranberry"        
## [22] "cucumber"          "currant"           "damson"           
## [25] "date"              "dragonfruit"       "durian"           
## [28] "eggplant"          "elderberry"        "feijoa"           
## [31] "fig"               "goji berry"        "gooseberry"       
## [34] "grape"             "grapefruit"        "guava"            
## [37] "honeydew"          "huckleberry"       "jackfruit"        
## [40] "jambul"            "jujube"            "kiwi fruit"       
## [43] "kumquat"           "lemon"             "lime"             
## [46] "loquat"            "lychee"            "mandarine"        
## [49] "mango"             "mulberry"          "nectarine"        
## [52] "nut"               "olive"             "orange"           
## [55] "pamelo"            "papaya"            "passionfruit"     
## [58] "peach"             "pear"              "persimmon"        
## [61] "physalis"          "pineapple"         "plum"             
## [64] "pomegranate"       "pomelo"            "purple mangosteen"
## [67] "quince"            "raisin"            "rambutan"         
## [70] "raspberry"         "redcurrant"        "rock melon"       
## [73] "salal berry"       "satsuma"           "star fruit"       
## [76] "strawberry"        "tamarillo"         "tangerine"        
## [79] "ugli fruit"        "watermelon"

15 / 40

We're looking for fruits like "bell pepper", "blood orange", etc.

Hint: Look up the help page for str_subset()

Hint: What character indicates that a string contains more than one word?

Solution: str_subset(fruit, " ")

Your Turn 1 Solution

str_subset(fruit, " ")

##  [1] "bell pepper"       "blood orange"      "canary melon"     
##  [4] "chili pepper"      "goji berry"        "kiwi fruit"       
##  [7] "purple mangosteen" "rock melon"        "salal berry"      
## [10] "star fruit"        "ugli fruit"

16 / 40

Factors

17 / 40

Meet forcats

stringr hex sticker

https://forcats.tidyverse.org/

18 / 40

"factor" is just another name for categorical data in R

library(forcats)
gss_cat

## # A tibble: 21,483 × 9
##    year marital         age race  rincome        partyid     relig denom tvhours
##   <int> <fct>         <int> <fct> <fct>          <fct>       <fct> <fct>   <int>
## 1  2000 Never married    26 White $8000 to 9999  Ind,near r… Prot… Sout…      12
## 2  2000 Divorced         48 White $8000 to 9999  Not str re… Prot… Bapt…      NA
## 3  2000 Widowed          67 White Not applicable Independent Prot… No d…       2
## 4  2000 Never married    39 White Not applicable Ind,near r… Orth… Not …       4
## 5  2000 Divorced         25 White Not applicable Not str de… None  Not …       1
## 6  2000 Married          25 White $20000 - 24999 Strong dem… Prot… Sout…      NA
## 7  2000 Never married    36 White $25000 or more Not str re… Chri… Not …       3
## 8  2000 Divorced         44 White $7000 to 7999  Ind,near d… Prot… Luth…      NA
## # ℹ 21,475 more rows

19 / 40

The General Social Survey is a large-scale survey designed to monitor changes in social characteristics and attitudes over time in the United States.

The forcats package has a subset of this data built-in, gss_cat.

EDA - continuous

gss_cat %>% 
  ggplot(aes(x=tvhours)) +
  geom_histogram() +
  labs(x = "TV Hours")

20 / 40

To get a sense of the distribution of this data, we can use a histogram.

EDA - categorical

gss_cat %>% 
  ggplot(aes(x=marital)) +
  geom_bar() + 
  labs(x = "Marital Status")

21 / 40

Now let's look at a categorical variable, marital status. In this case, we can use a bar plot to display the number of observations in each category of marital status.

What are we trying to show?

22 / 40

But it would be more informative to see a plot like this, which shows the categories in descending order of frequency.

Factors have an ordering

Factors are stored with an order even if there is no inherent meaning to the ordering

levels(gss_cat$marital)

## [1] "No answer"     "Never married" "Separated"     "Divorced"     
## [5] "Widowed"       "Married"

23 / 40

We call each of the categories of a factor a "level." Here is the default ordering of the levels of marital status. This is the same ordering we saw in our first bar plot.

Reorder factor levels

fct_inorder(): by the order in which they first appear.

fct_infreq(): by number of observations within each level (largest first)

fct_inseq(): by numeric value of level.

https://forcats.tidyverse.org/reference/fct_inorder.html

24 / 40

Forcats provides some helper functions for changing the order of factor levels when we need to do so. Here are a few examples (but there are more!)

Example

f <- factor(c("b", "b", "a", "c", "c", "c"))
f

## [1] b b a c c c
## Levels: a b c

fct_inorder(f)

## [1] b b a c c c
## Levels: b a c

25 / 40

Let's look at a simple character vector that we've converted to a factor. By default the levels are in alphabetical order. But we can use fct_inorder() to reorder the levels according to the order in which they first appear in the vector.

Since "b" appears first, that becomes the first level.

Your Turn 2

Use ggplot and one of these forcats functions to recreate the plot of gss_cat:

fct_inorder()
fct_infreq()
fct_inseq()

26 / 40

Your Turn 2 Solution

gss %>%
  ggplot(
    aes(
      x = fct_infreq(marital)
    )
  ) +
  geom_bar() + 
  labs(x = "Marital Status")

???

fct_infreq() displays the bars by order of how frequently each category of martial status appears in the data set.

27 / 40

Dates

28 / 40

Meet lubridate

stringr hex sticker

https://lubridate.tidyverse.org/

29 / 40

30 / 40

Lubridate is a package that makes it easier to work dates and datetimes. These are two standard formats for storing time-related information. A date is what is sounds like - the year, month, and day.

A datetime stores all of that as well as hours, minutes, seconds, and time zone.

Creating Dates and Datetimes31 / 40

When working with time-related information, often the first step is to get your data into a date or datetime format.

For example, you may be trying to read in time-related information that uses dashes to separate values. Or maybe spaces, periods, or no spacing at all.

Lubridate functions will handle all of these formats automatically. A function called ymd for year/month/day can read all of these different formats and will output the Date object shown on the right.

Creating Dates and Datetimes32 / 40

There are a number of functions that lubridate includes for creating a date or datetime from almost any format. Many of them are listed in this table. There are many permutations on y, m, and d, that are designed to read in time-related information that is stored in different orders.

Extract Information33 / 40

Once we have our data in a date or datetime format, we are able to easily access all of its components -- such as the year, month, day, hour, minute, etc.

Extract Information34 / 40

And, we can even extract additional information such as the quarter, week of the year, day of the year, or day of the week.

Other tasks with lubridate

do math with dates and datetimes
convert between time zones
work with time intervals

35 / 40

Bike Traffic data set

bike_traffic

## # A tibble: 85,810 × 5
##   date                   crossing           direction bike_count ped_count
##   <fct>                  <fct>              <fct>          <int>     <int>
## 1 02/28/2019 11:00:00 PM Burke Gilman Trail North              0         0
## 2 02/28/2019 10:00:00 PM Burke Gilman Trail North              0         0
## 3 02/28/2019 09:00:00 PM Burke Gilman Trail North              2         0
## 4 02/28/2019 08:00:00 PM Burke Gilman Trail North              2         1
## 5 02/28/2019 07:00:00 PM Burke Gilman Trail North              6         0
## 6 02/28/2019 06:00:00 PM Burke Gilman Trail North             13         5
## 7 02/28/2019 05:00:00 PM Burke Gilman Trail North             19        15
## 8 02/28/2019 04:00:00 PM Burke Gilman Trail North             26        23
## # ℹ 85,802 more rows

36 / 40

Let's look at an example using bike traffic data. The date column is stored as a factor with hours listed in AM and PM and is not currently in the standardized datetime format.

In the current form, we cannot take advantage of the many time-related tools that exist for dates and datetimes.

Get into a `datetime` format

bike_traffic %>%
  mutate(
    timestamp = mdy_hms(date, tz = "US/Pacific"), 
    .before = date
  )

## # A tibble: 85,810 × 6
##   timestamp           date               crossing direction bike_count ped_count
##   <dttm>              <fct>              <fct>    <fct>          <int>     <int>
## 1 2019-02-28 23:00:00 02/28/2019 11:00:… Burke G… North              0         0
## 2 2019-02-28 22:00:00 02/28/2019 10:00:… Burke G… North              0         0
## 3 2019-02-28 21:00:00 02/28/2019 09:00:… Burke G… North              2         0
## 4 2019-02-28 20:00:00 02/28/2019 08:00:… Burke G… North              2         1
## 5 2019-02-28 19:00:00 02/28/2019 07:00:… Burke G… North              6         0
## # ℹ 85,805 more rows

37 / 40

So, we will use a lubridate function to turn this column into a datetime.

Because our data was in the order Month-Day-Year-Hour-Minute-Second, we used the MDY_HMS function to turn the column of values into datetimes.

Now that we have this stored as a date time object, we can start to explore temporal patterns in this data.

Your Turn 3

What day of the week was the moon landing (July 20, 1969)?

💡 Hint: The wday() function returns the day of the week

ymd(____) %>%
  ____(label = TRUE)

38 / 40

Your Turn 3 Solution

What day of the week was the moon landing (July 20, 1969)?

ymd("1969-07-20") %>%
  wday(label = TRUE)

## [1] Sun
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

39 / 40

Remember your cheatsheets!

40 / 40

This really just scratches the surface of the tools available within these packages for working with data types. Think about what types of data you most commonly work with -- which of these packages would you like to explore further?

There are some optional tutorials in your campsite if you'd like to do a deeper dive in to any of these topics.

Cheatsheets are especially helpful when working with these packages.

Data types

Session 1

1 / 40

Conf - Data types

2 / 40

Go ahead and open the milestone labeled "Conf - Data types" in the campsite. We'll be using this for a few exercises as we go along.

Data types

Strings, factors and dates

3 / 40

Strings

4 / 40

Meet stringr

stringr hex sticker

https://stringr.tidyverse.org/

5 / 40

The stringr package is your tidyverse companion to all things strings.

It's part of the core tidyverse, along with packages like dplyr and ggplot2, so stringr functions play really nicely with dplyr functions like filter() and mutate().

Let's look at a concrete example.

Breed Traits data set

breed_traits

## # A tibble: 195 × 15
##   breed          affection shedding drooling openness playfulness protectiveness
##   <chr>              <dbl>    <dbl>    <dbl>    <dbl>       <dbl>          <dbl>
## 1 Retrievers (L…         5        4        2        5           5              3
## 2 French Bulldo…         5        3        3        5           5              3
## 3 German Shephe…         5        4        2        3           4              5
## 4 Retrievers (G…         5        4        2        5           4              3
## 5 Bulldogs               4        3        3        4           4              3
## 6 Poodles                5        1        1        5           5              5
## # ℹ 189 more rows
## # ℹ 8 more variables: adaptability <dbl>, trainability <dbl>, energy <dbl>,
## #   barking <dbl>, stimulation_needs <dbl>, good_w_children <dbl>,
## #   good_w_other_dogs <dbl>, grooming_freq <dbl>

6 / 40

In our analysis, want to compare traits across terrier breeds only, of which there are many types.

Cartoon of 18 dog breeds

7 / 40

To make this very clear, we have 195 dog breeds (with 18 very good boys and girls are pictured here as an example)...

Cartoon of 18 dog breeds with only four terrier breeds highlighted

8 / 40

(The four highlighted breeds, from top to bottom, left to right, are Scottish, Bull, Boston, and Russell terriers.)

Sniffing out terrier breeds

breed_traits %>% 
  filter(breed == "Yorkshire Terriers")

## # A tibble: 1 × 15
##   breed          affection shedding drooling openness playfulness protectiveness
##   <chr>              <dbl>    <dbl>    <dbl>    <dbl>       <dbl>          <dbl>
## 1 Yorkshire Ter…         5        1        1        5           4              5
## # ℹ 8 more variables: adaptability <dbl>, trainability <dbl>, energy <dbl>,
## #   barking <dbl>, stimulation_needs <dbl>, good_w_children <dbl>,
## #   good_w_other_dogs <dbl>, grooming_freq <dbl>

9 / 40

When I say "subset", alarm bells are probably going off in your head that we we'll be using the filter() function.

Using what we've already know how to do, we can print the breed_traits table and scan through the paginated results in RMarkdown to find our first match — Yorkshire terriers.

We'll use the == operator to match the string, and get one row in the output.

Sniffing out terrier breeds

breed_traits %>% 
  filter(breed %in% c("Yorkshire Terriers", "Boston Terriers"))

## # A tibble: 2 × 15
##   breed          affection shedding drooling openness playfulness protectiveness
##   <chr>              <dbl>    <dbl>    <dbl>    <dbl>       <dbl>          <dbl>
## 1 Yorkshire Ter…         5        1        1        5           4              5
## 2 Boston Terrie…         5        2        1        5           5              3
## # ℹ 8 more variables: adaptability <dbl>, trainability <dbl>, energy <dbl>,
## #   barking <dbl>, stimulation_needs <dbl>, good_w_children <dbl>,
## #   good_w_other_dogs <dbl>, grooming_freq <dbl>

10 / 40

And then our second match — Boston terriers.

This time, we'll use the %in% operator to match a vector of strings, and get two rows in the output.

You can where this is going...

Sniffing out terrier breeds ... with pattern matching!11 / 40

That's where stringr comes in.

Filtering with `str_detect()`

breed_traits %>% 
  filter(str_detect(breed, pattern = "Terrier"))

## # A tibble: 36 × 15
##    breed         affection shedding drooling openness playfulness protectiveness
##    <chr>             <dbl>    <dbl>    <dbl>    <dbl>       <dbl>          <dbl>
##  1 Yorkshire Te…         5        1        1        5           4              5
##  2 Boston Terri…         5        2        1        5           5              3
##  3 West Highlan…         5        3        1        4           5              5
##  4 Scottish Ter…         5        2        2        3           4              5
##  5 Soft Coated …         5        1        2        3           3              3
##  6 Airedale Ter…         3        1        1        3           3              5
##  7 Bull Terriers         4        3        1        4           4              3
##  8 Russell Terr…         5        3        1        5           5              4
##  9 Cairn Terrie…         4        2        1        3           4              4
## 10 Staffordshir…         5        2        3        4           4              5
## # ℹ 26 more rows
## # ℹ 8 more variables: adaptability <dbl>, trainability <dbl>, energy <dbl>,
## #   barking <dbl>, stimulation_needs <dbl>, good_w_children <dbl>,
## #   good_w_other_dogs <dbl>, grooming_freq <dbl>

12 / 40

In the example code, we keep only the rows where the sequence "Terrier" is found in the breed column, and drop the rest.

stringr functions

Character manipulation

str_sub("Introduction to the tidyverse", 21, 24)

## [1] "tidy"

13 / 40

In addition to pattern matching, you can use stringr to manipulate strings in a variety of ways. I'll show just a couple examples.

We can extract substrings from a vector using str_sub(), in this case by extracting the 21st through 24th characters which form the word "tidy".

stringr functions

Whitespace tools

str_trim("   Introduction to the tidyverse          ")

## [1] "Introduction to the tidyverse"

14 / 40

We can trim whitespace from a string using str_trim(), which can be a quick and easy data cleaning step.

These are just a couple examples of the many ways you can use stringr to manipulate strings.

Your Turn 1

Use str_subset() to subset the elements of the fruit vector that are made up of two or more words.

# preview `fruit`, which is loaded along with stringr
library(stringr)
fruit

##  [1] "apple"             "apricot"           "avocado"          
##  [4] "banana"            "bell pepper"       "bilberry"         
##  [7] "blackberry"        "blackcurrant"      "blood orange"     
## [10] "blueberry"         "boysenberry"       "breadfruit"       
## [13] "canary melon"      "cantaloupe"        "cherimoya"        
## [16] "cherry"            "chili pepper"      "clementine"       
## [19] "cloudberry"        "coconut"           "cranberry"        
## [22] "cucumber"          "currant"           "damson"           
## [25] "date"              "dragonfruit"       "durian"           
## [28] "eggplant"          "elderberry"        "feijoa"           
## [31] "fig"               "goji berry"        "gooseberry"       
## [34] "grape"             "grapefruit"        "guava"            
## [37] "honeydew"          "huckleberry"       "jackfruit"        
## [40] "jambul"            "jujube"            "kiwi fruit"       
## [43] "kumquat"           "lemon"             "lime"             
## [46] "loquat"            "lychee"            "mandarine"        
## [49] "mango"             "mulberry"          "nectarine"        
## [52] "nut"               "olive"             "orange"           
## [55] "pamelo"            "papaya"            "passionfruit"     
## [58] "peach"             "pear"              "persimmon"        
## [61] "physalis"          "pineapple"         "plum"             
## [64] "pomegranate"       "pomelo"            "purple mangosteen"
## [67] "quince"            "raisin"            "rambutan"         
## [70] "raspberry"         "redcurrant"        "rock melon"       
## [73] "salal berry"       "satsuma"           "star fruit"       
## [76] "strawberry"        "tamarillo"         "tangerine"        
## [79] "ugli fruit"        "watermelon"

15 / 40

We're looking for fruits like "bell pepper", "blood orange", etc.

Hint: Look up the help page for str_subset()

Hint: What character indicates that a string contains more than one word?

Solution: str_subset(fruit, " ")

Your Turn 1 Solution

str_subset(fruit, " ")

##  [1] "bell pepper"       "blood orange"      "canary melon"     
##  [4] "chili pepper"      "goji berry"        "kiwi fruit"       
##  [7] "purple mangosteen" "rock melon"        "salal berry"      
## [10] "star fruit"        "ugli fruit"

16 / 40

Factors

17 / 40

Meet forcats

stringr hex sticker

https://forcats.tidyverse.org/

18 / 40

"factor" is just another name for categorical data in R

library(forcats)
gss_cat

## # A tibble: 21,483 × 9
##    year marital         age race  rincome        partyid     relig denom tvhours
##   <int> <fct>         <int> <fct> <fct>          <fct>       <fct> <fct>   <int>
## 1  2000 Never married    26 White $8000 to 9999  Ind,near r… Prot… Sout…      12
## 2  2000 Divorced         48 White $8000 to 9999  Not str re… Prot… Bapt…      NA
## 3  2000 Widowed          67 White Not applicable Independent Prot… No d…       2
## 4  2000 Never married    39 White Not applicable Ind,near r… Orth… Not …       4
## 5  2000 Divorced         25 White Not applicable Not str de… None  Not …       1
## 6  2000 Married          25 White $20000 - 24999 Strong dem… Prot… Sout…      NA
## 7  2000 Never married    36 White $25000 or more Not str re… Chri… Not …       3
## 8  2000 Divorced         44 White $7000 to 7999  Ind,near d… Prot… Luth…      NA
## # ℹ 21,475 more rows

19 / 40

The General Social Survey is a large-scale survey designed to monitor changes in social characteristics and attitudes over time in the United States.

The forcats package has a subset of this data built-in, gss_cat.

EDA - continuous

gss_cat %>% 
  ggplot(aes(x=tvhours)) +
  geom_histogram() +
  labs(x = "TV Hours")

20 / 40

To get a sense of the distribution of this data, we can use a histogram.

EDA - categorical

gss_cat %>% 
  ggplot(aes(x=marital)) +
  geom_bar() + 
  labs(x = "Marital Status")

21 / 40

Now let's look at a categorical variable, marital status. In this case, we can use a bar plot to display the number of observations in each category of marital status.

What are we trying to show?

22 / 40

But it would be more informative to see a plot like this, which shows the categories in descending order of frequency.

Factors have an ordering

Factors are stored with an order even if there is no inherent meaning to the ordering

levels(gss_cat$marital)

## [1] "No answer"     "Never married" "Separated"     "Divorced"     
## [5] "Widowed"       "Married"

23 / 40

We call each of the categories of a factor a "level." Here is the default ordering of the levels of marital status. This is the same ordering we saw in our first bar plot.

Reorder factor levels

fct_inorder(): by the order in which they first appear.

fct_infreq(): by number of observations within each level (largest first)

fct_inseq(): by numeric value of level.

https://forcats.tidyverse.org/reference/fct_inorder.html

24 / 40

Forcats provides some helper functions for changing the order of factor levels when we need to do so. Here are a few examples (but there are more!)

Example

f <- factor(c("b", "b", "a", "c", "c", "c"))
f

## [1] b b a c c c
## Levels: a b c

fct_inorder(f)

## [1] b b a c c c
## Levels: b a c

25 / 40

Since "b" appears first, that becomes the first level.

Your Turn 2

Use ggplot and one of these forcats functions to recreate the plot of gss_cat:

fct_inorder()
fct_infreq()
fct_inseq()

26 / 40

Your Turn 2 Solution

gss %>%
  ggplot(
    aes(
      x = fct_infreq(marital)
    )
  ) +
  geom_bar() + 
  labs(x = "Marital Status")

???

fct_infreq() displays the bars by order of how frequently each category of martial status appears in the data set.

27 / 40

Dates

28 / 40

Meet lubridate

stringr hex sticker

https://lubridate.tidyverse.org/

29 / 40

30 / 40

Lubridate is a package that makes it easier to work dates and datetimes. These are two standard formats for storing time-related information. A date is what is sounds like - the year, month, and day.

A datetime stores all of that as well as hours, minutes, seconds, and time zone.

Creating Dates and Datetimes31 / 40

When working with time-related information, often the first step is to get your data into a date or datetime format.

For example, you may be trying to read in time-related information that uses dashes to separate values. Or maybe spaces, periods, or no spacing at all.

Creating Dates and Datetimes32 / 40

Extract Information33 / 40

Once we have our data in a date or datetime format, we are able to easily access all of its components -- such as the year, month, day, hour, minute, etc.

Extract Information34 / 40

And, we can even extract additional information such as the quarter, week of the year, day of the year, or day of the week.

Other tasks with lubridate

do math with dates and datetimes
convert between time zones
work with time intervals

35 / 40

Bike Traffic data set

bike_traffic

## # A tibble: 85,810 × 5
##   date                   crossing           direction bike_count ped_count
##   <fct>                  <fct>              <fct>          <int>     <int>
## 1 02/28/2019 11:00:00 PM Burke Gilman Trail North              0         0
## 2 02/28/2019 10:00:00 PM Burke Gilman Trail North              0         0
## 3 02/28/2019 09:00:00 PM Burke Gilman Trail North              2         0
## 4 02/28/2019 08:00:00 PM Burke Gilman Trail North              2         1
## 5 02/28/2019 07:00:00 PM Burke Gilman Trail North              6         0
## 6 02/28/2019 06:00:00 PM Burke Gilman Trail North             13         5
## 7 02/28/2019 05:00:00 PM Burke Gilman Trail North             19        15
## 8 02/28/2019 04:00:00 PM Burke Gilman Trail North             26        23
## # ℹ 85,802 more rows

36 / 40

Let's look at an example using bike traffic data. The date column is stored as a factor with hours listed in AM and PM and is not currently in the standardized datetime format.

In the current form, we cannot take advantage of the many time-related tools that exist for dates and datetimes.

Get into a `datetime` format

bike_traffic %>%
  mutate(
    timestamp = mdy_hms(date, tz = "US/Pacific"), 
    .before = date
  )

## # A tibble: 85,810 × 6
##   timestamp           date               crossing direction bike_count ped_count
##   <dttm>              <fct>              <fct>    <fct>          <int>     <int>
## 1 2019-02-28 23:00:00 02/28/2019 11:00:… Burke G… North              0         0
## 2 2019-02-28 22:00:00 02/28/2019 10:00:… Burke G… North              0         0
## 3 2019-02-28 21:00:00 02/28/2019 09:00:… Burke G… North              2         0
## 4 2019-02-28 20:00:00 02/28/2019 08:00:… Burke G… North              2         1
## 5 2019-02-28 19:00:00 02/28/2019 07:00:… Burke G… North              6         0
## # ℹ 85,805 more rows

37 / 40

So, we will use a lubridate function to turn this column into a datetime.

Because our data was in the order Month-Day-Year-Hour-Minute-Second, we used the MDY_HMS function to turn the column of values into datetimes.

Now that we have this stored as a date time object, we can start to explore temporal patterns in this data.

Your Turn 3

What day of the week was the moon landing (July 20, 1969)?

💡 Hint: The wday() function returns the day of the week

ymd(____) %>%
  ____(label = TRUE)

38 / 40

Your Turn 3 Solution

What day of the week was the moon landing (July 20, 1969)?

ymd("1969-07-20") %>%
  wday(label = TRUE)

## [1] Sun
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

39 / 40

Remember your cheatsheets!

40 / 40

There are some optional tutorials in your campsite if you'd like to do a deeper dive in to any of these topics.

Cheatsheets are especially helpful when working with these packages.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
o	Tile View: Overview of Slides

Data types

Session 1

Conf - Data types

Data types

Strings

Meet stringr

Breed Traits data set

Sniffing out terrier breeds

Sniffing out terrier breeds

Sniffing out terrier breeds ... with pattern matching!

Filtering with str_detect()

stringr functions

stringr functions

Your Turn 1

Your Turn 1 Solution

Factors

Meet forcats

General Social Survey data set

EDA - continuous

EDA - categorical

What are we trying to show?

Factors have an ordering

Reorder factor levels

Example

Your Turn 2

Your Turn 2 Solution

Dates

Meet lubridate

Creating Dates and Datetimes

Creating Dates and Datetimes

Extract Information

Extract Information

Other tasks with lubridate

Bike Traffic data set

Get into a datetime format

Your Turn 3

Your Turn 3 Solution

Remember your cheatsheets!

Conf - Data types

Help

Data types

Data types

Session 1

Conf - Data types

Data types

Strings

Meet stringr

Breed Traits data set

Sniffing out terrier breeds

Sniffing out terrier breeds

Sniffing out terrier breeds ... with pattern matching!

Filtering with str_detect()

stringr functions

stringr functions

Your Turn 1

Your Turn 1 Solution

Factors

Meet forcats

General Social Survey data set

EDA - continuous

EDA - categorical

What are we trying to show?

Factors have an ordering

Reorder factor levels

Example

Your Turn 2

Your Turn 2 Solution

Dates

Meet lubridate

Creating Dates and Datetimes

Creating Dates and Datetimes

Extract Information

Extract Information

Other tasks with lubridate

Bike Traffic data set

Get into a datetime format

Your Turn 3

Your Turn 3 Solution

Remember your cheatsheets!

Filtering with `str_detect()`

Get into a `datetime` format

Filtering with `str_detect()`

Get into a `datetime` format