We're going to start off with some review of what we've learned for the past several weeks.
We're going to start off with some review of what we've learned for the past several weeks.
Work together with your neighbors
There are often several different ways of getting to the right answer.
After 1-2 minutes, we'll go over the answer together. And then move on to the next question.
You'll use the sticky system to signal that you're done or your need help
Look for data/seattle_pets.csv
in your Files pane
Source: https://data.seattle.gov/Community/Seattle-Pet-Licenses/jguv-t9rb/about_data
This data was retrieved from Seattle's Open Data Portal. It was last updated in July 2024.
It contains a list of current Seattle pet licenses, including animal type (species), pet's name, breed and the owner's ZIP code.
Read in the data saved in data/seattle_pets.csv
and explore it. Can you recreate output that looks like this?
💡 Hint: What function from dplyr gives you a quick glimpse of your data?
## Rows: 43,683## Columns: 7## $ license_issue_date <chr> "December 18 2015", "June 14 2016", "August 04 2016…## $ license_number <chr> "S107948", "S116503", "S119301", "962273", "S133113…## $ animal_name <chr> "Zen", "Misty", "Lyra", "Veronica", "Spider", "Maxx…## $ species <chr> "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", "C…## $ primary_breed <chr> "Domestic Longhair", "Siberian", "Mix", "Domestic L…## $ secondary_breed <chr> "Mix", NA, NA, NA, NA, NA, "Mix", "Mix", "Mix", "Mi…## $ zip_code <dbl> 98117, 98117, 98121, 98107, 98115, 98125, 98103, 98…
library(tidyverse)seattle_pets <- read_csv("data/seattle_pets.csv")glimpse(seattle_pets)
## Rows: 43,683## Columns: 7## $ license_issue_date <chr> "December 18 2015", "June 14 2016", "August 04 2016…## $ license_number <chr> "S107948", "S116503", "S119301", "962273", "S133113…## $ animal_name <chr> "Zen", "Misty", "Lyra", "Veronica", "Spider", "Maxx…## $ species <chr> "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", "C…## $ primary_breed <chr> "Domestic Longhair", "Siberian", "Mix", "Domestic L…## $ secondary_breed <chr> "Mix", NA, NA, NA, NA, NA, "Mix", "Mix", "Mix", "Mi…## $ zip_code <dbl> 98117, 98117, 98121, 98107, 98115, 98125, 98103, 98…
How many different species are represented in seattle_pets
? How many pets of each species are there?
💡 Hint: What function from dplyr lets you count the unique values of one or more variables?
seattle_pets |> count(species)
or...
seattle_pets |> group_by(species) |> summarize(n = n())
## # A tibble: 4 × 2## species n## <chr> <int>## 1 Cat 13935## 2 Dog 29729## 3 Goat 16## 4 Pig 3
Because I was curious... what does one name a pet pig?
seattle_pets |> filter(species == "Pig") |> pull(animal_name)
## [1] "Millie" "Calvin" "Waffles Olivia McHart"
What is the most popular pet name in this data set?
💡 Hint: Try using slice_max()
from dplyr in your solution. Look up the help docs with ?slice_max
.
seattle_pets |> count(animal_name) |> slice_max(order_by = n)
or...
seattle_pets |> count(animal_name, sort = TRUE) |> head(1)
or...
seattle_pets |> count(animal_name) |> filter(n == max(n))
## # A tibble: 1 × 2## animal_name n## <chr> <int>## 1 Luna 410
What are the top 10 most popular primary dog breeds?
💡 Hint: Try using count()
and slice_max()
again in your solution -- which argument to slice_max()
specifies the number of rows to return?
seattle_pets |> filter(species == "Dog") |> count(primary_breed) |> slice_max(order_by = n, n = 10)
## # A tibble: 10 × 2## primary_breed n## <chr> <int>## 1 Retriever, Labrador 3025## 2 Retriever, Golden 1498## 3 Chihuahua, Short Coat 1485## 4 German Shepherd 989## 5 Poodle, Miniature 889## 6 Poodle, Standard 818## 7 Terrier 814## 8 Mixed Breed, Medium (up to 44 lbs fully grown) 787## 9 Australian Shepherd 726## 10 Mixed Breed, Large (over 44 lbs fully grown) 717
Visualize the top 10 dog breeds, re-creating the plot below.
💡 Hint: Pay close attention to the x and y axes
💡 Hint: Start with your code from the previous exercise, and pipe this code to ggplot()
:
seattle_pets |> filter(species == "Dog") |> count(primary_breed) |> slice_max(order_by = n, n = 10) |> ____ # add code here
seattle_pets |> filter(species == "Dog") |> count(primary_breed) |> slice_max(order_by = n, n = 10) |> ggplot(aes(x = n, y = primary_breed)) + geom_col()
or ...
seattle_pets |> filter(species == "Dog") |> count(primary_breed) |> slice_max(order_by = n, n = 10) |> ggplot(aes(x = primary_breed, y = n)) + geom_col() + coord_flip()
seattle_pets
further...What if we wanted visualize popular dog breeds in descending order?
We would need to handle factors (categorical variables).
seattle_pets
further...What if we wanted visualize trends in number of new pet licences by month?
We would need to handle dates.
seattle_pets
further...What if we wanted to explore pet names with a particular pattern?
e.g. Pets with "Sir" somewhere in their name.
We would need to handle strings.
## # A tibble: 22 × 2## animal_name species## <chr> <chr> ## 1 Sir Pounce Cat ## 2 Sir Furcifer Cat ## 3 Sir Thomas Sharpe Cat ## 4 Sir Digby Chicken Caesar Cat ## 5 Sir Herlock Sholmes Cat ## 6 Sir Daniel Cat ## 7 Sir Dapplesox Cat ## 8 Sir Robin Cashmoney Pouncealot Cat ## 9 Sir Mill Cat ## 10 Sir Tuna Cat ## 11 Sir Loafsalot Cat ## 12 Sir Tater Tot Dog ## 13 Sir CottonBall Dog ## 14 Sir Walter Leroy Phillips Dog ## 15 Sir Waggleton Dog ## 16 Ravindale's Sir Tristan Dog ## 17 Sir Francis Dog ## 18 Sir Oliver Grayson Dog ## 19 Sir Oliver Dog ## 20 Sir Sammy Haralson Lawrence III Esq. Dog ## 21 Sir Maximillion Dog ## 22 Sir Roman Snoopy II Dog
Fortunately, the tidyverse provides us with tools to work with these different types of data...
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
We're going to start off with some review of what we've learned for the past several weeks.
Work together with your neighbors
There are often several different ways of getting to the right answer.
After 1-2 minutes, we'll go over the answer together. And then move on to the next question.
You'll use the sticky system to signal that you're done or your need help
Look for data/seattle_pets.csv
in your Files pane
Source: https://data.seattle.gov/Community/Seattle-Pet-Licenses/jguv-t9rb/about_data
This data was retrieved from Seattle's Open Data Portal. It was last updated in July 2024.
It contains a list of current Seattle pet licenses, including animal type (species), pet's name, breed and the owner's ZIP code.
Read in the data saved in data/seattle_pets.csv
and explore it. Can you recreate output that looks like this?
💡 Hint: What function from dplyr gives you a quick glimpse of your data?
## Rows: 43,683## Columns: 7## $ license_issue_date <chr> "December 18 2015", "June 14 2016", "August 04 2016…## $ license_number <chr> "S107948", "S116503", "S119301", "962273", "S133113…## $ animal_name <chr> "Zen", "Misty", "Lyra", "Veronica", "Spider", "Maxx…## $ species <chr> "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", "C…## $ primary_breed <chr> "Domestic Longhair", "Siberian", "Mix", "Domestic L…## $ secondary_breed <chr> "Mix", NA, NA, NA, NA, NA, "Mix", "Mix", "Mix", "Mi…## $ zip_code <dbl> 98117, 98117, 98121, 98107, 98115, 98125, 98103, 98…
library(tidyverse)seattle_pets <- read_csv("data/seattle_pets.csv")glimpse(seattle_pets)
## Rows: 43,683## Columns: 7## $ license_issue_date <chr> "December 18 2015", "June 14 2016", "August 04 2016…## $ license_number <chr> "S107948", "S116503", "S119301", "962273", "S133113…## $ animal_name <chr> "Zen", "Misty", "Lyra", "Veronica", "Spider", "Maxx…## $ species <chr> "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", "C…## $ primary_breed <chr> "Domestic Longhair", "Siberian", "Mix", "Domestic L…## $ secondary_breed <chr> "Mix", NA, NA, NA, NA, NA, "Mix", "Mix", "Mix", "Mi…## $ zip_code <dbl> 98117, 98117, 98121, 98107, 98115, 98125, 98103, 98…
How many different species are represented in seattle_pets
? How many pets of each species are there?
💡 Hint: What function from dplyr lets you count the unique values of one or more variables?
seattle_pets |> count(species)
or...
seattle_pets |> group_by(species) |> summarize(n = n())
## # A tibble: 4 × 2## species n## <chr> <int>## 1 Cat 13935## 2 Dog 29729## 3 Goat 16## 4 Pig 3
Because I was curious... what does one name a pet pig?
seattle_pets |> filter(species == "Pig") |> pull(animal_name)
## [1] "Millie" "Calvin" "Waffles Olivia McHart"
What is the most popular pet name in this data set?
💡 Hint: Try using slice_max()
from dplyr in your solution. Look up the help docs with ?slice_max
.
seattle_pets |> count(animal_name) |> slice_max(order_by = n)
or...
seattle_pets |> count(animal_name, sort = TRUE) |> head(1)
or...
seattle_pets |> count(animal_name) |> filter(n == max(n))
## # A tibble: 1 × 2## animal_name n## <chr> <int>## 1 Luna 410
What are the top 10 most popular primary dog breeds?
💡 Hint: Try using count()
and slice_max()
again in your solution -- which argument to slice_max()
specifies the number of rows to return?
seattle_pets |> filter(species == "Dog") |> count(primary_breed) |> slice_max(order_by = n, n = 10)
## # A tibble: 10 × 2## primary_breed n## <chr> <int>## 1 Retriever, Labrador 3025## 2 Retriever, Golden 1498## 3 Chihuahua, Short Coat 1485## 4 German Shepherd 989## 5 Poodle, Miniature 889## 6 Poodle, Standard 818## 7 Terrier 814## 8 Mixed Breed, Medium (up to 44 lbs fully grown) 787## 9 Australian Shepherd 726## 10 Mixed Breed, Large (over 44 lbs fully grown) 717
Visualize the top 10 dog breeds, re-creating the plot below.
💡 Hint: Pay close attention to the x and y axes
💡 Hint: Start with your code from the previous exercise, and pipe this code to ggplot()
:
seattle_pets |> filter(species == "Dog") |> count(primary_breed) |> slice_max(order_by = n, n = 10) |> ____ # add code here
seattle_pets |> filter(species == "Dog") |> count(primary_breed) |> slice_max(order_by = n, n = 10) |> ggplot(aes(x = n, y = primary_breed)) + geom_col()
or ...
seattle_pets |> filter(species == "Dog") |> count(primary_breed) |> slice_max(order_by = n, n = 10) |> ggplot(aes(x = primary_breed, y = n)) + geom_col() + coord_flip()
seattle_pets
further...What if we wanted visualize popular dog breeds in descending order?
We would need to handle factors (categorical variables).
seattle_pets
further...What if we wanted visualize trends in number of new pet licences by month?
We would need to handle dates.
seattle_pets
further...What if we wanted to explore pet names with a particular pattern?
e.g. Pets with "Sir" somewhere in their name.
We would need to handle strings.
## # A tibble: 22 × 2## animal_name species## <chr> <chr> ## 1 Sir Pounce Cat ## 2 Sir Furcifer Cat ## 3 Sir Thomas Sharpe Cat ## 4 Sir Digby Chicken Caesar Cat ## 5 Sir Herlock Sholmes Cat ## 6 Sir Daniel Cat ## 7 Sir Dapplesox Cat ## 8 Sir Robin Cashmoney Pouncealot Cat ## 9 Sir Mill Cat ## 10 Sir Tuna Cat ## 11 Sir Loafsalot Cat ## 12 Sir Tater Tot Dog ## 13 Sir CottonBall Dog ## 14 Sir Walter Leroy Phillips Dog ## 15 Sir Waggleton Dog ## 16 Ravindale's Sir Tristan Dog ## 17 Sir Francis Dog ## 18 Sir Oliver Grayson Dog ## 19 Sir Oliver Dog ## 20 Sir Sammy Haralson Lawrence III Esq. Dog ## 21 Sir Maximillion Dog ## 22 Sir Roman Snoopy II Dog
Fortunately, the tidyverse provides us with tools to work with these different types of data...