Day 4: Data loading and tidying with tidyverse

2023-10-24

Refresher: vectors

  • Recall that a vector is a series of values that are all the same type
# A numeric vector:
c(1, 2, 3)
[1] 1 2 3
# A character vector:
c("banana", "orange", "apple")
[1] "banana" "orange" "apple" 
# A logical vector:
c(TRUE, FALSE, TRUE)
[1]  TRUE FALSE  TRUE

About dataframes

  • A dataframe is a collection of vectors
    • basically like a spreadsheet in Excel

Image of a data frame

  • Today we will learn to load and tidy (or “wrangle”) dataframes

About data tidying (“wrangling”)

  • Data often come to us in a form that is not suitable for analysis

  • Data tidying (or “wrangling”) is the process of re-formatting the data so it can be analyzed

Getty Images

data wrangling monsters

Image by Allison Horst

Set up project

  • Create a new project called gapminder-analysis in the data-analysis-course folder on your Desktop
    • We will use the “gapminder” dataset for the rest of the course
  • Inside that project, create a folder called data_raw to store raw data
    • You should never edit raw data directly (treat it as read-only)
  • Also create a file to write today’s code. You could call it data-tidying-practice.R

Set up project

About R Packages

  • The functions we have used so far are included in R (called “Base R”)

  • But sometimes we need to do something that is not available in Base R

  • Packages are collections of code that extend the functionality of R

    • There are nearly 200,000 packages currently available!

Install tidyverse

  • Today we will use the tidyverse set of packages for working with data

  • Install tidyverse using install.packages():

install.packages("tidyverse")

Loading packages

  • The install.packages() function downloads the package to your computer

    • You only need to do this once per package (or when you want to update the package)
  • But R isn’t able to use the package yet

  • The library() function tells R to load a package so that we can use it

Load tidyverse

  • Let’s load the tidyverse package with library()
    • Note that you don’t need to use quotation marks for the package name
library(tidyverse)

Load tidyverse

  • tidyverse actually includes multiple packages (dplyr, tidyr, readr, etc.)
    • These are all packages that are useful for loading and cleaning data

Load data

  • Use the read_csv() function to load a spreadsheet as a dataframe
    • There are other functions for other file formats like .xls, .xlsx, .tsv, etc.
gapminder <- read_csv("data_raw/gapminder.csv")

What is this dataset about?

  • What do you think this dataset is about?

  • What do you think the columns mean?

Gapminder

  • This is a dataset of economic statistics from various countries over time, from https://gapminder.org

  • The meaning of some columns is obvious (country, continent, year), but not others

    • pop: Population
    • lifeExp: Life expectancy (寿命)
    • gdpPercap: GDP per capita (一人当たりの国内総生産)

Sort data with arrange()

  • First provide the name of the dataframe, then the column to sort by
arrange(gapminder, lifeExp)
# A tibble: 1,704 × 6
  country      continent  year lifeExp     pop gdpPercap
  <chr>        <chr>     <dbl>   <dbl>   <dbl>     <dbl>
1 Rwanda       Africa     1992    23.6 7290203      737.
2 Afghanistan  Asia       1952    28.8 8425333      779.
3 Gambia       Africa     1952    30    284320      485.
4 Angola       Africa     1952    30.0 4232095     3521.
5 Sierra Leone Africa     1952    30.3 2143249      880.
6 Afghanistan  Asia       1957    30.3 9240934      821.
# ℹ 1,698 more rows

Sort data with arrange()

  • The default setting is to sort from small to large
    • To sort in the reverse (descending) direction, use desc()
arrange(gapminder, desc(lifeExp))
# A tibble: 1,704 × 6
  country          continent  year lifeExp       pop gdpPercap
  <chr>            <chr>     <dbl>   <dbl>     <dbl>     <dbl>
1 Japan            Asia       2007    82.6 127467972    31656.
2 Hong Kong, China Asia       2007    82.2   6980412    39725.
3 Japan            Asia       2002    82   127065841    28605.
4 Iceland          Europe     2007    81.8    301931    36181.
5 Switzerland      Europe     2007    81.7   7554661    37506.
6 Hong Kong, China Asia       2002    81.5   6762476    30209.
# ℹ 1,698 more rows

Sort data with arrange()

  • You can sort on multiple columns
    • Ties will be sorted by the next column
arrange(gapminder, continent, lifeExp)
# A tibble: 1,704 × 6
  country      continent  year lifeExp     pop gdpPercap
  <chr>        <chr>     <dbl>   <dbl>   <dbl>     <dbl>
1 Rwanda       Africa     1992    23.6 7290203      737.
2 Gambia       Africa     1952    30    284320      485.
3 Angola       Africa     1952    30.0 4232095     3521.
4 Sierra Leone Africa     1952    30.3 2143249      880.
5 Mozambique   Africa     1952    31.3 6446316      469.
6 Sierra Leone Africa     1957    31.6 2295678     1004.
# ℹ 1,698 more rows

Challenge

  • Sort the data by year, first going from smallest to biggest, then from biggest to smallest.

  • What is the earliest year in the data?

  • What is the latest year in the data?

Narrow down columns with select()

  • First provide the name of the dataframe, then the columns to select
select(gapminder, year, country, gdpPercap)
# A tibble: 1,704 × 3
   year country     gdpPercap
  <dbl> <chr>           <dbl>
1  1952 Afghanistan      779.
2  1957 Afghanistan      821.
3  1962 Afghanistan      853.
4  1967 Afghanistan      836.
5  1972 Afghanistan      740.
6  1977 Afghanistan      786.
# ℹ 1,698 more rows

Saving your output

  • Notice that although we have used several functions, gapminder is still the same:
gapminder
# A tibble: 1,704 × 6
  country     continent  year lifeExp      pop gdpPercap
  <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.
# ℹ 1,698 more rows

Saving your output

  • This is because we have not saved any of the output

  • To do that, you need to create a new object with <-

    • You can call the object whatever you want, but use a name that is easy to remember
gapminder_gdp <- select(gapminder, year, country, gdpPercap)

Challenge

  • Select only year, country, and population, then sort the data by population

About pipes

  • During the last challenge, you needed to save the output of the select() function, then use it as input into the arrange() function

  • There is a simpler way to do this, called the “pipe”

  • The pipe is written like this: %>%

    • Newer versions of R also let you write it like this: |>

About pipes

  • The pipe takes the output from one function and passes it to the input of the next function

  • You can think of it as saying “and then”

    • Do this and then do this, and then do this…
    • Do this %>% do this, %>% do this…

About pipes

  • We can even use the pipe just with one function:
    • Read this as “start with gapminder and then select only year, country, and population”
gapminder %>% select(year, country, pop)
# A tibble: 1,704 × 3
   year country          pop
  <dbl> <chr>          <dbl>
1  1952 Afghanistan  8425333
2  1957 Afghanistan  9240934
3  1962 Afghanistan 10267083
4  1967 Afghanistan 11537966
5  1972 Afghanistan 13079460
6  1977 Afghanistan 14880372
# ℹ 1,698 more rows

About pipes

  • This becomes very useful when we want to do multiple steps
    • Read this as “start with gapminder, and then select only year, country, and population, and then arrange by year”
gapminder %>% select(year, country, pop) %>% arrange(year)
# A tibble: 1,704 × 3
   year country         pop
  <dbl> <chr>         <dbl>
1  1952 Afghanistan 8425333
2  1952 Albania     1282697
3  1952 Algeria     9279525
4  1952 Angola      4232095
# ℹ 1,700 more rows

About pipes

  • We can make it easier to read by putting each step on its own line
gapminder %>%
  select(year, country, pop) %>%
  arrange(year)
# A tibble: 1,704 × 3
   year country          pop
  <dbl> <chr>          <dbl>
1  1952 Afghanistan  8425333
2  1952 Albania      1282697
3  1952 Algeria      9279525
4  1952 Angola       4232095
5  1952 Argentina   17876956
6  1952 Australia    8691212
# ℹ 1,698 more rows

About pipes

  • This way, you don’t have to save each intermediate step

  • This is a very useful way to manipulate data

  • I will now use the pipe (%>%) for the rest of the lesson

Subset rows with filter()

  • Use the filter() function to only keep rows that meet a certain condition

  • For example, let’s only keep the data in Europe:

gapminder %>% filter(continent == "Europe")
# A tibble: 360 × 6
  country continent  year lifeExp     pop gdpPercap
  <chr>   <chr>     <dbl>   <dbl>   <dbl>     <dbl>
1 Albania Europe     1952    55.2 1282697     1601.
2 Albania Europe     1957    59.3 1476505     1942.
3 Albania Europe     1962    64.8 1728137     2313.
4 Albania Europe     1967    66.2 1984060     2760.
5 Albania Europe     1972    67.7 2263554     3313.
6 Albania Europe     1977    68.9 2509048     3533.
# ℹ 354 more rows

Challenge

Using the pipe (%>%) and the functions we have learned so far, filter the data to Asia, then select only the country, year, and population columns.

Modify data with mutate()

  • For example, we could change the units of population to millions of people:
gapminder %>% mutate(pop = pop/1000000)
# A tibble: 1,704 × 6
  country     continent  year lifeExp   pop gdpPercap
  <chr>       <chr>     <dbl>   <dbl> <dbl>     <dbl>
1 Afghanistan Asia       1952    28.8  8.43      779.
2 Afghanistan Asia       1957    30.3  9.24      821.
3 Afghanistan Asia       1962    32.0 10.3       853.
4 Afghanistan Asia       1967    34.0 11.5       836.
5 Afghanistan Asia       1972    36.1 13.1       740.
6 Afghanistan Asia       1977    38.4 14.9       786.
# ℹ 1,698 more rows

Modify data with mutate()

  • If we provide a new column name, that column will be added
gapminder %>% mutate(pop_mil = pop/1000000)
# A tibble: 1,704 × 7
  country     continent  year lifeExp      pop gdpPercap pop_mil
  <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>   <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.    8.43
2 Afghanistan Asia       1957    30.3  9240934      821.    9.24
3 Afghanistan Asia       1962    32.0 10267083      853.   10.3 
4 Afghanistan Asia       1967    34.0 11537966      836.   11.5 
5 Afghanistan Asia       1972    36.1 13079460      740.   13.1 
6 Afghanistan Asia       1977    38.4 14880372      786.   14.9 
# ℹ 1,698 more rows

Challenge

  • The gdpPercap column stands for Gross Domestic Product (GDP) per capita (per person)

  • Calculate the total GDP using mutate()

    • hint: You need to use both the pop and gdpPercap columns

Calculate summary statistics with summarize()

  • For example, let’s calculate the overall mean population:
gapminder %>% summarize(mean_pop = mean(pop))
# A tibble: 1 × 1
   mean_pop
      <dbl>
1 29601212.

Calculate summary statistics with summarize()

  • Another example: the total population over all the data:
gapminder %>% summarize(total_pop = sum(pop))
# A tibble: 1 × 1
    total_pop
        <dbl>
1 50440465801

Use group_by() to do calculations per group

  • However, it is often more useful to calculate such summary statistics for particular groups

  • To do this, first specify the groups with group_by():

gapminder %>% group_by(continent)
# A tibble: 1,704 × 6
# Groups:   continent [5]
  country     continent  year lifeExp      pop gdpPercap
  <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
# ℹ 1,700 more rows

Use group_by() to do calculations per group

  • Next, use summarize() to calculate the summary statistic:
gapminder %>%
  group_by(continent) %>%
  summarize(mean_pop = mean(pop))
# A tibble: 5 × 2
  continent  mean_pop
  <chr>         <dbl>
1 Africa     9916003.
2 Americas  24504795.
3 Asia      77038722.
4 Europe    17169765.
5 Oceania    8874672.

Difference between mutate() and summarize()

  • mutate() calculates a new column (or over-writes an existing one) and keeps the number of rows the same

  • summarize() calculates a new column and decreases the number of rows

Summary

  • Sort data with arrange()
  • Narrow down columns with select()
  • Filter rows with filter()
  • Modify data with mutate()
  • Summarize data with summarize()
  • Group data with group_by()
  • Use the pipe (%>%) to chain steps together