Day 4: Data loading and tidying with tidyverse

2023-10-24

Refresher: vectors

Recall that a vector is a series of values that are all the same type

# A numeric vector:
c(1, 2, 3)

[1] 1 2 3

# A character vector:
c("banana", "orange", "apple")

[1] "banana" "orange" "apple"

# A logical vector:
c(TRUE, FALSE, TRUE)

[1]  TRUE FALSE  TRUE

About dataframes

A dataframe is a collection of vectors
- basically like a spreadsheet in Excel

Image of a data frame

Today we will learn to load and tidy (or “wrangle”) dataframes

About data tidying (“wrangling”)

Data often come to us in a form that is not suitable for analysis
Data tidying (or “wrangling”) is the process of re-formatting the data so it can be analyzed

Getty Images

Image by Allison Horst

Set up project

Create a new project called gapminder-analysis in the data-analysis-course folder on your Desktop
- We will use the “gapminder” dataset for the rest of the course
Inside that project, create a folder called data_raw to store raw data
- You should never edit raw data directly (treat it as read-only)
Also create a file to write today’s code. You could call it data-tidying-practice.R

Set up project

Download the gapminder.csv file from this link, and put it in data_raw
- https://www.dropbox.com/s/fdirlsnxlzy53qq/gapminder.csv?dl=0

About R Packages

The functions we have used so far are included in R (called “Base R”)
But sometimes we need to do something that is not available in Base R
Packages are collections of code that extend the functionality of R
- There are nearly 200,000 packages currently available!

Install `tidyverse`

Today we will use the tidyverse set of packages for working with data
Install tidyverse using install.packages():

install.packages("tidyverse")

Loading packages

The install.packages() function downloads the package to your computer
- You only need to do this once per package (or when you want to update the package)
But R isn’t able to use the package yet
The library() function tells R to load a package so that we can use it

Load `tidyverse`

Let’s load the tidyverse package with library()
- Note that you don’t need to use quotation marks for the package name

library(tidyverse)

Load `tidyverse`

tidyverse actually includes multiple packages (dplyr, tidyr, readr, etc.)
- These are all packages that are useful for loading and cleaning data

Load data

Use the read_csv() function to load a spreadsheet as a dataframe
- There are other functions for other file formats like .xls, .xlsx, .tsv, etc.

gapminder <- read_csv("data_raw/gapminder.csv")

Print out the data

Inspect the data just like any other object (variable) in R: type the name of the object into the console

gapminder

# A tibble: 1,704 × 6
  country     continent  year lifeExp      pop gdpPercap
  <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.
# ℹ 1,698 more rows

Print out the data

Notice that R tells us the number of rows and columns, as well as the data type of each column
- The tidyverse package refers to dataframes as “tibbles”

What is this dataset about?

What do you think this dataset is about?
What do you think the columns mean?

Gapminder

This is a dataset of economic statistics from various countries over time, from https://gapminder.org
The meaning of some columns is obvious (country, continent, year), but not others
- pop: Population
- lifeExp: Life expectancy (寿命)
- gdpPercap: GDP per capita (一人当たりの国内総生産)

Sort data with `arrange()`

First provide the name of the dataframe, then the column to sort by

arrange(gapminder, lifeExp)

# A tibble: 1,704 × 6
  country      continent  year lifeExp     pop gdpPercap
  <chr>        <chr>     <dbl>   <dbl>   <dbl>     <dbl>
1 Rwanda       Africa     1992    23.6 7290203      737.
2 Afghanistan  Asia       1952    28.8 8425333      779.
3 Gambia       Africa     1952    30    284320      485.
4 Angola       Africa     1952    30.0 4232095     3521.
5 Sierra Leone Africa     1952    30.3 2143249      880.
6 Afghanistan  Asia       1957    30.3 9240934      821.
# ℹ 1,698 more rows

Sort data with `arrange()`

The default setting is to sort from small to large
- To sort in the reverse (descending) direction, use desc()

arrange(gapminder, desc(lifeExp))

# A tibble: 1,704 × 6
  country          continent  year lifeExp       pop gdpPercap
  <chr>            <chr>     <dbl>   <dbl>     <dbl>     <dbl>
1 Japan            Asia       2007    82.6 127467972    31656.
2 Hong Kong, China Asia       2007    82.2   6980412    39725.
3 Japan            Asia       2002    82   127065841    28605.
4 Iceland          Europe     2007    81.8    301931    36181.
5 Switzerland      Europe     2007    81.7   7554661    37506.
6 Hong Kong, China Asia       2002    81.5   6762476    30209.
# ℹ 1,698 more rows

Sort data with `arrange()`

You can sort on multiple columns
- Ties will be sorted by the next column

arrange(gapminder, continent, lifeExp)

# A tibble: 1,704 × 6
  country      continent  year lifeExp     pop gdpPercap
  <chr>        <chr>     <dbl>   <dbl>   <dbl>     <dbl>
1 Rwanda       Africa     1992    23.6 7290203      737.
2 Gambia       Africa     1952    30    284320      485.
3 Angola       Africa     1952    30.0 4232095     3521.
4 Sierra Leone Africa     1952    30.3 2143249      880.
5 Mozambique   Africa     1952    31.3 6446316      469.
6 Sierra Leone Africa     1957    31.6 2295678     1004.
# ℹ 1,698 more rows

Challenge

Sort the data by year, first going from smallest to biggest, then from biggest to smallest.
What is the earliest year in the data?
What is the latest year in the data?

Narrow down columns with `select()`

First provide the name of the dataframe, then the columns to select

select(gapminder, year, country, gdpPercap)

# A tibble: 1,704 × 3
   year country     gdpPercap
  <dbl> <chr>           <dbl>
1  1952 Afghanistan      779.
2  1957 Afghanistan      821.
3  1962 Afghanistan      853.
4  1967 Afghanistan      836.
5  1972 Afghanistan      740.
6  1977 Afghanistan      786.
# ℹ 1,698 more rows

Saving your output

Notice that although we have used several functions, gapminder is still the same:

gapminder

# A tibble: 1,704 × 6
  country     continent  year lifeExp      pop gdpPercap
  <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.
# ℹ 1,698 more rows

Saving your output

This is because we have not saved any of the output
To do that, you need to create a new object with <-
- You can call the object whatever you want, but use a name that is easy to remember

gapminder_gdp <- select(gapminder, year, country, gdpPercap)

Challenge

Select only year, country, and population, then sort the data by population

About pipes

During the last challenge, you needed to save the output of the select() function, then use it as input into the arrange() function
There is a simpler way to do this, called the “pipe”
The pipe is written like this: %>%
- Newer versions of R also let you write it like this: |>

About pipes

The pipe takes the output from one function and passes it to the input of the next function
You can think of it as saying “and then”
- Do this and then do this, and then do this…
- Do this %>% do this, %>% do this…

About pipes

We can even use the pipe just with one function:
- Read this as “start with gapminder and then select only year, country, and population”

gapminder %>% select(year, country, pop)

# A tibble: 1,704 × 3
   year country          pop
  <dbl> <chr>          <dbl>
1  1952 Afghanistan  8425333
2  1957 Afghanistan  9240934
3  1962 Afghanistan 10267083
4  1967 Afghanistan 11537966
5  1972 Afghanistan 13079460
6  1977 Afghanistan 14880372
# ℹ 1,698 more rows

About pipes

This becomes very useful when we want to do multiple steps
- Read this as “start with gapminder, and then select only year, country, and population, and then arrange by year”

gapminder %>% select(year, country, pop) %>% arrange(year)

# A tibble: 1,704 × 3
   year country         pop
  <dbl> <chr>         <dbl>
1  1952 Afghanistan 8425333
2  1952 Albania     1282697
3  1952 Algeria     9279525
4  1952 Angola      4232095
# ℹ 1,700 more rows

About pipes

We can make it easier to read by putting each step on its own line

gapminder %>%
  select(year, country, pop) %>%
  arrange(year)

# A tibble: 1,704 × 3
   year country          pop
  <dbl> <chr>          <dbl>
1  1952 Afghanistan  8425333
2  1952 Albania      1282697
3  1952 Algeria      9279525
4  1952 Angola       4232095
5  1952 Argentina   17876956
6  1952 Australia    8691212
# ℹ 1,698 more rows

About pipes

This way, you don’t have to save each intermediate step
This is a very useful way to manipulate data
I will now use the pipe (%>%) for the rest of the lesson

Subset rows with `filter()`

Use the filter() function to only keep rows that meet a certain condition
For example, let’s only keep the data in Europe:

gapminder %>% filter(continent == "Europe")

# A tibble: 360 × 6
  country continent  year lifeExp     pop gdpPercap
  <chr>   <chr>     <dbl>   <dbl>   <dbl>     <dbl>
1 Albania Europe     1952    55.2 1282697     1601.
2 Albania Europe     1957    59.3 1476505     1942.
3 Albania Europe     1962    64.8 1728137     2313.
4 Albania Europe     1967    66.2 1984060     2760.
5 Albania Europe     1972    67.7 2263554     3313.
6 Albania Europe     1977    68.9 2509048     3533.
# ℹ 354 more rows

Challenge

Using the pipe (%>%) and the functions we have learned so far, filter the data to Asia, then select only the country, year, and population columns.

Modify data with `mutate()`

For example, we could change the units of population to millions of people:

gapminder %>% mutate(pop = pop/1000000)

# A tibble: 1,704 × 6
  country     continent  year lifeExp   pop gdpPercap
  <chr>       <chr>     <dbl>   <dbl> <dbl>     <dbl>
1 Afghanistan Asia       1952    28.8  8.43      779.
2 Afghanistan Asia       1957    30.3  9.24      821.
3 Afghanistan Asia       1962    32.0 10.3       853.
4 Afghanistan Asia       1967    34.0 11.5       836.
5 Afghanistan Asia       1972    36.1 13.1       740.
6 Afghanistan Asia       1977    38.4 14.9       786.
# ℹ 1,698 more rows

Modify data with `mutate()`

If we provide a new column name, that column will be added

gapminder %>% mutate(pop_mil = pop/1000000)

# A tibble: 1,704 × 7
  country     continent  year lifeExp      pop gdpPercap pop_mil
  <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>   <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.    8.43
2 Afghanistan Asia       1957    30.3  9240934      821.    9.24
3 Afghanistan Asia       1962    32.0 10267083      853.   10.3 
4 Afghanistan Asia       1967    34.0 11537966      836.   11.5 
5 Afghanistan Asia       1972    36.1 13079460      740.   13.1 
6 Afghanistan Asia       1977    38.4 14880372      786.   14.9 
# ℹ 1,698 more rows

Challenge

The gdpPercap column stands for Gross Domestic Product (GDP) per capita (per person)
Calculate the total GDP using mutate()
- hint: You need to use both the pop and gdpPercap columns

Calculate summary statistics with `summarize()`

For example, let’s calculate the overall mean population:

gapminder %>% summarize(mean_pop = mean(pop))

# A tibble: 1 × 1
   mean_pop
      <dbl>
1 29601212.

Calculate summary statistics with `summarize()`

Another example: the total population over all the data:

gapminder %>% summarize(total_pop = sum(pop))

# A tibble: 1 × 1
    total_pop
        <dbl>
1 50440465801

Use `group_by()` to do calculations per group

However, it is often more useful to calculate such summary statistics for particular groups
To do this, first specify the groups with group_by():

gapminder %>% group_by(continent)

# A tibble: 1,704 × 6
# Groups:   continent [5]
  country     continent  year lifeExp      pop gdpPercap
  <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
# ℹ 1,700 more rows

Use `group_by()` to do calculations per group

Next, use summarize() to calculate the summary statistic:

gapminder %>%
  group_by(continent) %>%
  summarize(mean_pop = mean(pop))

# A tibble: 5 × 2
  continent  mean_pop
  <chr>         <dbl>
1 Africa     9916003.
2 Americas  24504795.
3 Asia      77038722.
4 Europe    17169765.
5 Oceania    8874672.

Difference between `mutate()` and `summarize()`

mutate() calculates a new column (or over-writes an existing one) and keeps the number of rows the same
summarize() calculates a new column and decreases the number of rows

Summary

Sort data with arrange()
Narrow down columns with select()
Filter rows with filter()
Modify data with mutate()
Summarize data with summarize()
Group data with group_by()
Use the pipe (%>%) to chain steps together

Day 4: Data loading and tidying with tidyverse

Refresher: vectors

About dataframes

About data tidying (“wrangling”)

Set up project

Set up project

About R Packages

Install tidyverse

Loading packages

Load tidyverse

Load tidyverse

Load data

Print out the data

Print out the data

What is this dataset about?

Gapminder

Sort data with arrange()

Sort data with arrange()

Sort data with arrange()

Challenge

Narrow down columns with select()

Saving your output

Saving your output

Challenge

About pipes

About pipes

About pipes

About pipes

About pipes

About pipes

Subset rows with filter()

Challenge

Modify data with mutate()

Modify data with mutate()

Challenge

Calculate summary statistics with summarize()

Calculate summary statistics with summarize()

Use group_by() to do calculations per group

Use group_by() to do calculations per group

Difference between mutate() and summarize()

Summary

Install `tidyverse`

Load `tidyverse`

Load `tidyverse`

Sort data with `arrange()`

Sort data with `arrange()`

Sort data with `arrange()`

Narrow down columns with `select()`

Subset rows with `filter()`

Modify data with `mutate()`

Modify data with `mutate()`

Calculate summary statistics with `summarize()`

Calculate summary statistics with `summarize()`

Use `group_by()` to do calculations per group

Use `group_by()` to do calculations per group

Difference between `mutate()` and `summarize()`