Teaching an Introductory Data Science course with tidyverse workshop

# Teaching an Introductory Data Science course with tidyverse workshop
## Data Wrangling
### Dr. Mine Dogucu
### 2021-12-15

---

<div class="my-header"></div>

<div class="my-footer"> 
 CC BY-NC-ND 4.0 <a href="https://mdogucu.ics.uci.edu">Mine Dogucu</a></div>

---

## Goals

- The pipe operator
- Subsetting data frames
- Changing variables
- Aggregating Data

---

---

What is the average of 4, 8, 16 approximately?

---

1.What is the average of **4, 8, 16** approximately?

---

2.What is the **average** of 4, 8, 16 approximately?

---

3.What is the average of 4, 8, 16 **approximately**?

---

---

```r
c(4, 8, 16)
```

```
## [1]  4  8 16
```

<hr>

```r
mean(c(4, 8, 16))
```

```
## [1] 9.333333
```

<hr>

```r
round(mean(c(4, 8, 16)))
```

```
## [1] 9
```

---

**Problem with writing functions within functions**

Things will get messy and more difficult to read and debug as we deal with more complex operations on data.

---

---

```r
numbers <- c(4, 8, 16)
numbers
```

```
## [1]  4  8 16
```

<hr>

```r
avg_number <- mean(numbers)
avg_number
```

```
## [1] 9.333333
```

<hr>

```r
round(avg_number)
```

```
## [1] 9
```

---

**Problem with creating many objects**

We will end up with too many objects in `Environment`.

---

---

---
class: middle

```r
c(4, 8, 16) %>% 
  mean() %>% 
  round()
```

```
## [1] 9
```
]

Combine 4, 8, and 16 `and then`  
Take the mean   `and then`  
Round the output

]

---

The output of the first function is the first argument of the second function.

---

Do you recall composite functions such as `$f \circ g(x)$`?

Now we have `$f \circ g \circ h (x)$` or `round(mean(c(4, 8, 16)))`

```r
h(x) %>% 
  g() %>% 
  f()
```

]

```r
c(4, 8, 16) %>% 
  mean() %>% 
  round()
```

]

---

---

```r
glimpse(lapd)
```

```
## Rows: 68,564
## Columns: 35
## $ `Row ID`                       <chr> "3-1000027830ctFu", "3-…
## $ Year                           <dbl> 2013, 2013, 2013, 2013,…
## $ `Department Title`             <chr> "Police (LAPD)", "Polic…
## $ `Payroll Department`           <dbl> 4301, 4302, 4301, 4301,…
## $ `Record Number`                <dbl> 1000027830, 1000155488,…
## $ `Job Class Title`              <chr> "Police Detective II", …
## $ `Employment Type`              <chr> "Full Time", "Full Time…
## $ `Hourly or Event Rate`         <dbl> 53.16, 23.77, 60.80, 60…
## $ `Projected Annual Salary`      <dbl> 110998.08, 49623.67, 12…
## $ `Q1 Payments`                  <dbl> 24931.20, 11343.96, 241…
## $ `Q2 Payments`                  <dbl> 29181.61, 13212.37, 283…
## $ `Q3 Payments`                  <dbl> 26545.80, 11508.36, 287…
## $ `Q4 Payments`                  <dbl> 29605.30, 13442.53, 332…
## $ `Payments Over Base Pay`       <dbl> 4499.12, 1844.82, 13192…
## $ `% Over Base Pay`              <dbl> 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Total Payments`               <dbl> 110263.91, 49507.22, 11…
## $ `Base Pay`                     <dbl> 105764.79, 47662.40, 10…
## $ `Permanent Bonus Pay`          <dbl> 3174.12, 0.00, 7363.95,…
## $ `Longevity Bonus Pay`          <dbl> 0.00, 1310.82, 0.00, 0.…
## $ `Temporary Bonus Pay`          <dbl> 1325.00, 0.00, 1205.00,…
## $ `Lump Sum Pay`                 <dbl> 0.00, 0.00, 2133.18, 0.…
## $ `Overtime Pay`                 <dbl> 0.00, 0.00, 4424.32, 98…
## $ `Other Pay & Adjustments`      <dbl> 0.00, 534.00, -1934.02,…
## $ `Other Pay (Payroll Explorer)` <dbl> 4499.12, 1844.82, 8768.…
## $ MOU                            <chr> "24", "3", "24", "24", …
## $ `MOU Title`                    <chr> "POLICE OFFICERS UNIT",…
## $ `FMS Department`               <dbl> 70, 70, 70, 70, 70, 70,…
## $ `Job Class`                    <chr> "2223", "1358", "2227",…
## $ `Pay Grade`                    <chr> "2", "0", "1", "1", "0"…
## $ `Average Health Cost`          <dbl> 11651.40, 10710.24, 116…
## $ `Average Dental Cost`          <dbl> 898.08, 405.24, 898.08,…
## $ `Average Basic Life`           <dbl> 191.04, 11.40, 191.04, …
## $ `Average Benefit Cost`         <dbl> 12740.52, 11126.88, 127…
## $ `Benefits Plan`                <chr> "Police", "City", "Poli…
## $ `Job Class Link`               <chr> "http://per.lacity.org/…
```

---

Before we work with this data it is a good idea to change variable names to tidy style.

```r
lapd <- clean_names(lapd)
glimpse(lapd)
```

```
## Rows: 68,564
## Columns: 35
## $ row_id                     <chr> "3-1000027830ctFu", "3-1000…
## $ year                       <dbl> 2013, 2013, 2013, 2013, 201…
## $ department_title           <chr> "Police (LAPD)", "Police (L…
## $ payroll_department         <dbl> 4301, 4302, 4301, 4301, 430…
## $ record_number              <dbl> 1000027830, 1000155488, 100…
## $ job_class_title            <chr> "Police Detective II", "Cle…
## $ employment_type            <chr> "Full Time", "Full Time", "…
## $ hourly_or_event_rate       <dbl> 53.16, 23.77, 60.80, 60.98,…
## $ projected_annual_salary    <dbl> 110998.08, 49623.67, 126950…
## $ q1_payments                <dbl> 24931.20, 11343.96, 24184.0…
## $ q2_payments                <dbl> 29181.61, 13212.37, 28327.2…
## $ q3_payments                <dbl> 26545.80, 11508.36, 28744.2…
## $ q4_payments                <dbl> 29605.30, 13442.53, 33224.8…
## $ payments_over_base_pay     <dbl> 4499.12, 1844.82, 13192.43,…
## $ percent_over_base_pay      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_payments             <dbl> 110263.91, 49507.22, 114480…
## $ base_pay                   <dbl> 105764.79, 47662.40, 101287…
## $ permanent_bonus_pay        <dbl> 3174.12, 0.00, 7363.95, 708…
## $ longevity_bonus_pay        <dbl> 0.00, 1310.82, 0.00, 0.00, …
## $ temporary_bonus_pay        <dbl> 1325.00, 0.00, 1205.00, 132…
## $ lump_sum_pay               <dbl> 0.00, 0.00, 2133.18, 0.00, …
## $ overtime_pay               <dbl> 0.00, 0.00, 4424.32, 9839.3…
## $ other_pay_adjustments      <dbl> 0.00, 534.00, -1934.02, -21…
## $ other_pay_payroll_explorer <dbl> 4499.12, 1844.82, 8768.11, …
## $ mou                        <chr> "24", "3", "24", "24", "12"…
## $ mou_title                  <chr> "POLICE OFFICERS UNIT", "CL…
## $ fms_department             <dbl> 70, 70, 70, 70, 70, 70, 70,…
## $ job_class                  <chr> "2223", "1358", "2227", "22…
## $ pay_grade                  <chr> "2", "0", "1", "1", "0", "2…
## $ average_health_cost        <dbl> 11651.40, 10710.24, 11651.4…
## $ average_dental_cost        <dbl> 898.08, 405.24, 898.08, 898…
## $ average_basic_life         <dbl> 191.04, 11.40, 191.04, 191.…
## $ average_benefit_cost       <dbl> 12740.52, 11126.88, 12740.5…
## $ benefits_plan              <chr> "Police", "City", "Police",…
## $ job_class_link             <chr> "http://per.lacity.org/pers…
```

---

## subsetting variables/columns

`select()`

---

## subsetting observations/rows

`slice()` and `filter()`

---

`select` is used to select certain variables in the data frame.

```r
select(lapd, year, base_pay)
```

```
## # A tibble: 68,564 × 2
##     year base_pay
##    <dbl>    <dbl>
##  1  2013  105765.
##  2  2013   47662.
##  3  2013  101288.
##  4  2013  118087.
##  5  2013   90322.
##  6  2013   62770.
##  7  2013   93718.
##  8  2013       0 
##  9  2013   51246.
## 10  2013   74227.
## # … with 68,554 more rows
```
]

```r
lapd %>% 
  select(year, base_pay)
```

]

---

`select` can also be used to drop certain variables if used with a negative sign.

```r
select(lapd, -row_id, -department_title)
```

```
## # A tibble: 68,564 × 33
##     year payroll_department record_number job_class_title       
##    <dbl>              <dbl>         <dbl> <chr>                 
##  1  2013               4301    1000027830 Police Detective II   
##  2  2013               4302    1000155488 Clerk Typist          
##  3  2013               4301    1000194958 Police Sergeant I     
##  4  2013               4301    1000232317 Police Lieutenant I   
##  5  2013               4302    1000329284 Principal Storekeeper 
##  6  2013               4302    1001124320 Police Service Repres…
##  7  2013               4301    1001221822 Police Officer III    
##  8  2013               4301    1001243583 Police Sergeant I     
##  9  2013               4301    1001317832 Police Officer II     
## 10  2013               4301     100162910 Police Officer II     
## # … with 68,554 more rows, and 29 more variables:
## #   employment_type <chr>, hourly_or_event_rate <dbl>,
## #   projected_annual_salary <dbl>, q1_payments <dbl>,
## #   q2_payments <dbl>, q3_payments <dbl>, q4_payments <dbl>,
## #   payments_over_base_pay <dbl>, percent_over_base_pay <dbl>,
## #   total_payments <dbl>, base_pay <dbl>,
## #   permanent_bonus_pay <dbl>, longevity_bonus_pay <dbl>, …
```

---

## Selection helpers

`starts_with()`  
`ends_with()`  
`contains()`

---

```r
select(lapd, starts_with("q"))
```

```
## # A tibble: 68,564 × 4
##    q1_payments q2_payments q3_payments q4_payments
##          <dbl>       <dbl>       <dbl>       <dbl>
##  1      24931.      29182.      26546.      29605.
##  2      11344.      13212.      11508.      13443.
##  3      24184       28327.      28744.      33225.
##  4      29391.      36591.      32905.      37234.
##  5      20813       24136       21518.      25231.
##  6      16057.      17927.      14150.      17052.
##  7      22162.      25664.      23404.      24586.
##  8          0           0         331.          0 
##  9      11941.      14330.      13404.      14537.
## 10      17046.      20457.      18777.      21371.
## # … with 68,554 more rows
```

---

```r
select(lapd, ends_with("pay"))
```

```
## # A tibble: 68,564 × 8
##    payments_over_ba… percent_over_bas… base_pay permanent_bonus…
##                <dbl>             <dbl>    <dbl>            <dbl>
##  1             4499.                 0  105765.            3174.
##  2             1845.                 0   47662.               0 
##  3            13192.                 0  101288.            7364.
##  4            18035.                 0  118087.            7087.
##  5             1376.                 0   90322.               0 
##  6             2415.                 0   62770.               0 
##  7             2099.                 0   93718.             866.
##  8              331.                 0       0                0 
##  9             2967.                 0   51246.            1540.
## 10             3424.                 0   74227.            2233.
## # … with 68,554 more rows, and 4 more variables:
## #   longevity_bonus_pay <dbl>, temporary_bonus_pay <dbl>,
## #   lump_sum_pay <dbl>, overtime_pay <dbl>
```

---

```r
select(lapd, contains("pay"))
```

```
## # A tibble: 68,564 × 17
##    payroll_department q1_payments q2_payments q3_payments
##                 <dbl>       <dbl>       <dbl>       <dbl>
##  1               4301      24931.      29182.      26546.
##  2               4302      11344.      13212.      11508.
##  3               4301      24184       28327.      28744.
##  4               4301      29391.      36591.      32905.
##  5               4302      20813       24136       21518.
##  6               4302      16057.      17927.      14150.
##  7               4301      22162.      25664.      23404.
##  8               4301          0           0         331.
##  9               4301      11941.      14330.      13404.
## 10               4301      17046.      20457.      18777.
## # … with 68,554 more rows, and 13 more variables:
## #   q4_payments <dbl>, payments_over_base_pay <dbl>,
## #   percent_over_base_pay <dbl>, total_payments <dbl>,
## #   base_pay <dbl>, permanent_bonus_pay <dbl>,
## #   longevity_bonus_pay <dbl>, temporary_bonus_pay <dbl>,
## #   lump_sum_pay <dbl>, overtime_pay <dbl>,
## #   other_pay_adjustments <dbl>, …
```

---

## subsetting variables/columns

`select()`

---

## subsetting observations/rows

`slice()` and `filter()`

---

The data below include all the rows from third to seventh. Including third and seventh.

```r
slice(lapd, 3:7)
```

```
## # A tibble: 5 × 35
##   row_id    year department_title payroll_departm… record_number
##   <chr>    <dbl> <chr>                       <dbl>         <dbl>
## 1 3-10001…  2013 Police (LAPD)                4301    1000194958
## 2 3-10002…  2013 Police (LAPD)                4301    1000232317
## 3 3-10003…  2013 Police (LAPD)                4302    1000329284
## 4 3-10011…  2013 Police (LAPD)                4302    1001124320
## 5 3-10012…  2013 Police (LAPD)                4301    1001221822
## # … with 30 more variables: job_class_title <chr>,
## #   employment_type <chr>, hourly_or_event_rate <dbl>,
## #   projected_annual_salary <dbl>, q1_payments <dbl>,
## #   q2_payments <dbl>, q3_payments <dbl>, q4_payments <dbl>,
## #   payments_over_base_pay <dbl>, percent_over_base_pay <dbl>,
## #   total_payments <dbl>, base_pay <dbl>,
## #   permanent_bonus_pay <dbl>, longevity_bonus_pay <dbl>, …
```

]

`filter()` subsetting rows based on a condition.

The data below includes rows when the recorded year is 2018.

```r
filter(lapd, year == 2018)
```

```
## # A tibble: 14,824 × 35
##    row_id   year department_title payroll_departm… record_number
##    <chr>   <dbl> <chr>                       <dbl>         <dbl>
##  1 8-1000…  2018 Police (LAPD)                4301    1000027830
##  2 8-1000…  2018 Police (LAPD)                4301    1000194958
##  3 8-1000…  2018 Police (LAPD)                4301    1000232317
##  4 8-1001…  2018 Police (LAPD)                4302    1001124320
##  5 8-1001…  2018 Police (LAPD)                4301    1001221822
##  6 8-1001…  2018 Police (LAPD)                4301    1001317832
##  7 8-1001…  2018 Police (LAPD)                4301     100162910
##  8 8-1001…  2018 Police (LAPD)                4301    1001675957
##  9 8-1001…  2018 Police (LAPD)                4302    1001884819
## 10 8-1001…  2018 Police (LAPD)                4302    1001893163
## # … with 14,814 more rows, and 30 more variables:
## #   job_class_title <chr>, employment_type <chr>,
## #   hourly_or_event_rate <dbl>, projected_annual_salary <dbl>,
## #   q1_payments <dbl>, q2_payments <dbl>, q3_payments <dbl>,
## #   q4_payments <dbl>, payments_over_base_pay <dbl>,
## #   percent_over_base_pay <dbl>, total_payments <dbl>,
## #   base_pay <dbl>, permanent_bonus_pay <dbl>, …
```

]

---

### Relational Operators in R

| Operator | Description              |
|----------|--------------------------|
| <        | Less than                |
| >        | Greater than             |
| <=       | Less than or equal to    |
| >=       | Greater than or equal to |
| ==       | Equal to                 |
| !=       | Not equal to             |

]

### Logical Operators in R

| Operator | Description |
|----------|-------------|
| &        | and         |
| &#124;   | or          |

]

---

Q. According to [datausa.io](https://datausa.io/profile/geo/los-angeles-ca#:~:text=In%202018%2C%20Los%20Angeles%2C%20CA,%2462%2C474%2C%20a%203.78%25%20increase.) Los Angeles had a median household income of $62474 in 2018. How many LAPD staff members had a base pay higher than $62474 in year 2018 according to this data?

.footnote[Median household income is **not** the same thing as median employee income. Our aim is data wrangling and not necessarily statistical analysis for now.]

---

```r
lapd %>% 
  filter(year == 2018 & base_pay > 62474)
```

```
## # A tibble: 11,690 × 35
##    row_id   year department_title payroll_departm… record_number
##    <chr>   <dbl> <chr>                       <dbl>         <dbl>
##  1 8-1000…  2018 Police (LAPD)                4301    1000027830
##  2 8-1000…  2018 Police (LAPD)                4301    1000194958
##  3 8-1000…  2018 Police (LAPD)                4301    1000232317
##  4 8-1001…  2018 Police (LAPD)                4302    1001124320
##  5 8-1001…  2018 Police (LAPD)                4301    1001221822
##  6 8-1001…  2018 Police (LAPD)                4301    1001317832
##  7 8-1001…  2018 Police (LAPD)                4301     100162910
##  8 8-1001…  2018 Police (LAPD)                4301    1001675957
##  9 8-1001…  2018 Police (LAPD)                4302    1001884819
## 10 8-1001…  2018 Police (LAPD)                4302    1001893163
## # … with 11,680 more rows, and 30 more variables:
## #   job_class_title <chr>, employment_type <chr>,
## #   hourly_or_event_rate <dbl>, projected_annual_salary <dbl>,
## #   q1_payments <dbl>, q2_payments <dbl>, q3_payments <dbl>,
## #   q4_payments <dbl>, payments_over_base_pay <dbl>,
## #   percent_over_base_pay <dbl>, total_payments <dbl>,
## #   base_pay <dbl>, permanent_bonus_pay <dbl>, …
```

---

```r
lapd %>% 
  filter(year == 2018 & base_pay > 62474) %>% 
  nrow()
```

```
## [1] 11690
```

---

Q. How many observations are available between 2013 and 2015 including 2013 and 2015?

```r
lapd %>% 
  filter(year >= 2013 & year <= 2015)
```

```
## # A tibble: 40,227 × 35
##    row_id   year department_title payroll_departm… record_number
##    <chr>   <dbl> <chr>                       <dbl>         <dbl>
##  1 3-1000…  2013 Police (LAPD)                4301    1000027830
##  2 3-1000…  2013 Police (LAPD)                4302    1000155488
##  3 3-1000…  2013 Police (LAPD)                4301    1000194958
##  4 3-1000…  2013 Police (LAPD)                4301    1000232317
##  5 3-1000…  2013 Police (LAPD)                4302    1000329284
##  6 3-1001…  2013 Police (LAPD)                4302    1001124320
##  7 3-1001…  2013 Police (LAPD)                4301    1001221822
##  8 3-1001…  2013 Police (LAPD)                4301    1001243583
##  9 3-1001…  2013 Police (LAPD)                4301    1001317832
## 10 3-1001…  2013 Police (LAPD)                4301     100162910
## # … with 40,217 more rows, and 30 more variables:
## #   job_class_title <chr>, employment_type <chr>,
## #   hourly_or_event_rate <dbl>, projected_annual_salary <dbl>,
## #   q1_payments <dbl>, q2_payments <dbl>, q3_payments <dbl>,
## #   q4_payments <dbl>, payments_over_base_pay <dbl>,
## #   percent_over_base_pay <dbl>, total_payments <dbl>,
## #   base_pay <dbl>, permanent_bonus_pay <dbl>, …
```

---

Q. How many observations are available between 2013 and 2015 including 2013 and 2015?

```r
lapd %>% 
  filter(year >= 2013 & year <= 2015) %>% 
  nrow()
```

```
## [1] 40227
```

---

Q. How many LAPD staff were employed full time in 2018?

```r
lapd %>% 
  filter(employment_type == "Full Time" & year == 2018) %>% 
  nrow()
```

```
## [1] 14664
```

---
class: middle

We have done all sorts of selections, slicing, filtering on `lapd` but it has not changed at all. Why do you think so?

```r
glimpse(lapd)
```

---

Moving forward we are only going to focus on year 2018, and use `job_class_title`, `employment_type`, and `base_pay`. Let's clean our data accordingly and move on with the smaller `lapd` data that we need.

---

```r
lapd %>% 
  filter(year == 2018) %>% 
  select(job_class_title, 
         employment_type, 
         base_pay)
```

```
## # A tibble: 14,824 × 3
##    job_class_title                  employment_type base_pay
##    <chr>                            <chr>              <dbl>
##  1 Police Detective II              Full Time        119322.
##  2 Police Sergeant I                Full Time        113271.
##  3 Police Lieutenant II             Full Time        148116 
##  4 Police Service Representative II Full Time         78677.
##  5 Police Officer III               Full Time        109374.
##  6 Police Officer II                Full Time         95002.
##  7 Police Officer II                Full Time         95379.
##  8 Police Officer II                Full Time         95388.
##  9 Equipment Mechanic               Full Time         80496 
## 10 Detention Officer                Full Time         69640 
## # … with 14,814 more rows
```

---

```r
lapd <- 
  lapd %>% 
  filter(year == 2018) %>% 
  select(job_class_title, 
         employment_type, 
         base_pay)
```

---

```r
glimpse(lapd)
```

```
## Rows: 14,824
## Columns: 3
## $ job_class_title <chr> "Police Detective II", "Police Sergean…
## $ employment_type <chr> "Full Time", "Full Time", "Full Time",…
## $ base_pay        <dbl> 119321.60, 113270.70, 148116.00, 78676…
```

**Goal**:

Create a new variable called `base_pay_k` that represents `base_pay` in thousand dollars.

---

```r
lapd %>% 
  mutate(base_pay_k = base_pay/1000)
```

```
## # A tibble: 14,824 × 4
##    job_class_title           employment_type base_pay base_pay_k
##    <chr>                     <chr>              <dbl>      <dbl>
##  1 Police Detective II       Full Time        119322.      119. 
##  2 Police Sergeant I         Full Time        113271.      113. 
##  3 Police Lieutenant II      Full Time        148116       148. 
##  4 Police Service Represent… Full Time         78677.       78.7
##  5 Police Officer III        Full Time        109374.      109. 
##  6 Police Officer II         Full Time         95002.       95.0
##  7 Police Officer II         Full Time         95379.       95.4
##  8 Police Officer II         Full Time         95388.       95.4
##  9 Equipment Mechanic        Full Time         80496        80.5
## 10 Detention Officer         Full Time         69640        69.6
## # … with 14,814 more rows
```

---

```r
glimpse(lapd)
```

**Goal**:

Create a new variable called `base_pay_level` which has `Less Than 0`, `No Income`, `Less than Median and Greater than 0` and `Greater than Median`. We will consider $62474 as the median (from previous lecture).

---

Let's first check to see there is anyone earning exactly the median value.

```r
lapd %>% 
  filter(base_pay == 62474)
```

```
## # A tibble: 0 × 3
## # … with 3 variables: job_class_title <chr>,
## #   employment_type <chr>, base_pay <dbl>
```

---

```r
lapd %>% 
  mutate(base_pay_level = case_when(
    base_pay < 0 ~ "Less than 0", 
    base_pay == 0 ~ "No Income",
    base_pay < 62474 & base_pay > 0 ~ "Less than Median, Greater than 0",
    base_pay > 62474 ~ "Greater than Median")) 
```

```
## # A tibble: 14,824 × 4
##    job_class_title      employment_type base_pay base_pay_level 
##    <chr>                <chr>              <dbl> <chr>          
##  1 Police Detective II  Full Time        119322. Greater than M…
##  2 Police Sergeant I    Full Time        113271. Greater than M…
##  3 Police Lieutenant II Full Time        148116  Greater than M…
##  4 Police Service Repr… Full Time         78677. Greater than M…
##  5 Police Officer III   Full Time        109374. Greater than M…
##  6 Police Officer II    Full Time         95002. Greater than M…
##  7 Police Officer II    Full Time         95379. Greater than M…
##  8 Police Officer II    Full Time         95388. Greater than M…
##  9 Equipment Mechanic   Full Time         80496  Greater than M…
## 10 Detention Officer    Full Time         69640  Greater than M…
## # … with 14,814 more rows
```

---

```
## # A tibble: 14,824 × 1
##    base_pay_level     
##    <chr>              
##  1 Greater than Median
##  2 Greater than Median
##  3 Greater than Median
##  4 Greater than Median
##  5 Greater than Median
##  6 Greater than Median
##  7 Greater than Median
##  8 Greater than Median
##  9 Greater than Median
## 10 Greater than Median
## # … with 14,814 more rows
```

---

We can use pipes with ggplot too!

---

```r
glimpse(lapd)
```

**Goal**:

Make `job_class_title` and `employment_type` factor variables.

---

```r
lapd %>% 
  mutate(employment_type = as.factor(employment_type),
         job_class_title = as.factor(job_class_title)) 
```

```
## # A tibble: 14,824 × 3
##    job_class_title                  employment_type base_pay
##    <fct>                            <fct>              <dbl>
##  1 Police Detective II              Full Time        119322.
##  2 Police Sergeant I                Full Time        113271.
##  3 Police Lieutenant II             Full Time        148116 
##  4 Police Service Representative II Full Time         78677.
##  5 Police Officer III               Full Time        109374.
##  6 Police Officer II                Full Time         95002.
##  7 Police Officer II                Full Time         95379.
##  8 Police Officer II                Full Time         95388.
##  9 Equipment Mechanic               Full Time         80496 
## 10 Detention Officer                Full Time         69640 
## # … with 14,814 more rows
```

---

`as.factor()` - makes a vector factor  
`as.numeric()` - makes a vector numeric  
`as.integer()` - makes a vector integer  
`as.double()` - makes a vector double  
`as.character()` - makes a vector character

---

Once again we did not "save"
anything into `lapd`. As we work on data cleaning it makes sense not to "save" the data frames. Once we see the final data frame we want then we can "save" (i.e. overwrite) it.

---

In your lecture notes, you can do all the changes in this lecture in one long set of piped code. That's the beauty of piping!

```r
lapd <- 
  lapd %>% 
  clean_names() %>% 
  filter(year == 2018) %>% 
  select(job_class_title, 
         employment_type, 
         base_pay) %>% 
    mutate(employment_type = as.factor(employment_type),
           job_class_title = as.factor(job_class_title),
           base_pay_level = case_when(
             base_pay < 0 ~ "Less than 0", 
             base_pay == 0 ~ "No Income",
             base_pay < 62474 & base_pay > 0 ~ "Less than Median, Greater than 0",
             base_pay > 62474 ~ "Greater than Median")) 
```

---

## Word of caution

The functions `clean_names()`, `select()`, `filter()`, `mutate()` all take a data frame as the first argument. Even though we do not see it, the data frame is piped through from the previous step of code at each step. 
When we use these functions without the `%>%` we have to include the data frame explicitly.

```r
clean_names(lapd)
```

```r
lapd %>% 
  clean_names()
```

---

---
class: inverse middle

---

```r
lapd %>% 
  count(employment_type)
```

```
## # A tibble: 3 × 2
##   employment_type     n
##   <chr>           <int>
## 1 Full Time       14664
## 2 Part Time         132
## 3 Per Event          28
```

---

```r
lapd %>% 
  count(employment_type) %>% 
  mutate(prop = n/sum(n))
```

```
## # A tibble: 3 × 3
##   employment_type     n    prop
##   <chr>           <int>   <dbl>
## 1 Full Time       14664 0.989  
## 2 Part Time         132 0.00890
## 3 Per Event          28 0.00189
```

---

## Mean

```r
summarize(lapd, 
          mean_base_pay = mean(base_pay))
```

```
## # A tibble: 1 × 1
##   mean_base_pay
##           <dbl>
## 1        85149.
```

---

## Median

```r
summarize(lapd, 
          median(base_pay))
```

```
## # A tibble: 1 × 1
##   `median(base_pay)`
##                <dbl>
## 1             97601.
```
]

```r
median(lapd$base_pay)
```

```
## [1] 97600.66
```

]

---

## Quantiles

```r
summarize(lapd, quantile(base_pay, c(0.25, 0.50, 0.75)))
```

```
## # A tibble: 3 × 1
##   `quantile(base_pay, c(0.25, 0.5, 0.75))`
##                                      <dbl>
## 1                                   67266.
## 2                                   97601.
## 3                                  109368.
```

---

We can use multiple functions such as `mean()` and `median` within the summarize function.

```r
summarize(lapd,
          mean(base_pay),
          median(base_pay))
```

```
## # A tibble: 1 × 2
##   `mean(base_pay)` `median(base_pay)`
##              <dbl>              <dbl>
## 1           85149.             97601.
```

---

In order to display the variable names more legibly in the output, we can assign variable names to numerical summaries (e.g. `mean_base_pay`).

```r
summarize(lapd,
          mean_base_pay = mean(base_pay),
          med_base_pay = median(base_pay))
```

```
## # A tibble: 1 × 2
##   mean_base_pay med_base_pay
##           <dbl>        <dbl>
## 1        85149.       97601.
```

---

Some tther useful functions that `summarize()` can handle are:

`min()`  
`max()`  
`sd()`  
`var()`

---

---

`group_by()`

`group_by()` separates the data frame by the groups. Any action following `group_by()` will be completed for each group separately.
---

Q. What is the median salary for each employment type?

---

```r
lapd %>% 
  group_by(employment_type)
```

```
## # A tibble: 14,824 × 3
## # Groups:   employment_type [3]
##    job_class_title                  employment_type base_pay
##    <chr>                            <chr>              <dbl>
##  1 Police Detective II              Full Time        119322.
##  2 Police Sergeant I                Full Time        113271.
##  3 Police Lieutenant II             Full Time        148116 
##  4 Police Service Representative II Full Time         78677.
##  5 Police Officer III               Full Time        109374.
##  6 Police Officer II                Full Time         95002.
##  7 Police Officer II                Full Time         95379.
##  8 Police Officer II                Full Time         95388.
##  9 Equipment Mechanic               Full Time         80496 
## 10 Detention Officer                Full Time         69640 
## # … with 14,814 more rows
```

---

```r
lapd %>% 
  group_by(employment_type) %>% 
  summarize(med_base_pay = median(base_pay))
```

```
## # A tibble: 3 × 2
##   employment_type med_base_pay
##   <chr>                  <dbl>
## 1 Full Time             97996.
## 2 Part Time             14474.
## 3 Per Event              4275
```

---

We can also remind ourselves how many staff members there were in each group.

```r
lapd %>% 
  group_by(employment_type) %>% 
  summarize(med_base_pay = median(base_pay),
            count = n())
```

```
## # A tibble: 3 × 3
##   employment_type med_base_pay count
##   <chr>                  <dbl> <int>
## 1 Full Time             97996. 14664
## 2 Part Time             14474.   132
## 3 Per Event              4275     28
```

Note that `n()` does not take any arguments.

---

## Schedule for the Day

__10:00 - 10:15 Introduction and Setup__  
__10:15 - 11:15 Introduction to Toolkit and Data Basics__  
__11:20 - 12:30 Data Visualization__  
__1:00 - 1:45 Data Wrangling__   
1:45 - 2:15 Packages and External Datasets  
2:15 - 2:30 Wrap Up