class: center, middle, inverse, title-slide # Teaching an Introductory Data Science course with tidyverse workshop ## Introduction to the Toolkit ### Dr. Mine Dogucu ### 2021-12-15 --- layout: true <!-- This file by Mine Dogucu is licensed under a Attribution-ShareAlike 2.5 Generic License (CC BY-SA 2.5) More information about the license can be found at https://creativecommons.org/licenses/by-sa/2.5/ --> <div class="my-header"></div> <div class="my-footer"> CC BY-NC-ND 4.0 <a href="https://mdogucu.ics.uci.edu">Mine Dogucu</a></div> --- class: middle - R & RStudio - R Markdown - Getting to know data - Getting to know variables --- class: inverse center middle .font75[Introduction to R & RStudio] --- class: middle center <video width="80%" height="45%%" align = "center" controls> <source src="screencast/01a-hello-world.mp4" type="video/mp4"> </video> --- class: inverse middle center .font100[R review] --- class: middle ## Object assignment operator ```r birth_year <- 1950 ``` -- | | Windows | Mac | |----------------------------|----------------|------------------| | Shortcut | Alt + - | Option + - | --- class: middle ## R is case-sensitive ```r my_age <- 2020 - birth_year My_age ``` ``` ## Error in eval(expr, envir, enclos): object 'My_age' not found ``` -- --- class: middle If something comes in quotes, it is not defined in R. ```r ages <- c(25, my_age, 32) names <- c("Menglin", "Mine", "Rafael") data.frame(age = ages, name = names) ``` ``` ## age name ## 1 25 Menglin ## 2 70 Mine ## 3 32 Rafael ``` --- ## Vocabulary ```r do(something) ``` `do()` is a function; `something` is the argument of the function. -- ```r do(something, colorful) ``` `do()` is a function; `something` is the first argument of the function; `colorful` is the second argument of the function. --- class: middle ## Getting Help In order to get any help we can use `?` followed by function (or object) name. ```r ?c ``` --- ## tidyverse_style_guide canyoureadthissentence? -- .pull-right[ ```r age <- c(6, 9, 15) data.frame(age_kid = age) ``` [Tidyverse style guide link](https://style.tidyverse.org/) ] -- .pull-left[ After function names do not leave any spaces. Before and after operators (e.g. <-, =) leave spaces. Put a space after a comma, **not** before. Object names are all lower case, with words separated by an underscore. ] --- class: middle center #### RStudio Setup <video width="80%" height="45%%" align = "center" controls> <source src="screencast/01b-rstudio-setup.mp4" type="video/mp4"> </video> --- class: inverse middle center .font150[R Markdown] --- class: inverse middle center .font150[~~R~~ Markdown] --- ## markdown
<br> .pull-left[ ``` _Hello world_ __Hello world__ ~~Hello world~~ ``` ] .pull-right[ _Hello world_ __Hello world__ ~~Hello world~~ ] --- class: inverse middle .font100[
= .R file] .font100[
= .md file] .font100[
+
= .Rmd file] --- class: center middle #### R Markdown <video width="80%" height="45%%" align = "center" controls> <source src="screencast/01c-intro-rmarkdown.mp4" type="video/mp4"> </video> --- <img src="img/rmd-parts.jpeg" width="100%" /> --- class: center middle ## Add Chunk <img src="img/code-chunk.png" width="50%" /> --- class: center middle ## Run the Current Chunk <img src="img/run-code.png" width="20%" /> **Always** remember to run codes that I have provided for you before going over lecture notes and/or doing assignments. --- class: center middle ## Knit <img src="img/knit.png" width="50%" /> --- class: center middle ## Knit <img src="img/viewer-pane.png" width="50%" /> Having output documents open in viewer pane helps when teaching. --- class: middle center ## Shortcuts | | Windows | Mac | |----------------------------|------------------|------------------| | add chunk | Ctrl + Alt + I | Cmd + Option + I | | run the current chunk | Ctrl + Alt + C | Cmd + Option + C | | run current line/selection | Ctrl + Enter | Cmd + Return | | knit | Ctrl + Shift + K | Cmd + Shift + K | --- class: middle ## Slides for this workshop Slides that you are currently looking at are also written in R Markdown. We will take a look at them at the end of the workshop. If you prepare your teaching slides with R Markdown then you will 1) get more practice with R Markdown 2) get to be a role model for your students. --- class: middle ## Data Frames Context: [Dear Mona, Which State Has the Worst Drivers?](https://fivethirtyeight.com/features/which-state-has-the-worst-drivers/) --- class: middle ## Data Frame `bad_driver` <img src="img/data-matrix.png" width="100%" /><img src="img/data-matrix-tail.png" width="100%" /> --- class: middle ## Data Frame `bad_driver` - The data frame has 8 __variables__ (`state`, `num_drivers`, `perc_speeding`, `perc_not_distracted`, `perc_no_previous`, `insurance_premiums`, `losses`). - The data frame has 51 __cases__. Each case represents a US state (or District of Columbia). --- class: inverse center middle .font100[functions for data frames] --- class: middle ```r head(bad_drivers) ``` ``` ## # A tibble: 6 × 8 ## state num_drivers perc_speeding perc_alcohol perc_not_distra… ## <chr> <dbl> <int> <int> <int> ## 1 Alaba… 18.8 39 30 96 ## 2 Alaska 18.1 41 25 90 ## 3 Arizo… 18.6 35 28 84 ## 4 Arkan… 22.4 18 26 94 ## 5 Calif… 12 35 28 91 ## 6 Color… 13.6 37 28 79 ## # … with 3 more variables: perc_no_previous <int>, ## # insurance_premiums <dbl>, losses <dbl> ``` --- class: middle ```r tail(bad_drivers) ``` ``` ## # A tibble: 6 × 8 ## state num_drivers perc_speeding perc_alcohol perc_not_distra… ## <chr> <dbl> <int> <int> <int> ## 1 Vermo… 13.6 30 30 96 ## 2 Virgi… 12.7 19 27 87 ## 3 Washi… 10.6 42 33 82 ## 4 West … 23.8 34 28 97 ## 5 Wisco… 13.8 36 33 39 ## 6 Wyomi… 17.4 42 32 81 ## # … with 3 more variables: perc_no_previous <int>, ## # insurance_premiums <dbl>, losses <dbl> ``` --- class: middle ```r glimpse(bad_drivers) ``` ``` ## Rows: 51 ## Columns: 8 ## $ state <chr> "Alabama", "Alaska", "Arizona", "A… ## $ num_drivers <dbl> 18.8, 18.1, 18.6, 22.4, 12.0, 13.6… ## $ perc_speeding <int> 39, 41, 35, 18, 35, 37, 46, 38, 34… ## $ perc_alcohol <int> 30, 25, 28, 26, 28, 28, 36, 30, 27… ## $ perc_not_distracted <int> 96, 90, 84, 94, 91, 79, 87, 87, 10… ## $ perc_no_previous <int> 80, 94, 96, 95, 89, 95, 82, 99, 10… ## $ insurance_premiums <dbl> 784.55, 1053.48, 899.47, 827.34, 8… ## $ losses <dbl> 145.08, 133.93, 110.35, 142.39, 16… ``` --- class: middle ```r ncol(bad_drivers) ``` ``` ## [1] 8 ``` --- class: middle ```r nrow(bad_drivers) ``` ``` ## [1] 51 ``` --- class: center middle ## Getting to Know the Data Frame in Action <video width="80%" height="45%%" align = "center" controls> <source src="screencast/01i-data-interface.mp4" type="video/mp4"> </video> --- class: middle Using the starter code for this section of the workshop, can you write a sentence noting down the number of variables and number of observations in the `bad_driver` data frame? --- class: middle ## Data Frame for You to Try Out `candy_rankings` <img src="img/data-candy.png" width="100%" /><img src="img/data-candy-tail.png" width="100%" /> --- class: center middle ## Bob Ross <iframe width="560" height="315" src="https://www.youtube.com/embed/zIbR5TAz2xQ" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- class: middle ```r glimpse(bob_ross) ``` ``` ## Rows: 403 ## Columns: 71 ## $ episode <chr> "S01E01", "S01E02", "S01E03", "S01E… ## $ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… ## $ episode_num <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, … ## $ title <chr> "A WALK IN THE WOODS", "MT. MCKINLE… ## $ apple_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ aurora_borealis <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ barn <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ beach <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,… ## $ boat <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ bridge <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ building <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ bushes <int> 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,… ## $ cabin <int> 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,… ## $ cactus <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ circle_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ cirrus <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,… ## $ cliff <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ clouds <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,… ## $ conifer <int> 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1,… ## $ cumulus <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ deciduous <int> 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,… ## $ diane_andre <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ dock <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ double_oval_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ farm <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ fence <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,… ## $ fire <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ florida_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ flowers <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ fog <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ framed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ grass <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ half_circle_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ half_oval_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ hills <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ lake <int> 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,… ## $ lakes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ lighthouse <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ mill <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ moon <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,… ## $ mountain <int> 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1,… ## $ mountains <int> 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,… ## $ night <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,… ## $ ocean <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,… ## $ oval_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ palm_trees <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ path <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ person <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ portrait <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ rectangle_3d_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ rectangular_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ river <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,… ## $ rocks <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,… ## $ seashell_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ snow <int> 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,… ## $ snowy_mountain <int> 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1,… ## $ split_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ steve_ross <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ structure <int> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,… ## $ sun <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ tomb_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ tree <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,… ## $ trees <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,… ## $ triple_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ waterfall <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ waves <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ windmill <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ window_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ winter <int> 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,… ## $ wood_framed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ``` --- class: middle class: center middle ##`candy_rankings` vs `bob_ross` False - 0 True - 1 --- class: middle ## Variables <img src="img/data-candy.png" width="100%" /><img src="img/data-candy-tail.png" width="100%" /> --- class: middle <img src="img/diagram_small.png" width="377" style="display: block; margin: auto;" /> --- class: middle ## Variables Variables `n_kids` (number of kids), `height`, and `winpercent` are __numerical variables__. -- We can do certain analyses using these variables such as finding an average `winpercent` or the maximum or minimum `winpercent`. -- Not everything represented by numbers represents a numeric quantity. e.g. Student ID, cell phone number. --- class: middle ## Variables Variables such as `chocolate`, `fruity`, and `class_year` (first-year, sophomore, junior, senior) are __categorical variables__. -- Categorical variables have __levels__. For instance `chocolate` and `fruity` both have two levels as `TRUE` and `FALSE` and `class_year` have four levels. --- class: middle ```r glimpse(candy_rankings) ``` ``` ## Rows: 85 ## Columns: 6 ## $ competitorname <chr> "100 Grand", "3 Musketeers", "One dime"… ## $ chocolate <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, … ## $ fruity <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE… ## $ sugarpercent <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.46… ## $ pricepercent <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.76… ## $ winpercent <dbl> 66.97173, 67.60294, 32.26109, 46.11650,… ``` --- class: middle ```r glimpse(mariokart) ``` ``` ## Rows: 143 ## Columns: 12 ## $ id <dbl> 150377422259, 260483376854, 320432342985, … ## $ duration <int> 3, 7, 3, 3, 1, 3, 1, 1, 3, 7, 1, 1, 1, 1, … ## $ n_bids <int> 20, 13, 16, 18, 20, 19, 13, 15, 29, 8, 15,… ## $ cond <fct> new, used, new, new, new, new, used, new, … ## $ start_pr <dbl> 0.99, 0.99, 0.99, 0.99, 0.01, 0.99, 0.01, … ## $ ship_pr <dbl> 4.00, 3.99, 3.50, 0.00, 0.00, 4.00, 0.00, … ## $ total_pr <dbl> 51.55, 37.04, 45.50, 44.00, 71.00, 45.00, … ## $ ship_sp <fct> standard, firstClass, firstClass, standard… ## $ seller_rate <int> 1580, 365, 998, 7, 820, 270144, 7284, 4858… ## $ stock_photo <fct> yes, yes, no, yes, yes, yes, yes, yes, yes… ## $ wheels <int> 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, … ## $ title <fct> "~~ Wii MARIO KART & WHEEL ~ NINTENDO … ``` --- class: middle `character`: takes string values (e.g. a person's name, address) -- `integer`: integer (single precision) -- `double`: floating decimal (double precision) -- `numeric`: integer or double -- `factor`: categorical variables with different levels -- `logical`: TRUE (1), FALSE (0) --- class: inverse middle As a data scientist it is .font30[**your**] job to check the type(s) of data that you are working with. Do .font30[**not**] assume you will work with clean data frames, with clean names, labels, and types. --- class: middle ## Schedule for the Day __10:00 - 10:15 Introduction and Setup__ __10:15 - 11:15 Introduction to Toolkit and Data Basics__ 11:20 - 12:30 Data Visualization 1:00 - 1:45 Data Wrangling 1:45 - 2:15 Packages and External Datasets 2:15 - 2:30 Wrap Up