This is take-home part of the final exam. Note that you should be working individually for this final. Do not ask questions regarding this on Piazza. Please make use of office hours for any difficulties you may be having.
In this section I will review and introduce some R functions that might be helpful for you to complete the final.
library(tidyverse)
library(nycflights13)
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
The flights
dataset from the nycflights13 package contains information on flights leaving from NYC airport (JFK, LGA, and EWR). If we were interested in any flights arriving at SNA (John Wayne airport) then we could use the filter()
function.
flights %>%
filter(dest == "SNA") %>%
glimpse()
## Observations: 825
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 6, 7, 7, 8, 8,...
## $ dep_time <int> 646, 1143, 1832, 727, 1143, 1828, 701, 1823, 70...
## $ sched_dep_time <int> 645, 1145, 1828, 645, 1145, 1828, 705, 1819, 70...
## $ dep_delay <dbl> 1, -2, 4, 42, -2, 0, -4, 4, 2, 8, 101, 16, 7, 3...
## $ arr_time <int> 1023, 1512, 2144, 1024, 1435, 2153, 946, 2045, ...
## $ sched_arr_time <int> 1030, 1507, 2144, 1028, 1507, 2144, 1034, 2138,...
## $ arr_delay <dbl> -7, 5, 0, -4, -32, 9, -48, -53, -37, -19, 55, 1...
## $ carrier <chr> "UA", "UA", "UA", "UA", "UA", "UA", "UA", "UA",...
## $ flight <int> 1496, 1010, 1075, 277, 1010, 1075, 1455, 593, 1...
## $ tailnum <chr> "N38727", "N39726", "N18220", "N820UA", "N33714...
## $ origin <chr> "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR"...
## $ dest <chr> "SNA", "SNA", "SNA", "SNA", "SNA", "SNA", "SNA"...
## $ air_time <dbl> 380, 371, 342, 338, 334, 350, 326, 288, 327, 31...
## $ distance <dbl> 2434, 2434, 2434, 2434, 2434, 2434, 2434, 2434,...
## $ hour <dbl> 6, 11, 18, 6, 11, 18, 7, 18, 7, 18, 18, 18, 7, ...
## $ minute <dbl> 45, 45, 28, 45, 45, 28, 5, 19, 5, 19, 24, 19, 5...
## $ time_hour <dttm> 2013-01-01 06:00:00, 2013-01-01 11:00:00, 2013...
We can see that there are fewer observations and all the observations have dest
as SNA. The ==
in the R code signifies if it equals to. These are called logical operators. The following is a list of logical operators in R (there are more)
Operator | Description |
---|---|
== | equals to |
!= | not equal to |
> | greater than |
>= | greater than or equal to |
< | less than |
<= | less than or equal to |
& | and |
For instance the code below would show flights that arrive in SNA or in LAX. Note the output for dest variable.
flights %>%
filter(dest == "SNA" | dest == "LAX") %>%
glimpse()
## Observations: 16,999
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 558, 628, 646, 658, 702, 743, 828, 829, 856, 85...
## $ sched_dep_time <int> 600, 630, 645, 700, 700, 730, 823, 830, 900, 90...
## $ dep_delay <dbl> -2, -2, 1, -2, 2, 13, 5, -1, -4, -1, 21, -4, 32...
## $ arr_time <int> 924, 1016, 1023, 1027, 1058, 1107, 1150, 1152, ...
## $ sched_arr_time <int> 917, 947, 1030, 1025, 1014, 1100, 1143, 1200, 1...
## $ arr_delay <dbl> 7, 29, -7, 2, 44, 7, 7, -8, 6, -2, 10, 2, 39, 1...
## $ carrier <chr> "UA", "UA", "UA", "VX", "B6", "AA", "UA", "UA",...
## $ flight <int> 194, 1665, 1496, 399, 671, 33, 1506, 443, 1, 40...
## $ tailnum <chr> "N29129", "N33289", "N38727", "N627VA", "N779JB...
## $ origin <chr> "JFK", "EWR", "EWR", "JFK", "JFK", "JFK", "EWR"...
## $ dest <chr> "LAX", "LAX", "SNA", "LAX", "LAX", "LAX", "LAX"...
## $ air_time <dbl> 345, 366, 380, 361, 381, 358, 359, 360, 358, 35...
## $ distance <dbl> 2475, 2454, 2434, 2475, 2475, 2475, 2454, 2475,...
## $ hour <dbl> 6, 6, 6, 7, 7, 7, 8, 8, 9, 9, 9, 9, 9, 10, 11, ...
## $ minute <dbl> 0, 30, 45, 0, 0, 30, 23, 30, 0, 0, 0, 45, 21, 3...
## $ time_hour <dttm> 2013-01-01 06:00:00, 2013-01-01 06:00:00, 2013...
The code would not return any flights because there are no flights that arrives in SNA and LAX
flights %>%
filter(dest == "SNA" & dest == "LAX")
## # A tibble: 0 x 19
## # ... with 19 variables: year <int>, month <int>, day <int>,
## # dep_time <int>, sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Lasyly we are going to use mutate()
and case_when()
to track morning flights. If a flight is between 5 am - noon we will call it a morning flight.
flights %>%
mutate(morning = case_when(hour >= 5 & hour < 12 ~ "yes",
hour <5 | hour >= 12 ~ "no" )) %>%
glimpse()
## Observations: 336,776
## Variables: 20
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
## $ morning <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes"...
Notice that mutate()
is used to create a new variable called morning
. Inside case_when()
function we have used some logical statements to define that if the hour variable is less than or equal to 5 and less than 12 then the morning variable would be yes. Similarly we have also defined when morning variable would be no. If we had not defined the second condition morning variable would have NA values.
flights %>%
mutate(morning = if_else(hour >= 5 & hour < 12, "yes", "no")) %>%
glimpse()
## Observations: 336,776
## Variables: 20
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
## $ morning <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes"...
case_when()
is a useful function when you have more than two categories. Since we had only two morning or not, we can also use if_else()
function. We first defined a condition, and then a result if the condition is true and a result if the condition is false.
Metro Bike Share is a bike sharing system that operates in Los Angeles. Metro has been administering this bike share system since July 7, 2016. The way it works is riders can pick up a bike from one of the bike stations, ride it, and return it to a bike station. Make sure to read their homepage for more information about how the pricing works.
Metro provides data on rides here. For this part of the final exam, we will be using the data from the third quarter of 2019. Download the dataset and answer the following questions based on this dataset.
Read the data file into R. Call your data frame object bike. Glimpse at the data frame. How many observations are there? How many variables are there? What does each row of the data frame represent? Note that the duration variable is measured in minutes.
Can you calculate how much money riders have paid to Metro in third quarter of 2019? If yes, calculate the value, if not explain why it cannot be calculated.
If you look into passholder_type
closely, you will realize that some of the rides were test rides. Eliminate any ride that was a test ride from the dataset.
There are three types of bike used in the Metro bike system as standard, electric, and smart. We want to make a comparison of standard bikes and the other two types. Make a new variable called standard. This variable should have “yes” values for bikes that are standard and “no” values for bikes that are electric or smart.
We want to know whether nonstandard bike rides are longer than standard bike rides. Write out hypotheses to test this question.
What is the point estimate for the difference in biking duration between nonstandard bike rides and standard bike rides?
What is the standard error?
Calculate test statistic and p-value and state your conclusion whether those nonstandard bike rides are significantly longer than standard bike rides at the 0.05 significance level.
Can we make inference based on the rides in this dataset? Why / why not? Check conditions.
Question | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
Number of Points | 9 | 4 | 6 | 6 | 6 | 5 | 5 | 6 | 3 |
Submit your final on Gradescope in pdf format. Make sure to that each question is on a separate page. If it is not then you will lose 5 points from the total score.